CN114416449A - Method, device, electronic equipment and medium for mining operation and maintenance fault node - Google Patents

Method, device, electronic equipment and medium for mining operation and maintenance fault node Download PDF

Info

Publication number
CN114416449A
CN114416449A CN202210051440.4A CN202210051440A CN114416449A CN 114416449 A CN114416449 A CN 114416449A CN 202210051440 A CN202210051440 A CN 202210051440A CN 114416449 A CN114416449 A CN 114416449A
Authority
CN
China
Prior art keywords
maintenance
performance
node
detection result
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210051440.4A
Other languages
Chinese (zh)
Inventor
尤明超
赵雁
杨镇宇
潘佳文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202210051440.4A priority Critical patent/CN114416449A/en
Publication of CN114416449A publication Critical patent/CN114416449A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2273Test methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Computer Hardware Design (AREA)
  • Development Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present disclosure provides a method, an apparatus, an electronic device, a medium, and a computer program product for mining an operation and maintenance fault node. The method and the device for mining the operation and maintenance fault nodes can be used in the technical fields of artificial intelligence and computer operation and maintenance. The method for mining the operation and maintenance fault node comprises the following steps: determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container; acquiring a performance index of the operation and maintenance node; performing abnormity detection on the performance indexes of the operation and maintenance nodes; when the performance index is abnormal, outputting an abnormal detection result; verifying the abnormal detection result by using a detection model to obtain a verification result; and when the verification result is that the abnormal detection result passes the verification of the detection model, determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node.

Description

Method, device, electronic equipment and medium for mining operation and maintenance fault node
Technical Field
The present disclosure relates to the field of artificial intelligence and computer operation and maintenance technologies, and more particularly, to a method, an apparatus, an electronic device, a medium, and a computer program product for mining an operation and maintenance fault node.
Background
The development of financial science and technology can provide key support for business innovation and change of banks, even play a role in driving leading, and at present, in order to deal with the complicated international situation and comply with the requirement of the times, a strategy of IT architecture transformation is provided, namely, a large amount of applications are put down to open system equipment from a large host, and the open system equipment is used for bearing a new generation of bank core systems. With the transformation of the IT architecture and the continuous deepening of the service popularization, more and more cloud computing containers and distributed databases are deployed in a machine room, which brings great challenges to the operation and maintenance of an open platform.
Disclosure of Invention
In view of the above, the present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for mining operation and maintenance fault nodes with high timeliness, high operation and maintenance intelligence level, small problem impact, and low risk degree.
One aspect of the present disclosure provides a method for mining an operation and maintenance fault node, including: determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container; acquiring a performance index of the operation and maintenance node; performing abnormity detection on the performance indexes of the operation and maintenance nodes; when the performance index is abnormal, outputting an abnormal detection result; verifying the abnormal detection result by using a detection model to obtain a verification result; and when the verification result is that the abnormal detection result passes the verification of the detection model, determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node.
According to the method for mining the operation and maintenance fault node, at least one of the first physical machine, the virtual machine, the application container, the second physical machine and the database container is used as the operation and maintenance node, so that the method can integrate information collected by a plurality of channels, has the potential of historical rule mining and self-updating optimization, provides ideas and elicitations for fault diagnosis and emergency of other applications and platforms, and lays a foundation for further realizing simple fault automatic emergency in the follow-up process; in addition, the method can conveniently realize an integrated solution from monitoring to analysis to automatic diagnosis and decision assistance, and can solve the difficulty in the actual operation and maintenance of the platform. This disclosure can promote the emergent ageing of trouble, promotes the intelligent level of fortune dimension, reduces problem influence and risk degree.
In some embodiments, the performing anomaly detection on the performance index of the operation and maintenance node includes: the dynamic baseline model carries out anomaly detection on the performance indexes of the operation and maintenance nodes according to a judgment rule; the method further comprises the following steps: and when the verification result is that the abnormal detection result does not pass the verification of the detection model, the dynamic baseline model optimizes the judgment rule.
In some embodiments, the verifying the abnormal detection result by using the detection model includes: setting a standard value based on the performance index of the operation and maintenance node; and comparing the abnormal detection result with the standard value to obtain a verification result.
In some embodiments, the obtaining the performance index of the operation and maintenance node includes: acquiring a performance index of the operation and maintenance node within a time period t, wherein the performance index within the time period t forms a performance trend; performing anomaly detection on the performance indexes of the operation and maintenance nodes, wherein the step of obtaining an anomaly detection result comprises the following steps: performing anomaly detection on the performance trend of the operation and maintenance node to obtain an anomaly detection result; and the setting of the standard value based on the performance index of the operation and maintenance node comprises: and setting a standard value based on the performance trend of the operation and maintenance node.
In some embodiments, the obtaining the performance index of the operation and maintenance node includes: and acquiring performance indexes corresponding to the operation and maintenance nodes of the types according to the different types of the operation and maintenance nodes.
In some embodiments, the performance indicators of the virtual machine, the first physical machine, and the second physical machine include a speed of a central processing unit, a memory space, a disk read-write speed, and a traffic of a transmission control protocol.
In some embodiments, the performance indicators of the application container include a transaction rate, a first response time, and a success rate.
In some embodiments, the performance indicators of the database container include a concurrency rate, a second response time, and a database read-write speed.
In some embodiments, the method further comprises: and when the verification result is that the abnormal detection result passes the verification of the detection model, sending the abnormal detection result and the operation and maintenance fault node corresponding to the abnormal detection result.
In some embodiments, the method further comprises: and displaying the performance indexes, the verification result corresponding to each performance index and the operation and maintenance node corresponding to each performance index in a view and/or report form mode.
In some embodiments, the displaying, in a view and/or a report, a plurality of the performance indicators, the verification result corresponding to each of the performance indicators, and the operation and maintenance node corresponding to each of the performance indicators includes: a first icon showing a plurality of the performance indicators; rendering a first icon of the performance index corresponding to the verification result according to the verification result; and responding to the click request of the first icon, and displaying the operation and maintenance node corresponding to the first icon.
In some embodiments, the first icon showing a plurality of the performance indicators comprises: classifying a plurality of said performance indicators, said performance indicators of the same class being shown by one of said first icons; and the step of displaying the operation and maintenance node corresponding to the first icon in response to the click request of the first icon comprises the following steps: and responding to the click request of the first icon of the same type, and displaying g operation and maintenance nodes corresponding to the first icon in a form of a second icon or a report, wherein g is an integer greater than or equal to 1.
Another aspect of the present disclosure provides an apparatus for mining an operation and maintenance fault node, including: the determining module is used for determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container; the acquisition module is used for executing the performance index acquisition of the operation and maintenance node; the detection module is used for executing abnormal detection on the performance indexes of the operation and maintenance nodes; the output module is used for outputting an abnormal detection result when the performance index is abnormal; the verification module is used for verifying the abnormal detection result by using a detection model to obtain a verification result; and the fault determining module is used for determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node when the verification result is that the abnormal detection result passes the verification of the detection model.
Another aspect of the present disclosure provides an electronic device comprising one or more processors and one or more memories, wherein the memories are configured to store executable instructions that, when executed by the processors, implement the method as described above.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program product comprising a computer program comprising computer executable instructions for implementing the method as described above when executed.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an exemplary system architecture to which the methods, apparatus, and methods may be applied, in accordance with an embodiment of the present disclosure;
FIG. 2 is a flow chart schematically illustrating a method for mining an operation and maintenance fault node according to an embodiment of the present disclosure;
FIG. 3 schematically shows a flowchart of obtaining performance indicators of an operation and maintenance node according to an embodiment of the present disclosure;
FIG. 4 is a flow chart schematically illustrating the detection of an anomaly in the performance index of an operation and maintenance node according to an embodiment of the present disclosure;
FIG. 5 is a flow chart schematically illustrating a method for mining an operation and maintenance fault node according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart for verifying an anomaly detection result using a detection model according to an embodiment of the present disclosure, resulting in a verification result;
FIG. 7 is a flowchart schematically illustrating obtaining performance indicators of an operation and maintenance node according to an embodiment of the present disclosure;
fig. 8 schematically illustrates a flowchart of performing anomaly detection on a performance index of an operation and maintenance node to obtain an anomaly detection result according to an embodiment of the present disclosure;
FIG. 9 is a flow chart illustrating setting of the criterion value based on the performance index of the operation and maintenance node according to an embodiment of the disclosure;
FIG. 10 is a flow chart that schematically illustrates a method for mining an operation and maintenance failure node, in accordance with an embodiment of the present disclosure;
FIG. 11 is a flow chart that schematically illustrates a method for mining an operation and maintenance failure node, in accordance with an embodiment of the present disclosure;
FIG. 12 is a flowchart schematically illustrating a plurality of performance indicators, a verification result corresponding to each performance indicator, and an operation and maintenance node corresponding to each performance indicator in a view and/or a report according to an embodiment of the present disclosure;
FIG. 13 schematically illustrates a flow chart showing a first icon of a plurality of performance indicators, according to an embodiment of the disclosure;
FIG. 14 schematically illustrates a diagram of a presentation page after responding to a click on performance indicator 1, according to an embodiment of the disclosure;
FIG. 15 is a flow chart that schematically illustrates presentation of an operation and maintenance node corresponding to a first icon in response to a click request for the first icon, in accordance with an embodiment of the present disclosure;
FIG. 16 schematically illustrates a schematic diagram of a configuration topology according to an embodiment of the present disclosure;
FIG. 17 is a schematic diagram illustrating performance indicator behavior and data sources according to an embodiment of the disclosure;
FIG. 18 is a flow chart illustrating a general concept of a design of a method for mining an operation and maintenance fault node according to an embodiment of the present disclosure;
FIG. 19 is a block diagram schematically illustrating an apparatus for mining an operation and maintenance fault node according to an embodiment of the present disclosure;
FIG. 20 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure. In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features.
The development of financial science and technology can provide key support for business innovation and change of banks, even play a role in driving leading, and at present, in order to deal with the complicated international situation and comply with the requirement of the times, a strategy of IT architecture transformation is provided, namely, a large amount of applications are put down to open system equipment from a large host, and the open system equipment is used for bearing a new generation of bank core systems. With the transformation of the IT architecture and the continuous deepening of the service popularization, more and more cloud computing containers and distributed databases are deployed in a machine room, which brings great challenges to the operation and maintenance of an open platform. Currently, limited by cloud computing, a distributed technology framework and an IT operation and maintenance automation level, the following problems exist in locating the root cause of a production fault by an IT operation and maintenance person, including:
1. the service link is too long: banking transactions flow through multiple application systems, each developed and maintained by a different technician. When a certain type of banking business has production problems, the phenomenon of tearing the application skin easily occurs.
2. The operation and maintenance complexity is high: each application is internally divided into a plurality of application layers, and a plurality of cloud computing containers are deployed on each application layer and connected with a plurality of distributed databases. With the complexity of the topological structure, the efficiency is low when the problem is searched, and the labor consumption is huge.
3. The alarm interference items are multiple: due to the lack of a global visual operation interface and state display, operation and maintenance personnel can only sense the service operation condition from the alarm monitoring system. Once a fault occurs, an alarm storm is often caused, which causes great trouble to operation and maintenance personnel.
Embodiments of the present disclosure provide a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for mining an operation and maintenance fault node. The method for mining the operation and maintenance fault node comprises the following steps: determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container; acquiring performance indexes of operation and maintenance nodes; carrying out abnormity detection on the performance indexes of the operation and maintenance nodes; when the performance index is abnormal, outputting an abnormal detection result; verifying the abnormal detection result by using the detection model to obtain a verification result; and when the verification result is that the abnormal detection result passes the verification of the detection model, determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node.
It should be noted that the method, apparatus, electronic device, computer-readable storage medium, and computer program product for mining operation and maintenance fault node of the present disclosure may be used in the fields of artificial intelligence and computer operation and maintenance, and may also be used in any fields other than the fields of artificial intelligence and computer operation and maintenance, such as the field of finance, and the field of the present disclosure is not limited herein.
Fig. 1 schematically illustrates an exemplary system architecture 100 to which the method, apparatus, electronic device, computer-readable storage medium, and computer program product for mining an operation and maintenance failure node may be applied, according to embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the method for mining operation and maintenance fault nodes provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the device for mining operation and maintenance fault nodes provided by the embodiment of the present disclosure may be generally disposed in the server 105. The method for mining operation and maintenance fault nodes provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the device for mining operation and maintenance fault nodes provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The method for mining operation and maintenance fault nodes according to the embodiment of the present disclosure will be described in detail below with reference to fig. 2 to 14 based on the scenario described in fig. 1.
Fig. 2 schematically shows a flowchart of a method for mining an operation and maintenance fault node according to an embodiment of the present disclosure.
As shown in fig. 2, the method for mining an operation and maintenance fault node of this embodiment includes operations S210 to S260.
In operation S210, an operation and maintenance node is determined, wherein the operation and maintenance node includes at least one of a first physical machine, a virtual machine, an application container, a second physical machine, and a database container. For example, m virtual machines may be disposed on the first physical machine, n application containers may be disposed on each virtual machine, s database containers may be disposed on the second physical machine, each application container is in communication connection with at least one database container, and m, n, and s are integers greater than or equal to 1.
In operation S220, a performance index of the operation and maintenance node is obtained.
As a possible implementation manner, as shown in fig. 3, the operation S220 of acquiring the performance index of the operation and maintenance node includes an operation S221.
In operation S221, according to the different types of the operation and maintenance nodes, performance indexes corresponding to the types of the operation and maintenance nodes are obtained. It is to be understood that the first physical machine, the virtual machine, and the second physical machine may be server class operation and maintenance nodes, and the application container and the database container may be database class operation and maintenance nodes. The server class operation and maintenance node has a performance index corresponding to the server class operation and maintenance node, and the database class operation and maintenance node has a performance index corresponding to the database class operation and maintenance node.
In some specific examples, the performance indicators of the virtual machine, the first physical machine, and the second physical machine may include a speed of the central processing unit, a memory space, a read/write speed of the disk, and a traffic volume of the transmission control protocol.
In some specific examples, the performance indicators of the application container may include a transaction rate, a first response time, and a success rate. Wherein, the transaction rate can be understood as the number of times of external requests initiated by the application container; the first response time may be understood as the response time of the application container; the success rate may be understood as the number of times the application container receives a response to an external initiation request.
In some specific examples, the performance indicators for the database container may include a concurrency rate, a second response time, and a database read-write speed. Wherein, the concurrency rate can be understood as how many threads work per second; the second response time may be understood as the response time of the database container.
In operation S230, an anomaly detection is performed on the performance index of the operation and maintenance node.
In operation S240, when the performance index is abnormal, an abnormality detection result is output. It can be understood that the performance indexes of the virtual machine, the first physical machine, and the second physical machine, that is, the speed, the memory space, the disk read-write speed of the central processing unit, and the traffic of the transmission control protocol may be detected, and when it is detected that the speed, the memory space, the disk read-write speed, and the traffic of the transmission control protocol are abnormal, the abnormal detection result may be output.
The performance indexes of the application container, namely the transaction rate, the first response time and the success rate can be detected, and when the transaction rate, the first response time and the success rate are detected to be abnormal, the abnormal detection result can be output. The performance indexes of the database container, namely the concurrency rate, the second response time and the database reading and writing speed can be detected, and when the concurrency rate, the second response time and the database reading and writing speed are detected to be abnormal, an abnormal detection result can be output.
In operation S250, the anomaly detection result is verified by using the detection model, and a verification result is obtained. It should be noted that the detection model is provided with a verification parameter for the abnormal detection result, that is, a standard value mentioned below, and the verification result of comparing the abnormal detection result with the verification parameter can be obtained through the verification parameter.
In operation S260, when the verification result is that the abnormal detection result passes the verification of the detection model, the operation and maintenance node corresponding to the abnormal detection result is determined as an operation and maintenance fault node.
According to the method for mining the operation and maintenance fault node, at least one of the first physical machine, the virtual machine, the application container, the second physical machine and the database container is used as the operation and maintenance node, so that the method can integrate information collected by a plurality of channels, has the potential of historical rule mining and self-updating optimization, provides ideas and elicitations for fault diagnosis and emergency of other applications and platforms, and lays a foundation for further realizing simple fault automatic emergency in the follow-up process; in addition, the integrated solution from monitoring to analysis to automatic diagnosis and decision assistance can be conveniently realized through the operation S210 to the operation S260, and the difficulty in the platform operation and maintenance practice can be solved. This disclosure can promote the emergent ageing of trouble, promotes the intelligent level of fortune dimension, reduces problem influence and risk degree.
According to some embodiments of the present disclosure, as shown in fig. 4, the operation S230 of performing anomaly detection on the performance index of the operation and maintenance node includes an operation S231.
In operation S231, the dynamic baseline model performs anomaly detection on the performance index of the operation and maintenance node according to the determination rule. It can be understood that, a determination rule for performing anomaly detection on the performance index may be set in the dynamic baseline model, for example, the performance indexes of the virtual machine, the first physical machine, and the second physical machine, that is, the speed of the central processing unit, the memory space, the disk read-write speed, and the traffic of the transmission control protocol may be subjected to anomaly detection according to the determination rule. Specifically, for the speed of the central processing unit, the determination rule may be that the speed of the central processing unit is normal when the speed is higher than a speed threshold value, and is abnormal when the speed is lower than the speed threshold value; the determination rule can also be that the speed of the central processing unit is normal within a threshold range and is abnormal if not within the threshold range; of course, the speed determination rule for the cpu is not limited thereto, and is only illustrated here by way of example.
For the memory space, the judgment rule can be that the memory space is normal when the memory space is higher than a space threshold value and is abnormal when the memory space is lower than the space threshold value; the determination rule can also be that the memory space is normal within a threshold range and is abnormal if not within the threshold range; of course, the memory space determination rule is not limited thereto, and is only illustrated here by way of example.
For the read-write speed of the disk, the judgment rule can be that the read-write speed of the disk is higher than the read-write speed threshold value and is normal, and the read-write speed lower than the read-write speed threshold value and is abnormal; the judgment rule can also be that the disk reading and writing speed is normal within a threshold range and is abnormal if not within the threshold range; of course, the determination rule for the disk read/write speed is not limited thereto, and is only illustrated here by way of example.
For the traffic of the transmission control protocol, the judgment rule can be that the traffic of the transmission control protocol is normal when the traffic is higher than a traffic threshold value and is abnormal when the traffic is lower than the traffic threshold value; the determination rule may also be that the traffic of the transmission control protocol is normal within a threshold range and is abnormal if not within the threshold range; of course, the traffic decision rule for the tcp is not limited thereto, and is only illustrated here by way of example.
For another example, the performance index of the application container, that is, the transaction rate, the first response time and the success rate, may be subjected to anomaly detection according to the determination rule. Specifically, for the transaction rate, the determination rule may be that the transaction rate is normal when being higher than the transaction threshold and abnormal when being lower than the transaction threshold; the decision rule may also be that the transaction rate is normal within a threshold range and not abnormal within the threshold range; of course, the determination rule for transaction rate is not limited thereto, and is only illustrated here by way of example. The first response time and the success rate are determined according to the same principle, and are not described in detail here.
For another example, the performance index of the database container, that is, the concurrency rate, the second response time, and the database read-write speed, may be subjected to anomaly detection according to the determination rule. Specifically, for the concurrency rate, the determination rule may be that the concurrency rate is higher than the concurrency threshold and is normal, and the concurrency rate is lower than the concurrency threshold and is abnormal; the determination rule may also be that the concurrency rate is normal within a threshold range and is not abnormal within the threshold range; of course, the concurrency rate determination rule is not limited thereto, and is only illustrated here by way of example. The second response time and the determination rule of the database read-write speed are the same, and are not described in detail here.
As shown in fig. 5, the method for mining the operation and maintenance fault node further includes operation S310.
In operation S310, when the verification result is that the abnormality detection result fails the verification of the detection model, the dynamic baseline model optimizes the determination rule. It can be understood that when the verification result is that the abnormal detection result fails to pass the verification of the detection model, it indicates that the abnormal detection result may be incorrect and there is a possibility of false detection, so that optimizing the determination rule of the dynamic baseline model can improve the accuracy of the dynamic baseline model in performing the abnormal detection on the performance index of the operation and maintenance node.
Fig. 6 schematically shows a flowchart for verifying an abnormal detection result by using a detection model according to an embodiment of the present disclosure.
Operation S250 verifies the abnormality detection result using the detection model, and the verification result includes operation S251 and operation S252.
In operation S251, a standard value is set based on the performance index of the operation and maintenance node. It should be noted that each performance index of the operation and maintenance node has a corresponding standard value, for example, the speed of the central processing unit has a corresponding standard value, the memory space has a corresponding standard value, the read-write speed of the magnetic disk has a corresponding standard value, the traffic of the transmission control protocol has a corresponding standard value, and the transaction rate, the first response time, the success rate, the concurrency rate, the second response time, and the read-write speed of the database also have corresponding standard values, which is not described herein any more. The standard value may be a value or a range of values empirically determined by an expert.
In operation S252, the abnormality detection result is compared with the standard value to obtain a verification result. For example, when the speed of the central processing unit is abnormal, the speed value of the central processing unit is used as an abnormal detection result, and the abnormal detection result is compared with a standard value, so that a verification result can be obtained, wherein the verification result can be that the abnormal detection result is abnormal relative to the standard value, and at this time, the verification result is that the abnormal detection result passes the verification of the detection model, so that the operation and maintenance node corresponding to the abnormal detection result can be determined as an operation and maintenance fault node; the verification result can be that the abnormal detection result is normal relative to the standard value, at this time, the verification result is that the abnormal detection result fails to pass the verification of the detection model, and the dynamic baseline model optimizes the judgment rule.
Similarly, the memory space, the disk read-write speed, the traffic of the transmission control protocol, the transaction rate, the first response time, the success rate, the concurrency rate, the second response time, and the verification result of the comparison between the abnormal detection result of the database read-write speed and the standard value are not repeated here. The operation S251 and the operation S252 may facilitate the verification of the anomaly detection result by using the detection model, so as to obtain the verification result.
In some embodiments of the present disclosure, as shown in fig. 7, the operation S220 of obtaining the performance index of the operation and maintenance node includes an operation S222 of: and acquiring the performance index of the operation and maintenance node in the time period t, wherein the performance index in the time period t forms a performance trend.
As shown in fig. 8, the operation S230 performs anomaly detection on the performance index of the operation and maintenance node, and obtaining an anomaly detection result includes operation S232: and carrying out anomaly detection on the performance trend of the operation and maintenance node to obtain an anomaly detection result.
As shown in fig. 9, the operation S251 of setting the criterion value based on the performance index of the operation and maintenance node includes an operation S2511: and setting a standard value based on the performance trend of the operation and maintenance node.
It is understood that the performance index at a certain time, such as external environmental factors (specific activities or company policy enforcement), may be different from the normal value or normal range, for example, when a certain hot financial product is at a selling point, the traffic of the tcp at the selling point of the financial application product is too large to exceed the normal value or normal range, but it does not mean that the performance index is abnormal. Therefore, the performance index of the operation and maintenance node in the time period t is obtained, and the performance index in the time period t forms a performance trend; and carrying out abnormity detection on the performance trend of the operation and maintenance node to obtain an abnormity detection result, so that the problem of abnormity detection of performance indexes caused by external environmental factors can be solved. The standard value is set based on the performance trend of the operation and maintenance node, so that the accuracy of the detection model for verifying the abnormal detection result can be improved.
In some embodiments of the present disclosure, as shown in fig. 10, the method of mining the operation and maintenance fault node may further include operation S410.
In operation S410, when the verification result is that the anomaly detection result passes the verification of the detection model, the anomaly detection result and the operation and maintenance fault node corresponding to the anomaly detection result are sent. Therefore, the operation and maintenance fault node corresponding to the abnormal detection result and the abnormal detection result can be conveniently and timely known by the staff, and quick repair is realized.
Fig. 11 schematically shows a flowchart of a method for mining an operation and maintenance fault node according to an embodiment of the present disclosure.
The method of mining the operation and maintenance fault node further includes operation S510.
In operation S510, the performance indexes, the verification result corresponding to each performance index, and the operation and maintenance node corresponding to each performance index are displayed in a view and/or a report.
As a possible implementation manner, as shown in fig. 12, operation S510 shows, in a view and/or a report form, a plurality of performance indexes, a verification result corresponding to each performance index, and an operation and maintenance node corresponding to each performance index, including operations S511 to S513.
In operation S511, a first icon of a plurality of performance indicators is presented. For example, there may be a plurality of performance indicators for the first physical machine, m virtual machines on the first physical machine, n application containers on each virtual machine, the second physical machine, and s database containers on the second physical machine, and each performance indicator may correspond to a first icon, and the first icon is an icon showing the performance indicator. Of course, the performance indicators may also be classified, and each type of performance indicator corresponds to a first icon.
In operation S512, a first icon of a performance index corresponding to the verification result is rendered according to the verification result. It should be noted that, when the verification result is that the abnormal detection result passes the verification of the detection model, the performance index corresponding to the verification result is abnormal, so that the abnormal performance index can be rendered into a first color, which may be red, yellow, blue, gray, and the like, and is not limited specifically herein; when the verification result is that the abnormal detection result fails to pass the verification of the detection model, the performance index corresponding to the verification result is normal, so that the normal performance index can be rendered into a second color, which may be red, yellow, blue, gray, and the like, and is not limited herein. The first color and the second color are different in order to distinguish between a normal performance indicator and an abnormal performance indicator.
In operation S513, in response to a click request for a first icon, an operation and maintenance node corresponding to the first icon is presented. Therefore, the multiple performance indexes, the verification result corresponding to each performance index, and the operation and maintenance node corresponding to each performance index can be conveniently displayed in a view and/or a report form through operations S511 to S513.
According to some embodiments of the present disclosure, as shown in fig. 13 and 14, the operation S511 of displaying the first icon of the plurality of performance indicators includes an operation S5111 of: a plurality of performance indicators are classified, and the performance indicators of the same class are displayed by a first icon. For example, a plurality of performance indexes may be provided for m virtual machines on the first physical machine, n application containers on each virtual machine, and s database containers on the second physical machine, where the first physical machine, the m virtual machines, and the second physical machine have the same performance index, the n application containers on each virtual machine have the same performance index, and the s database containers on the second physical machine have the same performance index, and at this time, the same performance indexes may be classified into one group.
For example, the speed of the central processing unit may be used as a class of performance indexes, and the operation and maintenance nodes having the class of performance indexes are the first physical machine, the m virtual machines on the first physical machine, and the second physical machine; the memory space can be used as a class of performance indexes, and the operation and maintenance nodes with the class of performance indexes are a first physical machine, m virtual machines on the first physical machine and a second physical machine; the transaction rate can be used as a type of performance index, and the operation and maintenance nodes with the type of performance index are n application containers on each virtual machine.
The disk read-write speed, the flow of the transmission control protocol, the first response time, the success rate, the concurrency rate, the second response time, the database read-write speed and the speed, the memory space and the transaction rate of the central processing unit are the same, and are not described in detail herein.
As shown in fig. 14 and 15, the operation S513 of presenting, in response to the click request for the first icon, the operation S5131 of: and responding to a click request of the same type of first icon, and displaying g operation and maintenance nodes corresponding to the first icon in a form of a second icon or a report, wherein g is an integer greater than or equal to 1.
Therefore, the operation S5111 and the operation S5131 can intuitively and prominently display each operation and maintenance node with clear thinking, and workers in different professions and different levels can grasp the fault condition more quickly by the aid of the operation S5111 and the operation S5131, so that the efficiency of searching for the fault and solving the fault can be improved.
The method for mining operation and maintenance fault nodes according to the embodiment of the disclosure is described in detail below with reference to fig. 16 to 18. It is to be understood that the following description is illustrative only and is not intended to be in any way limiting of the present disclosure.
The application-oriented configuration topology is constructed by triggering acquisition scripts at regular time, inquiring configuration management, a cloud management platform and the like, and the configuration topology structure is explained in detail according to fig. 16, and the specific method comprises the following steps of one step to four steps.
The method comprises the following steps: and acquiring the virtual machine condition corresponding to the computing node from an OpenStack (open source cloud computing management platform) system.
Step two: and acquiring a virtual machine address corresponding to the cloud container from an HMP (holographic monitoring platform) system.
Step three: and the slave database master library sends a command at regular time to check the long connection condition, namely the cloud container address of the database container connection.
Step four: and acquiring account information of the database container, namely address information of the database container corresponding to the host from a DMP (data management platform) platform, and finally constructing an application-oriented configuration topology.
The performance index conditions and data sources collected at each configuration node are different, and they will be described in detail below with reference to fig. 17.
The method comprises the following steps: and acquiring performance indexes such as QPS (quick Path manager), response time, database I/O (input/output) and the like of each database container from the DMP platform.
Step two: and acquiring performance indexes such as transaction rate, response time, success rate and the like of each cloud container from the HMP system.
Step three: and acquiring performance indexes of each server, such as the CPU, the memory, the disk I/O, TCP flow conditions and the like of the virtual machine, the computing node and the host machine from the operation and maintenance platform system.
The general idea of the design of the present disclosure is described in detail with reference to fig. 18, and the specific method includes the following steps one to twelve.
The method comprises the following steps: an application-oriented configuration topology is constructed by triggering acquisition scripts at regular time, inquiring configuration management, a cloud management platform and the like.
Step two: and collecting the performance index condition on each configuration node.
Step three: and (4) forming historical modeling data by carrying out IP granularity convergence association configuration topology on each performance index collected before.
Step four: the historical modeling data contains a large amount of index information, but the index meanings of the historical modeling data are different, and if the historical modeling data is not subjected to uniform warehousing management, the statistical analysis of the data is not facilitated. Therefore, the index information is pre-processed in time series (data is collected and gathered for a period of time), and then is imported into a data warehouse, so as to lay a foundation for subsequent data mining.
Step five: and (3) realizing the abnormal detection of each performance index by adopting a parameter-free and threshold-free self-learning dynamic baseline algorithm.
Step six: and verifying whether the abnormal detection result obtained by the dynamic baseline algorithm is reasonable or not by using model detection, wherein the detailed steps of the model detection method are shown in the tenth.
Step seven: and extracting the monitoring alarm information within the last 5 minutes, matching the IP of the monitoring alarm information with the nodes of the configuration topology, and finding out the nodes suspected to have abnormality.
Step eight: and inquiring the performance trend condition of the suspected abnormal node IP according to a predefined expert rule, and judging whether the suspected abnormal node IP is abnormal or not.
Step nine: and (4) assisting fault positioning according to expert rules, comparing the result with the abnormal detection result obtained by the model, and if the result is similar, passing the model detection, otherwise, not passing the model detection.
Step ten: if the model is checked, the model can dig out the association relation according to the position information of the abnormal index in the configuration topology to obtain an intelligent diagnosis result and push the intelligent diagnosis result in a short message.
Step eleven: if the data does not pass the model test, the parameters can be automatically updated to carry out optimization and reconstruction according to the change of the data characteristics, and the accuracy of the data is further improved.
Step twelve: and feeding back the intelligent diagnosis result to a WEB foreground, and carrying out dynamic view display on results obtained by data analysis and data mining in the modes of view, report and the like.
The present disclosure employs the following techniques:
1. the JAVA programming language.
JAVA is an object-oriented programming language, not only absorbs various advantages of C + + language, but also abandons concepts such as multiple inheritance, pointers and the like which are difficult to understand in C + +, so that the JAVA language has two characteristics of strong function, simplicity and easiness in use. The JAVA language, which is representative of the static object-oriented programming language, excellently implements object-oriented theory, allowing programmers to perform complex programming in an elegant thinking manner. JAVA has the characteristics of simplicity, object-oriented property, distribution, robustness, safety, platform independence and portability, multithreading, dynamicity and the like. JAVA may write desktop applications, WEB applications, distributed system and embedded system applications, etc.
2. MySQL database systems.
MySQL is a relational database management system developed by MySQL AB, Sweden, and belongs to the product under Oracle flag. MySQL is one of the most popular Relational Database Management systems, and in terms of WEB applications, MySQL is one of the best RDBMS (Relational Database Management System) application software. MySQL is a relational database management system that keeps data in different tables instead of putting all the data in one large repository, which increases speed and flexibility. The SQL language used by MySQL is the most common standardized language for accessing databases. MySQL software adopts a double-authorization policy and is divided into a community version and a business version, and generally MySQL is selected as a website database for development of small and medium-sized websites due to the characteristics of small volume, high speed, low total ownership cost and particularly open source codes.
3. HTML (hyper Text Markup language) hypertext Markup language.
HTML, known as hypertext markup language, is a markup language. The document format on the network can be unified through the labels, so that the scattered Internet resources are connected into a logic whole. HTML text is descriptive text consisting of HTML commands that can specify words, graphics, animations, sounds, tables, links, etc. Hypertext is a way to organize information by associating words and diagrams in text with other information media through a hyperlink method. These interrelated information media may be in the same text, may be other files, or may be files on a computer that is geographically remotely located. The information resources distributed at different positions are connected in a random mode by the information organization mode, and convenience is provided for people to search and retrieve information.
The method realizes an integrated solution from monitoring to analysis to automatic diagnosis and decision assistance, and really solves the difficulty in the actual operation and maintenance of the platform; according to the implementation method, the core algorithm integrates information collected by a plurality of channels, has the potential of historical rule mining and self-updating optimization, also provides ideas and inspires for fault diagnosis and emergency of other applications and platforms, and lays a foundation for further implementing simple fault automatic emergency in the follow-up process; from specific products, the display view is visual, the key points are outstanding, the thought is clear, and people in different professions and different levels can more quickly master the fault condition by the aid of the display view. In conclusion, the emergency timeliness of the fault emergency is improved, the intelligent operation and maintenance level is improved, and good value is embodied by reducing problem influence and risk degree.
Based on the method for mining the operation and maintenance fault node, the present disclosure also provides a device 10 for mining the operation and maintenance fault node. The device 10 for mining operation and maintenance fault nodes will be described in detail below with reference to fig. 19.
Fig. 19 schematically shows a block diagram of the apparatus 10 for mining an operation and maintenance fault node according to an embodiment of the present disclosure.
The device 10 for mining operation and maintenance fault nodes comprises a determining module 1, an obtaining module 2, a detecting module 3, an output module 4, a verifying module 5 and a fault determining module 6.
Determination module 1, the determination module 1 is configured to perform operation S210: and determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container.
An obtaining module 2, where the obtaining module 2 is configured to perform operation S220: and acquiring the performance index of the operation and maintenance node.
A detection module 3, the detection module 3 being configured to perform operation S230: and carrying out abnormity detection on the performance indexes of the operation and maintenance nodes.
An output module 4, the output module 4 being configured to perform operation S240: and outputting an abnormal detection result when the performance index is abnormal.
An authentication module 5, the authentication module 5 being configured to perform operation S250: and verifying the abnormal detection result by using the detection model to obtain a verification result.
A failure determination module 6, the failure determination module 6 being configured to perform operation S260: and when the verification result is that the abnormal detection result passes the verification of the detection model, determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node.
According to the device for mining the operation and maintenance fault node, at least one of the first physical machine, the virtual machine, the application container, the second physical machine and the database container is used as the operation and maintenance node, so that the device can integrate information collected by a plurality of channels, has the potential of historical rule mining and self-updating optimization, provides ideas and elicitations for fault diagnosis and emergency of other applications and platforms, and lays a foundation for further realizing simple fault automatic emergency in the follow-up process; in addition, the device can conveniently realize an integrated solution from monitoring to analysis to automatic diagnosis and decision assistance, and can solve the difficulty in the actual operation and maintenance of the platform. This disclosure can promote the emergent ageing of trouble, promotes the intelligent level of fortune dimension, reduces problem influence and risk degree.
In addition, according to the embodiment of the present disclosure, any plurality of the determining module 1, the obtaining module 2, the detecting module 3, the outputting module 4, the verifying module 5, and the failure determining module 6 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module.
According to an embodiment of the present disclosure, at least one of the determining module 1, the obtaining module 2, the detecting module 3, the outputting module 4, the verifying module 5 and the failure determining module 6 may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or implemented by any one of three implementations of software, hardware and firmware, or an appropriate combination of any several of them.
Alternatively, at least one of the determination module 1, the acquisition module 2, the detection module 3, the output module 4, the verification module 5 and the failure determination module 6 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.
Fig. 20 schematically illustrates a block diagram of an electronic device adapted to implement the method of mining an operation and maintenance fault node according to an embodiment of the present disclosure.
As shown in fig. 20, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 910 is also connected to an input/output (I/O) interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. The program code is for causing a computer system to perform the methods of the embodiments of the disclosure when the computer program product is run on the computer system.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 901. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, and downloaded and installed through the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims (16)

1. A method for mining operation and maintenance fault nodes is characterized by comprising the following steps:
determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container;
acquiring a performance index of the operation and maintenance node;
performing abnormity detection on the performance indexes of the operation and maintenance nodes;
when the performance index is abnormal, outputting an abnormal detection result;
verifying the abnormal detection result by using a detection model to obtain a verification result; and
and when the verification result is that the abnormal detection result passes the verification of the detection model, determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node.
2. The method of claim 1, wherein the detecting the performance index of the operation and maintenance node as abnormal comprises: the dynamic baseline model carries out anomaly detection on the performance indexes of the operation and maintenance nodes according to a judgment rule;
the method further comprises the following steps: and when the verification result is that the abnormal detection result does not pass the verification of the detection model, the dynamic baseline model optimizes the judgment rule.
3. The method of claim 1, wherein the verifying the anomaly detection result by using the detection model comprises:
setting a standard value based on the performance index of the operation and maintenance node; and
and comparing the abnormal detection result with the standard value to obtain a verification result.
4. The method of claim 3,
the acquiring the performance index of the operation and maintenance node comprises: acquiring a performance index of the operation and maintenance node within a time period t, wherein the performance index within the time period t forms a performance trend;
performing anomaly detection on the performance indexes of the operation and maintenance nodes, wherein the step of obtaining an anomaly detection result comprises the following steps: performing anomaly detection on the performance trend of the operation and maintenance node to obtain an anomaly detection result; and
the setting of the standard value based on the performance index of the operation and maintenance node comprises: and setting a standard value based on the performance trend of the operation and maintenance node.
5. The method of claim 1, wherein the obtaining the performance index of the operation and maintenance node comprises:
and acquiring performance indexes corresponding to the operation and maintenance nodes of the types according to the different types of the operation and maintenance nodes.
6. The method of claim 1, wherein the performance indicators of the virtual machine, the first physical machine, and the second physical machine each include a speed of a central processing unit, a memory space, a disk read-write speed, and a traffic of a transmission control protocol.
7. The method of claim 1, wherein the performance metrics of the application container include a transaction rate, a first response time, and a success rate.
8. The method of claim 1, wherein the performance metrics of the database container include a concurrency rate, a second response time, and a database read-write speed.
9. The method of claim 1, further comprising:
and when the verification result is that the abnormal detection result passes the verification of the detection model, sending the abnormal detection result and the operation and maintenance fault node corresponding to the abnormal detection result.
10. The method according to any one of claims 1-9, further comprising:
and displaying the performance indexes, the verification result corresponding to each performance index and the operation and maintenance node corresponding to each performance index in a view and/or report form mode.
11. The method according to claim 10, wherein the presenting the plurality of performance indicators, the verification result corresponding to each performance indicator, and the operation and maintenance node corresponding to each performance indicator in a view and/or a report form comprises:
a first icon showing a plurality of the performance indicators;
rendering a first icon of the performance index corresponding to the verification result according to the verification result;
and responding to the click request of the first icon, and displaying the operation and maintenance node corresponding to the first icon.
12. The method of claim 11,
the first icon showing a plurality of the performance indicators comprises: classifying a plurality of said performance indicators, said performance indicators of the same class being shown by one of said first icons; and
the displaying the operation and maintenance node corresponding to the first icon in response to the click request of the first icon comprises: and responding to the click request of the first icon of the same type, and displaying g operation and maintenance nodes corresponding to the first icon in a form of a second icon or a report, wherein g is an integer greater than or equal to 1.
13. An apparatus for mining operation and maintenance fault nodes, comprising:
the determining module is used for determining an operation and maintenance node, wherein the operation and maintenance node comprises at least one of a first physical machine, a virtual machine, an application container, a second physical machine and a database container;
the acquisition module is used for executing the performance index acquisition of the operation and maintenance node;
the detection module is used for executing abnormal detection on the performance indexes of the operation and maintenance nodes;
the output module is used for outputting an abnormal detection result when the performance index is abnormal;
the verification module is used for verifying the abnormal detection result by using a detection model to obtain a verification result; and
and the fault determining module is used for determining the operation and maintenance node corresponding to the abnormal detection result as an operation and maintenance fault node when the verification result is that the abnormal detection result passes the verification of the detection model.
14. An electronic device, comprising:
one or more processors;
one or more memories for storing executable instructions that, when executed by the processor, implement the method of any of claims 1-12.
15. A computer-readable storage medium having stored thereon executable instructions that when executed by a processor implement a method according to any one of claims 1 to 12.
16. A computer program product comprising a computer program comprising one or more executable instructions which, when executed by a processor, implement a method according to any one of claims 1 to 12.
CN202210051440.4A 2022-01-17 2022-01-17 Method, device, electronic equipment and medium for mining operation and maintenance fault node Pending CN114416449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210051440.4A CN114416449A (en) 2022-01-17 2022-01-17 Method, device, electronic equipment and medium for mining operation and maintenance fault node

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210051440.4A CN114416449A (en) 2022-01-17 2022-01-17 Method, device, electronic equipment and medium for mining operation and maintenance fault node

Publications (1)

Publication Number Publication Date
CN114416449A true CN114416449A (en) 2022-04-29

Family

ID=81274132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210051440.4A Pending CN114416449A (en) 2022-01-17 2022-01-17 Method, device, electronic equipment and medium for mining operation and maintenance fault node

Country Status (1)

Country Link
CN (1) CN114416449A (en)

Similar Documents

Publication Publication Date Title
US11847574B2 (en) Systems and methods for enriching modeling tools and infrastructure with semantics
US10936479B2 (en) Pluggable fault detection tests for data pipelines
US10977293B2 (en) Technology incident management platform
US11562304B2 (en) Preventative diagnosis prediction and solution determination of future event using internet of things and artificial intelligence
CN110351150B (en) Fault source determination method and device, electronic equipment and readable storage medium
US11457029B2 (en) Log analysis based on user activity volume
US20220083531A1 (en) Abnormal event analysis
US11171835B2 (en) Automated generation of an information technology asset ontology
CN111754123B (en) Data monitoring method, device, computer equipment and storage medium
CN112527774A (en) Data center building method and system and storage medium
CN112927082A (en) Credit risk prediction method, apparatus, device, medium, and program product
US20210392144A1 (en) Automated and adaptive validation of a user interface
CN114205216A (en) Root cause positioning method and device for micro-service fault, electronic equipment and medium
CN113420935A (en) Fault location method, apparatus, device and medium
JP2021503652A (en) Automatically connect external data to business analysis processing
US9104573B1 (en) Providing relevant diagnostic information using ontology rules
CN114416449A (en) Method, device, electronic equipment and medium for mining operation and maintenance fault node
CN114281586A (en) Fault determination method and device, electronic equipment and computer readable storage medium
US11494416B2 (en) Automated event processing system
US11392375B1 (en) Optimizing software codebases using advanced code complexity metrics
CN115238292A (en) Data security management and control method and device, electronic equipment and storage medium
CN115550141A (en) Event processing method and device, electronic equipment and readable storage medium
CN114416422A (en) Problem locating method, apparatus, device, medium and program product
US20220086183A1 (en) Enhanced network security based on inter-application data flow diagrams
CN113064834A (en) Abnormality detection method, abnormality detection apparatus, electronic device, abnormality detection medium, and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination