CN116932269A - Fault processing method, device, electronic equipment and computer storage medium - Google Patents

Fault processing method, device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN116932269A
CN116932269A CN202310946690.9A CN202310946690A CN116932269A CN 116932269 A CN116932269 A CN 116932269A CN 202310946690 A CN202310946690 A CN 202310946690A CN 116932269 A CN116932269 A CN 116932269A
Authority
CN
China
Prior art keywords
execution
node
target
nodes
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310946690.9A
Other languages
Chinese (zh)
Inventor
任振锋
李逶
刘林新
吴少红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202310946690.9A priority Critical patent/CN116932269A/en
Publication of CN116932269A publication Critical patent/CN116932269A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Abstract

The disclosure provides a fault processing method, a fault processing device, electronic equipment and a computer storage medium, which can be applied to the technical fields of big data, blockchain and financial science and technology. The method comprises the following steps: determining M basic execution nodes related to target execution nodes to be detected, wherein the basic execution nodes are used for executing specific data processing operations in a distributed system, and M is more than or equal to 2; acquiring respective first failure rates of M basic execution nodes; calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes according to the association relation between the M basic execution nodes and the target execution node; and determining a fault handling policy matched with the target executing node according to the second fault rate.

Description

Fault processing method, device, electronic equipment and computer storage medium
Technical Field
The present disclosure relates to the field of big data technology, the field of blockchain technology, and the field of financial technology, and more particularly, to a fault handling method, a device, an electronic apparatus, and a computer storage medium.
Background
With the continuous development of computer technology, distributed systems are increasingly being used in enterprise business frameworks. For example, a basic application is built on the underlying distributed hardware and operating system, and then a complex business system is built on the basic application. For example, for the product side, business systems integrate hardware, software products of multiple origins; for the development side, a business system or business function is commonly developed and maintained by a plurality of departments.
In the process of realizing the above inventive concept, the inventor finds that at least the following technical problems exist in the prior art: because the distributed service system involves complex software products, hardware and developers, the geographical areas are distributed, so that the fault processing efficiency of the distributed system is low, the fault detection accuracy is low and the detection difficulty is high.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a fault handling method, apparatus, electronic device, and computer storage medium.
According to a first aspect of the present disclosure, there is provided a fault handling method, comprising:
determining M basic execution nodes related to target execution nodes to be detected, wherein the basic execution nodes are used for executing specific data processing operations in a distributed system, and M is more than or equal to 2;
Acquiring respective first failure rates of M basic execution nodes;
calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes according to the association relation between the M basic execution nodes and the target execution node; and
and determining a fault processing strategy matched with the target executing node according to the second fault rate.
According to an embodiment of the present disclosure, calculating, according to an association relationship between M base execution nodes and a target execution node, a second failure rate of the target execution node using respective first failure rates of the M base execution nodes includes:
determining a target link comprising M basic execution nodes and target execution nodes, wherein the target execution nodes are positioned at the uppermost layer of the target link, and the basic execution nodes are positioned at the lowermost layer of the target link; and
based on the association relation, calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes in the order from the lowest layer to the uppermost layer.
According to an embodiment of the present disclosure, wherein calculating, based on the association relation, the second failure rate of the target execution node using the respective first failure rates of the M base execution nodes in order from the lowest layer to the uppermost layer includes:
For the lowest layer of the target link,
determining an upper execution node connected with at least two basic execution nodes;
under the condition that the upper execution node is not the target execution node, determining the connection relation type between at least two basic execution nodes and the upper execution node; and
and calculating the failure rate of the upper execution node by using the respective first failure rates of the at least two basic execution nodes based on the connection relation type.
According to an embodiment of the present disclosure, calculating, based on the connection relationship type, the failure rate of the upper execution node using the respective first failure rates of the at least two basic execution nodes includes:
under the condition that the connection relation type is determined to be a first connection relation type, processing the first failure rate of each of at least two basic execution nodes by using a first failure calculation mode to obtain the failure rate of an upper execution node, wherein the first connection relation type characterizes the data processing operation executed by the upper execution node and is determined by the data processing operation executed by one of the at least two basic execution nodes;
and under the condition that the connection relation type is determined to be a second connection relation type, processing the first failure rate of each of the at least two basic execution nodes by using a second failure calculation mode to obtain the failure rate of the upper execution node, wherein the second connection relation type characterizes the data processing operation executed by the upper execution node and is determined by the data processing operation executed by the at least two basic execution nodes.
According to an embodiment of the present disclosure, determining, according to the second failure rate, a failure processing policy that matches the target execution node includes:
under the condition that the second fault rate is larger than or equal to the fault threshold value, determining importance of M basic execution nodes to the target execution node according to a fault calculation mode of the second fault rate; and
and determining the basic execution node with the highest importance as a fault adjustment node, and generating a fault processing strategy aiming at the fault adjustment node.
According to an embodiment of the present disclosure, determining importance of M base execution nodes to a target execution node according to a failure calculation mode of a second failure rate includes:
and aiming at the M-th basic execution node, calculating partial differentiation of the first failure rate of the failure calculation mode aiming at the M-th basic execution node to obtain the importance degree of the M-th basic execution node to the target execution node, wherein M is more than or equal to M is more than or equal to 1.
According to an embodiment of the present disclosure, determining M base execution nodes related to a target execution node to be detected includes:
obtaining a structural model of the distributed system, wherein the structural model comprises association relations among a plurality of execution nodes in the distributed system; and
And determining M basic execution nodes related to the target execution node from the structural model according to the node identification of the target execution node.
According to an embodiment of the present disclosure, before determining M base execution nodes related to a target execution node to be detected, the method includes:
acquiring communication data of the distributed system through a mirror image port of the distributed system;
preprocessing communication data to obtain operation data related to data processing operation and sequence data related to communication processing logic;
inputting the operation data into an execution node classification model, and outputting a classification result, wherein the classification result comprises a plurality of execution nodes;
a structural model of the distributed system is constructed based on the classification result and the sequential data.
A second aspect of the present disclosure provides a fault handling apparatus, comprising:
the determining module is used for determining M basic executing nodes related to the target executing node to be detected, wherein the basic executing nodes are used for executing specific data processing operation in the distributed system, and M is more than or equal to 2;
the acquisition module is used for acquiring the first failure rate of each of the M basic execution nodes;
the computing module is used for computing a second failure rate of the target execution node by utilizing the respective first failure rates of the M basic execution nodes according to the association relation between the M basic execution nodes and the target execution node; and
And the processing module is used for determining a fault processing strategy matched with the target execution node according to the second fault rate.
A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the fault handling method described above.
A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described fault handling method.
A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described fault handling method.
In the embodiment of the disclosure, M basic execution nodes related to a target execution node to be detected are determined, first failure rates of the M basic execution nodes are obtained, second failure rates of the target execution nodes are calculated according to association relations between the M basic execution nodes and the target execution nodes by using the first failure rates of the M basic execution nodes, a failure processing strategy matched with the target execution nodes is determined according to the second failure rates, and failure prediction and failure processing for the target execution nodes are achieved. Because the basic execution node is used for executing specific data processing operation in the distributed system, the data processing operation cannot be frequently changed along with service upgrading or online in the distributed system, and therefore the distributed system is not required to be analyzed and changed manually frequently, and the fault prediction and fault processing efficiency is improved. In addition, because the association relation exists between the target execution node and the basic execution node, the fault rate of the target execution node can be flexibly, rapidly and accurately calculated based on the association relation and the first fault rate of the basic execution node without manually positioning faults by combining a plurality of departments.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an application scenario of a fault handling method according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow chart of a fault handling method according to an embodiment of the present disclosure;
FIG. 3 schematically illustrates a flow chart of a second failure rate determination method according to an embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of a target link according to a particular embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart of preprocessing communication data according to an embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow chart of a method of generating a structural model according to an embodiment of the present disclosure;
FIG. 7 schematically illustrates a block diagram of a fault handling apparatus according to an embodiment of the present disclosure; and
fig. 8 schematically illustrates a block diagram of an electronic device adapted for a fault handling method according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.
For enterprises such as banks, the business system is generally a distributed system. For example, the functions of the mobile banking client are implemented by a plurality of servers, and the data of the mobile banking client is stored in databases distributed in a plurality of cities.
In practical application, the distributed business system of the bank has very complex components, including self-research hardware and software of multiple departments combined in an enterprise, and hardware and software products purchased from manufacturers or outsourced. Thus, the practical application process of a distributed system includes a variety of difficult-to-predict risk factors, such as performance risks such as transaction throughput due to distributed storage, unmanageable distributed billing, data forking, etc.; due to the risks brought by the introduction of outsourcing software, hardware, such as hacking, data disclosure, etc.
In the prior art, each department is generally responsible for partial fault detection and fault handling of the distributed system. Because the distributed architecture splits a single system into a plurality of services, the related geographic area has large span and complex system call relationship, when the distributed system fails, a department can hardly locate and process the failure source timely and accurately. In addition, if the fault source is not in the area in charge of the current department, the fault source needs to be co-located by combining multiple departments, so that the fault detection and fault processing efficiency is low.
The embodiment of the disclosure provides a fault processing method, which comprises the following steps: determining M basic execution nodes related to target execution nodes to be detected, wherein the basic execution nodes are used for executing specific data processing operations in a distributed system, and M is more than or equal to 2; acquiring respective first failure rates of M basic execution nodes; calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes according to the association relation between the M basic execution nodes and the target execution node; and determining a fault handling policy matched with the target executing node according to the second fault rate.
Fig. 1 schematically illustrates an application scenario of a fault handling method according to an embodiment of the present disclosure.
As shown in fig. 1, an application scenario 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.
For example, a user can access a distributed system provided in the server 105 through the first terminal device 101, the second terminal device 102, and the third terminal device 103.
The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.
Server 105 may also integrate a server cluster of multiple servers. A distributed system may be provided in a server cluster.
It should be noted that the fault handling method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the fault handling apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The fault handling method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105. Accordingly, the fault handling apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The fault handling method of the disclosed embodiment will be described in detail below with reference to fig. 2 to 6 based on the scenario described in fig. 1.
Fig. 2 schematically illustrates a flow chart of a fault handling method according to an embodiment of the present disclosure.
As shown in fig. 2, the method 200 includes operations S210 to S240.
In operation S210, M basic execution nodes related to the target execution node to be detected are determined, where the basic execution nodes are used to perform a specific data processing operation in the distributed system, and M is greater than or equal to 2.
According to an embodiment of the present disclosure, a distributed system includes a plurality of execution nodes, each for performing a portion of data processing operations in the distributed system. The target execution node may be any execution node in the distributed system.
According to an embodiment of the present disclosure, a distributed system may include a plurality of system layers, each of which may include at least one execution node. The basic execution node is an execution node for realizing the lowest data processing operation in the distributed system.
For example, the base execution nodes include an execution node that performs a data query operation, an execution node that performs a data storage operation, an execution node that performs a delete operation, and so on.
According to an embodiment of the present disclosure, a distributed system includes, from bottom to top, a physical layer, a network layer, a base application layer, a business service layer, a data service layer, and a thin client layer.
According to an embodiment of the present disclosure, in performing fault detection, one execution node is randomly determined from a fault area, and the execution node is taken as a target execution node to be detected. When the distributed system is subjected to daily fault investigation, one execution node can be randomly selected from the distributed system, and the execution node is used as a target execution node to be detected.
In operation S220, a first failure rate of each of the M basic execution nodes is acquired.
According to the embodiment of the disclosure, the respective first failure rate of the basic execution nodes can be obtained by performing actual tests on the basic execution nodes.
According to an embodiment of the present disclosure, the base execution node is disposed at a physical layer of the distributed system, and the implementation of the data processing operation depends on the hardware device, and the production label of the hardware device includes the failure rate. Thus, the respective first failure rates of the base execution nodes may also be determined from the failure rates of the hardware devices on which the base execution nodes depend.
In operation S230, a second failure rate of the target execution node is calculated using the respective first failure rates of the M base execution nodes according to the association relationship between the M base execution nodes and the target execution node.
According to the embodiment of the disclosure, a service function is commonly implemented by at least one basic execution node in a distributed system, and an association relationship exists between at least one basic execution node for implementing the same service and the same energy.
For example, the target execution node C is used to implement a transfer operation, and the target execution node C depends on the execution node B, which depends on the base execution node A1 and the base execution node A2. The execution node B has an association relation with the basic execution node A1 and the basic execution node A2, namely the execution node B schedules the basic execution node A1 and the basic execution node A2; there is an association between the target executing node C and the executing node B, i.e. the target executing node C schedules the executing node B.
According to the embodiment of the disclosure, since the scheduling relationship of the distributed system is complex, the dependency relationship between services is complex, and the execution node on which each service depends is different, it is difficult to cover all services in the case of performing fault analysis on the basis of the service. Even if the distributed system covers all the services, the fault detection workload of the distributed system is increased continuously along with the upgrade of the services and the online of new services.
In the embodiments of the present disclosure, the service functions of the entire distributed system depend on the bottommost basic execution node, i.e., the execution node of the physical layer, and the basic execution node generally does not change with the upgrade of the service functions or the online of new services. Therefore, from the probability of the failure of the basic execution node, the failure probability of each target execution node to be detected can be flexibly and accurately determined according to the association relation between the target execution node and the basic execution node.
In operation S240, a fault handling policy matching the target execution node is determined according to the second fault rate.
According to an embodiment of the present disclosure, the second failure rate is used to characterize a probability that the target executing node fails. The higher the second failure rate, the greater the probability that the target executing node fails. After the second failure rate of the target execution node is determined, the failure processing strategy matched with the target execution node is determined in advance, and the failure can be predicted and processed in advance by adopting the failure processing strategy for the target execution node, so that the influence on the distributed system after the failure of the target execution node is avoided.
In the embodiment of the disclosure, M basic execution nodes related to a target execution node to be detected are determined, first failure rates of the M basic execution nodes are obtained, second failure rates of the target execution nodes are calculated according to association relations between the M basic execution nodes and the target execution nodes by using the first failure rates of the M basic execution nodes, a failure processing strategy matched with the target execution nodes is determined according to the second failure rates, and failure prediction and failure processing for the target execution nodes are achieved. Because the basic execution node is used for executing specific data processing operation in the distributed system, the data processing operation cannot be frequently changed along with service upgrading or online in the distributed system, and therefore the distributed system is not required to be analyzed and changed manually frequently, and the fault prediction and fault processing efficiency is improved. In addition, because the association relation exists between the target execution node and the basic execution node, the fault rate of the target execution node can be flexibly, rapidly and accurately calculated based on the association relation and the first fault rate of the basic execution node without manually positioning faults by combining a plurality of departments.
Fig. 3 schematically illustrates a flow chart of a second failure rate determination method according to an embodiment of the present disclosure.
As shown in fig. 3, the second failure rate determining method 300 of this embodiment includes operations S331 to S332, which may be a specific embodiment of operation S230.
In operation S331, a target link including M base execution nodes and target execution nodes is determined, wherein the target execution nodes are located at the uppermost layer of the target link and the base execution nodes are located at the lowermost layer of the target link.
In operation S332, a second failure rate of the target execution node is calculated using the respective first failure rates of the M base execution nodes in order from the lowest layer to the uppermost layer based on the association relationship.
According to embodiments of the present disclosure, a target link may be understood as a link made up of multiple execution nodes and multiple underlying execution nodes with which the target execution node is associated. The target executing node is used as the uppermost layer of the target link, the basic executing node is positioned at the lowermost layer of the target link, namely, the uppermost layer of the target link only comprises one target executing node, and the lowermost layer of the target link only comprises M basic executing nodes.
According to an embodiment of the present disclosure, there is also at least one hierarchy of execution nodes between the target execution node and the M base execution nodes. An execution node having an association relationship with at least two basic execution nodes may be referred to as an upper execution node of the at least two basic execution nodes; alternatively, the at least two basic execution nodes may also be referred to as lower-level execution nodes of the execution nodes.
For example, the target link includes three layers, the uppermost layer is the target execution node C, an association relationship exists between the target execution node C and the execution node located below the target execution node C is the execution node B. The execution node B has an association relationship with the basic execution node A1 and the basic execution node A2, and is positioned at an upper layer of the basic execution node A1 and the basic execution node A2. The execution node B may be referred to as a lower-level execution node of the target execution node C, and the execution node B may be referred to as an upper-level execution node of the base execution node A1 and the base execution node A2.
According to the embodiment of the disclosure, after determining M basic execution nodes related to a target execution node to be detected, a target link may be determined according to an association relationship between the M basic execution nodes and the target execution node.
According to the embodiment of the disclosure, after the first failure rates of the M base execution nodes are obtained, the second failure rate of the target execution node may be calculated using the first failure rates of the M base execution nodes in order from the lowest layer to the uppermost layer based on the association relationship between the M base execution nodes and the target execution node.
In an embodiment of the present disclosure, after determining the first failure rate of the base execution node, the second failure rate is calculated from the base execution node toward the target execution node according to the association relationship between the target execution node and the M base execution nodes. Because the first failure rate of the basic execution node and the association relation are determined, the second failure rate can be rapidly calculated from the lowest layer to the uppermost layer, and the accurate second failure rate can be obtained.
According to an embodiment of the present disclosure, calculating a second failure rate of a target execution node using respective first failure rates of M base execution nodes in order from a lowest layer to an uppermost layer based on an association relation, includes: determining an upper-layer execution node connected with at least two basic execution nodes aiming at the lowest layer of the target link; under the condition that the upper execution node is not the target execution node, determining the connection relation type between at least two basic execution nodes and the upper execution node; and calculating the failure rate of the upper execution node by using the respective first failure rates of the at least two basic execution nodes based on the connection relation type.
According to the embodiment of the disclosure, an association relationship exists among a plurality of execution nodes, and the failure rate of each execution node is affected by the execution node of the lower layer of the execution node. Therefore, the failure rate of the upper execution node can be calculated from the first failure rate of the base execution node from the lowest layer until the second failure rate of the target execution node is obtained.
According to an embodiment of the present disclosure, in the process of calculating from the lowest layer to the upper layer, the upper layer execution node connected to at least two base execution nodes is determined by comparing whether the execution nodes having the association relationship with each base execution node are identical.
According to an embodiment of the present disclosure, the upper level execution node, which characterizes the target link as not being the target execution node, includes at least three layers, the upper level execution node being located between the target execution node and the base execution node. Thus, it is necessary to calculate the failure rate of the upper execution node first, and then calculate the second failure rate of the target execution node layer by layer upward.
According to the embodiment of the disclosure, for an upper execution node connected with two or more execution nodes, the upper execution node may acquire an execution result returned by the two or more execution nodes and perform a subsequent operation on the execution result, that is, the upper execution node depends on the two or more execution nodes at the same time. The upper execution node may also obtain an execution result returned by one of the two or more execution nodes, and perform a subsequent operation on the execution result, that is, the upper execution node depends on only one execution node. Therefore, it is necessary to calculate the failure rate of the upper execution node according to the connection relationship type between at least two basic execution nodes and the upper execution node.
In the embodiment of the disclosure, in the process of calculating the second execution node from the lowest layer to the uppermost layer, the actual execution condition of the service system can be adapted by distinguishing the connection relation type, so that the calculation is performed based on the first failure rate of the basic execution node, and the failure rate of the target execution node can be more flexibly and accurately determined.
According to an embodiment of the present disclosure, calculating, based on a connection relationship type, a failure rate of an upper execution node using respective first failure rates of at least two basic execution nodes includes: and under the condition that the connection relation type is determined to be the first connection relation type, processing the first failure rate of each of the at least two basic execution nodes by using a first failure calculation mode to obtain the failure rate of the upper execution node, wherein the first connection relation type characterizes the data processing operation executed by the upper execution node and is determined by the data processing operation executed by one of the at least two basic execution nodes. And under the condition that the connection relation type is determined to be a second connection relation type, processing the first failure rate of each of the at least two basic execution nodes by using a second failure calculation mode to obtain the failure rate of the upper execution node, wherein the second connection relation type characterizes the data processing operation executed by the upper execution node and is determined by the data processing operation executed by the at least two basic execution nodes.
According to an embodiment of the present disclosure, the first connection type may be understood as a relationship of "or" between at least two basic execution nodes and an upper execution node, and the second connection type may be understood as a relationship of "and" between at least two basic execution nodes and an upper execution node.
According to an embodiment of the present disclosure, the failure probability distribution function of the base execution node a is F A (x)=1-e -λx Lambda represents the first failure rate of the basic execution node A, and the failure probability density function of the basic execution node A is f A (x)=F A (x)'=λe -λx
The first failure calculation mode at time t may be determined by the following equation:
wherein P (S.ltoreq.t) represents the probability of failure of the executing node S at time t, i.e. the failure rate,fault probability density function of executing node S representing first connection type, F S (t) represents a failure probability distribution function of the execution node S at time t. Wherein the probability density function of the first connection type may be determined by the following formula
Wherein T represents T basic execution nodes connected with the execution node S, lambda i Representing a first failure rate of an ith underlying executing node connected to executing node S.
According to an embodiment of the present disclosure, the second failure calculation mode may be determined by the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing a fault probability density function of the executing node S in the second connection relation type. Wherein the probability density function of the failure of the second connection type can be determined by the following formula
Wherein G and M represent the executing node G and the executing node M, F connected with the executing node S G (t) represents a probability distribution function of failure of the executing node G at time t, F M (t) represents a probability distribution function of failure of the executing node M at time t, f G (t) represents a failure probability density function of the execution node G at the time t, f M (t) represents a failure probability density function of the execution node M at time t.
According to an embodiment of the present disclosure, in the case where it is determined that the upper-layer execution node is the target execution node, it is indicated that the target link includes only two layers, namely, the layer to which the target execution node belongs and the layer to which the base execution node belongs. At this time, the failure rate of the target execution node may be directly calculated according to the first failure rate of the connected at least two basic execution nodes. Since the base execution node is determined according to the target execution node, if the upper execution node is the target execution node, the data of the base execution node connected to the upper execution node is M.
According to an embodiment of the present disclosure, in a case where it is determined that an upper-layer execution node is a target execution node, calculating a failure rate of the target execution node directly from first failure rates of at least two base execution nodes connected thereto includes: and processing the first failure rate of each of the at least two basic execution nodes by using the second failure calculation mode to obtain the second failure rate of the target execution node.
Fig. 4 schematically illustrates a schematic diagram of a target link according to a specific embodiment of the present disclosure.
As shown in fig. 4, the target execution node is S, and the base execution node is A, B, C, E, I, J. Three faults can occur in the data acquisition and display functions of the distributed system, namely, the data source faults, the data cannot be displayed, and the data cannot be displayed according to a preset sequence. The data receiving module may be understood as an executing node G, the base executing node A, B representing two data sources. The data display module is M, wherein the execution node N represents a display sub-module, and the execution node O represents a sequence sub-module. The display sub-modules are respectively characterized as base execution nodes C, E by two factors, and the sequential sub-modules are respectively characterized as base execution nodes I, J by two factors. First failure rates of the basic execution nodes A, B, C, E, I, J are lambda respectively 1 、λ 1 、λ 2 、λ 2 、λ 3 、λ 3
As shown in fig. 4, the upper level execution nodes of the basic execution nodes a and B are G, the upper level execution nodes of the basic execution nodes C and E are N, and the upper level execution nodes of the basic execution nodes I and J are O. The upper-layer executing node of the executing nodes N and O is M, and the upper-layer executing node of the executing nodes G and M is S, namely the target executing node S.
The upper execution node G may operate according to the execution results of the base execution nodes a and B, i.e., the upper execution node G may be understood as "and". The upper execution node N may operate according to the execution result of one of the base execution nodes C and E, i.e., the upper execution node N may be understood as "or". Similarly, the upper execution node O is "and", and the upper execution node M is "and".
The first failure calculation mode can be used for calculating the failure probability density function of the upper execution node N asThe fault probability density function of the upper execution node M is f M (M)=f M|N,P (m|n,o)=f N (t)+f O (t)-[F N (t)*F O (t)]'. The second failure rate of the final calculated target execution node S is:
wherein F is G (t) and F M (t) represent the probability distribution functions of the failure of the execution node G and the execution node M connected to the target execution node, respectively.
According to embodiments of the present disclosure, the analysis process from the base execution node to the target execution node may also be implemented by constructing a continuous-time bayesian network.
The embodiment of the disclosure effectively integrates expert experience, historical data and various incomplete and uncertain information by describing the dynamic behaviors and interactions of each execution node in the distributed system, and realizes the fault prediction and fault processing of the distributed system.
According to an embodiment of the present disclosure, determining a fault handling policy that matches the target execution node according to the second fault rate includes: under the condition that the second fault rate is larger than or equal to the fault threshold value, determining importance of M basic execution nodes to the target execution node according to a fault calculation mode of the second fault rate; and determining the basic execution node with the highest importance as a fault adjustment node, and generating a fault processing strategy aiming at the fault adjustment node.
According to embodiments of the present disclosure, the fault threshold may be determined from actual test results, for example 80%.
According to embodiments of the present disclosure, the fault handling policy may be to replace the hardware device on which the executing node depends.
According to an embodiment of the present disclosure, determining importance of M base execution nodes to a target execution node according to a failure calculation mode of a second failure rate includes: aiming at the M-th basic execution node, partial differentiation of a first failure rate of the failure calculation mode aiming at the M-th basic execution node is calculated, and the importance of the M-th basic execution node to the target execution node is obtained, wherein M is more than or equal to M is more than or equal to 1.
For example, given a task period t, e.g., t=10 5 In the case of hour, when the input of the fault calculation mode is biased, the result of the biased differentiation can be used as the importance of the basic execution node to the target execution node.
Still taking fig. 4 as an example, for λ 1 The partial differentiation of (a) is the importance of the base execution nodes A and B, and similarly, for lambda 2 And lambda (lambda) 3 The partial differentiation of (a) is the importance of the basic execution nodes C and E, and the basic execution nodes I and J, respectively.
According to an embodiment of the present disclosure, determining M base execution nodes related to a target execution node to be detected includes: obtaining a structural model of the distributed system, wherein the structural model comprises association relations among a plurality of execution nodes in the distributed system; and determining M basic execution nodes related to the target execution node from the structural model according to the node identification of the target execution node.
According to an embodiment of the present disclosure, the structural model is determined from the multi-layer characteristics of the distributed system. The distributed system longitudinally divides a physical layer, a network layer, a basic application layer, a business service layer, a data service layer and a thin client layer; each layer is divided into a plurality of nodes from the lateral direction, and a structural model is generated based on the nodes. After the structural model is generated, the association relationship among the plurality of execution nodes in the distributed system is known.
According to embodiments of the present disclosure, an execution node may be represented in a structural model by its node identification. Thus, from the node representation of the target execution node, M base execution nodes associated with the target execution node may be determined from the structural model.
According to embodiments of the present disclosure, the structural model is generated and updated from daily communication data of the distributed system.
According to an embodiment of the present disclosure, before determining M base execution nodes related to a target execution node to be detected, the method includes: a structural model is generated. The step of generating a structural model includes: acquiring communication data of the distributed system through a mirror image port of the distributed system; preprocessing communication data to obtain operation data related to data processing operation and sequence data related to communication processing logic; inputting the operation data into an execution node classification model, and outputting a classification result, wherein the classification result comprises a plurality of execution nodes; a structural model of the distributed system is constructed based on the classification result and the sequential data.
According to embodiments of the present disclosure, real-time communication data may be acquired from a distributed system through a bypass acquisition mode. The bypass acquisition mode is more flexible and convenient to deploy, and the structure of the existing distributed system is not affected.
The related art may acquire communication data through a serial mode. The tandem mode typically uses the data acquisition device as a gateway, bridge, or proxy server to change the existing distributed system architecture. The serial mode is connected in the network in series, all data can be sent to each client after passing through the data acquisition equipment and analysis and inspection of the data acquisition equipment, so that delay occurs in the network speed; and if the data acquisition device fails, the distributed system is interrupted.
In embodiments of the present disclosure, the bypass acquisition mode may copy communication data from the distributed system through the mirror port. Because the communication data is obtained from the mirror image port, the fault detection process or the structure model generation process does not cause analysis delay of the original data packet and does not cause any influence on the network speed. In addition, the equipment in the bypass acquisition mode fails or stops running, and the existing network is not affected.
Fig. 5 schematically illustrates a flow chart of preprocessing communication data according to an embodiment of the present disclosure.
As shown in fig. 5, after the communication data is acquired through the mirror interface, the communication data is input into a filtering splitting array so that the filtering splitting array distinguishes the communication data of the plurality of service functions. For each service function, the communication data is converted into a data form matched with the bypass acquisition mode through coding conversion. And then, obtaining structured information and unstructured information through de-duplication and information extraction.
According to an embodiment of the present disclosure, the structured information includes general fields of transaction account, gender, etc., and the unstructured information includes fields of special scenes of acquisition channel, business scene, medium, etc. After the structured information and the unstructured information are obtained, the structured information and the unstructured information are stored in corresponding databases so as to be used for fault analysis subsequently.
According to embodiments of the present disclosure, although the technologies and communication structures employed by the various executing nodes are different, the application characteristics of the executing nodes and their basic behavior rules are ubiquitous across node technologies, and these characteristics rules must necessarily be implicitly present in the communication data. Operation data related to data processing operations and sequence data related to communication processing logic are obtained by preprocessing communication data. And then, inputting the operation data into an execution node classification model, identifying implicit characteristics of the execution nodes by the node classification model, combing the operation data into a plurality of execution nodes, and outputting the operation data as classification results.
According to an embodiment of the present disclosure, the execution node classification model includes a support vector machine (Support Vector Machines, SVM) model. The SVM model is utilized to classify the operation data, so that the execution nodes of the distributed system can be automatically identified, and various behavior identifiers of each execution node are obtained. According to the embodiment of the disclosure, a node list can be generated according to the output result of the SVM model so as to carry out node concatenation subsequently.
According to the embodiment of the disclosure, when training an SVM model, training data is input into the SVM model to find a problem of a 'hyperplane' as an optimization problem to solve, and a trained SVM model is obtained. Finding a "hyperplane" problem refers to maximizing the separation of two classes of data by a sort boundary.
According to an embodiment of the present disclosure, the classification plane is expressed as:
(w*x)+b=0 (6)
where x represents a multidimensional vector, and w and b are parameters of the hyperplane. The optimization problem of the SVM model is expressed as:
wherein, the constraint condition characterizes: require each data point (x i ,y i ) The distance to the classification plane is greater than or equal to 1.y is i For the purpose of classifying the data, 2/|| w| 2 Representing the classification interval.
Since the optimization function of the SVM model is quadratic and the constraint is linear, the optimization problem is a typical quadratic programming problem and can be solved by using the Lagrangian multiplier method. The dual problem of this optimization problem is:
wherein a is i Is a lagrange multiplier.
According to an embodiment of the present disclosure, after the optimal solution a is found for the dual problem, w corresponding to the optimal solution a may be eliminated * And b *
In determining an optimal w according to embodiments of the present disclosure * And b * An optimal hyperplane can also be determined.
According to embodiments of the present disclosure, for the linear inseparable problem, a look-ahead soft interval classifier method or a nonlinear hard interval classifier method may be employed.
According to an embodiment of the present disclosure, constructing a structural model of a distributed system based on classification results and sequence data includes: and connecting a plurality of execution nodes represented by the classification results in series according to the sequence data to obtain a structural model.
According to the embodiment of the disclosure, in the process of concatenation, element extraction can be performed on the behaviors of a plurality of execution nodes represented by the sequence information through a hierarchical conditional random field model, and automatic extraction is realized by forming an information extraction wrapper, so that the behavior information elements of the execution nodes are obtained.
According to the embodiment of the disclosure, the element extraction process of the execution node is to label the data property of each variable or element, so the behavior recognition and element extraction problems are similar to the current classical document part-of-speech label or semantic label problems. Among the various approaches to solving the problems of part-of-speech tagging or semantic tagging, conditional Random Fields (CRF) are commonly employed models. The conditional random field is an extension of the maximum entropy model, and the type label of a single information unit is extended to be a label of a sequence under the condition of a given observation value sequence, so that the serial connection operation of a plurality of execution nodes is completed when the labeling process of the sequence is completed.
Fig. 6 schematically illustrates a flow chart of a method of generating a structural model according to an embodiment of the present disclosure.
As shown in fig. 6, the distributed system 601 may transmit acquired communication data to the preprocessing module 602 through a mirrored interface. The preprocessing module may input operational data in the communication data into the execution node classification model 603 and sequential data into the hierarchical conditional random field model 604. The execution node classification model 603 is pre-trained with training data.
The execution node classification model 603 performs execution node recognition and classification on the input operation data, outputs the classified execution nodes, and forms an execution node list. The executing node classification model 603 then sends the list of executing nodes to the hierarchical conditional random field model 604. The hierarchical conditional random field model 604 extracts behavior information elements of the execution nodes from the received order data and the list of execution nodes, which together form a structural model 605.
Fig. 7 schematically shows a block diagram of a fault handling apparatus according to an embodiment of the present disclosure.
As shown in fig. 7, the fault handling apparatus 700 of this embodiment includes a determining module 710, an acquiring module 720, a calculating module 730, and a processing module 740.
The determining module 710 is configured to determine M basic execution nodes related to a target execution node to be detected, where the basic execution nodes are configured to perform a specific data processing operation in the distributed system, and M is greater than or equal to 2. In an embodiment, the determining module 710 may be configured to perform the operation S210 described above, which is not described herein.
And the obtaining module 720 is configured to obtain respective first failure rates of the M basic execution nodes. In an embodiment, the obtaining module 720 may be configured to perform the operation S220 described above, which is not described herein.
And a calculating module 730, configured to calculate, according to the association relationship between the M base execution nodes and the target execution node, a second failure rate of the target execution node by using the first failure rates of the M base execution nodes. In an embodiment, the computing module 730 may be configured to perform the operation S230 described above, which is not described herein.
And a processing module 740, configured to determine, according to the second failure rate, a failure processing policy that matches the target execution node. In an embodiment, the processing module 740 may be configured to perform the operation S240 described above, which is not described herein.
According to an embodiment of the present disclosure, the computing module 730 includes a first determining sub-module and a computing sub-unit.
The first determining submodule is used for determining a target link comprising M basic executing nodes and target executing nodes, wherein the target executing nodes are located at the uppermost layer of the target link, and the basic executing nodes are located at the lowermost layer of the target link. In an embodiment, the first determining sub-module may be used to perform the operation S331 described above, which is not described herein.
The calculating subunit is used for calculating the second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes according to the sequence from the lowest layer to the uppermost layer based on the association relation. In an embodiment, the computing subunit may be configured to perform the operation S332 described above, which is not described herein.
According to an embodiment of the present disclosure, the calculation sub-module includes a first determination unit, a second determination unit, and a third determination sub-unit. For the lowest layer of the target link,
the first determining unit is used for determining an upper-layer executing node connected with at least two basic executing nodes.
The second determining unit is used for determining the connection relation type between at least two basic executing nodes and the upper executing node under the condition that the upper executing node is not the target executing node.
The third determining unit is used for calculating the failure rate of the upper execution node by using the respective first failure rates of the at least two basic execution nodes based on the connection relation type.
According to an embodiment of the present disclosure, the third determination unit includes a first determination subunit and a second determination subunit.
The first determining subunit is configured to process, when it is determined that the connection relationship type is the first connection relationship type, respective first failure rates of the at least two basic execution nodes by using a first failure calculation mode to obtain a failure rate of the upper execution node, where the first connection relationship type characterizes that a data processing operation performed by the upper execution node is determined by a data processing operation performed by one of the at least two basic execution nodes.
And the second determining subunit is configured to process the first failure rates of the at least two basic execution nodes by using a second failure calculation mode under the condition that the connection relationship type is determined to be the second connection relationship type, so as to obtain the failure rate of the upper execution node, where the second connection relationship type characterizes the data processing operation performed by the upper execution node and is determined by the data processing operation performed by the at least two basic execution nodes.
According to an embodiment of the present disclosure, the processing module includes a first processing sub-module and a second processing sub-module.
And the first processing sub-module is used for determining the importance of the M basic execution nodes to the target execution node according to the fault calculation mode of the second fault rate under the condition that the second fault rate is larger than or equal to the fault threshold value.
And the second processing sub-module is used for determining the basic execution node with the highest importance as the fault adjustment node and generating a fault processing strategy aiming at the fault adjustment node.
According to the embodiment of the disclosure, the first processing submodule comprises a processing unit, and is used for solving partial differentiation of a first failure rate of the mth basic execution node for the failure calculation model aiming at the mth basic execution node to obtain importance of the mth basic execution node to the target execution node, wherein M is more than or equal to M is more than or equal to 1.
According to an embodiment of the present disclosure, the determination module 710 includes an acquisition sub-module and a second determination sub-module.
The acquisition sub-module is used for acquiring a structural model of the distributed system, wherein the structural model comprises an association relation among a plurality of execution nodes in the distributed system.
And the second determining submodule is used for determining M basic execution nodes related to the target execution node from the structural model according to the node identification of the target execution node.
According to an embodiment of the present disclosure, the fault handling apparatus 700 includes a bypass acquisition module, a preprocessing module, a classification module, and a generation module.
And the bypass acquisition module is used for acquiring communication data of the distributed system through a mirror image port of the distributed system.
And the preprocessing module is used for preprocessing the communication data to obtain operation data related to data processing operation and sequence data related to communication processing logic.
And the classification module is used for inputting the operation data into the execution node classification model and outputting a classification result, wherein the classification result comprises a plurality of execution nodes.
And the generation module is used for constructing a structural model of the distributed system based on the classification result and the sequential data.
Any of the determining module 710, the obtaining module 720, the calculating module 730, and the processing module 740 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module.
According to embodiments of the present disclosure, at least one of the determination module 710, the acquisition module 720, the calculation module 730, and the processing module 740 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, or in any other reasonable manner of integrating or packaging circuitry, or in any one of or a suitable combination of any of the three. Alternatively, at least one of the determination module 710, the acquisition module 720, the calculation module 730, and the processing module 740 may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
Fig. 8 schematically illustrates a block diagram of an electronic device adapted for a fault handling method according to an embodiment of the disclosure.
As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.
In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM 802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in one or more memories.
According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The electronic device 800 may also include one or more of the following components connected to the input/output I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.
The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 802 and/or RAM 803 and/or one or more memories other than ROM 802 and RAM 803 described above.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. The program code, when executed in a computer system, causes the computer system to perform the methods provided by embodiments of the present disclosure.
The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed, and downloaded and installed in the form of a signal on a network medium, and/or from a removable medium 811 via a communication portion 809. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 809, and/or installed from the removable media 811. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 801. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.
According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present disclosure, and are not meant to limit the disclosure to the particular embodiments disclosed, but to limit the scope of the disclosure to the particular embodiments disclosed.

Claims (12)

1. A fault handling method, comprising:
determining M basic execution nodes related to target execution nodes to be detected, wherein the basic execution nodes are used for executing specific data processing operations in a distributed system, and M is more than or equal to 2;
acquiring respective first failure rates of M basic execution nodes;
According to the association relation between the M basic execution nodes and the target execution node, calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes; and
and determining a fault processing strategy matched with the target executing node according to the second fault rate.
2. The method of claim 1, wherein the calculating the second failure rate of the target execution node using the first failure rates of the M base execution nodes according to the association relationship between the M base execution nodes and the target execution node includes:
determining a target link comprising M basic execution nodes and the target execution nodes, wherein the target execution nodes are positioned at the uppermost layer of the target link, and the basic execution nodes are positioned at the lowermost layer of the target link; and
and calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes in the sequence from the lowest layer to the uppermost layer based on the association relation.
3. The method of claim 2, wherein the calculating the second failure rate of the target execution node using the respective first failure rates of the M base execution nodes in order from the lowest layer to the uppermost layer based on the association relation comprises:
For the lowest layer of the target link,
determining an upper execution node connected with at least two basic execution nodes;
determining a connection relationship type between the at least two basic execution nodes and the upper execution node under the condition that the upper execution node is not the target execution node; and
and calculating the failure rate of the upper execution node by using the respective first failure rates of the at least two basic execution nodes based on the connection relation type.
4. The method of claim 3, wherein said calculating a failure rate of the upper level execution node using respective first failure rates of the at least two base execution nodes based on the connection relationship type comprises:
under the condition that the connection relation type is determined to be a first connection relation type, processing the first failure rate of each of the at least two basic execution nodes by using a first failure calculation mode to obtain the failure rate of the upper execution node, wherein the first connection relation type characterizes the data processing operation executed by the upper execution node to be determined by the data processing operation executed by one of the at least two basic execution nodes;
And under the condition that the connection relation type is determined to be a second connection relation type, processing the first failure rate of each of the at least two basic execution nodes by using a second failure calculation mode to obtain the failure rate of the upper execution node, wherein the second connection relation type characterizes the data processing operation executed by the upper execution node and is determined by the data processing operation executed by the at least two basic execution nodes.
5. The method of claim 1, wherein the determining a fault handling policy that matches the target execution node based on the second fault rate comprises:
determining importance of the M basic execution nodes to the target execution node according to a fault calculation mode of the second fault rate under the condition that the second fault rate is greater than or equal to a fault threshold value; and
and determining the basic execution node with the highest importance as a fault adjustment node, and generating a fault processing strategy aiming at the fault adjustment node.
6. The method of claim 5, wherein the determining the importance of the M base execution nodes to the target execution node according to the failure calculation mode of the second failure rate comprises:
And aiming at the M-th basic execution node, calculating partial differentiation of the first failure rate of the failure calculation mode aiming at the M-th basic execution node to obtain the importance of the M-th basic execution node to the target execution node, wherein M is more than or equal to M is more than or equal to 1.
7. The method of claim 1, wherein the determining M base execution nodes associated with the target execution node to be detected comprises:
obtaining a structural model of the distributed system, wherein the structural model comprises association relations among a plurality of execution nodes in the distributed system; and
and determining M basic execution nodes related to the target execution node from the structural model according to the node identification of the target execution node.
8. The method of claim 1, wherein prior to said determining M base execution nodes associated with a target execution node to be detected, comprising:
acquiring communication data of the distributed system through a mirror image port of the distributed system;
preprocessing the communication data to obtain operation data related to data processing operation and sequence data related to communication processing logic;
inputting the operation data into an execution node classification model, and outputting a classification result, wherein the classification result comprises a plurality of execution nodes;
And constructing a structural model of the distributed system based on the classification result and the sequence data.
9. A fault handling apparatus comprising:
the determining module is used for determining M basic executing nodes related to a target executing node to be detected, wherein the basic executing nodes are used for executing specific data processing operation in the distributed system, and M is more than or equal to 2;
the acquisition module is used for acquiring the first failure rate of each of the M basic execution nodes;
the calculation module is used for calculating a second failure rate of the target execution node by using the respective first failure rates of the M basic execution nodes according to the association relation between the M basic execution nodes and the target execution node; and
and the processing module is used for determining a fault processing strategy matched with the target execution node according to the second fault rate.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-8.
11. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-8.
12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.
CN202310946690.9A 2023-07-28 2023-07-28 Fault processing method, device, electronic equipment and computer storage medium Pending CN116932269A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310946690.9A CN116932269A (en) 2023-07-28 2023-07-28 Fault processing method, device, electronic equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310946690.9A CN116932269A (en) 2023-07-28 2023-07-28 Fault processing method, device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN116932269A true CN116932269A (en) 2023-10-24

Family

ID=88378912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310946690.9A Pending CN116932269A (en) 2023-07-28 2023-07-28 Fault processing method, device, electronic equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN116932269A (en)

Similar Documents

Publication Publication Date Title
US11562304B2 (en) Preventative diagnosis prediction and solution determination of future event using internet of things and artificial intelligence
US10977293B2 (en) Technology incident management platform
US10417528B2 (en) Analytic system for machine learning prediction model selection
US11488055B2 (en) Training corpus refinement and incremental updating
US10643135B2 (en) Linkage prediction through similarity analysis
AU2020385264B2 (en) Fusing multimodal data using recurrent neural networks
US20210056458A1 (en) Predicting a persona class based on overlap-agnostic machine learning models for distributing persona-based digital content
CN110705719A (en) Method and apparatus for performing automatic machine learning
US20200097338A1 (en) Api evolution and adaptation based on cognitive selection and unsupervised feature learning
US20200089586A1 (en) Cognitively triggering recovery actions during a component disruption in a production environment
US20210279279A1 (en) Automated graph embedding recommendations based on extracted graph features
US20230078134A1 (en) Classification of erroneous cell data
US20200372398A1 (en) Model quality and related models using provenance data
US20220122000A1 (en) Ensemble machine learning model
US11062330B2 (en) Cognitively identifying a propensity for obtaining prospective entities
US11854018B2 (en) Labeling optimization through image clustering
US11636185B2 (en) AI governance using tamper proof model metrics
US11783221B2 (en) Data exposure for transparency in artificial intelligence
WO2023093259A1 (en) Iteratively updating a document structure to resolve disconnected text in element blocks
van Dinter et al. Just-in-time defect prediction for mobile applications: using shallow or deep learning?
US20190164022A1 (en) Query analysis using deep neural net classification
US11809375B2 (en) Multi-dimensional data labeling
US20230029218A1 (en) Feature engineering using interactive learning between structured and unstructured data
CN116932269A (en) Fault processing method, device, electronic equipment and computer storage medium
US11386265B2 (en) Facilitating information technology solution templates

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination