CN111108481B - Fault analysis method and related equipment - Google Patents

Fault analysis method and related equipment Download PDF

Info

Publication number
CN111108481B
CN111108481B CN201780094808.2A CN201780094808A CN111108481B CN 111108481 B CN111108481 B CN 111108481B CN 201780094808 A CN201780094808 A CN 201780094808A CN 111108481 B CN111108481 B CN 111108481B
Authority
CN
China
Prior art keywords
fault
reason
phenomenon
nodes
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201780094808.2A
Other languages
Chinese (zh)
Other versions
CN111108481A (en
Inventor
张瑞荣
姚满海
李翠琴
石俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN111108481A publication Critical patent/CN111108481A/en
Application granted granted Critical
Publication of CN111108481B publication Critical patent/CN111108481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The embodiment of the invention discloses a fault analysis method and related equipment, wherein the method comprises the following steps: the method comprises the steps that fault description information is obtained by fault detection equipment, wherein the fault description information is used for describing a fault phenomenon of the fault equipment; and the fault detection equipment traverses a fault tree according to the fault phenomenon so as to obtain the fault reason of the fault equipment, wherein the fault tree reflects the corresponding relation between the fault phenomenon and the fault reason. By adopting the embodiment of the invention, the fault reason of the fault equipment can be accurately analyzed and positioned by utilizing the fault tree, the fault detection efficiency is improved, the fault maintenance cost is reduced, and the user experience is improved.

Description

Fault analysis method and related equipment
Technical Field
The invention relates to the technical field of terminals, in particular to a fault analysis method and related equipment.
Background
For the terminal equipment, the equipment fault exists objectively and occurs randomly. At present, if problems occur in the using process of equipment, a user usually takes a maintenance network point to perform maintenance. Generally, a maintenance engineer uses a corresponding maintenance detection tool to perform detection and maintenance according to a fault phenomenon fed back by a user, and the method has great dependence on the experience of the maintenance engineer under the condition that the tool maintenance detection capability is not perfect at the present stage. Particularly for some complex faults, maintenance personnel often develop once, cannot quickly locate the reason of the fault and need to return to the factory for maintenance. Therefore, the fault detection rate is low, the maintenance time is long, and the service experience of the product after being on the market is directly influenced.
In order to solve the problems, the prior art adopts a fault machine aiming at the failure of network point detection, the fault machine is analyzed and solved by research and development and maintenance engineering in a combined manner, the analysis capability is integrated into a fault detection tool, and the fault machine is improved by a generation of product cycle iteration. In technical implementation, the method mainly comprises a means of capturing fault machine log analysis. According to the scheme, the log types are multiple, the log content structure is complex, and the analysis efficiency is low; in addition, because newly added detection schemes are researched and developed, and original detection schemes are changed, fault detection tools cannot be reflected in time, the tool detection capability is incomplete, and the fault analysis efficiency of a network point is low.
Disclosure of Invention
The embodiment of the invention provides a fault analysis method and related equipment, which can be used for quickly and accurately analyzing and positioning the fault reason of the fault equipment by using a fault tree theory, and on the basis, the fault analysis efficiency is effectively improved and the labor maintenance cost is reduced, namely the equipment maintenance cost is reduced by adopting a fault code delimiting and fault correlation detection technology.
In a first aspect, an embodiment of the present invention provides a fault analysis method, including:
the method comprises the steps that fault description information is obtained by fault detection equipment, wherein the fault description information is used for describing a fault phenomenon of the fault equipment;
and the fault detection equipment traverses a fault tree according to the fault phenomenon so as to obtain the fault reason of the fault equipment, wherein the fault tree reflects the corresponding relation between the fault phenomenon and the fault reason.
In some possible embodiments, the fault tree is separately set in a configuration file, wherein the configuration file can be separately updated through a wired or wireless manner.
In some possible embodiments, the fault tree includes fault phenomenon nodes and fault cause nodes distributed in multiple layers, where the fault cause node in the middle layer is used for indicating an intermediate cause causing the fault phenomenon, and the fault cause node in the bottom layer is used for indicating a root cause causing the fault phenomenon.
In some possible embodiments, the fault phenomenon node and the fault cause node may be identified by a fault code indicating an intermediate or root cause on a node that caused the fault phenomenon to occur.
In some possible embodiments, each of the failure cause nodes in the multi-layer distribution has a corresponding failure determination rule, where the failure determination rule is used to determine that the corresponding failure cause node indicates a basis for causing the failure phenomenon.
In some possible embodiments, the fault determination rule comprises at least one of: alarm class rules, command class rules, log class rules, performance class rules.
In some possible embodiments, the fault decision rule is comprised of at least one of: the fault detection method comprises a fault reason node, influence parameters and a logic relation, wherein the logic relation comprises the logic relation between the fault reason node and the image parameters and/or the logic relation between the influence parameters, and the influence parameters are used for judging the basis of the fault phenomenon of the fault reason node.
In some possible embodiments, the fault phenomenon node and at least one of the fault reason nodes distributed in multiple layers are characterized by a pre-coded code character, and different nodes correspond to different code characters.
In some possible embodiments, the fault tree is compiled and stored by a user through a visual compilation interface based on experience.
In some possible embodiments, the method further comprises: and the fault detection equipment recommends a fault maintenance suggestion corresponding to the fault reason according to the fault reason.
In some possible embodiments, the cause of the failure comprises at least one of: component failure, environmental impact, software defects, human error, system failure.
In some possible embodiments, the fault tree is an N-ary tree, where N is a positive integer.
In some possible embodiments, the fault tree may be further designed by association detection. Specifically, failure nodes (i.e., event or failure cause nodes) that can cause the same or similar failure phenomena in different domains can be added to the failure tree. Taking performance failures (stuck, slow reacting, not fluent) as an example, the problem of stuck, slow reacting, not fluent system in different areas can also be caused by any one or a combination of more of the following reasons: system resource class problems, device class problems (device aging), bug occurrences in the application itself, and so on.
In a second aspect, an embodiment of the present invention provides a fault detection device, including a functional unit for performing the method of the first aspect.
In a third aspect, an embodiment of the present invention provides a fault detection device, including a memory, a communication interface, and a processor coupled to the memory and the communication interface; the memory is used for storing instructions, the processor is used for executing the instructions, and the communication interface is used for communicating with other terminal equipment under the control of the processor; wherein the processor, when executing the instructions, performs the method described in the first aspect above.
In a fourth aspect, a computer-readable storage medium is provided that stores program code for failure analysis. The program code comprises instructions for performing the method described in the first aspect above.
In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method described in the first aspect above.
By implementing the embodiment of the invention, the fault reason of the fault equipment can be accurately analyzed and positioned by utilizing the fault tree, the fault detection efficiency is improved, the fault maintenance cost is reduced, and the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.
FIG. 1 is a schematic diagram of a motor driving circuit according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a motor fault tree provided by an embodiment of the present invention;
FIG. 3 is a schematic diagram of a camera fault tree provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of a stability fault tree provided by an embodiment of the present invention;
FIG. 5 is a diagram of a performance fault tree according to an embodiment of the present invention;
fig. 6 is a schematic view of a visual inspection rule editing interface according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a visual inspection rule editing interface according to another embodiment of the present invention;
FIG. 8 is a schematic diagram of a fault tree provided by an embodiment of the present invention;
fig. 9 is a schematic flow chart of a fault analysis method according to an embodiment of the present invention;
fig. 10A is a schematic structural diagram of a terminal device according to an embodiment of the present invention;
fig. 10B is a schematic structural diagram of another terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings of the present invention.
The inventor of the present application finds, in the course of proposing the present application, that in the prior art, to determine the cause of a failure of a device failure so as to know, the following two schemes are adopted.
In the first scheme, the experience of a maintenance engineer is relied on, the labor cost is increased, meanwhile, the maintenance time is too long, especially for some complex faults, the maintenance engineer cannot rely on the experience to know the fault reason, the fault detection rate is low, and the equipment maintenance time is prolonged.
In the second scheme, the log records of the equipment are analyzed and combined with the diagnosis and analysis of a maintenance engineer, so that the fault reason of the equipment is located and a corresponding solution is output, the processing efficiency is low, and the labor cost is increased.
In order to solve the problems, the fault reason of the fault equipment is quickly and accurately positioned on the basis of the fault tree, so that a corresponding fault maintenance suggestion is given. The following briefly describes an embodiment associated with a fault tree.
First, some theoretical knowledge involved in fault trees is introduced.
The fault Tree, also called Fault Tree Analysis (FTA), is to search the direct reason and indirect reason of the top event layer by layer from top to bottom starting from a possible event (top event) to the basic reason (bottom reason), and express the logical relationship of these events by using a logical diagram. That is, a fault tree is a logical causal graph whose constituent elements include events and logic gates. The event is used for describing the state of the system and the component failure, and the logic gate is used for associating the events and representing the logical relationship among the events. The events include, but are not limited to: top events, middle events, and bottom events. The logical relationships (logic gates) include, but are not limited to: and gates, or gates, not gates, voting gates, xor gates, etc.
For example, fig. 1 shows a schematic diagram of a motor driving circuit. When the circuit switch is closed, the motor does not rotate (does not work), and accordingly a schematic diagram of a motor fault tree is given as shown in fig. 2. Obviously, the direct reason for the non-rotation of the motor after the circuit switch is closed is: the motor has a fault or the switch has no power supply after being closed; the indirect reason or the bottom layer reason under the direct reason is further determined, and the reason that no power supply exists after the switch is closed is as follows: power failure, or line failure. And after the reason of the bottom layer is analyzed, the process can be ended.
Next, a construction example of a fault tree for equipment fault analysis in the embodiment of the present invention is described. That is, some embodiments are introduced that are involved in building the fault tree.
First, a failure experience library is designed. The fault tree analysis method is applied to the terminal equipment to detect and locate the fault reason of the equipment, and the top event requirement of the fault tree analysis method is a fault phenomenon which can be sensed by a user. Therefore, the fault experience base is established based on the fault phenomenon perceivable by the user, and a statistical table of the fault phenomenon of the mobile phone perceivable by the user is exemplarily given in the following table 1.
TABLE 1
Figure GPA0000285919080000061
As can be seen from table 1 above, for example, in a camera, from the perspective of user perception, there may occur failure phenomena such as the main/sub camera does not work, the camera has a focusing failure, photographing jitter or abnormal sound, and the camera has an astigmatism failure.
In an alternative embodiment, the problem data generated in the production stage, the test (beta) stage and the commercial after-stage of the equipment can be stored in the failure experience database according to the failure phenomenon system analysis.
An example of the determination of the failed node involved in the failure tree is described below.
1) Selecting top events
The developer system analyzes the various stages of the device (e.g., the generation stage, the testing stage, and the business stage, as described above) as well as the problems that have occurred. Top events are selected based on these issues and in combination with custom work order fault phenomena. The requirement event comprises a fault phenomenon when equipment fails, and at least can cover the fault phenomenon of the work order. The work order fault phenomenon may refer to a work order/statistical table for recording equipment fault phenomena fed back by a maintenance site or a maintenance engineer, or equipment fault phenomena configured by other users/equipment in a user-defined manner, which is not limited in the present invention.
2) Failure panoramic analysis
The developer can analyze the problem data generated at each stage of the equipment by the system, for example, analyzing the distribution of the problems (i.e., faults) at each layer of the system by taking the system hierarchy of the equipment as a reference. Correspondingly, on the basis of the existing fault detection, more fault detection points are excavated to improve the fault detection coverage rate.
Taking a camera as an example, fig. 3 shows a schematic diagram of a camera failure analysis. As can be seen from fig. 3, the faults that may occur in the camera are analyzed and subjected to data statistics from an Application Layer (APP), a framework Layer (framework), a Hardware Abstraction Layer (HAL), a kernel (kernel), and an Image Signal Processor (ISP), so as to obtain the fault expression and the probability of the faults (i.e., the fault ratio or the failure ratio) when the faults occur in each Layer.
3) And (4) fault decomposition, namely determining an intermediate event and an underlying event.
The top event is decomposed layer by layer in combination with the fault phenomenon until the bottom event is decomposed (i.e. the delimitation detection can be supported), and the delimitation detection is detailed below.
Taking stability fault, a dead-stop restart as an example, fig. 4 shows a stability fault tree. The direct reason for the crash restart includes the upper layer restart or the complete machine restart. The reason for the upper layer restart is that there are problems with system virtual machine restart (e.g., android vmrebot) or virtual machine Watchdog (Vm Watchdog). The reason for the restart of the whole machine can include at least one of the following reasons: system errors (panic), Watchdog (e.g., Hw Watchdog) problems, and Hardware failures (Hardware Fault).
In an optional embodiment, an association detection design can be added. Specifically, in order to ensure that the fault tree covers more fault reasons causing the same fault phenomenon, including direct reasons and indirect reasons, namely the above-mentioned intermediate events and bottom layer events, fault nodes (i.e. events) in different fields, which find the same or similar fault phenomena, can be added to the fault tree. How to obtain fault nodes with similar fault phenomena in different fields is not detailed or limited herein, and for example, the fault nodes can be detected and obtained by using an associated fault analysis model.
Taking performance failures (stuck, slow reacting, not fluent) as an example, the problem of stuck, slow reacting, not fluent system in different areas can also be caused by any one or a combination of more of the following reasons: system resource class problems, device class problems (device aging), bug occurrences in the application itself, and so on. Fig. 5 shows a performance fault tree, which includes fault nodes corresponding to the current considered hierarchical fault causes that may cause the system to be stuck, slow in response, not smooth, and the like, and will not be described in detail herein. The failure nodes may include failure phenomenon nodes and failure cause nodes distributed in multiple layers. Wherein the fault phenomenon node is associated with the top event and is used for indicating a fault phenomenon. And the fault reason node of the middle layer is associated with the middle event and used for indicating the middle reason causing the fault phenomenon. And the bottom layer fault reason node is associated with the bottom layer event and used for indicating the bottom layer reason causing the fault phenomenon.
In alternative embodiments, the failure cause, the underlying cause (also referred to as the root cause), may include, but is not limited to, any one or more of the following: component failures, environmental impacts, software bugs, human error, and system failures, or other factors.
Second, the detection of delimitation. To achieve a fast and accurate localization of equipment faults, a fault-coding-based delimitation detection can be designed to provide an efficient fault description (i.e. fault phenomena), optionally also fault repair recommendations. Wherein the fault code is associated with the faulty node for identifying the faulty node.
Taking a battery failure as an example, the following table 2 shows a fault coding delimiting statistical table.
TABLE 2
Figure GPA0000285919080000071
Figure GPA0000285919080000081
As can be seen from table 2 above, when a fault occurs in the terminal device, the fault code can be automatically recorded, so as to accurately know the corresponding fault type. Accordingly, when fault analysis is performed on the terminal device (also called fault device), the fault judgment rule can be utilized to judge whether the fault node corresponding to the fault code has a fault which affects normal use of a user, so that corresponding fault phenomena (namely fault description) and fault maintenance suggestions are rapidly given. Details about the failure determination rule will be described later.
Thirdly, designing a fault judgment rule. In order to determine whether a failed node is actually a cause node for causing a failure phenomenon, a corresponding failure determination rule needs to be designed for each failure cause node. The fault determination rules include, but are not limited to, any one or more of the following: alarm class rules, command class rules, log class rules, performance class rules, stability class rules, and the like.
In an optional embodiment, the system may provide a visual editing interface to the user, so that the user sets a corresponding fault determination rule for the fault node in the visual editing interface.
In an optional embodiment, the fault determination rule may include, but is not limited to, a fault code, specifically, a number of times that a fault phenomenon occurs in a fault node corresponding to the fault code within a preset time period exceeds a preset threshold, or other self-defined configured rules, so as to determine whether the node is a basis for causing the fault phenomenon. Optionally, the fault determination rule may also be a combination of fault determination rules, including both a fault code and an influence parameter, where the number of the influence parameters is not limited, the influence parameter is a relevant parameter for influencing whether the faulty node is determined to be a node causing a fault phenomenon, and the parameter may set a corresponding determination condition. When the number of the impact parameters is plural, the logical relationship among the plural impact parameters is also required to be set, and the like, which is not limited.
Taking a command rule as an example, the fault determination rule may be set according to a fault code (i.e., a fault node) or a mode of combining the fault code with an influence parameter, so as to determine whether a fault phenomenon has occurred in a fault node where the fault code is located. Correspondingly, when fault detection of the fault equipment is carried out, the fault codes can be obtained from the fault database, the frequency of fault phenomena corresponding to the fault codes is sent within a preset time length, and then the fault phenomena fed back by a user are combined, so that whether faults affecting normal use of the user and fault reasons occur or not is judged, and the fault equipment is maintained in a targeted manner.
Fig. 6 to 7 show schematic diagrams of two visual editing interfaces. Fig. 6 shows a visual editing interface for setting the failure determination rule. The alarm code is a fault code corresponding to a fault node, and the number of times corresponding to a certain fault code occurring in a certain period needs to be counted. The parameter list refers to all parameters for influencing the fault phenomenon corresponding to the fault code, for example, fig. 6 shows parameter 1 (i.e., parameter name, Hname), parameter 2(CPUfreq), and the type of the parameter, which may be integer int, array, and so on. The rule editing means defining a logical relationship between the parameters in the parameter list, and the diagram shows that the temperature parameter of the motherboard and the operating frequency of the CPU are a logical and, that is, the number of times that the fault phenomenon of the fault code occurs in a certain period exceeds a set number of times, and the two conditions of the parameter 1 and the parameter 2 need to be satisfied at the same time.
FIG. 7 illustrates a visual editing interface for influencing a parameter. The user can set the influence parameters and the conditions to be met by the influence parameters through the visual editing interface. FIG. 7 shows that the user edited parameter 1(Hname- > SON1) needs to be equal to (equal, EQ)1, and parameter 2(Hname- > SON2) needs to be greater than (grater than, GT) 2. The above is merely an example of setting the failure determination rule, and is not limited.
In an alternative embodiment, the fault determination rules may be decoupled from the tool detection code. Specifically, the fault determination rules are updated separately in the form of detection configuration files to solve the problem that the fault determination rules in the prior art depend on the version of the detection tool, so that the requirements of equipment fault detection after product commercialization can be responded and adapted more quickly. The detection configuration file can be updated or replaced to meet the real-time requirements of fault detection, and is not limited by the version of the detection tool (software).
Fourth, a detection engine is designed. In the process of detecting the equipment fault, the detection engine is actually used for realizing the detection and the positioning of the fault reason of the equipment. Therefore, the detection engine needs to satisfy the following four principles:
1) the fault tree can be traversed, and the analysis of the logical relationship between the fault node and the fault phenomenon is supported. Namely, the logical relationship between any two or more nodes in the failure phenomenon node and the failure reason node is supported and analyzed.
2) And the fault judgment rule for analyzing the binding/association corresponding to the fault node is supported, so that whether the fault node has a fault phenomenon or not is judged.
3) The fault reason (namely, the bottom-layer reason or the intermediate reason) can be judged according to the fault tree traversal and the fault judgment result of the fault node.
4) And outputting a corresponding fault reason according to the detected fault reason node. Optionally, corresponding troubleshooting advice may also be output.
In an alternative embodiment, the developer or the system may design the fault tree constructed as described above as a detection tool (also referred to as a repair detection tool) for fault location analysis, so that fault detection of the faulty equipment can learn the corresponding fault cause and give a corresponding fault repair suggestion. The fault tree, including the fault determination rule and the fault phenomenon information, can be set/converted into a configuration file. Optionally, the system may provide a visual editing interface for a developer to edit to construct the fault tree, or for the developer to complete the fault tree at any time through the visual editing interface, which is not limited in the present invention.
In an alternative embodiment, the developer may update the configuration file according to his own needs or periodically, for example, periodically complete the fault tree through a visual editing interface. Accordingly, the developer can upload the configuration file to the cloud server, so that the user can download the configuration file of the latest version immediately. Correspondingly, the terminal equipment provided with the maintenance detection tool can periodically acquire a new version of configuration file from the cloud server through a network. The terminal device detects that the version of the configuration file is not the latest version, downloads and updates the configuration file to the latest version, thereby detecting and locating the fault reason of the fault device by using the latest configuration file (namely, a fault tree) and giving out a corresponding fault repair suggestion.
Finally, related embodiments of how to determine the underlying cause (i.e., root cause for short) from the fault tree are described.
The judgment standard of the root cause is as follows: there is a path in the fault tree where all the faulty nodes are paths (all correct/true, fault problem), and if there is a logical and gate in the middle, all the branches of the and gate must be true paths. Specifically, the minimum cut sets can be matched by matching the states of all leaf nodes, and if a certain minimum cut set is satisfied and all paths of the path where the minimum cut set is located are true paths, it is indicated that the root is found.
Fig. 8 shows a schematic diagram of a fault tree. As in fig. 8, a ═ B or C; b ═ D or E; c ═ F and G; then a ═ D or (e) or (f and g), respectively. Wherein the minimal cut sets are (D), (E), (F, G), respectively. Any one of the minimal cut sets occurs, which can result in a occurring.
In an alternative embodiment, if the intermediate node is true, the leaf node (i.e., the bottom node) is false, and the intermediate node determines that the intermediate node is true, which indicates that the leaf nodes are incomplete, and there may be an unknown leaf node, i.e., there is no information about the cause of the unknown bottom node and the unknown leaf node is not written into the fault tree.
Accordingly, if the intermediate node is false (no failure occurs) and the leaf node (i.e., the bottom node) is true (failure occurs), it indicates that the leaf node is not the root cause, and there may be some unknown conditions (i.e., unknown failure determination rules) on the leaf node that make the leaf node not necessarily a sufficient condition for the parent node.
Based on the foregoing embodiments, embodiments of a specific fault analysis method according to the present invention are described below. Fig. 9 is a schematic flow chart of a fault analysis method according to an embodiment of the present invention. The fault analysis method shown in fig. 9 includes the following implementation steps:
step S902, the fault detection equipment acquires fault description information, wherein the fault description information is used for describing a fault phenomenon of the fault equipment;
step S904, the fault detection device traverses a fault tree according to the fault phenomenon, so as to obtain a fault cause of the fault device, where the fault tree reflects a correspondence between the fault phenomenon and the fault cause.
The fault tree may be configured/set in a configuration file separately to periodically update the configuration file. The fault tree is decoupled from the detection tool code, independent of or limited to the version of the detection tool. Accordingly, the configuration file may be downloaded and updated in a wired or wireless manner, which is specifically described in the foregoing embodiments and will not be described in detail herein.
In an alternative embodiment, the fault tree is obtained by a user (specifically, a developer) through editing and storing through a visual editing interface according to experience accumulation.
In an alternative embodiment, the fault tree includes fault phenomenon nodes, fault cause nodes distributed in multiple layers, and logical relationships (i.e., logic gates) between the nodes. And the fault phenomenon node is used for indicating the fault phenomenon when fault equipment fails. The fault reason nodes distributed in the multiple layers comprise a middle layer fault reason node and a bottom layer fault reason node, and the middle layer fault reason node is used for indicating an intermediate reason (intermediate event) causing the fault phenomenon. The underlying fault cause node is used to indicate the underlying cause (i.e., root cause, underlying event) that caused the fault phenomenon. For details, reference may be made to the related explanations in the foregoing embodiments, and details are not described here. Alternatively, the fault tree may be an N-ary tree, where N is a positive integer. When N is 2, the tree is a binary tree.
In an optional embodiment, if no fault detection rule is set on a node, the fault detection device may directly find out a fault cause corresponding to the fault phenomenon according to a logical relationship between the nodes. When the logical relationship between the nodes is not satisfied (for example, a logical and gate may be used), none of the nodes is the fault cause node where the fault phenomenon occurs, and the fault cause corresponding to the fault phenomenon is not found.
In an optional embodiment, each fault cause node in the multi-layer distribution has a corresponding fault determination rule, where the fault determination rule is used to determine the correctness of the cause that the corresponding fault cause node indicates to cause the fault phenomenon.
Specifically, a corresponding fault determination rule may be set for each fault cause node in the fault tree, where the fault determination rule is used to determine whether the fault cause node is a cause node that causes the fault phenomenon. For the setting of the failure determination rule, reference may be made to the related explanations in the foregoing embodiments, and details are not described here.
Correspondingly, in step S904, after receiving the fault description information, the fault detection device may determine a fault sub-tree from the fault tree according to the fault phenomenon described by the fault description information, where the fault sub-tree is used to locate a fault cause of the fault phenomenon. The fault sub-tree is part of the fault tree and includes at least the fault phenomena and fault causes associated with the fault phenomena, i.e., fault phenomenon nodes and multi-level fault cause nodes.
Specifically, the fault detection device may determine, according to the fault determination rule, whether the corresponding fault cause node is a node that causes the fault phenomenon, if so, continue to determine a next fault cause node in one path, and if the determination result of all fault cause nodes in one path is yes, correspondingly, the fault cause node at the last bottom layer corresponds to the indicated reason, which is the fault cause of the fault device. For how to traverse the fault tree to determine the root cause of the fault phenomenon (i.e., the fault cause in the present application), reference may be made to the related explanation of root cause determination in the foregoing embodiments, and details are not described here.
In an optional embodiment, the fault detection device may further recommend a fault repair suggestion corresponding to the fault cause according to the fault cause.
In an optional embodiment, when the fault detection device is the fault device, the fault device needs to first obtain a configuration file including the fault tree from a maintenance detection tool, so as to perform fault analysis by using the fault tree. The fault description information may be recorded and reported when the fault device itself fails, or fault description information input by a user when the user uses the fault description information, and the like, which is not limited in the present invention. The failure detection device may also be another device than the failed device.
In an alternative embodiment, the failed device refers to a device after the failure. The fault detection equipment is equipment supporting fault analysis and positioning by using a maintenance detection tool (or by using a fault tree). The device may be a user device, a server, a smart phone (such as an Android phone, an IOS phone, etc.), a personal computer, a tablet computer, a palmtop computer, a Mobile Internet device (MID, Mobile Internet Devices), a wearable smart device, or other Internet Devices, and the embodiment of the present invention is not limited.
For details that are not described in the embodiments of the present invention, reference may be made to the related explanations in the foregoing embodiments, which are not described herein again.
By implementing the embodiment of the invention, the fault reason of the fault equipment can be accurately analyzed and positioned by utilizing the fault tree, the fault detection efficiency is improved, the fault maintenance cost is reduced, and the user experience is improved.
The above description mainly introduces the scheme provided by the embodiment of the present invention from the perspective of the interaction between the fault detection device and the fault device. It will be appreciated that the fault detection device, in order to implement the above-described functions, comprises corresponding hardware structures and/or software modules for performing the respective functions. The elements and algorithm steps of the various examples described in connection with the embodiments disclosed herein may be embodied in hardware or in a combination of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present teachings.
The embodiment of the present invention may perform functional unit division on the sending client according to the above method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present invention is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
In the case of an integrated unit, fig. 10A shows a schematic view of a possible structure of the fault detection device involved in the above-described embodiment. The fault detection device 900 includes: a processing unit 902 and a communication unit 903. The processing unit 902 is configured to control and manage the actions of the fault detection apparatus 900, for example, the processing unit 902 is configured to support the fault detection apparatus 900 to perform step S904 in fig. 9, and/or to perform other steps of the techniques described herein. The communication unit 903 is used to support the communication of the fault detection device 900 with a faulty device or other devices, e.g., the communication unit 903 is used to support the fault detection device 900 to perform step S902 in fig. 9, and/or to perform other steps of the techniques described herein. The fault detection device 900 may further comprise a storage unit 901 for storing program codes and data of the fault detection device 900.
The Processing Unit 902 may be a Processor, such as a Central Processing Unit (CPU), a general-purpose Processor, a Digital Signal Processor (DSP), an Application-Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other Programmable logic devices, transistor logic devices, hardware components, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs, and microprocessors, among others. The communication unit 903 may be a communication interface, a transceiver circuit, etc., wherein the communication interface is a generic term and may include one or more interfaces, such as an interface between a fault detection device and a fault device. The storage unit 901 may be a memory.
When the processing unit 902 is a processor, the communication unit 903 is a communication interface, and the storage unit 901 is a memory, the fault detection device according to the embodiment of the present invention may be the fault detection device shown in fig. 10B.
Referring to fig. 10B, the fault detection apparatus 910 includes: a processor 912, a communication interface 913, and a memory 911. Optionally, the end device 910 may also include a bus 914. Wherein, the communication interface 913, the processor 912, and the memory 911 may be connected to each other through a bus 914; the bus 914 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 914 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 10B, but this is not intended to represent only one bus or type of bus.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware or in software executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in a Random Access Memory (RAM), a flash Memory, a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a register, a hard disk, a removable hard disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a fault detection device. Of course, the processor and the storage medium may reside as discrete components in a fault detection apparatus.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Claims (14)

1. A method of fault analysis, the method comprising:
the method comprises the steps that fault description information is obtained by fault detection equipment, wherein the fault description information is used for describing a fault phenomenon of the fault equipment;
the fault detection equipment traverses a fault tree according to the fault phenomenon so as to obtain a fault reason of the fault equipment, wherein the fault tree reflects the corresponding relation between the fault phenomenon and the fault reason;
the fault tree is separately set in a configuration file, wherein the configuration file can be separately updated in a wired or wireless manner; the fault tree comprises fault phenomenon nodes and fault reason nodes distributed in a multi-layer mode, wherein the fault reason nodes in the middle layer are used for indicating middle reasons causing the fault phenomenon, and the fault reason nodes in the bottom layer are used for indicating root reasons causing the fault phenomenon;
each fault reason node in the fault reason nodes distributed in the multilayer mode is provided with a corresponding fault judgment rule, wherein the fault judgment rule is used for judging the correctness of the reason that the corresponding fault reason node indicates the fault phenomenon; the failure determination rules are individually set in the configuration file;
the fault decision rule is decoupled from the fault detection tool code, and the fault tree is decoupled from the fault detection tool code.
2. The method of claim 1, wherein the fault determination rule comprises at least one of: alarm class rules, command class rules, log class rules, performance class rules.
3. The method of claim 1, wherein at least one of the plurality of fault phenomenon nodes and the plurality of layers of fault cause nodes is characterized by a pre-encoded code character, and wherein different nodes correspond to different code characters.
4. The method of claim 1, wherein the fault tree is compiled and stored by a user through a visual compilation interface based on experience.
5. The method of claim 1, further comprising:
and the fault detection equipment recommends a fault maintenance suggestion corresponding to the fault reason according to the fault reason.
6. The method according to any of claims 1 to 5, wherein the cause of failure comprises at least one of: component failure, environmental impact, software defects, human error, system failure.
7. A fault detection device, comprising a communication unit and a processing unit,
the communication unit is used for acquiring fault description information, wherein the fault description information is used for describing fault phenomena of fault equipment;
the processing unit is used for traversing a fault tree according to the fault phenomenon so as to obtain a fault reason of the fault equipment, wherein the fault tree reflects the corresponding relation between the fault phenomenon and the fault reason;
the fault tree comprises fault phenomenon nodes and fault reason nodes distributed in a multi-layer mode, wherein the fault reason nodes in the middle layer are used for indicating intermediate reasons causing the fault phenomenon, and the fault reason nodes in the bottom layer are used for indicating root causes causing the fault phenomenon;
each fault reason node in the fault reason nodes distributed in the multilayer mode is provided with a corresponding fault judgment rule, wherein the fault judgment rule is used for judging the correctness of the reason of the fault phenomenon caused by the indication of the corresponding fault reason node;
the fault decision rule is decoupled from the fault detection tool code, and the fault tree is decoupled from the fault detection tool code.
8. The fault detection device of claim 7, wherein the fault decision rule comprises at least one of: alarm class rules, command class rules, log class rules, performance class rules.
9. The fault detection device of claim 7, wherein at least one of the plurality of fault phenomenon nodes and the plurality of layers of distributed fault cause nodes is characterized by a pre-encoded code character, and wherein different nodes correspond to different code characters.
10. The fault detection device of claim 7, wherein the fault tree is compiled and stored by a user through a visual compilation interface based on experience accumulation.
11. The fault detection device of claim 7,
the processing unit is also used for recommending a fault maintenance suggestion corresponding to the fault reason according to the fault reason.
12. The fault detection device of any one of claims 7 to 11, wherein the cause of the fault comprises at least one of: component failure, environmental impact, software defects, human error, system failure.
13. A fault detection device comprising a memory, a communication interface, and a processor coupled to the memory and the communication interface; the memory is configured to store instructions, the processor is configured to execute the instructions, and the communication interface is configured to communicate with a faulty device under control of the processor; wherein the processor, when executing the instructions, performs the method of any of claims 1 to 6.
14. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.
CN201780094808.2A 2017-09-29 2017-09-29 Fault analysis method and related equipment Active CN111108481B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/104644 WO2019061364A1 (en) 2017-09-29 2017-09-29 Failure analyzing method and related device

Publications (2)

Publication Number Publication Date
CN111108481A CN111108481A (en) 2020-05-05
CN111108481B true CN111108481B (en) 2021-08-13

Family

ID=65900454

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780094808.2A Active CN111108481B (en) 2017-09-29 2017-09-29 Fault analysis method and related equipment

Country Status (2)

Country Link
CN (1) CN111108481B (en)
WO (1) WO2019061364A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446511A (en) * 2020-11-20 2021-03-05 中国建设银行股份有限公司 Fault handling method, device, medium and equipment
CN112416644A (en) * 2020-11-30 2021-02-26 中国航空工业集团公司西安航空计算技术研究所 Quick fault positioning method for airborne computer
CN114911513A (en) * 2021-02-09 2022-08-16 北京嘀嘀无限科技发展有限公司 Device detection method, device, storage medium, and computer program product
CN113672420B (en) * 2021-08-10 2022-08-09 荣耀终端有限公司 Fault detection method and electronic equipment
CN114859875B (en) * 2022-07-07 2022-11-15 深圳市信润富联数字科技有限公司 Fault management method, device, equipment and storage medium for multiple equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000073903A2 (en) * 1999-06-02 2000-12-07 Siemens Aktiengesellschaft Method and system for determining a fault tree of a technical system, computer program product and a computer readable storage medium
CN1553328A (en) * 2003-06-08 2004-12-08 华为技术有限公司 Fault tree analysis based system fault positioning method and device
CN1794187A (en) * 2004-12-21 2006-06-28 日本电气株式会社 Computer system and method for dealing with errors
CN101742540A (en) * 2010-02-05 2010-06-16 华为技术有限公司 Method and device for online self-diagnosis
CN102033789A (en) * 2010-12-03 2011-04-27 北京理工大学 Reliability analysis method for embedded safety-critical system
CN105335276A (en) * 2014-06-13 2016-02-17 联想(北京)有限公司 Fault detection method and electronic device
CN106130794A (en) * 2016-08-24 2016-11-16 上海卓易科技股份有限公司 A kind of fault handling method and device
CN106844145A (en) * 2016-12-29 2017-06-13 北京奇虎科技有限公司 A kind of server hardware fault early warning method and device
CN107025290A (en) * 2017-04-14 2017-08-08 北京航天发射技术研究所 The storage method and read method of a kind of fault tree data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011132730A1 (en) * 2010-04-22 2011-10-27 日本電気株式会社 Runtime system fault tree analysis method, system and program
CN104218676B (en) * 2014-09-02 2015-11-18 广东电网有限责任公司茂名供电局 The intelligent warning system of power dispatching automation main website and method
CN105187533A (en) * 2015-09-10 2015-12-23 浪潮软件股份有限公司 Data transmission method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000073903A2 (en) * 1999-06-02 2000-12-07 Siemens Aktiengesellschaft Method and system for determining a fault tree of a technical system, computer program product and a computer readable storage medium
CN1553328A (en) * 2003-06-08 2004-12-08 华为技术有限公司 Fault tree analysis based system fault positioning method and device
CN1794187A (en) * 2004-12-21 2006-06-28 日本电气株式会社 Computer system and method for dealing with errors
CN101742540A (en) * 2010-02-05 2010-06-16 华为技术有限公司 Method and device for online self-diagnosis
CN102033789A (en) * 2010-12-03 2011-04-27 北京理工大学 Reliability analysis method for embedded safety-critical system
CN105335276A (en) * 2014-06-13 2016-02-17 联想(北京)有限公司 Fault detection method and electronic device
CN106130794A (en) * 2016-08-24 2016-11-16 上海卓易科技股份有限公司 A kind of fault handling method and device
CN106844145A (en) * 2016-12-29 2017-06-13 北京奇虎科技有限公司 A kind of server hardware fault early warning method and device
CN107025290A (en) * 2017-04-14 2017-08-08 北京航天发射技术研究所 The storage method and read method of a kind of fault tree data

Also Published As

Publication number Publication date
WO2019061364A1 (en) 2019-04-04
CN111108481A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN111108481B (en) Fault analysis method and related equipment
CN110928772B (en) Test method and device
US10901727B2 (en) Monitoring code sensitivity to cause software build breaks during software project development
CN105580032B (en) For reducing instable method and system when upgrading software
Ocariza et al. An empirical study of client-side JavaScript bugs
US11386154B2 (en) Method for generating a graph model for monitoring machinery health
US20130024842A1 (en) Software test automation systems and methods
US11327742B2 (en) Affinity recommendation in software lifecycle management
CN112241370B (en) API interface class checking method, system and device
Gholamian et al. A comprehensive survey of logging in software: From logging statements automation to log mining and analysis
CN113590454A (en) Test method, test device, computer equipment and storage medium
US11263072B2 (en) Recovery of application from error
CN110765007A (en) Crash information online analysis method for android application
CN114691403A (en) Server fault diagnosis method and device, electronic equipment and storage medium
CN116194894A (en) Fault localization of native cloud applications
Hassine et al. A framework for the recovery and visualization of system availability scenarios from execution traces
JP7190246B2 (en) Software failure prediction device
CN111679924B (en) Reliability simulation method and device for componentized software system and electronic equipment
CN114138537A (en) Crash information online analysis method for android application
CN105786865B (en) Fault analysis method and device for retrieval system
Murtaza et al. Identifying recurring faulty functions in field traces of a large industrial software system
CN113626288A (en) Fault processing method, system, device, storage medium and electronic equipment
CN115176233A (en) Performing tests in deterministic order
WO2004068347A1 (en) Method and apparatus for categorising test scripts
CN111694752A (en) Application testing method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant