CN117112283B

CN117112283B - Parallel self-adaptive system level fault diagnosis method based on PMC model

Info

Publication number: CN117112283B
Application number: CN202311382342.XA
Authority: CN
Inventors: 樊卫北; 刘宣丽; 肖甫; 吕梦婕; 何昕; 王俊昌; 韩磊
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-10-24
Filing date: 2023-10-24
Publication date: 2024-01-26
Anticipated expiration: 2043-10-24
Also published as: CN117112283A

Abstract

The invention belongs to the field of fault diagnosis, and discloses a parallel self-adaptive diagnosis method based on a PMC model, which is used for respectively carrying out clockwise test on small Hamiltonian circles when a specified system comprises a plurality of substructures and the substructures comprise Hamiltonian circles, and dividing the small Hamiltonian circles into suspicious and correct two types. And then testing the suspicious Hamiltonian ring, dividing the suspicious Hamiltonian ring into a plurality of sequences, dividing the states of the nodes into fault nodes, fault-free nodes and unknown nodes according to the number of the sequences and the test rule of the PMC, and carrying out a round of test on the rest of the unknown nodes to determine whether the states of the rest of the unknown nodes are fault or non-fault, so that all the fault nodes are diagnosed. The invention can well and quickly locate the fault node, has good universality and accuracy, can adopt the diagnosis method as long as the multiprocessor system can be divided into a plurality of Hamiltonian ring structures, and has wide market prospect in the application of the fault diagnosis of the multiprocessor system.

Description

Parallel self-adaptive system level fault diagnosis method based on PMC model

Technical Field

The invention belongs to the field of fault diagnosis, and particularly relates to a parallel self-adaptive system level fault diagnosis method based on a PMC model.

Background

With the continued development of information technology, multiprocessor systems play a vital role in the modern computing field. Multiprocessor systems are composed of multiple processors or computing cores that are capable of executing tasks in parallel, increasing computing power and processing speed. The high-performance computing system is widely applied to the fields of supercomputers, data centers, cloud computing, high-performance computing tasks and the like, and brings great convenience and benefit for scientific research, commercial application and social life. Therefore, the importance of multiprocessor systems in the field of modern computing is self-evident.

However, as processor sizes continue to increase, multiprocessor systems also face more serious challenges. In such a large-scale system, processor failure is unavoidable. Processor failures can cause not only task interruption and system crashes, but also serious economic losses to enterprises and users. Therefore, it becomes particularly important to diagnose processor faults quickly and accurately and to ensure reliability of the system. Diagnosing faults is a process of determining the cause and location of the fault, which is critical to taking appropriate fault handling measures quickly. Reliability refers to the ability of a system to remain properly functioning for a predetermined period of time without being affected by a fault. Under the condition that the fault nodes exist in the multiprocessor, the fault nodes need to be rapidly positioned and detected, and then the fault nodes are repaired or replaced, so that the reliability of the system is improved. Therefore, fault diagnosis is an important factor for ensuring system reliability.

Disclosure of Invention

In order to fill the blank of the prior art, the invention provides a parallel self-adaptive fault diagnosis method based on a PMC model, when a specified multiprocessor system can be decomposed into a plurality of subsystems containing Hamiltonian circles, the Hamiltonian circles are firstly divided into suspicious Hamiltonian circles and correct Hamiltonian circles according to a test result set of PMC, PMC diagnosis is carried out on the rest suspicious Hamiltonian circles in parallel, one suspicious Hamiltonian circle can be divided into a plurality of sequences according to the test result, and three states of nodes of a network are preliminarily obtained according to the sequence characteristics and the test result, wherein the three states are fault nodes, fault-free nodes and unknown nodes respectively. In the subsequent test, the known fault-free nodes are used for testing the state of the unknown nodes, so that the diagnosis of the fault states of all nodes in the system is completed.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

for a given multiprocessor system, it is broken down into individual substructures containing hamiltonian circles, which are referred to as clusters.

The invention is realized under the condition that all nodes in one cluster cannot be failed, and for practical application, the probability of the failure of all nodes in one cluster is very small, so that the invention still has good universality and application prospect.

The invention relates to a parallel self-adaptive system level fault diagnosis method based on a PMC model, which specifically comprises the following steps:

step 1: first, the hamiltonian rings are built for each cluster in parallel. And then, carrying out two rounds of testing on each constructed Hamiltonian ring according to the clockwise direction, wherein the first round is to test even nodes by using odd nodes, and the second round is to test odd nodes by using even nodes. And obtaining a unidirectional test result set.

Step 2: classifying Hamiltonian circles according to the symptom set obtained by the test in the step 1. The method comprises the steps of dividing a suspicious Hamiltonian circle and correct Hamiltonian, wherein the correct Hamiltonian circle indicates that all nodes in the Hamiltonian circle are fault-free, the suspicious Hamiltonian circle indicates that the nodes in the Hamiltonian circle possibly contain fault nodes, and then performing two-round test on the suspicious Hamiltonian circle along the anticlockwise direction of the circle according to the test rule of the step 1 to obtain a bidirectional symptom set.

Step 3: each suspicious hamiltonian is divided into several sequences. A non-fault node set, a fault node set and a non-node set of each sequence can be obtained through a certain rule; the non-fault node set consists of non-fault nodes, the fault node set consists of fault nodes, and the unknown node set consists of unknown nodes;

step 4: the unknown nodes between the sequences are tested with the nodes that have been tested for no faults. The testing at this step is performed within each cluster.

Step 5: and finally, testing the remaining unknown nodes by adopting a 5 th round of test. The failed node is tested with known non-failed nodes. The testing in this step is mutual testing among clusters, and the correct Hamiltonian circle is preferentially selected to test the suspicious Hamiltonian circle.

Further, the step 2 specifically includes the following steps:

step 2-1, classifying hamiltonian circles according to the unidirectional test symptom set obtained in step 1, and regarding one hamiltonian circle unidirectional test symptom set, if the unidirectional test symptom set contains 1 symptom, then the hamiltonian circle is called as a suspicious hamiltonian circle, namely, a fault vertex is most likely to exist in the circle, if the unidirectional test symptom set is tested to have only 0 symptom, then the hamiltonian circle is called as a correct hamiltonian circle, because the subsystem cannot be completely failed, namely, all nodes in the subsystem are correct for the situation;

and 2-2, aiming at the suspicious Hamiltonian ring in the step 2-1, performing two rounds of tests along the anticlockwise direction of the suspicious Hamiltonian ring, wherein the first round of tests are to test the nodes with even numbers by using the nodes with odd numbers, and the second round of tests are to test the nodes with odd numbers by using the nodes with even numbers, so as to obtain a bidirectional symptom set.

Further, in the step 3, each suspicious hamiltonian ring is divided into a plurality of sequences according to a unidirectional symptom set obtained clockwise, and the method specifically comprises the following steps:

step 3-1, selecting a 0 result after the test result is 1, and assuming that the 0 result is represented by b 0;

step 3-2, representing the result after b0 by b, if b is a 0 result, directing b0 to b and repeating step 3-2; otherwise, executing the step 3-3;

step 3-3: if the result 1 pointed by b is not marked before, marking the result behind b by M, enabling b0 to point to the result marked by M and executing the step 3-2; otherwise, ending;

step 3-4: the result marked by M is the result of the split sequence of Hamiltonian circles.

According to the rule of PMC, for any two adjacent nodes u and v, there are the following features for the bidirectional symptom set:

(1) If the result of u test v is 1 and v is a fault-free node, then u is a fault node, and if the result of u test v is 1 and the result of v test u is 0, then v can only be a fault node;

(2) If the result of u test v is 0 and node v is a failed node, then node u is also a failed node.

Further, in the step 5, a round of testing is performed for the remaining unknown nodes, that is, a fifth round of testing, which specifically includes the following steps:

step 5-1, testing in a cluster, and testing unknown nodes by using the tested fault-free nodes;

and 5-2, adopting a testing method among clusters, if the unknown node in the suspicious Hamiltonian ring is adjacent to the node in the correct Hamiltonian ring, preferentially selecting the node in the correct Hamiltonian ring to test the unknown node in the suspicious Hamiltonian ring, otherwise, selecting the non-fault node in another suspicious Hamiltonian ring adjacent to the unknown node to test the unknown node in the suspicious Hamiltonian ring.

The beneficial effects of the invention are as follows:

(1) The fault diagnosis method can reduce the test times. For a t-diagnosable system, if a non-adaptive diagnosis is used, diagnosis under the PMC model requires at least nt times to complete, where n is the number of vertices, and for an adaptive diagnosis scheme it does not complete all tests at once as for a non-adaptive diagnosis and then gives the results. But dynamically selects the next test and performs the test based on the results of the previous test. Thus, under the diagnostic method proposed by the present invention, the test will be performed in several test rounds and each processor will perform at most one test per round of test.

(2) The fault diagnosis method provided by the invention has high identification efficiency. The invention adopts parallel self-adaptive diagnosis, is an innovative solution and aims to effectively solve the problem of fault diagnosis in a large-scale multiprocessor system. By fully utilizing the advantage of parallelism, the scheme can rapidly and accurately locate and diagnose faults in the system. The hamiltonian loop is an important concept in graph theory, and can help determine whether a fault node exists in the multiprocessor system, so as to accelerate the diagnosis process. Notably, the number of nodes in the hamiltonian directly affects the time and algorithm efficiency of the scheme execution. Through parallel self-adaptive diagnosis, the scheme can search a plurality of candidate Hamiltonian rings simultaneously, so that the diagnosis speed is increased, and the efficiency is remarkably improved particularly under the condition that the topological structure is relatively simple or the number of fault nodes is small.

(3) The invention has universality and diagnosis rate. According to the characteristic that the multiprocessor structure can be decomposed into the sub-network diagram containing the Hamiltonian ring and the characteristic of PMC test, the invention can accurately test any system containing the Hamiltonian ring. Most of the prior network structures contain recursion, and only the recursion system contains Hamiltonian circles, the Hamiltonian circles can be decomposed into sub-structures, so that the invention has good universality. The fewer the number of nodes, the higher the diagnostic efficiency of the system. Therefore, the invention has higher diagnosis rate. The invention maintains a high degree of accuracy even if there are a relatively large number of failed nodes in the system.

(4) The PMC model adopted by the invention provides that adjacent nodes can be tested, when the tested nodes are fault-free nodes, and when the test result is 1 (0), the tested nodes are fault-free (fault nodes). If the test node is a fault node, the test result may be 0 or 1 no matter whether the tested node is a fault node or a no fault node. Thus, the test results are only trusted when the test node is a failure-free node. In system level fault diagnostics, testing is the basis of diagnostics. The PMC model fully utilizes the communication capability among the nodes of the processor in the system, and the nodes process and analyze the diagnosis result by sending the test information and feeding back the test result, thereby finally determining the fault state of each node in the system.

Drawings

Fig. 1 is a hamiltonian ring to be diagnosed.

Fig. 2 is a schematic diagram of hamiltonian circles after obtaining a bi-directional symptom set and after being circled.

Fig. 3 is a schematic sequence diagram obtained after dividing hamiltonian circles.

Fig. 4 is a flow chart of a system method.

FIG. 5 is a diagram showing the effect of different numbers of fault nodes on the accuracy.

FIG. 6 is a graph showing the effect of different numbers of fault nodes on error rate.

Fig. 7 is a schematic diagram showing the effect of different fault node numbers on the false alarm rate.

Detailed Description

Embodiments of the invention are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the invention. That is, in some embodiments of the invention, these practical details are unnecessary.

As shown in fig. 4, the present invention provides a parallel adaptive system fault diagnosis scheme based on a PMC model, including the following steps:

step 1: the multiprocessor system to be diagnosed is divided into subsystems, and Hamiltonian loops are built for each subsystem in parallel. Then carrying out two-round tests on the subsystems along the Hamiltonian ring to obtain a one-way test symptom set;

the specific process of the test is as follows: in the clockwise direction of the hamiltonian circles, the first test is to test the even numbered nodes with the odd numbered nodes, and the second test is to test the odd numbered nodes with the even numbered nodes.

Step 2: according to the one-way test symptom set in the step 1, the Hamiltonian circles are divided into suspicious Hamiltonian circles and correct Hamiltonian circles, and then two-round test is carried out on the suspicious Hamiltonian circles only to obtain the two-way symptom set of the suspicious Hamiltonian circles.

Step 2.1: the main reason for dividing the hamiltonian circles is to reduce the test times, and for a hamiltonian circle test symptom set, if the test symptom set contains 1 symptom, the hamiltonian circle is called as a suspicious hamiltonian circle, that is, a fault vertex in the circle is most likely to exist. If there are only 0 symptoms in the test symptom set, this hamiltonian is called the correct hamiltonian because the subsystem is unlikely to fail entirely, i.e., all nodes in the subsystem are correct for this situation.

Step 2.2: for the suspicious hamiltonian ring, two rounds of testing are performed along the anticlockwise direction of the suspicious hamiltonian ring, wherein the first round of testing uses the nodes with the odd numbers to test the nodes with the even numbers, and the second round of testing uses the nodes with the even numbers to test the nodes with the odd numbers. A bi-directional symptom set is obtained.

Step 3: the sequence is then partitioned. The fault node of the suspicious Hamiltonian ring can be preliminarily obtained according to the characteristics of each sequence and the bidirectional symptom set, the fault node and the unknown node are formed by the fault node, the fault node set is formed by the fault node, and the unknown node set is formed by the unknown node;

according to the symptom set obtained by the Hamiltonian circle clockwise, each Hamiltonian circle is divided into a plurality of sequences by adopting a circle division scheme, and the specific process of circle division is as follows:

step 3.1.1, selecting a 0 result after 1, which is assumed to be represented by b 0;

step 3.1.2: representing the result after b0 with b, if b is a 0 result, directing b0 to b and repeating step 3.1.2; otherwise, executing the step 3.1.3;

step 3.1.3: if the 1 result pointed to by b has not been marked before, we mark the result after b with M, let b0 point to the M-marked result and perform step 3.1.2; otherwise, the algorithm is ended;

step 3.1.4: the marked result of M is the result of the broken Hamiltonian circle division sequence;

(1) If u tests that v results in 1 and v is a fault-free node, then u must be a fault node. In particular, if the result of u test v is 1 and the result of v test u is 0, then v can only be the failed node.

Based on the results obtained for the third and fourth wheel only,for each sequence S _i Along the anticlockwise direction of the Hamiltonian ring, the first node is called as the last node of the head node, and the last node of the head node is called as the tail node, the first place where the result 1 appears is found firstly, and if the result 1 is found, the nodes after the result 1 are all fault nodes up to the tail node;

step 4: testing unknown nodes between sequences with the tested fault-free nodes, the testing being performed within each cluster;

step 5: and carrying out one round of testing on the remaining unknown nodes, namely, a fifth round of testing, and obtaining the states of the remaining unknown nodes. The fifth round is specifically tested as follows:

step 5.1: firstly, testing in a cluster, and testing unknown nodes by using the tested fault-free nodes;

step 5.2: and then adopting a testing method among clusters, if the unknown node in the suspicious Hamiltonian ring is adjacent to the node in the correct Hamiltonian ring, preferentially selecting the node in the correct Hamiltonian ring to test the unknown node in the suspicious Hamiltonian ring, otherwise, testing the unknown node by using the non-fault node in another suspicious Hamiltonian ring adjacent to the unknown node.

After the five parallel adaptive diagnosis schemes, the states of all nodes can be diagnosed.

In the embodiment of the present invention, a processor system including N nodes is given, if the parallel adaptive diagnosis method provided by the present invention is used to diagnose whether the nodes in the system have faults, the multiprocessor system is first divided into subsystems, and then the hamiltonian ring structure of each subsystem is found out, and the specific structure is shown in fig. 1.

In order to ensure the accuracy of the diagnosis method, the number of fault nodes in the system structure cannot exceed the number of nodes in the substructure, namely, the nodes in one substructure cannot be all fault nodes.

After the hamiltonian rings for each subsystem are obtained, all operations are diagnosed in parallel, i.e., all rings are operated simultaneously. First, two Hamiltonian rings are needed to be clockwiseThe test set of symptoms was obtained from the test run. The hamiltonian circles are divided into suspicious hamiltonian circles and correct hamiltonian circles according to the obtained test symptom sets. As shown in FIG. 2, C ₁ I.e. the correct hamiltonian loop, the remaining three loops being suspicious hamiltonian loops.

The suspicious Hamiltonian ring is operated, and the suspicious Hamiltonian ring is divided into a plurality of sequences according to the ring division rule to form sub Hamiltonian ring C ₀ The results obtained for the example division are shown in fig. 3.

In the present embodiment, in order to improve the diagnostic efficiency, the following constraint is set:

(1) If the node for testing fails, then the test result takes 0 or 1 with a probability of 0.5.

(2) There is no failure condition in the system for all nodes in the system or all nodes in the subsystem, and the failure rate should be relatively low.

In this embodiment, a corresponding experiment is set for verification, and specifically the following steps are provided:

1. generating a network

Given the number of nodes n of the network, the degree of failure of the network is f. An initial state value of 0 or 1 is randomly given to each node in the network system, wherein 0 represents that the node is fault-free and 1 represents that the node is fault. Different fault degrees of the network system are set, but the number of sub-Hamiltonian ring nodes is not exceeded, and the fault node rate is relatively low.

2. Fault diagnosis

The fault diagnosis is the core of the experiment, and is mainly to diagnose all nodes in a given network system, judge the states of the nodes and evaluate the method through the accuracy rate, the error rate and the false alarm rate of diagnosis.

Definition of the accuracy, error rate and false alarm rate is as follows:

accuracy= (number of nodes correct and diagnosed as normal+number of nodes failed and diagnosed as failed)/number of all nodes in the system

Error rate= (number of nodes normal and diagnosed as failed+number of nodes failure and diagnosed as normal)/number of all nodes in the system

False alarm rate = number of nodes failed and diagnosed as normal/(number of nodes failed and diagnosed as failed + number of nodes normal and diagnosed as normal)

3. Verification result

The influence of the number of failed nodes on the diagnostic method is mainly verified in the present embodiment. The fault degrees are respectively set to be 0.02n,0.04n,0.06n,0.08n and 0.1n, wherein n is 100 nodes. The experiment is repeated 50 times, and after the average value of the experimental result data obtained under the same condition is obtained, the influences of different fault nodes on the accuracy rate, the fault rate and the false alarm rate of the diagnosis algorithm are reflected through a broken line statistical graph.

It can be seen from fig. 5 to fig. 7 that the diagnostic accuracy of the method is 1 before the number of the fault nodes is 6, and the diagnostic accuracy is still close to 1 even if the number of the fault nodes is increased. From the experimental results, the diagnosis method provided by the invention has good performance in the diagnosis of the system.

According to the characteristics of a PMC model, the invention can be decomposed into a plurality of Hamiltonian ring structures aiming at a system structure, and can well and quickly position the failed node. The invention has good universality, and can be used for the diagnosis method as long as the multiprocessor system can be divided into a plurality of Hamiltonian ring structures, and even if more fault nodes exist in the system, the invention can keep higher accuracy. The invention has wide market prospect in the application of fault diagnosis of the multiprocessor system.

The foregoing description is only illustrative of the invention and is not to be construed as limiting the invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of the present invention, should be included in the scope of the claims of the present invention.

Claims

1. A parallel self-adaptive system level fault diagnosis method based on a PMC model is characterized in that: the parallel self-adaptive system level fault diagnosis method specifically comprises the following steps:

step 1, dividing a multiprocessor system to be diagnosed into a plurality of subsystems containing Hamiltonian rings, wherein each subsystem is a cluster, constructing Hamiltonian rings of each cluster in parallel, and then carrying out two-round test on each constructed Hamiltonian ring to obtain a unidirectional test symptom set;

step 2, dividing the hamiltonian circle into a suspicious hamiltonian circle and a correct hamiltonian circle according to the unidirectional test symptom set obtained in the step 1, wherein the correct hamiltonian circle indicates that nodes in the hamiltonian circle are correct, the suspicious hamiltonian circle indicates that the hamiltonian circle possibly contains fault nodes, and then carrying out two-round test on the suspicious hamiltonian circle to obtain a bidirectional symptom set of the suspicious hamiltonian circle, and the method specifically comprises the following steps:

step 2-1, classifying hamiltonian circles according to the unidirectional test symptom set obtained in step 1, and regarding one hamiltonian circle unidirectional test symptom set, if the unidirectional test symptom set contains 1 symptom, then the hamiltonian circle is called as a suspicious hamiltonian circle, namely, a fault vertex is most likely to exist in the hamiltonian circle, if the unidirectional test symptom set is only 0 symptom, then the hamiltonian circle is called as a correct hamiltonian circle, because the subsystem cannot all fail, namely, all nodes in the subsystem are correct for the situation;

step 2-2, performing two-round tests on the suspicious Hamiltonian ring in the step 2-1 along the anticlockwise direction of the suspicious Hamiltonian ring, wherein the first-round test is to test nodes with even numbers by using nodes with odd numbers, and the second-round test is to test nodes with odd numbers by using nodes with even numbers, so as to obtain a bidirectional symptom set;

step 3, dividing each suspicious Hamiltonian ring obtained in the step 2 into a plurality of sequences, and obtaining a non-fault node set, a fault node set and an unknown node set of each sequence according to the characteristics of a PMC test result and a PMC rule, wherein the non-fault node set consists of non-fault nodes, the fault node set consists of fault nodes and the unknown node set consists of unknown nodes;

step 4, using the tested fault-free nodes to test the unknown nodes among the sequences, wherein the test is carried out in each cluster;

and 5, carrying out one round of test, namely a fifth round of test, aiming at the remained unknown nodes, and testing the fault nodes by using the known fault-free nodes, wherein the test is mutual test among clusters, and after the five rounds of parallel self-adaptive diagnosis schemes, the states of all the nodes can be diagnosed.

2. A parallel adaptive system level fault diagnosis method based on a PMC model according to claim 1, wherein: in the step 1, the specific process of testing each constructed hamiltonian ring is as follows: in the clockwise direction of the hamiltonian circles, the first test is to test the even numbered nodes with the odd numbered nodes, and the second test is to test the odd numbered nodes with the even numbered nodes.

3. A parallel adaptive system level fault diagnosis method based on a PMC model according to claim 1, wherein: in the step 3, each suspicious hamiltonian ring is divided into a plurality of sequences according to a unidirectional symptom set obtained clockwise, and the method specifically comprises the following steps:

4. A parallel adaptive system level fault diagnosis method based on PMC model according to claim 1 or 3, characterised in that: in the step 3, according to the rule of PMC, for any two adjacent nodes u and v, the following features are provided for the bidirectional symptom set:

5. A parallel adaptive system level fault diagnosis method based on a PMC model according to claim 1, wherein: in the step 5, a round of testing is performed for the remaining unknown nodes, namely a fifth round of testing, which specifically includes the following steps:

and 5-2, adopting a testing method among clusters, preferentially selecting a node adjacent to an unknown node in a suspicious hamiltonian ring in a correct hamiltonian ring to test the unknown node in the suspicious hamiltonian ring, and secondly selecting a non-fault node in another suspicious hamiltonian ring to test the unknown node in the suspicious hamiltonian ring adjacent to the non-fault node in the suspicious hamiltonian ring.