US20100039944A1 - Distributed system - Google Patents

Distributed system Download PDF

Info

Publication number
US20100039944A1
US20100039944A1 US12/457,329 US45732909A US2010039944A1 US 20100039944 A1 US20100039944 A1 US 20100039944A1 US 45732909 A US45732909 A US 45732909A US 2010039944 A1 US2010039944 A1 US 2010039944A1
Authority
US
United States
Prior art keywords
fault
identification
node
monitoring
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/457,329
Inventor
Masahiro Matsubara
Kohei Sakurai
Kotaro Shimamura
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Assigned to HITACHI, LTD. reassignment HITACHI, LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: SHIMAMURA, KOTARO, MATSUBARA, MASAHIRO, SAKURAI, KOHEI
Publication of US20100039944A1 publication Critical patent/US20100039944A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/40Bus networks
    • H04L12/407Bus networks with decentralised control
    • H04L12/417Bus networks with decentralised control with deterministic access, e.g. token passing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/40Bus networks
    • H04L2012/40208Bus networks characterized by the use of a particular bus standard
    • H04L2012/40241Flexray
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/40Bus networks
    • H04L2012/4026Bus for use in automation systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/28Data switching networks characterised by path configuration, e.g. LAN [Local Area Networks] or WAN [Wide Area Networks]
    • H04L12/40Bus networks
    • H04L2012/40267Bus for use in transportation systems
    • H04L2012/40273Bus for use in transportation systems the transportation system being a vehicle
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0811Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking connectivity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning

Definitions

  • the present invention relates to a distributed system for exercising highly reliable control based on cooperative operations of multiple networked devices.
  • Each of the nodes independently monitors a specific item of a given node so as to detect a fault.
  • the nodes exchange a fault-monitoring result through the network.
  • a fault is finally located from the fault-monitoring results collected in the nodes by a majority rule, for example.
  • These processes are synchronized with a communication cycle.
  • the above-mentioned processes of monitoring a fault, exchanging the fault-monitoring results, and locating the fault are performed in a pipelined fashion, making it possible to locate the fault at every communication cycle.
  • An object of the present invention is to provide a distributed system that can configure cycles for fault monitoring and communication independently to reduce processing loads on a central processing unit (CPU) and communication bands for fault monitoring and increase the degree of freedom for configuring fault-monitoring cycles.
  • CPU central processing unit
  • an aspect of the present invention provides a distributed system that includes plural nodes being connected to each other via a network.
  • Each of the nodes includes a fault-monitoring section for monitoring a fault in other nodes; a transmission and reception section for transmitting and receiving data to detect the fault in other nodes via the network; and a fault-identification section for identifying which node has the fault based on the data.
  • the fault-monitoring section uses plural communication cycles as a monitoring period; the plural communication cycles are synchronized between the nodes.
  • the distributed system of the present invention may include the transmission and reception section that includes a monitoring result from the fault-monitoring section in transmission and reception data and distributes transmission and reception of the data into a next monitoring period.
  • the next monitoring period is a next period of when the monitoring result is obtained.
  • the distributed system of the present invention may include the fault-identification section that distributes fault identification into next monitoring period.
  • the next monitoring period is a next period of when the monitoring result is obtained by the fault-monitoring section.
  • the monitoring result is included in the data.
  • the distributed system of the present invention may include the fault-monitoring section that changes, while the distributed system is in operation, the monitoring period for each node to be monitored.
  • the present invention can provide a distributed system that reduces processing loads on a central processing unit (CPU) and communication bands for fault monitoring and increases the degree of freedom for configuring fault-monitoring cycles.
  • CPU central processing unit
  • FIG. 1 is a block diagram of a distributed system
  • FIG. 2 is a flow chart showing a fault-identification process based on inter-node monitoring
  • FIG. 3 shows an example of the fault-identification process in a pipelined fashion
  • FIG. 4 shows another example of the fault-identification process in a pipelined fashion
  • FIG. 5 shows an operation example of the fault-identification process
  • FIG. 6 is a flow chart showing the fault-identification process distributed among nodes
  • FIG. 7 shows an example of the fault-identification process distributed among nodes in a pipelined fashion
  • FIG. 8 shows another example of the fault-identification process distributed among nodes in a pipelined fashion
  • FIG. 9A shows an operation example of the fault-identification process
  • FIG. 9B shows another operation example of the fault-identification process
  • FIG. 10 shows an example of the fault-identification process capable of varying an active cycle in a pipelined fashion
  • FIG. 11 shows another example of the fault-identification process capable of varying an active cycle in a pipelined fashion
  • FIG. 12A shows an operation example of the fault-identification process capable of varying an active cycle
  • FIG. 12B shows another operation example of the fault-identification process capable of varying an active cycle.
  • FIG. 1 is a block diagram of a distributed system.
  • the distributed system includes multiple nodes 10 , such as 10 - 1 , 10 - 2 , . . . , and 10 - n .
  • the nodes are connected via a network 100 .
  • the nodes are equivalent to a processor capable of information communication via the network, including various electronic control devices such as the CPU, actuators, and necessary drivers, and sensors.
  • the network 100 is capable of multiplex communication and broadcast transmission that simultaneously transmits the same content from a given node to all the other nodes connected to the network.
  • the distributed system may use a communication protocol such as FlexRay (registered trademark of Daimler AG) and TTCAN (time-triggered CAN).
  • Each node is represented by i, where i is the node number ranging from 1 to n.
  • Each node includes a CPU 11 - i , a main memory 12 - i , an interface (I/F) 13 - i , and a storage device 14 - i .
  • the nodes are connected to each other through an internal communication line or the like.
  • the interface 13 - i is connected to the network 100 .
  • the storage device 14 - i includes programs, such as a fault-monitoring section 141 - i , a transmission/reception section 142 - i , a fault-identification section 143 - i , and a counter section 144 - i , and a fault-identification result 145 - i .
  • the fault-identification result 145 - i includes a monitoring result table, a fault-identification result table, and an error counter, which are described later.
  • the CPU 11 - i reads these programs into the main memory 12 - i and executes the programs for processes.
  • the programs or data described in this specification may be stored in the storage device in advance, may be supplied from storage media such as CD-ROM, or may be downloaded from other devices via the network. Special hardware may be used to realize functions implemented by the programs.
  • the fault-monitoring section 141 - i performs fault monitoring (MON) on the other nodes.
  • the transmission/reception section 142 - i transmits or receives data via the network 100 for detecting faults on the other nodes.
  • the fault-identification section 143 - i performs fault identification (ID) to identify which node has a fault based on data for detecting faults on the other nodes.
  • the counter section 144 - i counts the number of errors in a node identified as having a fault, with respect to nodes, error locations (error items), and fault-identification conditions to be described later.
  • FIG. 2 is a flow chart showing a fault-identification process based on inter-node monitoring. The process is performed by mutual synchronous communication of each node through the network 100 .
  • the fault-monitoring section 141 - i monitors a fault on the other nodes, performing a fault-monitoring process (MON) that determines, by node i itself, whether a fault occurs or not in a transmission node according to the contents of received data or the situation of reception. It may be preferable to use multiple fault-monitoring items. For example, an item “reception malfunction” indicates a malfunction when the data reception has an error, such as detection of unsuccessful reception or a malfunction in received data based on an error-detecting code. An item “sequence number malfunction” is used as follows. The transmission node supplies transmission/reception data with a sequence number that an application increments at every communication cycle.
  • MON fault-monitoring process
  • the reception node checks an increment of the sequence number and detects a malfunction when the sequence number is not incremented.
  • the sequence number is used to confirm an application malfunction in the transmission node.
  • An item “self-diagnosis malfunction” is used as follows. Each of the nodes performs self-diagnosis as to whether the node itself has a malfunction or not and transmits a result of the diagnosis (self-diagnosis result) to the other nodes.
  • the reception node detects a malfunction in the transmission node by using the self-diagnosis result. When any of the fault-monitoring items indicates a malfunction, a fault-monitoring item that unifies the other fault-monitoring items may be used to notify “malfunction detected.”
  • the fault-monitoring process is performed at p-communication cycle, where p denotes 1, 2, 3, . . . .
  • the p-communication cycle is used as a unit of the period.
  • the nodes are synchronized with each other during a fault-monitoring period at the p-communication cycle.
  • a node may declare initiation of the fault-monitoring process using the communication.
  • the number of communication cycles may be used to determine the monitoring period. For example, if the first fault monitoring is defined to begin with communication cycle 0, the beginning of each fault-monitoring period can be found from no residual in the calculation of dividing the number of communication cycles by p.
  • the use of multiple communication cycles for the period of the fault-monitoring process can decrease the frequency of subsequent processes and reduce a communication band per communication cycle and a processing load on the CPU in each node.
  • the transmission/reception section 142 - i performs a fault-monitoring result exchange process (EXD) that exchanges the fault-monitoring result acquired at Step 21 among the nodes.
  • EXD fault-monitoring result exchange process
  • Each node holds the fault-monitoring results from all the nodes including the result of the node itself.
  • the collected fault-monitoring results are stored in the monitoring result table of the fault-identification result 145 - i.
  • the fault-monitoring result exchange process may be performed at one communication cycle or divided into multiple communication cycles. Performing the fault-monitoring result exchange process over multiple communication cycles reduces a necessary communication band per communication cycle and a load for processing received data on the CPU of each node.
  • the fault-identification section 143 - i performs a fault-identification process (ID) that determines whether each node and each fault-monitoring item has a malfunction or not from the fault-monitoring results collected in the nodes at Step 22 .
  • ID a fault-identification process
  • a fault-identification result is stored in the fault-identification result table of the fault-identification result 145 - i.
  • One of the fault-identification methods is a majority rule, which decides whether a malfunction is occurred or not by the number of nodes. If the number of nodes having detected a fault in a given node or fault-monitoring item is greater than or equal to a threshold value of a fault-identification condition 1, the node subject to the fault detection is assumed to be abnormal. If the number of the nodes is smaller than a threshold value of a fault-identification condition 2, the node having detected the fault is assume to be abnormal. Normally, the threshold value is equivalent to half the number of the collected fault-monitoring results.
  • a node is assumed to be normal if it detects no fault under the fault-identification condition 1 or it is a node subject to the fault detection under the fault-identification condition 2.
  • a fault satisfying the fault-identification condition 1 is referred to as a majority malfunction.
  • a fault satisfying the fault-identification condition 2 is referred to as a minority malfunction.
  • the fault-identification process may be performed at one communication cycle or divided into multiple communication cycles. Performing the fault-monitoring result exchange process over multiple communication cycles can reduce a load for CPU processing per communication cycle in each node.
  • each node performs a fault-identification result utilization process. If a malfunction is determined at Step 23 , the counter section 144 - i increments an error counter value that indicates the number of errors in the node or the monitoring item subject to the fault identification. If no malfunction is determined, the counter section 144 - i decrements the counter value. The counter section 144 - i may reset the counter value or do nothing instead of decrementing. Setting whether to decrement, to reset, or to do nothing is selected in advance.
  • the error counter may be provided for each of the fault-identification conditions. In this case, the error counter is decremented or reset if none of the fault-identification conditions is satisfied.
  • the counter section 144 - i notifies a control application of a fault occurrence.
  • One of the notification means is to turn on a node fault flag corresponding to the node or the monitoring item subject to the fault identification.
  • the application referring to the node fault flag, can identify the fault occurrence.
  • the fault occurrence may be immediately notified by interrupting the control application or invoking a callback function after the node fault flag is turned on.
  • the node fault flag is also provided for each of the fault-identification conditions.
  • the fault-identification result utilization process may be performed when all the fault-identification processes are completed or when part of the fault-identification processes is completed to make sequential use of the results of the part of the fault-identification processes.
  • the former should be employed if all nodes need to maintain the same recognition on the fault occurrence or the same state transition according to the fault occurrence.
  • the above-mentioned processes can locate the fault occurrence with high reliability and provide the nodes with the same recognition on the error occurrence. In this case, distributing the processes into multiple communication cycles can reduce a load for CPU processing or suppress necessary communication bands per communication cycle.
  • the processes in FIG. 2 may be performed in parallel when the processes are performed repeatedly. Defining an opportunity to perform the process in FIG. 2 once as a fault-identification round, it may be preferable to perform the multiple fault-identification rounds in parallel.
  • FIGS. 3 and 4 show examples of parallel processing of identifying a fault in a 4-node system by the inter-node monitoring based on the process flow in FIG. 2 .
  • Each node exchanges the monitoring results (EXD) for nodes 1 and 2 at the communication cycle i+2 and nodes 3 and 4 at the communication cycle i+3, performing the fault identification (ID) from the monitoring results.
  • the fault-monitoring result exchange process (EXD) and the fault-identification process (ID) are divided into the nodes and are distributed into the communication cycles.
  • the nodes concurrently perform the fault-identification round 1 and a fault-identification round 2 and later.
  • the fault-monitoring result exchange (EXD) is performed for the fault-identification round 1 .
  • the fault monitoring (MON) is performed for the fault-identification round 2 according to the content of the received data or the situation of the data reception resulting from the fault-monitoring result exchange (EXD).
  • the fault monitoring (MON) is performed for a fault-identification round 3 simultaneously with the fault-monitoring result exchange (EXD) for the fault-identification round 2 .
  • the fault identification (ID) is performed between whiles. These processes are repeated subsequently. Results of the fault identification (ID) may be used from the nodes 1 and 2 first or from all the nodes after acquisition of results from the nodes 3 and 4 .
  • the fault monitoring (MON) is performed at the communication cycles i and i+1, the fault-monitoring result exchange (EXD) being distributed into the communication cycles i+2 and i+3 and the fault identification (ID) being distributed into the communication cycles i+3 and i+4.
  • the results of the fault monitoring (MON) are transmitted from the nodes 1 and 2 at the communication cycle i+2 and from the nodes 3 and 4 at the communication cycle i+3.
  • the fault identification (ID) is performed on the nodes 1 and 2 at the communication cycle i+3 and on the nodes 3 and 4 at the communication cycle i+4.
  • a difference from FIG. 3 is to divide the fault-monitoring result exchange (EXD) process into the transmission nodes and distribute the process into the communication cycles.
  • the nodes concurrently perform the fault-identification round 1 and the fault-identification round 2 and later.
  • the fault-monitoring result exchange (EXD) is performed for the fault-identification round 1 .
  • the fault monitoring (MON) is performed for the fault-identification round 2 according to the content of the received data or the situation of the data reception resulting from the fault-monitoring result exchange (EXD). Relation between the fault-identification rounds 2 and 3 is the same and the above-mentioned processes are repeated subsequently.
  • the fault-identification process according to the inter-node monitoring in FIG. 2 is performed in a pipelined fashion.
  • the fault monitoring (MON) is applicable to all the time intervals (communication cycles).
  • the fault identification (ID) can be continuously performed at a specified interval.
  • two communication cycles are used as an applicable period for the fault monitoring (MON), and the fault-monitoring result exchange (EXD) process and the fault-identification (ID) process are distributed into the two communication cycles.
  • the number of communication cycles may be one or more than two. Decreasing the number of communication cycles for each process shortens the time (the number of communication cycles) needed for the fault identification (ID) but relatively increases a load for CPU processing or a communication band to be consumed. Increasing the number of communication cycles for each process lengthens the time (the number of communication cycles) needed for the fault identification (ID) but relatively decreases a load for CPU processing or a communication band to be consumed.
  • the fault-monitoring result exchange (EXD) and the fault identification (ID) may be performed on the nodes 1 to 3 at the communication cycle i+2 and on the nodes 4 to 6 at the communication cycle i+3 for the first fault-identification round.
  • Time-base process distribution Distribution of the fault-monitoring result exchange (EXD) and the fault identification (ID) to communication cycles is hereafter referred to as time-base process distribution.
  • the time-base process distribution is preferable when a load for CPU processing and a quantity of communication are equal for each communication cycle. This is because the control application is relatively less affected by resources such as the CPU throughput and the communication band.
  • FIGS. 3 and 4 show the examples of such equal distribution.
  • each node may perform a part of the fault-identification (ID) process, such as counting for a majority rule, at the communication cycle i+2 using fault-monitoring results received from the nodes 1 and 2 and then perform the rest of the fault-identification (ID) process at the communication cycle i+3 using fault-monitoring results received from the nodes 3 and 4 to complete the fault-identification process.
  • ID fault-identification
  • FIG. 5 shows an operation example of the fault-identification process based on the inter-node monitoring.
  • the process flow is based on FIG. 2 .
  • the time-base process distribution and the process pipelining are compliant with FIG. 3 .
  • the number of nodes is assumed to be four. In this example, various items are unified as one fault-monitoring item.
  • the fault-identification process (ID) is performed at the end of the communication cycles after termination of transmission or reception of each node.
  • Transmission data includes bits for two nodes, each of the bits indicating the presence or absence of a malfunction concerning a node to be monitored.
  • An area corresponding to a given node stores a result of diagnosis about the given node.
  • the presence or absence of a malfunction concerning the nodes 1 and 2 are stored at an even-numbered cycle.
  • the presence or absence of a malfunction concerning the nodes 3 and 4 are stored at an odd-numbered cycle.
  • the transmission data also includes an error counter value, which each node has, of one node.
  • the node 1 transmits an error counter value for the node 2 ; the node 2 transmits an error counter value for the node 3 ; the node 3 transmits an error counter value for the node 4 ; the node 4 transmits an error counter value for the node 1 .
  • the node 1 transmits an error counter value for the node 4 ; the node 2 transmits an error counter value for the node 1 ; the node 3 transmits an error counter value for the node 2 ; the node 4 transmits an error counter value for the node 2 .
  • the nodes are rotated. Error counters are independent for a majority malfunction and a minority malfunction. The number of majority malfunctions (EC) is transmitted at an even-numbered cycle. The number of minority malfunctions (FC) is transmitted at an odd-numbered cycle.
  • the node When receiving an error counter value, the node uses the received error counter value to synchronize the error counters between nodes in the fault-identification result utilization process before the node reflects the result of the fault identification (ID) on the error counter. This is because the error counter values may differ from node to node even when the fault-identification process is performed in accordance with the inter-node monitoring. Possible causes of this difference include a reset based on diagnosis on the node itself or temporal communication failure.
  • the error counters may be synchronized as follows.
  • a node receives a counter value that differs from the counter value that the node has and a difference of successively received two counter values is within a given value (e.g., ⁇ 1), the counter value for the node is adjusted to the later received counter value.
  • a given value e.g., ⁇ 1
  • the transmission data indicates only part of the contents.
  • the transmission data may include a sequence number and control data as well as the above-mentioned data.
  • the nodes 1 to 4 sequentially use slots 1 to 4 to transmit fault-monitoring results (EXD, 501 - 0 to 504 - 0 ) concerning the nodes 1 and 2 for a fault-identification round k ⁇ 1, maintaining results received from the other nodes and generated from the node itself ( 521 - 0 to 524 - 0 represented in binary). Since the results include no data indicating “abnormality” and are normally received by the nodes, no malfunction is found in the fault identification (ID) concerning the nodes 1 and 2 for the fault-identification round k ⁇ 1 and none of the nodes turns on the node fault flag ( 551 - 0 to 554 - 0 represented in binary).
  • ID fault identification
  • None of the nodes detects a fault during the fault monitoring (MON) for a fault-identification round k ( 511 - 0 to 514 - 0 represented in binary).
  • the error counter value for each node indicates 2 corresponding to the majority malfunction of the node 3 and indicates 0 otherwise. No change is made from a communication cycle i ⁇ 1 ( 541 - 0 to 544 - 0 ).
  • the node 3 causes a CPU fault at the end of the communication cycle i. It is assumed that this fault leads to a fault that disables the node 3 from incrementing a sequence number to be transmitted at the next communication cycle i+1.
  • the sequence numbers are not shown in the data in FIG. 5 .
  • the nodes transmit fault-monitoring results ( 501 - 1 to 504 - 1 ) concerning the nodes 3 and 4 for the fault-identification round k ⁇ 1 and maintain the results ( 521 - 1 to 524 - 1 ).
  • the fault identification (ID) concerning the nodes 3 and 4 for the fault-identification round k ⁇ 1
  • the nodes 1 , 2 , and 4 detect a fault in the node 3 ( 511 - 1 , 512 - 1 , and 514 - 1 ) according to sequence number malfunctions of the node 3 during the fault monitoring (MON) concerning the nodes 3 and 4 for the fault-identification round k.
  • the node 3 cannot detect a malfunction in the node 3 itself ( 513 - 1 ).
  • the fault-identification result exchange (EXD) and the fault identification (ID) for the fault-identification round k and the fault monitoring (MON) for the fault-identification round k+1 are performed on the nodes 1 and 2 at the communication cycle i+2 and on the nodes 3 and 4 at the communication cycle i+3. No malfunction is detected at a communication i+2 similarly to the communication cycle i.
  • the fault-monitoring result exchange (EXD) for the fault-identification round k exchanges fault detection results of the node 3 at the communication cycle i+i ( 501 - 3 to 504 - 3 and 521 - 3 to 524 - 3 ) and the fault identification (ID) for each node identifies a majority malfunction of the node 3 ( 531 - 3 to 534 - 3 ).
  • the error counter value of each node concerning the majority malfunction of the node 3 is incremented to 3 ( 541 - 3 to 544 - 3 ).
  • the threshold value in the system is 3 for the application to notify a fault; therefore the node fault flag of each node concerning the majority malfunction of the node 3 is turned on ( 551 - 3 to 554 - 3 ).
  • the CPU fault of the node 3 is identified by each node and the corresponding node fault flag notifies the application of the fault.
  • the fault-identification process according to the inter-node monitoring in FIG. 2 can be performed in a pipelined fashion in synchronization with the communication cycle.
  • the time-base process distribution reduces the load for the CPU processing and the quantity of the communication per communication cycle compared to a case without the time-base process distribution.
  • the minority malfunction is processed in the same way as the above example, which describes the majority malfunction.
  • FIG. 6 is a flow chart of the fault-identification process according to the inter-node monitoring.
  • the fault monitoring (MON) at Step 21 and the fault-monitoring result exchange process (EXD 1 in FIG. 6 ) at Step 22 are the same as those in FIG. 2 .
  • the fault-identification section 142 - i performs the fault-identification process (ID 1 ) on one node of nodes involved in mutual monitoring except the node itself.
  • the node itself is responsible for the one node about the fault identification.
  • the relevant nodes rotate for every communication cycle so as not to conflict with each other. In this manner, a load on the fault-identification process is distributed among the nodes and reduced.
  • the transmission/reception section 142 - i performs a fault-identification result exchange process (EXD 2 ) that exchanges a fault-identification result about the one node acquired at Step 61 among the nodes. Consequently, each node maintains fault-identification results about all the nodes including the result processed by the node itself.
  • a fault-identification process (ID 2 ) uses the collected fault-identification results to settle a final fault-identification result.
  • Step 24 is the same as that of the fault-identification result utilization process in FIG. 2 .
  • the fault-identification process (ID 1 ) may use the fault-identification condition 1 for determination about one node.
  • the fault-identification process (ID 2 ) may use the fault-identification condition 2 for determination about all nodes.
  • the fault-identification process (ID 2 ) may use the fault-identification condition 2 for determination about one node.
  • This result may be exchanged among the nodes using a fault-identification result exchange process (EXD 3 ).
  • the fault-identification process (ID 1 ) may be performed on two or more nodes, not limited to one node.
  • FIGS. 7 and 8 show examples of parallel processing for identifying a fault in a 4-node system using the inter-node monitoring based on the process flow in FIG. 6 .
  • the fault monitoring (MON) is performed at the communication cycles i and i+1.
  • the fault-monitoring result exchange (EXD 1 ) and the fault identification (ID 1 ) are distributed into the communication cycles i+2 and i+3.
  • the fault-identification result exchange (EXD 2 ) and the fault identification (ID 2 ) are distributed into the communication cycles i+4 and i+5.
  • the nodes perform the fault-monitoring result exchange (EXD 1 ) and the fault identification (ID 1 ) on the nodes 1 and 2 at the communication cycle i+2 and on the nodes 3 and 4 at the communication cycle i+3.
  • the nodes perform the fault-identification result exchange (EXD 2 ) and the fault identification (ID 2 ) on all the nodes at the communication cycle i+4. As shown in FIG. 7 , the fault-monitoring result exchange process (EXD 1 ) and the fault-identification process (ID 1 ) are divided into the nodes and are distributed into the communication cycles.
  • the nodes concurrently perform the fault-identification round 1 and the fault-identification round 2 and later.
  • the fault-monitoring result exchange (EXD 1 ) is performed for the fault-identification round 1 .
  • the fault monitoring (MON) is performed for the fault-identification round 2 according to the content of the received data or the situation of the data reception.
  • the fault-identification result exchange (EXD 2 ) is performed for the fault-identification round 1 .
  • the fault-monitoring result exchange (EXD 1 ) is performed on the nodes 1 and 2 for the fault-identification round 2 .
  • the fault monitoring (MON) is performed for the fault-identification round 3 according to the content of the received data or the situation of the data reception. Relation between the fault-identification rounds 2 and later is the same and the above-mentioned processes are repeated subsequently.
  • the fault monitoring (MON) is performed at the communication cycles i and i+1.
  • the fault-monitoring result exchange (EXD 1 ) and the fault identification (ID 1 ) are distributed into the communication cycles i+2 and i+3.
  • the fault-identification result exchange (EXD 2 ) and the fault identification (ID 2 ) are distributed into the communication cycles i+4 and i+5.
  • the nodes perform a half of the processes of fault-monitoring result exchange (EXD 1 ) and fault identification (ID 1 ) at the communication cycles i+2 and i+3, respectively.
  • the half means that the fault-monitoring result exchange (EXD 1 ) transmits a half of the fault-monitoring result at the communication cycle i+2; the fault identification (ID 1 ) halfway performs the process of collecting fault-monitoring results for the fault identification, such as a majority rule, just for the data acquired by the fault-monitoring result exchange (EXD 1 ).
  • the remaining process is performed at the communication cycle i+3.
  • the nodes perform the fault-identification result exchange (EXD 2 ) and the fault identification (ID 2 ) concerning a majority malfunction at the communication cycle i+4 and a minority malfunction at the communication cycle i+5.
  • the processes of fault-monitoring result exchange (EXD 1 ), fault-identification result exchange (EXD 2 ), and fault identifications (ID 1 and ID 2 ) are distributed into the communication cycles.
  • the nodes concurrently perform the fault-identification round 1 and the fault-identification round 2 and later.
  • the fault-monitoring result exchange (EXD 1 ) is performed for the fault-identification round 1 .
  • the fault monitoring (MON) is performed for the fault-identification round 2 .
  • the fault-identification result exchange (EXD 2 ) is performed for the fault-identification round 1 .
  • the fault-monitoring result exchange (EXD 1 ) is performed for the fault-identification round 2 .
  • the fault monitoring (MON) is performed for the fault-identification round 3 as well. Relation between the fault-identification rounds 2 and later is the same and the above-mentioned processes are repeated subsequently.
  • FIGS. 9A and 9B show operation examples of the fault-identification process according to the inter-node monitoring.
  • the process flow is based on FIG. 6 .
  • the time-base process distribution and the process pipelining are compliant with FIG. 8 .
  • the conditions, such as the number of nodes and the fault-monitoring items, are the same as those shown in FIG. 5 .
  • the fault-identification result exchange (EXD 2 ) includes increase or decrease of the error counter value in accordance with the result of the fault identification (ID 1 ), transmission of the error counter value, and transmission of the counter value for error counter synchronization.
  • the node synchronizes the error counter.
  • An example of the method for synchronizing is as follows. (1) If a difference between the received counter value and the counter value of the node itself is a specified value (e.g., ⁇ 1), the node itself adjusts the counter value to the received counter value. (2) If the condition (1) is not satisfied and a difference between successively received two counter values is a specified value (e.g., ⁇ 1), the node itself adjusts the counter value to the later received counter value.
  • the transmission data may include an area exclusively used for results of the fault identification (ID 1 ) without reflecting the result of the fault identification (ID 1 ) on the error counter value.
  • the nodes 1 to 4 sequentially use slots 1 to 4 to transmit fault-monitoring results (EXD 1 , 901 - 0 to 904 - 0 and 901 - 1 to 904 - 1 ) for the fault-identification round k ⁇ 1, maintaining results received from the other nodes and generated from the node itself ( 921 - 0 to 924 - 0 and 921 - 1 to 924 - 1 ).
  • the nodes 1 and 2 transmit fault-monitoring results concerning the nodes 1 and 2 ; the nodes 3 and 4 transmit fault-monitoring results concerning the nodes 3 and 4 .
  • the nodes transmit the remaining data. Since the results include no data indicating “abnormality” and are normally received by the nodes, no malfunction is found in the fault identification (ID) for the fault-identification round k ⁇ 1.
  • the fault identification (ID) is divided into the communication cycles i and i+1 and generates a result at the communication cycle i+1 ( 931 - 1 to 934 - 1 , which represent node numbers in charge). None of the nodes turns on the node fault flag ( 951 - 0 to 954 - 0 and 951 - 1 to 954 - 1 ).
  • the fault-identification result exchange (EXD 2 ) and the fault identification (ID 2 ) for the fault-identification round k ⁇ 2 are also performed.
  • the error counter values are set to 2 for the majority malfunction of the node 3 and to 0 otherwise.
  • the error counter value for each node indicates 2 corresponding to the majority malfunction of the node 3 and indicates 0 otherwise. No change is made from a communication cycle i ⁇ 1 ( 941 - 0 to 944 - 0 and 941 - 1 to 944 - 1 ).
  • the fault monitoring (MON) is performed for the fault-identification round k in parallel with the fault-monitoring result exchange (EXD 1 ) for the fault-identification round k ⁇ 1.
  • each node detects no fault at the communication cycle i ( 911 - 0 to 914 - 0 ).
  • the node 3 is subject to a CPU fault at the end of the communication cycle i and causes a sequence number malfunction.
  • the nodes 1 , 2 , and 4 detect the fault of the node 3 ( 911 - 1 to 914 - 1 ).
  • the fault-monitoring result exchange (EXD 1 , 901 - 2 to 904 - 2 and 901 - 3 to 904 - 3 ) for the fault-identification round k is performed similarly to the fault-identification round k ⁇ 1.
  • Each node acquires the fault-monitoring results including the fault detection from the node 3 at the communication cycle i+1 ( 921 - 2 to 924 - 2 and 921 - 3 to 924 - 3 ).
  • the fault identification (ID 1 ) for the fault-identification round k is also performed similarly to the fault-identification round k ⁇ 1.
  • the node 1 in charge of the node 3 identifies the majority malfunction of the node 3 ( 931 - 3 to 934 - 3 ). All the nodes detect no fault during the concurrently performed fault monitoring (MON) for the fault-identification round k+1 ( 911 - 2 to 914 - 2 and 911 - 3 to 914 - 3 ).
  • MON concurrently performed fault monitoring
  • the fault-identification result exchange (EXD 2 ) and the fault identification (ID 2 ) for the fault-identification round k are performed in parallel with the fault monitoring (MON) for the fault-identification round k+2 and the fault-monitoring result exchange (EXD 1 ) for the fault-identification round k+1.
  • the majority malfunction of the node 3 detected by the node 1 is transmitted to the other nodes ( 901 - 4 ).
  • the nodes recognize the majority malfunction of the node 3 and increment the corresponding error counter value to 3 at the communication i+5 ( 941 - 5 to 944 - 5 ). Consequently, the nodes turn on the node fault flag corresponding to the majority malfunction of the node 3 ( 951 - 5 to 954 - 5 ).
  • the CPU fault of the node 3 is identified by each node and the corresponding node fault flag notifies the application of the fault.
  • the fault-identification process according to the inter-node monitoring in FIG. 6 can be performed in a pipelined fashion in synchronization with the communication cycle.
  • the time-base process distribution reduces the load for the CPU processing and the quantity of the communication per communication cycle compared to a case without the time-base process distribution.
  • the minority malfunction is processed in the same way as the above example, which describes the majority malfunction.
  • the periods (communication cycles) are constant for the fault-monitoring process (MON) and for the fault-monitoring result exchanges (EXD and EXD 1 ) and the fault identifications (ID, ID 1 , and ID 2 ) that are divided and performed. These periods can be changed while the system is in operation. In other words, it is also possible to use a variable cycle for performing the fault identification based on the mutual monitoring.
  • FIGS. 10 and 11 illustrate examples of changing cycles for performing the fault-monitoring process (MON), the fault-monitoring result exchange (EXD), and the fault-identification process (ID) while the system is in operation for the parallel processing of the fault-identification according to the mutual monitoring in FIG. 3 .
  • MON fault-monitoring process
  • EXD fault-monitoring result exchange
  • ID fault-identification process
  • One method of changing the cycle of the fault identification is, when a fault occurs in a node, to shorten the cycles of the processes associated with the fault identification for the node.
  • the method is based on the principle that the node subject to a fault needs to perform the fault identification in a short cycle.
  • the cycle may be changed when the error counter value becomes greater than or equal to a specified value. This is because the error counter has synchronization means and can synchronize a timing to change the cycle among the nodes.
  • FIG. 10 shows an example of changing the fault-identification cycle of the node 1 .
  • the communication cycles i to i+3 are the same as those shown in FIG. 3 .
  • the error counter value for the node 1 becomes greater than or equal to the specified value during the fault identification (ID) for the node 1 at the communication cycle i+2, causing the cycle of the fault identification for the node 1 to be shortened from the normal two cycles to one cycle.
  • the period (communication cycle) of the fault monitoring (MON) for the node 1 is shortened at the communication cycle i+4 and later.
  • the fault-monitoring result exchange (EXD) and the fault identification (ID) for the node 1 are also performed at one cycle next to the cycle of the fault monitoring (MON).
  • the fault-monitoring result exchange (EXD) for the node 1 is also performed in parallel with the fault monitoring (MON) for all the nodes. Consequently, the fault identification (ID) for the node 1 is performed in a pipelined fashion at each cycle.
  • FIG. 11 shows an example of changing the fault-identification cycle of the node 3 .
  • the communication cycles i to i+3 are the same as those shown in FIG. 3 .
  • the error counter value for the node 3 becomes greater than or equal to the specified value during the fault identification (ID) for the node 3 at the communication cycle i+3, causing the cycle of the fault identification for the node 3 to be shortened from the normal two cycles to one cycle.
  • the period (communication cycle) of the fault monitoring (MON) for the node 3 is shortened at the communication cycle i+4 and later.
  • the fault-monitoring result exchange (EXD) and the fault identification (ID) for the node 3 are also performed at one cycle next to the cycle of the fault monitoring (MON).
  • the fault-monitoring result exchange (EXD) and the fault identification (ID) corresponding to the fault monitoring (MON) for the node 3 at communication cycles i+2 and i+3 are performed at the communication cycle i+4 advanced from the scheduled communication cycle i+5. Instead, at the communication cycle i+5, the fault-monitoring result exchange (EXD) and the fault identification (ID) corresponding to the fault monitoring (MON) for the node 3 at the communication cycle i+4 are performed. Similarly, at a communication cycle i+6 and later, the fault-monitoring result exchange (EXD) and the fault identification (ID) corresponding to the fault monitoring (MON) at the communication cycle i+5 and subsequent cycles advanced by one cycle from those scheduled are performed, respectively. As for the node 3 , the fault identification (ID) is performed at every cycle.
  • FIGS. 12A and 12B show operation examples of the fault-identification process according to the inter-node monitoring.
  • the process flow is based on FIG. 2 .
  • the time-base process distribution and the process pipelining are compliant with FIG. 11 .
  • the conditions, such as the fault-monitoring items, are the same as those shown in FIG. 5 .
  • a difference is that the transmission data includes the bits of the fault-monitoring result for the nodes 1 to 4 at every cycle. Whether or not to use the fault-monitoring result depends on the cycle of the fault identification.
  • the fault identification (ID) does not necessarily use the fault-monitoring result.
  • the communication cycles i to i+3 are almost same as those shown in FIG. 5 .
  • all nodes are assigned 0 as an initial error counter value concerning the majority malfunction of the node 3 ( 1241 - 0 to 1244 - 0 , 1241 - 1 to 1244 - 1 , and 1241 - 2 to 1244 - 2 ). Therefore, when the nodes identify the majority malfunction of the node 3 at the communication cycle i+3 ( 1231 - 3 to 1234 - 3 ), the corresponding error counter value is incremented to 1 ( 1241 - 3 to 1244 - 3 ).
  • the node 3 causes a CPU malfunction at the communication cycles i+1 to i+3, causing a sequence number malfunction in the node 3 itself. Consequently, the nodes 1 , 2 , and 4 detect the fault of the node 3 using the fault monitoring (MON) at the communication cycles i+2 to i+4 ( 1211 - 2 to 1214 - 2 , 1211 - 3 to 1214 - 3 , and 1211 - 4 to 1214 - 4 ).
  • MON fault monitoring
  • the error counter value concerning the majority malfunction of the node 3 is set to 1 at the communication cycle i+3.
  • the nodes then change the fault-identification cycle for the node 3 from 2 to 1.
  • a fault of the node 3 detected at the communication cycles i+2 and i+3 ( 1211 - 2 to 1214 - 2 and 1211 - 3 to 1214 - 3 ) is used for the fault-monitoring result exchange (EXD) at the communication cycle i+4 (an OR operation is carried out to regard the faults at the communication cycles i+2 and i+3 as one fault).
  • a fault of the node 3 detected at the communication cycle i+4 ( 1211 - 4 to 1214 - 4 ) is used for the fault-monitoring result exchange (EXD) at the communication cycle i+5.
  • EXD fault-monitoring result exchange
  • the round 2 corresponds to the communication cycles i+2 and i+3
  • the round 3 corresponds to the communication cycle i+4.
  • the corresponding fault identifications (ID) are performed at the communication cycles i+3 ( 1231 - 3 to 1234 - 3 ), i+4 ( 1231 - 4 to 1234 - 4 ), and i+5 ( 1231 - 5 to 1234 - 5 ), respectively.
  • the error counter values for the nodes corresponding to majority malfunctions of the node 3 are incremented ( 1241 - 3 to 1244 - 3 , 1241 - 4 to 1244 - 4 , and 1241 - 5 to 1244 - 5 ).
  • the counter value is set to 3 at the communication cycle i+5.
  • the node fault flag corresponding to the majority malfunction of the node 3 is turned on ( 1245 - 1 to 1245 - 5 ).
  • the CPU fault of the node 3 is identified by each node and the corresponding node fault flag notifies the application of the fault.
  • the fault-identification process according to the inter-node monitoring in FIG. 2 can change the cycle of the fault identification while the system is in operation.
  • the flow chart in FIG. 6 and the minority malfunction are processed in the same way as the above example, which describes the flow chart in FIG. 2 and the majority malfunction.
  • Control systems using the distributed system are applied to a wide range of industrial fields, such as vehicles, construction equipment, and factory automation (FA).
  • the present invention can ensure high system reliability and improve availability based on backup control for the distributed control systems.
  • the distributed systems can be controlled at low cost without additional special apparatus.

Abstract

A distributed system performs fault identification using inter-node monitoring so as to locate a fault with high reliability and ensure consistent recognition about the situation of the fault occurrence among nodes. The process is synchronized with a communication cycle. If a system does not need to perform the fault identification as often as every communication cycle, the frequency of the fault identification should be decreased so as to reduce a load for CPU processing and a consumption of a communication band per unit time.
A distributed system of the present invention includes plural nodes that are connected to each other via a network. Each of the nodes includes a fault-monitoring section, a transmission and reception section, and a fault-identification section. The fault-monitoring section monitors a fault in other nodes. The transmission and reception section transmits and receives data to detect the fault in other nodes via the network. The fault-identification section identifies which node has the fault based on the data. The fault-monitoring section uses one or more communication cycles as a monitoring period; the communication cycles are synchronized between the nodes.

Description

    CLAIM OF PRIORITY
  • The present application claims priority from Japanese Patent Application JP 2008-168052 filed on Jun. 27, 2008, the content of which is hereby incorporated by reference into this application.
  • FIELD OF THE INVENTION
  • The present invention relates to a distributed system for exercising highly reliable control based on cooperative operations of multiple networked devices.
  • BACKGROUND OF THE INVENTION
  • In order to improve driving comfort and safety of vehicles, vehicle control systems are being developed that reflect, using electronic control rather than mechanical coupling, driver's operations on an accelerator, a steering wheel, a brake, and the like in vehicle mechanisms for generating driving, steering, and braking forces. Similar electronic control is presently applied to other devices such as construction equipment. In these systems, multiple electronic control units (ECUs) are distributed on the devices and provide cooperative operations by exchanging data via a network. When a fault occurs in an ECU in a network, it is required, for a fail-safe, that each of the other ECUs within the same network accurately locates the fault and provides backup control appropriate to the content of the fault. Japanese Published Unexamined Patent Application No. 2000-47894 discloses a technology allowing each of nodes (e.g., ECUs as processors) included in the system to monitor the other networked nodes.
  • The technology described in Japanese Published Unexamined Patent Application No. 2000-47894 requires a special node (shared disk) to share monitoring information about operating states of database applications and the like with respective nodes. A fault of the shared disk inhibits continuous monitoring of nodes in the system. Installation of the shared disk may increase system costs.
  • The following method may solve the problem. Each of the nodes independently monitors a specific item of a given node so as to detect a fault. The nodes exchange a fault-monitoring result through the network. A fault is finally located from the fault-monitoring results collected in the nodes by a majority rule, for example. These processes are synchronized with a communication cycle. The above-mentioned processes of monitoring a fault, exchanging the fault-monitoring results, and locating the fault are performed in a pipelined fashion, making it possible to locate the fault at every communication cycle.
  • However, locating a fault at every communication cycle may be too frequent for a system. An object of the present invention is to provide a distributed system that can configure cycles for fault monitoring and communication independently to reduce processing loads on a central processing unit (CPU) and communication bands for fault monitoring and increase the degree of freedom for configuring fault-monitoring cycles.
  • SUMMARY OF THE INVENTION
  • To achieve the above-mentioned object, an aspect of the present invention provides a distributed system that includes plural nodes being connected to each other via a network. Each of the nodes includes a fault-monitoring section for monitoring a fault in other nodes; a transmission and reception section for transmitting and receiving data to detect the fault in other nodes via the network; and a fault-identification section for identifying which node has the fault based on the data. The fault-monitoring section uses plural communication cycles as a monitoring period; the plural communication cycles are synchronized between the nodes.
  • The distributed system of the present invention may include the transmission and reception section that includes a monitoring result from the fault-monitoring section in transmission and reception data and distributes transmission and reception of the data into a next monitoring period. The next monitoring period is a next period of when the monitoring result is obtained.
  • The distributed system of the present invention may include the fault-identification section that distributes fault identification into next monitoring period. The next monitoring period is a next period of when the monitoring result is obtained by the fault-monitoring section. The monitoring result is included in the data.
  • The distributed system of the present invention may include the fault-monitoring section that changes, while the distributed system is in operation, the monitoring period for each node to be monitored.
  • The present invention can provide a distributed system that reduces processing loads on a central processing unit (CPU) and communication bands for fault monitoring and increases the degree of freedom for configuring fault-monitoring cycles.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of a distributed system;
  • FIG. 2 is a flow chart showing a fault-identification process based on inter-node monitoring;
  • FIG. 3 shows an example of the fault-identification process in a pipelined fashion;
  • FIG. 4 shows another example of the fault-identification process in a pipelined fashion;
  • FIG. 5 shows an operation example of the fault-identification process;
  • FIG. 6 is a flow chart showing the fault-identification process distributed among nodes;
  • FIG. 7 shows an example of the fault-identification process distributed among nodes in a pipelined fashion;
  • FIG. 8 shows another example of the fault-identification process distributed among nodes in a pipelined fashion;
  • FIG. 9A shows an operation example of the fault-identification process;
  • FIG. 9B shows another operation example of the fault-identification process;
  • FIG. 10 shows an example of the fault-identification process capable of varying an active cycle in a pipelined fashion;
  • FIG. 11 shows another example of the fault-identification process capable of varying an active cycle in a pipelined fashion;
  • FIG. 12A shows an operation example of the fault-identification process capable of varying an active cycle; and
  • FIG. 12B shows another operation example of the fault-identification process capable of varying an active cycle.
  • DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
  • Embodiments of the present invention will be described in detail with reference to the drawings.
  • First Embodiment
  • FIG. 1 is a block diagram of a distributed system.
  • The distributed system includes multiple nodes 10, such as 10-1, 10-2, . . . , and 10-n. The nodes are connected via a network 100. The nodes are equivalent to a processor capable of information communication via the network, including various electronic control devices such as the CPU, actuators, and necessary drivers, and sensors. The network 100 is capable of multiplex communication and broadcast transmission that simultaneously transmits the same content from a given node to all the other nodes connected to the network. The distributed system may use a communication protocol such as FlexRay (registered trademark of Daimler AG) and TTCAN (time-triggered CAN).
  • Each node is represented by i, where i is the node number ranging from 1 to n. Each node includes a CPU 11-i, a main memory 12-i, an interface (I/F) 13-i, and a storage device 14-i. The nodes are connected to each other through an internal communication line or the like. The interface 13-i is connected to the network 100.
  • The storage device 14-i includes programs, such as a fault-monitoring section 141-i, a transmission/reception section 142-i, a fault-identification section 143-i, and a counter section 144-i, and a fault-identification result 145-i. The fault-identification result 145-i includes a monitoring result table, a fault-identification result table, and an error counter, which are described later.
  • The CPU 11-i reads these programs into the main memory 12-i and executes the programs for processes. The programs or data described in this specification may be stored in the storage device in advance, may be supplied from storage media such as CD-ROM, or may be downloaded from other devices via the network. Special hardware may be used to realize functions implemented by the programs.
  • In the following explanation, the programs are described as an agent of the processing but the actual agent is the CPU, which performs processing according to the programs.
  • The fault-monitoring section 141-i performs fault monitoring (MON) on the other nodes. The transmission/reception section 142-i transmits or receives data via the network 100 for detecting faults on the other nodes. The fault-identification section 143-i performs fault identification (ID) to identify which node has a fault based on data for detecting faults on the other nodes. The counter section 144-i counts the number of errors in a node identified as having a fault, with respect to nodes, error locations (error items), and fault-identification conditions to be described later.
  • FIG. 2 is a flow chart showing a fault-identification process based on inter-node monitoring. The process is performed by mutual synchronous communication of each node through the network 100.
  • At Step 21, the fault-monitoring section 141-i monitors a fault on the other nodes, performing a fault-monitoring process (MON) that determines, by node i itself, whether a fault occurs or not in a transmission node according to the contents of received data or the situation of reception. It may be preferable to use multiple fault-monitoring items. For example, an item “reception malfunction” indicates a malfunction when the data reception has an error, such as detection of unsuccessful reception or a malfunction in received data based on an error-detecting code. An item “sequence number malfunction” is used as follows. The transmission node supplies transmission/reception data with a sequence number that an application increments at every communication cycle. The reception node checks an increment of the sequence number and detects a malfunction when the sequence number is not incremented. The sequence number is used to confirm an application malfunction in the transmission node. An item “self-diagnosis malfunction” is used as follows. Each of the nodes performs self-diagnosis as to whether the node itself has a malfunction or not and transmits a result of the diagnosis (self-diagnosis result) to the other nodes. The reception node detects a malfunction in the transmission node by using the self-diagnosis result. When any of the fault-monitoring items indicates a malfunction, a fault-monitoring item that unifies the other fault-monitoring items may be used to notify “malfunction detected.”
  • The fault-monitoring process is performed at p-communication cycle, where p denotes 1, 2, 3, . . . . The p-communication cycle is used as a unit of the period. The nodes are synchronized with each other during a fault-monitoring period at the p-communication cycle. To activate the synchronization, a node may declare initiation of the fault-monitoring process using the communication. Alternatively, the number of communication cycles may be used to determine the monitoring period. For example, if the first fault monitoring is defined to begin with communication cycle 0, the beginning of each fault-monitoring period can be found from no residual in the calculation of dividing the number of communication cycles by p. The use of multiple communication cycles for the period of the fault-monitoring process can decrease the frequency of subsequent processes and reduce a communication band per communication cycle and a processing load on the CPU in each node.
  • At Step 22, the transmission/reception section 142-i performs a fault-monitoring result exchange process (EXD) that exchanges the fault-monitoring result acquired at Step 21 among the nodes. Each node holds the fault-monitoring results from all the nodes including the result of the node itself. The collected fault-monitoring results are stored in the monitoring result table of the fault-identification result 145-i.
  • The fault-monitoring result exchange process may be performed at one communication cycle or divided into multiple communication cycles. Performing the fault-monitoring result exchange process over multiple communication cycles reduces a necessary communication band per communication cycle and a load for processing received data on the CPU of each node.
  • At Step 23, the fault-identification section 143-i performs a fault-identification process (ID) that determines whether each node and each fault-monitoring item has a malfunction or not from the fault-monitoring results collected in the nodes at Step 22. A fault-identification result is stored in the fault-identification result table of the fault-identification result 145-i.
  • One of the fault-identification methods is a majority rule, which decides whether a malfunction is occurred or not by the number of nodes. If the number of nodes having detected a fault in a given node or fault-monitoring item is greater than or equal to a threshold value of a fault-identification condition 1, the node subject to the fault detection is assumed to be abnormal. If the number of the nodes is smaller than a threshold value of a fault-identification condition 2, the node having detected the fault is assume to be abnormal. Normally, the threshold value is equivalent to half the number of the collected fault-monitoring results.
  • A node is assumed to be normal if it detects no fault under the fault-identification condition 1 or it is a node subject to the fault detection under the fault-identification condition 2. In the following description, a fault satisfying the fault-identification condition 1 is referred to as a majority malfunction. A fault satisfying the fault-identification condition 2 is referred to as a minority malfunction.
  • There is another fault-identification method, which assumes a node subject to fault detection or a fault-monitoring item to be abnormal if at least one node detects a fault.
  • The fault-identification process may be performed at one communication cycle or divided into multiple communication cycles. Performing the fault-monitoring result exchange process over multiple communication cycles can reduce a load for CPU processing per communication cycle in each node.
  • At Step 24, each node performs a fault-identification result utilization process. If a malfunction is determined at Step 23, the counter section 144-i increments an error counter value that indicates the number of errors in the node or the monitoring item subject to the fault identification. If no malfunction is determined, the counter section 144-i decrements the counter value. The counter section 144-i may reset the counter value or do nothing instead of decrementing. Setting whether to decrement, to reset, or to do nothing is selected in advance. The error counter may be provided for each of the fault-identification conditions. In this case, the error counter is decremented or reset if none of the fault-identification conditions is satisfied.
  • If the number of errors reaches the specified threshold value or more, the counter section 144-i notifies a control application of a fault occurrence. One of the notification means is to turn on a node fault flag corresponding to the node or the monitoring item subject to the fault identification. The application, referring to the node fault flag, can identify the fault occurrence. The fault occurrence may be immediately notified by interrupting the control application or invoking a callback function after the node fault flag is turned on. When the error counter is provided for each of the fault-identification conditions, the node fault flag is also provided for each of the fault-identification conditions.
  • When the fault-identification process is divided into multiple communication cycles, the fault-identification result utilization process may be performed when all the fault-identification processes are completed or when part of the fault-identification processes is completed to make sequential use of the results of the part of the fault-identification processes. The former should be employed if all nodes need to maintain the same recognition on the fault occurrence or the same state transition according to the fault occurrence.
  • The above-mentioned processes can locate the fault occurrence with high reliability and provide the nodes with the same recognition on the error occurrence. In this case, distributing the processes into multiple communication cycles can reduce a load for CPU processing or suppress necessary communication bands per communication cycle.
  • The processes in FIG. 2 may be performed in parallel when the processes are performed repeatedly. Defining an opportunity to perform the process in FIG. 2 once as a fault-identification round, it may be preferable to perform the multiple fault-identification rounds in parallel.
  • FIGS. 3 and 4 show examples of parallel processing of identifying a fault in a 4-node system by the inter-node monitoring based on the process flow in FIG. 2.
  • In FIG. 3, as a fault-identification round 1, the fault monitoring (MON) is performed at communication cycles i and i+1 (r=2), the fault-monitoring result exchange (EXD) and the fault identification (ID) being distributed into communication cycles i+2 and i+3. Each node exchanges the monitoring results (EXD) for nodes 1 and 2 at the communication cycle i+2 and nodes 3 and 4 at the communication cycle i+3, performing the fault identification (ID) from the monitoring results. As shown in FIG. 3, the fault-monitoring result exchange process (EXD) and the fault-identification process (ID) are divided into the nodes and are distributed into the communication cycles.
  • The nodes concurrently perform the fault-identification round 1 and a fault-identification round 2 and later. At the communication cycles i+2 and i+3, the fault-monitoring result exchange (EXD) is performed for the fault-identification round 1. At the same time, the fault monitoring (MON) is performed for the fault-identification round 2 according to the content of the received data or the situation of the data reception resulting from the fault-monitoring result exchange (EXD). Similarly, the fault monitoring (MON) is performed for a fault-identification round 3 simultaneously with the fault-monitoring result exchange (EXD) for the fault-identification round 2. The fault identification (ID) is performed between whiles. These processes are repeated subsequently. Results of the fault identification (ID) may be used from the nodes 1 and 2 first or from all the nodes after acquisition of results from the nodes 3 and 4.
  • In FIG. 4, as the fault-identification round 1, the fault monitoring (MON) is performed at the communication cycles i and i+1, the fault-monitoring result exchange (EXD) being distributed into the communication cycles i+2 and i+3 and the fault identification (ID) being distributed into the communication cycles i+3 and i+4. The results of the fault monitoring (MON) are transmitted from the nodes 1 and 2 at the communication cycle i+2 and from the nodes 3 and 4 at the communication cycle i+3. The fault identification (ID) is performed on the nodes 1 and 2 at the communication cycle i+3 and on the nodes 3 and 4 at the communication cycle i+4. A difference from FIG. 3 is to divide the fault-monitoring result exchange (EXD) process into the transmission nodes and distribute the process into the communication cycles.
  • The nodes concurrently perform the fault-identification round 1 and the fault-identification round 2 and later. At the communication cycles i+2 and i+3, the fault-monitoring result exchange (EXD) is performed for the fault-identification round 1. At the same time, the fault monitoring (MON) is performed for the fault-identification round 2 according to the content of the received data or the situation of the data reception resulting from the fault-monitoring result exchange (EXD). Relation between the fault- identification rounds 2 and 3 is the same and the above-mentioned processes are repeated subsequently.
  • As shown in FIGS. 3 and 4, the fault-identification process according to the inter-node monitoring in FIG. 2 is performed in a pipelined fashion. In this manner, the fault monitoring (MON) is applicable to all the time intervals (communication cycles). In addition, the fault identification (ID) can be continuously performed at a specified interval.
  • While FIGS. 3 and 4 assume the number of nodes to be four (n=4), the number of nodes is not limited. In FIGS. 3 and 4, two communication cycles are used as an applicable period for the fault monitoring (MON), and the fault-monitoring result exchange (EXD) process and the fault-identification (ID) process are distributed into the two communication cycles. The number of communication cycles may be one or more than two. Decreasing the number of communication cycles for each process shortens the time (the number of communication cycles) needed for the fault identification (ID) but relatively increases a load for CPU processing or a communication band to be consumed. Increasing the number of communication cycles for each process lengthens the time (the number of communication cycles) needed for the fault identification (ID) but relatively decreases a load for CPU processing or a communication band to be consumed.
  • For example, in a case of six nodes in FIG. 3, the fault-monitoring result exchange (EXD) and the fault identification (ID) may be performed on the nodes 1 to 3 at the communication cycle i+2 and on the nodes 4 to 6 at the communication cycle i+3 for the first fault-identification round. Alternatively, it may be preferable to add the fault-monitoring result exchange (EXD) and the fault identification (ID) to be performed on nodes 5 and 6 at a communication cycle i+4.
  • Distribution of the fault-monitoring result exchange (EXD) and the fault identification (ID) to communication cycles is hereafter referred to as time-base process distribution. The time-base process distribution is preferable when a load for CPU processing and a quantity of communication are equal for each communication cycle. This is because the control application is relatively less affected by resources such as the CPU throughput and the communication band. FIGS. 3 and 4 show the examples of such equal distribution.
  • As for the time-base process distribution, the processes are distributed into the nodes for fault monitoring, the nodes for fault identification, and the nodes for transmission in FIGS. 3 and 4. However, the processes may be distributed in any manner as long as each node performs a part of the process at each communication cycle. In FIG. 4, for example, each node may perform a part of the fault-identification (ID) process, such as counting for a majority rule, at the communication cycle i+2 using fault-monitoring results received from the nodes 1 and 2 and then perform the rest of the fault-identification (ID) process at the communication cycle i+3 using fault-monitoring results received from the nodes 3 and 4 to complete the fault-identification process. This decreases by one the number of communication cycles in FIG. 4 needed to complete the fault-identification process.
  • FIG. 5 shows an operation example of the fault-identification process based on the inter-node monitoring. The process flow is based on FIG. 2. The time-base process distribution and the process pipelining are compliant with FIG. 3. The number of nodes is assumed to be four. In this example, various items are unified as one fault-monitoring item. The fault-identification process (ID) is performed at the end of the communication cycles after termination of transmission or reception of each node.
  • Transmission data includes bits for two nodes, each of the bits indicating the presence or absence of a malfunction concerning a node to be monitored. An area corresponding to a given node stores a result of diagnosis about the given node. The presence or absence of a malfunction concerning the nodes 1 and 2 are stored at an even-numbered cycle. The presence or absence of a malfunction concerning the nodes 3 and 4 are stored at an odd-numbered cycle.
  • The transmission data also includes an error counter value, which each node has, of one node. At the communication cycles i and i+1, the node 1 transmits an error counter value for the node 2; the node 2 transmits an error counter value for the node 3; the node 3 transmits an error counter value for the node 4; the node 4 transmits an error counter value for the node 1. At the communication cycles i+2 and i+3, the node 1 transmits an error counter value for the node 4; the node 2 transmits an error counter value for the node 1; the node 3 transmits an error counter value for the node 2; the node 4 transmits an error counter value for the node 2. The nodes are rotated. Error counters are independent for a majority malfunction and a minority malfunction. The number of majority malfunctions (EC) is transmitted at an even-numbered cycle. The number of minority malfunctions (FC) is transmitted at an odd-numbered cycle.
  • When receiving an error counter value, the node uses the received error counter value to synchronize the error counters between nodes in the fault-identification result utilization process before the node reflects the result of the fault identification (ID) on the error counter. This is because the error counter values may differ from node to node even when the fault-identification process is performed in accordance with the inter-node monitoring. Possible causes of this difference include a reset based on diagnosis on the node itself or temporal communication failure. The error counters may be synchronized as follows. If a node receives a counter value that differs from the counter value that the node has and a difference of successively received two counter values is within a given value (e.g., ±1), the counter value for the node is adjusted to the later received counter value.
  • The transmission data indicates only part of the contents. The transmission data may include a sequence number and control data as well as the above-mentioned data.
  • At communication cycle i, where i is an even number, the nodes 1 to 4 sequentially use slots 1 to 4 to transmit fault-monitoring results (EXD, 501-0 to 504-0) concerning the nodes 1 and 2 for a fault-identification round k−1, maintaining results received from the other nodes and generated from the node itself (521-0 to 524-0 represented in binary). Since the results include no data indicating “abnormality” and are normally received by the nodes, no malfunction is found in the fault identification (ID) concerning the nodes 1 and 2 for the fault-identification round k−1 and none of the nodes turns on the node fault flag (551-0 to 554-0 represented in binary). None of the nodes detects a fault during the fault monitoring (MON) for a fault-identification round k (511-0 to 514-0 represented in binary). The error counter value for each node indicates 2 corresponding to the majority malfunction of the node 3 and indicates 0 otherwise. No change is made from a communication cycle i−1 (541-0 to 544-0).
  • However, the node 3 causes a CPU fault at the end of the communication cycle i. It is assumed that this fault leads to a fault that disables the node 3 from incrementing a sequence number to be transmitted at the next communication cycle i+1. The sequence numbers are not shown in the data in FIG. 5.
  • At the communication cycle i+1, the nodes transmit fault-monitoring results (501-1 to 504-1) concerning the nodes 3 and 4 for the fault-identification round k−1 and maintain the results (521-1 to 524-1). Similarly to the communication cycle i, no malfunction is found in the fault identification (ID) concerning the nodes 3 and 4 for the fault-identification round k−1 and the error counter (541-0 to 544-0) and the node fault flag (551-1 to 554-1) are the same as those for the communication cycle i. However, the nodes 1, 2, and 4 detect a fault in the node 3 (511-1, 512-1, and 514-1) according to sequence number malfunctions of the node 3 during the fault monitoring (MON) concerning the nodes 3 and 4 for the fault-identification round k. The node 3 cannot detect a malfunction in the node 3 itself (513-1).
  • The fault-identification result exchange (EXD) and the fault identification (ID) for the fault-identification round k and the fault monitoring (MON) for the fault-identification round k+1 are performed on the nodes 1 and 2 at the communication cycle i+2 and on the nodes 3 and 4 at the communication cycle i+3. No malfunction is detected at a communication i+2 similarly to the communication cycle i. At the communication cycle i+3, the fault-monitoring result exchange (EXD) for the fault-identification round k exchanges fault detection results of the node 3 at the communication cycle i+i (501-3 to 504-3 and 521-3 to 524-3) and the fault identification (ID) for each node identifies a majority malfunction of the node 3 (531-3 to 534-3). As a result, the error counter value of each node concerning the majority malfunction of the node 3 is incremented to 3 (541-3 to 544-3). The threshold value in the system is 3 for the application to notify a fault; therefore the node fault flag of each node concerning the majority malfunction of the node 3 is turned on (551-3 to 554-3).
  • As mentioned above, the CPU fault of the node 3 is identified by each node and the corresponding node fault flag notifies the application of the fault. The fault-identification process according to the inter-node monitoring in FIG. 2 can be performed in a pipelined fashion in synchronization with the communication cycle. The time-base process distribution reduces the load for the CPU processing and the quantity of the communication per communication cycle compared to a case without the time-base process distribution. The minority malfunction is processed in the same way as the above example, which describes the majority malfunction.
  • Second Embodiment
  • FIG. 6 is a flow chart of the fault-identification process according to the inter-node monitoring.
  • The fault monitoring (MON) at Step 21 and the fault-monitoring result exchange process (EXD1 in FIG. 6) at Step 22 are the same as those in FIG. 2.
  • At Step 61, the fault-identification section 142-i performs the fault-identification process (ID1) on one node of nodes involved in mutual monitoring except the node itself. The node itself is responsible for the one node about the fault identification. The relevant nodes rotate for every communication cycle so as not to conflict with each other. In this manner, a load on the fault-identification process is distributed among the nodes and reduced.
  • At Step 62, the transmission/reception section 142-i performs a fault-identification result exchange process (EXD2) that exchanges a fault-identification result about the one node acquired at Step 61 among the nodes. Consequently, each node maintains fault-identification results about all the nodes including the result processed by the node itself. At Step 63, a fault-identification process (ID2) uses the collected fault-identification results to settle a final fault-identification result.
  • Step 24 is the same as that of the fault-identification result utilization process in FIG. 2.
  • The fault-identification process (ID1) may use the fault-identification condition 1 for determination about one node. The fault-identification process (ID2) may use the fault-identification condition 2 for determination about all nodes. Alternatively, the fault-identification process (ID2) may use the fault-identification condition 2 for determination about one node. This result may be exchanged among the nodes using a fault-identification result exchange process (EXD3).
  • The fault-identification process (ID1) may be performed on two or more nodes, not limited to one node.
  • FIGS. 7 and 8 show examples of parallel processing for identifying a fault in a 4-node system using the inter-node monitoring based on the process flow in FIG. 6.
  • In FIG. 7, as the fault-identification round 1, the fault monitoring (MON) is performed at the communication cycles i and i+1. The fault-monitoring result exchange (EXD1) and the fault identification (ID1) are distributed into the communication cycles i+2 and i+3. The fault-identification result exchange (EXD2) and the fault identification (ID2) are distributed into the communication cycles i+4 and i+5. The nodes perform the fault-monitoring result exchange (EXD1) and the fault identification (ID1) on the nodes 1 and 2 at the communication cycle i+2 and on the nodes 3 and 4 at the communication cycle i+3. The nodes perform the fault-identification result exchange (EXD2) and the fault identification (ID2) on all the nodes at the communication cycle i+4. As shown in FIG. 7, the fault-monitoring result exchange process (EXD1) and the fault-identification process (ID1) are divided into the nodes and are distributed into the communication cycles.
  • The nodes concurrently perform the fault-identification round 1 and the fault-identification round 2 and later. At the communication cycles i+2 and i+3, the fault-monitoring result exchange (EXD1) is performed for the fault-identification round 1. At the same time, the fault monitoring (MON) is performed for the fault-identification round 2 according to the content of the received data or the situation of the data reception. At the communication cycle i+4, the fault-identification result exchange (EXD2) is performed for the fault-identification round 1. At the same time, the fault-monitoring result exchange (EXD1) is performed on the nodes 1 and 2 for the fault-identification round 2. The fault monitoring (MON) is performed for the fault-identification round 3 according to the content of the received data or the situation of the data reception. Relation between the fault-identification rounds 2 and later is the same and the above-mentioned processes are repeated subsequently.
  • In FIG. 8, as the fault-identification round 1, the fault monitoring (MON) is performed at the communication cycles i and i+1. The fault-monitoring result exchange (EXD1) and the fault identification (ID1) are distributed into the communication cycles i+2 and i+3. The fault-identification result exchange (EXD2) and the fault identification (ID2) are distributed into the communication cycles i+4 and i+5. The nodes perform a half of the processes of fault-monitoring result exchange (EXD1) and fault identification (ID1) at the communication cycles i+2 and i+3, respectively. The half means that the fault-monitoring result exchange (EXD1) transmits a half of the fault-monitoring result at the communication cycle i+2; the fault identification (ID1) halfway performs the process of collecting fault-monitoring results for the fault identification, such as a majority rule, just for the data acquired by the fault-monitoring result exchange (EXD1).
  • The remaining process is performed at the communication cycle i+3. The nodes perform the fault-identification result exchange (EXD2) and the fault identification (ID2) concerning a majority malfunction at the communication cycle i+4 and a minority malfunction at the communication cycle i+5. As shown in FIG. 7, the processes of fault-monitoring result exchange (EXD1), fault-identification result exchange (EXD2), and fault identifications (ID1 and ID2) are distributed into the communication cycles.
  • The nodes concurrently perform the fault-identification round 1 and the fault-identification round 2 and later. At the communication cycles i+2 and i+3, the fault-monitoring result exchange (EXD1) is performed for the fault-identification round 1. At the same time, the fault monitoring (MON) is performed for the fault-identification round 2. At the communication cycles i+4 and i+5, the fault-identification result exchange (EXD2) is performed for the fault-identification round 1. At the same time, the fault-monitoring result exchange (EXD1) is performed for the fault-identification round 2. The fault monitoring (MON) is performed for the fault-identification round 3 as well. Relation between the fault-identification rounds 2 and later is the same and the above-mentioned processes are repeated subsequently.
  • FIGS. 9A and 9B show operation examples of the fault-identification process according to the inter-node monitoring. The process flow is based on FIG. 6. The time-base process distribution and the process pipelining are compliant with FIG. 8. The conditions, such as the number of nodes and the fault-monitoring items, are the same as those shown in FIG. 5.
  • A result of the fault identification (ID1) is reflected on the error counter value. The fault-identification result exchange (EXD2) includes increase or decrease of the error counter value in accordance with the result of the fault identification (ID1), transmission of the error counter value, and transmission of the counter value for error counter synchronization. When receiving the error counter value, the node synchronizes the error counter. An example of the method for synchronizing is as follows. (1) If a difference between the received counter value and the counter value of the node itself is a specified value (e.g., ±1), the node itself adjusts the counter value to the received counter value. (2) If the condition (1) is not satisfied and a difference between successively received two counter values is a specified value (e.g., ±1), the node itself adjusts the counter value to the later received counter value.
  • The transmission data may include an area exclusively used for results of the fault identification (ID1) without reflecting the result of the fault identification (ID1) on the error counter value.
  • At the communication cycles i and i+1, where i is an even number, the nodes 1 to 4 sequentially use slots 1 to 4 to transmit fault-monitoring results (EXD1, 901-0 to 904-0 and 901-1 to 904-1) for the fault-identification round k−1, maintaining results received from the other nodes and generated from the node itself (921-0 to 924-0 and 921-1 to 924-1). At the communication cycle i, the nodes 1 and 2 transmit fault-monitoring results concerning the nodes 1 and 2; the nodes 3 and 4 transmit fault-monitoring results concerning the nodes 3 and 4. At the communication cycle i+1, the nodes transmit the remaining data. Since the results include no data indicating “abnormality” and are normally received by the nodes, no malfunction is found in the fault identification (ID) for the fault-identification round k−1. The fault identification (ID) is divided into the communication cycles i and i+1 and generates a result at the communication cycle i+1 (931-1 to 934-1, which represent node numbers in charge). None of the nodes turns on the node fault flag (951-0 to 954-0 and 951-1 to 954-1). The fault-identification result exchange (EXD2) and the fault identification (ID2) for the fault-identification round k−2 are also performed. The error counter values are set to 2 for the majority malfunction of the node 3 and to 0 otherwise. The error counter value for each node indicates 2 corresponding to the majority malfunction of the node 3 and indicates 0 otherwise. No change is made from a communication cycle i−1 (941-0 to 944-0 and 941-1 to 944-1).
  • The fault monitoring (MON) is performed for the fault-identification round k in parallel with the fault-monitoring result exchange (EXD1) for the fault-identification round k−1. During the fault monitoring (MON), each node detects no fault at the communication cycle i (911-0 to 914-0). The node 3 is subject to a CPU fault at the end of the communication cycle i and causes a sequence number malfunction. At the communication cycle i+1, the nodes 1, 2, and 4 detect the fault of the node 3 (911-1 to 914-1).
  • At the communication cycles i+2 and i+3, the fault-monitoring result exchange (EXD1, 901-2 to 904-2 and 901-3 to 904-3) for the fault-identification round k is performed similarly to the fault-identification round k−1. Each node acquires the fault-monitoring results including the fault detection from the node 3 at the communication cycle i+1 (921-2 to 924-2 and 921-3 to 924-3). The fault identification (ID1) for the fault-identification round k is also performed similarly to the fault-identification round k−1. At the communication cycle i+3, the node 1 in charge of the node 3 identifies the majority malfunction of the node 3 (931-3 to 934-3). All the nodes detect no fault during the concurrently performed fault monitoring (MON) for the fault-identification round k+1 (911-2 to 914-2 and 911-3 to 914-3). While the fault-identification result exchange (EXD2) and the fault identification (ID2) for the fault-identification round k−1 are also concurrently performed, no change is made to the error counters (941-2 to 944-2 and 941-3 to 944-3) and the node fault flags (951-2 to 954-2 and 951-3 to 954-3).
  • At the communication cycles i+4 and i+5, the fault-identification result exchange (EXD2) and the fault identification (ID2) for the fault-identification round k are performed in parallel with the fault monitoring (MON) for the fault-identification round k+2 and the fault-monitoring result exchange (EXD1) for the fault-identification round k+1. The majority malfunction of the node 3 detected by the node 1 is transmitted to the other nodes (901-4). The nodes recognize the majority malfunction of the node 3 and increment the corresponding error counter value to 3 at the communication i+5 (941-5 to 944-5). Consequently, the nodes turn on the node fault flag corresponding to the majority malfunction of the node 3 (951-5 to 954-5).
  • As mentioned above, the CPU fault of the node 3 is identified by each node and the corresponding node fault flag notifies the application of the fault. The fault-identification process according to the inter-node monitoring in FIG. 6 can be performed in a pipelined fashion in synchronization with the communication cycle. The time-base process distribution reduces the load for the CPU processing and the quantity of the communication per communication cycle compared to a case without the time-base process distribution. The minority malfunction is processed in the same way as the above example, which describes the majority malfunction.
  • Third Embodiment
  • In the above-mentioned examples, the periods (communication cycles) are constant for the fault-monitoring process (MON) and for the fault-monitoring result exchanges (EXD and EXD1) and the fault identifications (ID, ID1, and ID2) that are divided and performed. These periods can be changed while the system is in operation. In other words, it is also possible to use a variable cycle for performing the fault identification based on the mutual monitoring.
  • FIGS. 10 and 11 illustrate examples of changing cycles for performing the fault-monitoring process (MON), the fault-monitoring result exchange (EXD), and the fault-identification process (ID) while the system is in operation for the parallel processing of the fault-identification according to the mutual monitoring in FIG. 3.
  • One method of changing the cycle of the fault identification is, when a fault occurs in a node, to shorten the cycles of the processes associated with the fault identification for the node. The method is based on the principle that the node subject to a fault needs to perform the fault identification in a short cycle. The cycle may be changed when the error counter value becomes greater than or equal to a specified value. This is because the error counter has synchronization means and can synchronize a timing to change the cycle among the nodes.
  • FIG. 10 shows an example of changing the fault-identification cycle of the node 1. The communication cycles i to i+3 are the same as those shown in FIG. 3. In the example, however, the error counter value for the node 1 becomes greater than or equal to the specified value during the fault identification (ID) for the node 1 at the communication cycle i+2, causing the cycle of the fault identification for the node 1 to be shortened from the normal two cycles to one cycle. As a result, the period (communication cycle) of the fault monitoring (MON) for the node 1 is shortened at the communication cycle i+4 and later. The fault-monitoring result exchange (EXD) and the fault identification (ID) for the node 1 are also performed at one cycle next to the cycle of the fault monitoring (MON). In this case, the fault-monitoring result exchange (EXD) for the node 1 is also performed in parallel with the fault monitoring (MON) for all the nodes. Consequently, the fault identification (ID) for the node 1 is performed in a pipelined fashion at each cycle.
  • FIG. 11 shows an example of changing the fault-identification cycle of the node 3. The communication cycles i to i+3 are the same as those shown in FIG. 3. In the example, however, the error counter value for the node 3 becomes greater than or equal to the specified value during the fault identification (ID) for the node 3 at the communication cycle i+3, causing the cycle of the fault identification for the node 3 to be shortened from the normal two cycles to one cycle. As a result, the period (communication cycle) of the fault monitoring (MON) for the node 3 is shortened at the communication cycle i+4 and later. The fault-monitoring result exchange (EXD) and the fault identification (ID) for the node 3 are also performed at one cycle next to the cycle of the fault monitoring (MON). The fault-monitoring result exchange (EXD) and the fault identification (ID) corresponding to the fault monitoring (MON) for the node 3 at communication cycles i+2 and i+3 are performed at the communication cycle i+4 advanced from the scheduled communication cycle i+5. Instead, at the communication cycle i+5, the fault-monitoring result exchange (EXD) and the fault identification (ID) corresponding to the fault monitoring (MON) for the node 3 at the communication cycle i+4 are performed. Similarly, at a communication cycle i+6 and later, the fault-monitoring result exchange (EXD) and the fault identification (ID) corresponding to the fault monitoring (MON) at the communication cycle i+5 and subsequent cycles advanced by one cycle from those scheduled are performed, respectively. As for the node 3, the fault identification (ID) is performed at every cycle.
  • Even if the fault-monitoring result exchange (EXD) is performed for three cycles or longer, the processes of the fault-monitoring result exchange (EXD) and the fault identification (ID) are moved up and performed as shown in FIG. 11 when changing the cycle for the fault identification.
  • In FIGS. 10 and 11, even if the error counter value is not synchronized among the nodes due to a communication failure or the like and the cycle of the fault identification is not changed for some of the nodes, processes associated with the fault identification have little difference in terms of effectiveness before and after the change of the cycles. The reasons are as follows. The above-mentioned fault-identification method detects no malfunction in a node that is subject to no change in the cycle and does not transmit a fault-monitoring result at a shorter cycle than the usual one. Even when an error counter value for the node differs from that for another node, the error counter synchronization means synchronizes the error counter values within several communication cycles.
  • FIGS. 12A and 12B show operation examples of the fault-identification process according to the inter-node monitoring. The process flow is based on FIG. 2. The time-base process distribution and the process pipelining are compliant with FIG. 11. The conditions, such as the fault-monitoring items, are the same as those shown in FIG. 5. A difference is that the transmission data includes the bits of the fault-monitoring result for the nodes 1 to 4 at every cycle. Whether or not to use the fault-monitoring result depends on the cycle of the fault identification. The fault identification (ID) does not necessarily use the fault-monitoring result.
  • The communication cycles i to i+3 are almost same as those shown in FIG. 5. As a difference from FIG. 5, all nodes are assigned 0 as an initial error counter value concerning the majority malfunction of the node 3 (1241-0 to 1244-0, 1241-1 to 1244-1, and 1241-2 to 1244-2). Therefore, when the nodes identify the majority malfunction of the node 3 at the communication cycle i+3 (1231-3 to 1234-3), the corresponding error counter value is incremented to 1 (1241-3 to 1244-3). In addition, the node 3 causes a CPU malfunction at the communication cycles i+1 to i+3, causing a sequence number malfunction in the node 3 itself. Consequently, the nodes 1, 2, and 4 detect the fault of the node 3 using the fault monitoring (MON) at the communication cycles i+2 to i+4 (1211-2 to 1214-2, 1211-3 to 1214-3, and 1211-4 to 1214-4).
  • The error counter value concerning the majority malfunction of the node 3 is set to 1 at the communication cycle i+3. The nodes then change the fault-identification cycle for the node 3 from 2 to 1. A fault of the node 3 detected at the communication cycles i+2 and i+3 (1211-2 to 1214-2 and 1211-3 to 1214-3) is used for the fault-monitoring result exchange (EXD) at the communication cycle i+4 (an OR operation is carried out to regard the faults at the communication cycles i+2 and i+3 as one fault). A fault of the node 3 detected at the communication cycle i+4 (1211-4 to 1214-4) is used for the fault-monitoring result exchange (EXD) at the communication cycle i+5. Assuming that a fault-identification round corresponds to the communication cycles i and i+1 is the fault-identification round 1, the round 2 corresponds to the communication cycles i+2 and i+3; the round 3 corresponds to the communication cycle i+4. The corresponding fault identifications (ID) are performed at the communication cycles i+3 (1231-3 to 1234-3), i+4 (1231-4 to 1234-4), and i+5 (1231-5 to 1234-5), respectively. The error counter values for the nodes corresponding to majority malfunctions of the node 3 are incremented (1241-3 to 1244-3, 1241-4 to 1244-4, and 1241-5 to 1244-5). The counter value is set to 3 at the communication cycle i+5. The node fault flag corresponding to the majority malfunction of the node 3 is turned on (1245-1 to 1245-5).
  • As mentioned above, the CPU fault of the node 3 is identified by each node and the corresponding node fault flag notifies the application of the fault. The fault-identification process according to the inter-node monitoring in FIG. 2 can change the cycle of the fault identification while the system is in operation. The flow chart in FIG. 6 and the minority malfunction are processed in the same way as the above example, which describes the flow chart in FIG. 2 and the majority malfunction.
  • Control systems using the distributed system are applied to a wide range of industrial fields, such as vehicles, construction equipment, and factory automation (FA). The present invention can ensure high system reliability and improve availability based on backup control for the distributed control systems.
  • According to the present invention, the distributed systems can be controlled at low cost without additional special apparatus.

Claims (4)

1. A distributed system comprising plural nodes being connected to each other via a network,
wherein each of the nodes comprises:
a fault-monitoring section for monitoring a fault in other nodes;
a transmission and reception section for transmitting and receiving data to detect the fault in other nodes via the network;
and
a fault-identification section for identifying which node has the fault based on the data, and
wherein the fault-monitoring section uses plural communication cycles as a monitoring period, the plural communication cycles being synchronized between the nodes.
2. The distributed system according to claim 1,
wherein the transmission and reception section includes a monitoring result from the fault-monitoring section in transmission and reception data and distributes transmission and reception of the data into a next monitoring period that is a next period of when the monitoring result is obtained.
3. The distributed system according to claim 1,
wherein the fault-identification section distributes fault identification into a next monitoring period that is a next period of when the monitoring result is obtained by the fault-monitoring section, the monitoring result being included in the data.
4. The distributed system according to claim 1,
wherein the fault-monitoring section, while the distributed system is in operation, changes the monitoring period for each node to be monitored.
US12/457,329 2008-06-27 2009-06-08 Distributed system Abandoned US20100039944A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2008168052A JP2010011093A (en) 2008-06-27 2008-06-27 Distributed system
JP2008-168052 2008-06-27

Publications (1)

Publication Number Publication Date
US20100039944A1 true US20100039944A1 (en) 2010-02-18

Family

ID=41591044

Family Applications (1)

Application Number Title Priority Date Filing Date
US12/457,329 Abandoned US20100039944A1 (en) 2008-06-27 2009-06-08 Distributed system

Country Status (2)

Country Link
US (1) US20100039944A1 (en)
JP (1) JP2010011093A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140078889A1 (en) * 2012-09-20 2014-03-20 Broadcom Corporation Automotive neural network
US20140081508A1 (en) * 2012-09-18 2014-03-20 Hitachi Automotive Systems, Ltd. Automotive Control Unit and Automotive Control System
US20160302220A1 (en) * 2013-12-26 2016-10-13 Kabushiki Kaisha Toshiba Wireless communication device, wireless communication system, and wireless communication method
US20170031743A1 (en) * 2015-07-31 2017-02-02 AppDynamics, Inc. Quorum based distributed anomaly detection and repair
CN116449809A (en) * 2023-06-16 2023-07-18 成都瀚辰光翼生物工程有限公司 Fault processing method and device, electronic equipment and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6211842B2 (en) * 2013-07-26 2017-10-11 Necプラットフォームズ株式会社 COMMUNICATION SYSTEM, COMMUNICATION DEVICE, AND FILM OPERATION ABNORMALITY CONTROL METHOD
JP6634680B2 (en) * 2015-02-12 2020-01-22 いすゞ自動車株式会社 Vehicle control device and vehicle self-diagnosis method
JP6464052B2 (en) * 2015-07-29 2019-02-06 株式会社日立製作所 Distributed control device
JPWO2023281595A1 (en) * 2021-07-05 2023-01-12

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4330826A (en) * 1980-02-05 1982-05-18 The Bendix Corporation Synchronizer and synchronization system for a multiple computer system
US4816989A (en) * 1987-04-15 1989-03-28 Allied-Signal Inc. Synchronizer for a fault tolerant multiple node processing system
US5019799A (en) * 1981-08-06 1991-05-28 Nissan Motor Company, Limited Electronic device with self-monitor for an automotive vehicle
US5959969A (en) * 1997-08-13 1999-09-28 Mci Communications Corporation Method for initiating a distributed restoration process
US20040255185A1 (en) * 2003-05-28 2004-12-16 Nec Corporation Fault tolerant multi-node computing system using periodically fetched configuration status data to detect an abnormal node
US20070076593A1 (en) * 2005-10-03 2007-04-05 Hitachi, Ltd. Vehicle control system
US20080141072A1 (en) * 2006-09-21 2008-06-12 Impact Technologies, Llc Systems and methods for predicting failure of electronic systems and assessing level of degradation and remaining useful life
US20090037573A1 (en) * 2007-08-03 2009-02-05 At&T Knowledge Ventures, Lp System and method of health monitoring and fault monitoring in a network system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4466188B2 (en) * 2003-07-16 2010-05-26 株式会社デンソー Vehicle control device
JP2007158534A (en) * 2005-12-01 2007-06-21 Toyota Motor Corp Communication system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4330826A (en) * 1980-02-05 1982-05-18 The Bendix Corporation Synchronizer and synchronization system for a multiple computer system
US4342083A (en) * 1980-02-05 1982-07-27 The Bendix Corporation Communication system for a multiple-computer system
US5019799A (en) * 1981-08-06 1991-05-28 Nissan Motor Company, Limited Electronic device with self-monitor for an automotive vehicle
US4816989A (en) * 1987-04-15 1989-03-28 Allied-Signal Inc. Synchronizer for a fault tolerant multiple node processing system
US5959969A (en) * 1997-08-13 1999-09-28 Mci Communications Corporation Method for initiating a distributed restoration process
US20040255185A1 (en) * 2003-05-28 2004-12-16 Nec Corporation Fault tolerant multi-node computing system using periodically fetched configuration status data to detect an abnormal node
US20070076593A1 (en) * 2005-10-03 2007-04-05 Hitachi, Ltd. Vehicle control system
US20080141072A1 (en) * 2006-09-21 2008-06-12 Impact Technologies, Llc Systems and methods for predicting failure of electronic systems and assessing level of degradation and remaining useful life
US20090037573A1 (en) * 2007-08-03 2009-02-05 At&T Knowledge Ventures, Lp System and method of health monitoring and fault monitoring in a network system

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140081508A1 (en) * 2012-09-18 2014-03-20 Hitachi Automotive Systems, Ltd. Automotive Control Unit and Automotive Control System
CN103676925A (en) * 2012-09-18 2014-03-26 日立汽车系统株式会社 Automotive control unit and automotive control system
JP2014058210A (en) * 2012-09-18 2014-04-03 Hitachi Automotive Systems Ltd Vehicle control device and vehicle control system
US20140078889A1 (en) * 2012-09-20 2014-03-20 Broadcom Corporation Automotive neural network
US8953436B2 (en) * 2012-09-20 2015-02-10 Broadcom Corporation Automotive neural network
US20160302220A1 (en) * 2013-12-26 2016-10-13 Kabushiki Kaisha Toshiba Wireless communication device, wireless communication system, and wireless communication method
US20170031743A1 (en) * 2015-07-31 2017-02-02 AppDynamics, Inc. Quorum based distributed anomaly detection and repair
US9886337B2 (en) * 2015-07-31 2018-02-06 Cisco Technology, Inc. Quorum based distributed anomaly detection and repair using distributed computing by stateless processes
CN116449809A (en) * 2023-06-16 2023-07-18 成都瀚辰光翼生物工程有限公司 Fault processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
JP2010011093A (en) 2010-01-14

Similar Documents

Publication Publication Date Title
US20100039944A1 (en) Distributed system
US20090040934A1 (en) Distributed System
US7920587B2 (en) Method for establishing a global time base in a time-controlled communications system and communications system
US7729827B2 (en) Vehicle control system
US20090262649A1 (en) Bus guardian with improved channel monitoring
EP1490998B1 (en) Method and circuit arrangement for the monitoring and management of data traffic in a communication system with several communication nodes
JP5033199B2 (en) Node of distributed communication system, node coupled to distributed communication system, and monitoring apparatus
US7246186B2 (en) Mobius time-triggered communication
US20080195882A1 (en) Method and device for synchronizing cycle time of a plurality of TTCAN buses based on determined global time deviations and a corresponding bus system
US7873739B2 (en) Voting mechanism for transmission schedule enforcement
KR101519719B1 (en) Message process method of gateway
Kimm et al. Integrated fault tolerant system for automotive bus networks
US20080313426A1 (en) Information Processing Apparatus and Information Processing Method
US20090290485A1 (en) Distributed communication system and corresponding communication method
US8041993B2 (en) Distributed control system
US20090116388A1 (en) Vehicle Communication Method and Communication Device
US7237152B2 (en) Fail-operational global time reference in a redundant synchronous data bus system
US8369969B2 (en) Distributed significant control monitoring system and device with transmission synchronization
EP1271854A2 (en) Fault tolerant voting system and method
JP3884643B2 (en) Process control device
JP6492885B2 (en) Diagnostic equipment
JP2000040013A (en) Method for detecting line abnormality for duplex communication system
JP5077016B2 (en) Communications system
Bergenhem et al. A Process Membership Service for Active Safety Systems
Rumpler et al. High speed and high dependability communication for automotive electronics

Legal Events

Date Code Title Description
AS Assignment

Owner name: HITACHI, LTD.,JAPAN

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:MATSUBARA, MASAHIRO;SAKURAI, KOHEI;SHIMAMURA, KOTARO;SIGNING DATES FROM 20090804 TO 20090806;REEL/FRAME:023152/0975

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION