WO2008062511A1

WO2008062511A1 - Multiprocessor system

Info

Publication number: WO2008062511A1
Application number: PCT/JP2006/323168
Authority: WO
Inventors: Hiromasa Takahashi; Takashi Chiba; Shunsuke Kamijo
Original assignee: Fujitsu Limited
Priority date: 2006-11-21
Filing date: 2006-11-21
Publication date: 2008-05-29

Abstract

A processor element (PE0) transmits a command to a processor element (PE1) via an inter-PE communication path. If a status response received from the processor element (PE1) shows an abnormal state, the processor element (PE0) transmits a state check command to a processor element (PE2) via an inter-PE communication path. When the communication between the processor element (PE0) and the processor element (PE2) is successful, it is determined that the receiving circuit of the processor element (PE1) is out of order. When the communication between the processor element (PE0) and the processor element (PE2) is failed, it is determined that the transmitting circuit of the processor element (PE0) is out of order.

Description

Specification

Manolet processor system

Technical field

The present invention relates to a multiprocessor system including a plurality of processor elements, and more particularly to a technique for improving the reliability of an embedded multiprocessor system. Background art

[0002] Techniques for detecting a failure of a processor element built in a multiprocessor system have been proposed. For example, the following Patent Documents 1 to 3 describe techniques for detecting a failure of a processor element in a server system! RU

[0003] In the method described in Patent Document 1, first, the first processor checks the states of all other processors. Subsequently, the second processor checks the status of all other processors. In the same way, the status of all other processors is checked in the same manner for each processor. These confirmation operations are executed periodically. According to this procedure, since the status of all the processors is regularly monitored, it is possible to reliably detect a processor failure.

[0004] In the method described in Patent Document 2, a plurality of processors coupled by a shared bus transfer operation confirmation signals in a predetermined order. When each processor receives the operation confirmation signal, it forwards the operation confirmation signal to the next processor and returns a response signal to the previous processor. When each processor fails to receive a response signal within a predetermined time, it determines that a failure has occurred and executes a recovery procedure. This detects a processor that is suspected of failing.

[0005] In the system described in Patent Document 3, a heartbeat path for transmitting a signal indicating that each processor is alive is provided in addition to a path for transmitting and receiving data. The failure of each processor is detected by monitoring this heartbeat path.

[0006] In addition, a multiprocessor system usually includes a node (hereinafter referred to as an inter-PE communication node) that transmits and receives commands and the like between processor elements without using a shared memory. here If a communication node between PEs fails, the failure may affect the entire system. Therefore, in a multiprocessor system, in addition to detecting the failure of each processor element itself, it is also important to detect a failure in the communication path between PEs. For example, the following two methods are known as conventional techniques for detecting a failure in a communication path between PEs.

[0007] In the first method, when communication performed via a communication path between PEs fails, it is determined that the processor element of the transmission source has failed. According to this method, the time required for failure determination is short and the processing amount is small. However, with this method, the processor element on the receiving side of the communication path between PEs may fail, but it may be determined that the processor element on the transmitting side has failed.

[0008] The second method is similar to the method described in Patent Document 1, but each processor element periodically monitors all other processor elements. According to this method, a failed processor element can be reliably detected. However, this method increases the overhead of the multi-processor system and increases the time required to detect a failure.

[0009] By the way, failure detection in a server system that requires high reliability mainly focuses on how to improve the certainty. For this reason, server systems often have dedicated hardware circuitry and / or software to monitor for failures. Here, when the software is installed, in order not to lower the original application processing capability, a high processing capability and a processor element are required, and the cost for the processing also increases. Similarly, the cost increases when the hardware circuit is mounted.

[0010] However, multiprocessor systems have come to be widely used in embedded systems in recent years as well as server systems alone. Here, an embedded system is an information processing system that is built in a target device to be controlled and controls the operation and state of the device. Examples of devices to be controlled include automobiles, aircraft, and ships. For this reason, in an embedded system, it is generally necessary to control a device without delay, and a high-speed response is required. In addition, since embedded systems are required to be low cost, The processing capacity of the mouth element is usually about a fraction of that of the processor element used in the server system. Therefore, it is not appropriate to introduce the failure detection method adopted in the server system directly into the embedded system. In particular, failures related to the communication path between PEs could not be detected easily in a short time when the processing capacity of each processor element was low.

Patent Document 1: Japanese Patent Laid-Open No. 63-4366

Patent Document 2: Japanese Patent Laid-Open No. 7-262042

Patent Document 3: Japanese Unexamined Patent Publication No. 2006-11992

Disclosure of the invention

An object of the present invention is to easily detect a failure related to a communication path between PEs in a short time in a multiprocessor system including a plurality of processor elements. The multiprocessor system of the present invention has a configuration in which a plurality of processor elements are connected by an interprocessor communication path, and communication from the first processor element to another processor element via the interprocessor communication path has succeeded. Detecting means for detecting whether or not the power, and communication between the first processor element to the third processor element when communication from the first processor element to the second processor element fails. If communication means for performing communication via a path and communication from the first processor element to the third processor element are successful, a failure has occurred in the second processor element. Determining the third processor element from the first processor element. And determining means for determining that a failure has occurred in the first processor element when communication with the first processor fails.

[0012] In the above multiprocessor system, when communication from the first processor element to the second processor element fails, either the first processor element or the second processor element has failed. It is not possible to determine which processor element is faulty. The first processor element then attempts to communicate with the third processor element. As a result, if communication from the first processor element to the third processor element is successful, the first processor element is used. Since there is no problem with the mouth sensor element, it is determined that the second processor element has failed. On the other hand, if communication from the first processor element to the third processor element fails, it is determined that the first processor element has failed.

[0013] Thus, in the multiprocessor system of the present invention, it is possible to specify a location where a failure related to the interprocessor communication path occurs only by executing communication between the processor elements twice. it can. Therefore, even when the processing capacity of each processor element for which dedicated hardware or software is provided is low, a failure can be detected in a short time without reducing the processing capacity of the original application. In particular, the first communication out of the above two communications can be realized by a normal procedure for providing the original processing of the multiprocessor system, not by a special procedure for detecting a failure. Therefore, it is possible to further speed up the failure detection process.

Brief Description of Drawings

FIG. 1 is a diagram showing a configuration of a multiprocessor system according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating an outline of a method for detecting a failure in a communication path between PEs in the multiprocessor system of the embodiment.

[FIG. 3A] —A diagram showing a format of a general instruction.

FIG. 3B is a diagram showing a format of a status check instruction.

FIG. 4 is a diagram showing a configuration of a processor element.

FIG. 5 is a flowchart showing a procedure for detecting a failure of a communication node between PEs.

BEST MODE FOR CARRYING OUT THE INVENTION

FIG. 1 is a diagram showing a configuration of a multiprocessor system according to an embodiment of the present invention. Here, the multiprocessor system of the embodiment includes four processor elements (PEO to PE3). Each processor element PEO to PE3 can perform independent processing in parallel. In the example shown in FIG. 1, processes A to D are assigned to the processor elements PEO to PE3, respectively.

The multiprocessor system of the embodiment is not particularly limited, but in this embodiment, it is used in an embedded system. In other words, this multi-volume sensor system is incorporated into a device (for example, an automobile, an aircraft, a ship, etc.) Control the operation of the device. In this case, the multiprocessor system requires a high-speed response in order to control the operation of the device without delay. However, since the embedded system is required to reduce the cost, the processing capacity of each processor element PEO to PE3 is assumed to be lower than that of the processor used in the server system or the like.

[0017] The processor elements PEO to PE3 are connected to the memory bus 11, respectively. An SDRAM 12 is connected to the memory bus 11. The SDRAM 12 is shared by the processor elements PEO to PE3. The processor elements PEO to PE3 are also connected to the iZO bus 13, respectively, and the non-volatile memory 14 is connected to the IZO bus 13 !. The nonvolatile memory 14 stores a real-time OS for the processor elements ΡΕ0 to ΡΕ3, application programs to be executed by the processor elements PE0 to PE3, parameters that define the operations of the processor elements PE0 to PE3, and the like. The program stored in the non-volatile memory 14 is loaded into the SDRAM 12 and used. The processor elements PE0 to PE3 are connected to the LAN via the IZO bus 13.

[0018] The processor elements PE0 to PE3 are connected by an inter-processor communication path (hereinafter referred to as an inter-PE communication path) 15. The PE communication path 15 may be a serial signal path or a parallel signal path. Also, the communication path configuration may be a bus type or a crossbar switch type. Each processor element can transmit a command to a desired processor element via the inter-PE communication path 15. In addition, signals for establishing synchronization between the processor elements PE0 to PE3 and data having a small capacity can be transferred using the inter-PE communication path 15.

Note that the multiprocessor system shown in FIG. 1 is configured to include four processor elements. The number of force processor elements is not particularly limited. Also, the configuration of the nose is not particularly limited.

In the multiprocessor system configured as described above, the memory bus 11 is mainly used for transferring data (in particular, large-capacity data). In contrast, the inter-PE communication path 15 is used for applications that require high speed such as command transfer and signal transfer to establish synchronization between processor elements. Therefore, the inter-PE communication path 15 If a failure occurs, the effect on the entire multiprocessor system is large. Therefore, the multiprocessor system of the embodiment has a function for detecting a failure in the inter-PE communication path 15. The failure detection function will be described below.

FIG. 2 is a diagram illustrating an outline of a method for detecting a failure in the inter-PE communication path 15 in the multiprocessor system of the embodiment. Here, the failure of the communication path between PEs refers to the communication between PEs only due to the failure of the signal line (conductor that propagates the electrical signal or optical transmission line that propagates the optical signal) that connects the processor elements. This includes failure of the transmitting circuit of the transmitting processor element in communication using the path and failure of the receiving circuit of the receiving processor element in communication using the communication path between PEs.

[0022] The procedure for detecting a failure in the inter-PE communication path 15 is executed when communication via the inter-PE communication path 15 fails between a pair of processor elements. Here, the communication via the inter-PE communication path 15 includes a procedure for transmitting a command from a certain processor element to another processor element via the inter-PE communication path 15 in this embodiment.

[0023] The general command transmitted via the inter-PE communication node 15 includes, for example, a PE ID, a command code, and data as shown in FIG. 3A. Here, the “general instruction” is not particularly limited, but corresponds to an instruction for providing an original function of the multiprocessor system. The PE—ID identifies the processor element to which the instruction is sent. Note that both the identifier of the destination processor element and the identifier of the source processor element may be written in the PE-ID area. The command code indicates the operation (for example, task start / end, register value read / write, etc.) in the destination processor element. Data is a parameter used when executing an instruction, and is added as necessary.

The processor element that has received the instruction executes processing according to the instruction and returns a status response. The status response indicates whether the command has been received normally. In this embodiment, “0” is returned when the command is normally received, and “1” is returned when the command is not normally received.

In the multiprocessor system of the embodiment, in addition to the general instructions described above, a status check is performed. Command is used. The format of the status check instruction is the same as that of the general instruction. However, the command code of the status check command is “Status Check”, and the data of the status check command is “zero (empty)”. The processor element that receives the status check command does not execute any other processing to return a status response.

Next, the failure detection procedure of the embodiment will be described with reference to FIG. In the following, it is assumed that a failure in the inter-PE communication path 15 is detected between the processor element PE0 and the processor element PE1. In this case, the failure detection procedure is executed when communication between the processor elements PEO and PE1 fails. Specifically, one processor element (PE0) is executed when an instruction is issued to the other processor element (PE1).

(1) The processor element PEO issues a general command for inter-PE communication and transmits it to the processor element PE 1 via the inter-PE communication path 15. The PE—ID of this instruction contains a value that identifies the processor element PE1. The command code indicates the processing to be executed by the processor element PE1.

[0028] When the processor element PE1 normally receives this instruction via the inter-PE communication path 15, the processor element PE1 executes a process corresponding to the instruction and also sends a status response = 0 (normal) to the processor element PEO. ". On the other hand, when the processor element PE1 cannot properly receive this command (that is, when an abnormal signal is received), it returns “status response = 1 (abnormal)” to the processor element PE0. “Receiving a command normally” means, for example, that any one of predefined command codes is detected.

[0029] (2) If the status response to the instruction issued in (1) above is “0 (normal)”, it is determined that the inter-PE communication path 15 between the processor elements PE0 and PE1 is normal, and Continue the operation.

[0030] (3) When the status response to the instruction issued in (1) above is “1 (abnormal)”, or the issue power of the instruction cannot receive the corresponding status response within the predetermined time. In such a case, the processor element PEO determines that the inter-PE communication path 15 between the processor elements PEO and PE1 has failed. In this case, the processor element PEO A check command is transmitted to the processor element other than processor element PE1 via the inter-PE communication path 15. Here, it is assumed that a status check command is sent to processor element PE2. When the processor element PE2 receives the status check command normally, it returns "status response = 0 (normal)", and when it cannot successfully receive the status check command, processor element PE2 returns "status response = 1 (Abnormal) "is returned.

[0031] (4) When the status response to the status check command is “1 (abnormal)”, or when the status response corresponding to the command issuance within a predetermined time cannot be received, The processor element PEO determines that the transmission circuit of the processor element PEO has failed.

(5) If the status response to the state check instruction is “0 (normal)”, the processor element PEO determines that the receiving circuit of the processor element PE1 has failed.

[0033] (6) The processor element in which the failure is detected is disconnected from the multiprocessor system. In this case, the processing executed by the processor element in which the failure is detected may be executed by another processor element thereafter.

Thus, according to the failure detection procedure of the embodiment, it is possible to specify the location of the failure relating to the inter-PE communication path 15 only by issuing two commands. In other words, it is possible to quickly detect a failure related to the inter-PE communication path 15 without adding dedicated hardware for failure detection or special monitoring software. In addition, the processing amount of the processor element required for this failure detection is small. In addition, the first of these two instructions is for normal processing of a multiprocessor system, so only one status check instruction is issued for fault detection. Therefore, the time required for failure detection is very short.

[0035] In the case of (4) above, there is a possibility that the communication circuit 15 between PEs is disconnected because the transmission circuit of the processor element PEO has failed! /. However, if the inter-PE communication path 15 is disconnected, basically all communication via the inter-PE communication path 15 stops.

FIG. 4 is a diagram showing a configuration of each processor element. Here, the communication between PEs If the function is not directly related to the function, it is omitted.

Each processor element is connected to a communication path 15 between PEs. The inter-PE communication node 15 is composed of a communication packet path 16 for transferring commands and data and a communication status path 17 for transferring status response signals.

Each processor element includes a processor core 21. The processor core 21 provides a corresponding function by executing a given program. The processor core 21 includes an instruction cache 22 and a data cache 23.

The transmission buffer 31 temporarily holds the instruction packet generated by the processor core 21. The instruction packet read from the transmission buffer 31 is output to the communication packet path 16. The instruction packet output to the communication packet path 16 is written to the reception buffer 32 of each processor element.

The decoder 33 takes out the instruction packet from the reception buffer 32 and decodes the PE-ID and the command code. At this time, if the PE-ID as the destination address indicates another processor element, the received instruction packet is discarded. Further, the decoder 33 checks whether or not the command code is normal. The checking method is not particularly limited. For example, the received command code power is checked to determine whether it matches any one of a plurality of predefined command codes. In this case, if the received command code matches one of the predefined command codes, it is determined that the command has been received normally. On the other hand, if the received command code does not match any of the predefined command codes, it is determined that the command has not been received normally. As another method, it is possible to determine whether or not the command has been successfully received using the NORITY bit. Then, when the instruction is normally received, the decoder 33 gives the instruction to the processor core 21.

The status response generator 34 generates a status response according to the decoding result by the decoder 33. Here, a status response notifying “0” is generated when the command is normally received, and a status response notifying “1” is generated when the command cannot be normally received. The generated status response is temporarily held in the status signal transmission buffer 35 with the PE-ID of the processor element added. Caro with PE-ID There are two methods: adding the PE-ID on the receiving side and adding the PE-ID on the transmitting side. Here, the PE-ID on the receiving side is added. The status response read from the status signal transmission notfer 35 is output to the communication status path 17.

The status response output to the communication status path 17 is written to the status signal reception buffer 36 of each processor element. The status check unit 37 checks the status response held in the status signal reception buffer 36 and notifies the processor core 21 of the result. In this case, if the PE-ID in the status response indicates something other than the destination processor element, the received status is discarded.

[0042] A procedure when an instruction is transmitted from the processor element PEO to the processor element PE1 will be described. In this case, the instruction packet sent from the processor element PEO is decoded by the decoder 33 in the processor element PE1. At this time, when the instruction is normally received, the status response generation unit 34 of the processor element PE1 generates “status response = 0”. On the other hand, if the command is successfully received, the status response generator 34 of the processor element PE 1 generates “status response = 1”. In either case, the generated status response is returned to the processor element PEO and checked by the status check unit 37. Then, the check result is notified to the processor core 21. As described above, when the processor core 21 issues an instruction and transmits it to the corresponding processor element, the processor core 21 can receive a status response indicating whether or not the instruction has been normally received.

FIG. 5 is a flowchart showing a procedure for detecting a failure in the inter-PE communication path 15. This process is executed by the processor core 21 of an arbitrary processor element (here, the processor element (a)).

[0044] In step S1, a general instruction is generated and transmitted to the processor element Hb) via the inter-PE communication path 15. In step S2, it is checked whether the above instruction has been successfully received by the processor element (b). If the status response returned from the processor element (b) is “0”, it is determined that the instruction has been normally received by the processor element (b). On the other hand, the status response returned from the processor element (b) If the answer is “1”, it is determined that the above command has not been received normally by the processor element (b). Note that if the command is issued in step S1 and the force fails to receive the status response within the predetermined time, it is determined that the command transmission has failed.

In step S 3, a state check command is generated and transmitted to the processor element (c) via the inter-PE communication path 15. Here, the processor element (c) is an arbitrary processor element other than the processor element (b). In step S4, the same check as in step S2 is performed. If the returned status response is “1”, the process proceeds to step S5. In this case, the processor element (a) also fails to send a command for the deviation between the processor element (b) and the processor element (c)! Therefore, it is determined that the transmission circuit of the processor element (a) has failed. The transmission circuit of the plugging element is, for example, the transmission buffer 31.

[0046] Processor element ( _c ) force If the returned status response is "0", proceed to step S6. In this case, the processor element (a) failed to transmit the instruction to the processor element (b), but the instruction transmission to the processor element (c) was successful. Therefore, it is determined that the receiving circuit of the processor element (b) has failed. The reception circuit of the processor element is, for example, a reception buffer 32, a decoder 33, a status response generation unit 34, and a status signal transmission buffer 35. Thereafter, another processor element may be notified that the processor element (b) has failed.

[0047] <Other embodiments>

In the above-described embodiment, the dedicated status check instruction is used as the second instruction after the failure of the first instruction. However, the present invention is not limited to this method. In other words, the same effect can be obtained by using a general instruction as the second instruction and performing dummy processing in the processor element to which the instruction is transmitted.

[0048] In addition, if communication relating to the first instruction fails, the communication destination processor element to which the second instruction is to be transmitted is any processor element as long as it is a processor element other than the transmission destination of the first instruction. Element.

[0049] Furthermore, in the above embodiment, the first command and the second command are issued in order to improve the force detection accuracy in which the first command and the second command are issued only once to perform fault detection. Life Even if you send the command repeatedly several times,

Claims

The scope of the claims

[1] A multiprocessor system in which a plurality of processor elements are connected by an interprocessor communication path,

Detecting means for detecting whether or not the communication through the inter-processor communication node from the first processor element to the other processor element is successful;

Communication means for executing communication via the inter-processor communication path from the first processor element to the third processor element when communication from the first processor element to the second processor element fails When,

If communication from the first processor element to the third processor element is successful, it is determined that a failure has occurred in the second processor element, and the third processor element determines that the third processor element has failed. Determining means for determining that a failure has occurred in the first processor element when communication to the processor element fails;

A multiprocessor system.

[2] A multiprocessor system according to claim 1,

Each processor element includes response means for returning a status signal corresponding to an instruction received via the inter-processor communication path,

The detecting means monitors the status signal after an instruction is transmitted from the first processor element to the second processor element via the inter-processor communication path, thereby detecting the first processor element. Detect whether the communication from the second processor element to the second processor element is successful

A multiprocessor system characterized by that.

[3] A multiprocessor system according to claim 2,

When the response means receives an abnormal signal via the inter-processor communication path, it returns a status signal indicating an abnormal state,

The detection means determines that communication from the first processor element to the second processor element has failed when a status signal indicating an abnormal state is received from the second processor element. A multiprocessor system characterized by that.

[4] A multiprocessor system according to claim 2,

If the detection means is unable to receive a status signal from the second processor element within a predetermined time, it determines that communication from the first processor element to the second processor element has failed. Do

A multiprocessor system characterized by that.

[5] The multiprocessor system according to claim 1,

The detecting means monitors the status signal after an instruction is transmitted from the first processor element to the third processor element via the inter-processor communication path, thereby detecting the first processor element. Detect whether the communication from the third processor element to the third processor element is successful

A multiprocessor system characterized by that.

[6] The multiprocessor system according to claim 5,

The response means returns a status signal indicating a normal state when an instruction is normally received via the inter-processor communication path, and indicates an abnormal state when an abnormal signal is received via the inter-processor communication path. A status signal representing

When the detection means receives a status signal indicating a normal state from the third processor element, the detection means determines that communication from the first processor element to the third processor element is successful, and When a status signal indicating an abnormal state is received from the third processor element, it is determined that communication from the first processor element to the third processor element has failed.

A multiprocessor system characterized by that.

[7] The multiprocessor system according to claim 5,

If the detection means is unable to receive a status signal from the third processor element within a predetermined time, it determines that communication from the first processor element to the third processor element has failed. Do A multiprocessor system characterized by that.

[8] The multiprocessor system according to claim 5,

The communication means transmits a state check command from the first processor element to the third processor element via the inter-processor communication path without processing of a processor core included in a destination processor element.

The response means of the third processor element returns the status signal upon receiving the state check command.

A multiprocessor system characterized by that.

[9] In a multiprocessor system including a plurality of processor elements, a method for detecting a failure in an interprocessor communication path for performing communication between the plurality of processor elements,

Detecting whether or not the communication through the inter-processor communication path from the first processor element to the second processor element is successful;

When communication from the first processor element to the second processor element fails, communication is performed via the inter-processor communication path from the first processor element to the third processor element;

If communication from the first processor element to the third processor element is successful, it is determined that a failure has occurred in the second processor element, and the third processor element determines that the third processor element has failed. If communication to the processor element fails, it is determined that a failure has occurred in the first processor element.

A failure detection method for an interprocessor communication path.