WO2008062511A1 - Système multiprocesseur - Google Patents
Système multiprocesseur Download PDFInfo
- Publication number
- WO2008062511A1 WO2008062511A1 PCT/JP2006/323168 JP2006323168W WO2008062511A1 WO 2008062511 A1 WO2008062511 A1 WO 2008062511A1 JP 2006323168 W JP2006323168 W JP 2006323168W WO 2008062511 A1 WO2008062511 A1 WO 2008062511A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processor element
- processor
- communication
- multiprocessor system
- inter
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/2236—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
- G06F11/2242—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors in multi-processor systems, e.g. one processor becoming the test master
Definitions
- the present invention relates to a multiprocessor system including a plurality of processor elements, and more particularly to a technique for improving the reliability of an embedded multiprocessor system.
- Patent Documents 1 to 3 describe techniques for detecting a failure of a processor element in a server system! RU
- the first processor checks the states of all other processors.
- the second processor checks the status of all other processors. In the same way, the status of all other processors is checked in the same manner for each processor. These confirmation operations are executed periodically. According to this procedure, since the status of all the processors is regularly monitored, it is possible to reliably detect a processor failure.
- a heartbeat path for transmitting a signal indicating that each processor is alive is provided in addition to a path for transmitting and receiving data. The failure of each processor is detected by monitoring this heartbeat path.
- a multiprocessor system usually includes a node (hereinafter referred to as an inter-PE communication node) that transmits and receives commands and the like between processor elements without using a shared memory.
- an inter-PE communication node a node that transmits and receives commands and the like between processor elements without using a shared memory.
- a communication node between PEs fails, the failure may affect the entire system. Therefore, in a multiprocessor system, in addition to detecting the failure of each processor element itself, it is also important to detect a failure in the communication path between PEs. For example, the following two methods are known as conventional techniques for detecting a failure in a communication path between PEs.
- the processor element on the receiving side of the communication path between PEs may fail, but it may be determined that the processor element on the transmitting side has failed.
- the second method is similar to the method described in Patent Document 1, but each processor element periodically monitors all other processor elements. According to this method, a failed processor element can be reliably detected. However, this method increases the overhead of the multi-processor system and increases the time required to detect a failure.
- an embedded system is an information processing system that is built in a target device to be controlled and controls the operation and state of the device. Examples of devices to be controlled include automobiles, aircraft, and ships. For this reason, in an embedded system, it is generally necessary to control a device without delay, and a high-speed response is required. In addition, since embedded systems are required to be low cost, The processing capacity of the mouth element is usually about a fraction of that of the processor element used in the server system. Therefore, it is not appropriate to introduce the failure detection method adopted in the server system directly into the embedded system. In particular, failures related to the communication path between PEs could not be detected easily in a short time when the processing capacity of each processor element was low.
- Patent Document 1 Japanese Patent Laid-Open No. 63-4366
- Patent Document 2 Japanese Patent Laid-Open No. 7-262042
- Patent Document 3 Japanese Unexamined Patent Publication No. 2006-11992
- An object of the present invention is to easily detect a failure related to a communication path between PEs in a short time in a multiprocessor system including a plurality of processor elements.
- the multiprocessor system of the present invention has a configuration in which a plurality of processor elements are connected by an interprocessor communication path, and communication from the first processor element to another processor element via the interprocessor communication path has succeeded.
- the multiprocessor system of the present invention it is possible to specify a location where a failure related to the interprocessor communication path occurs only by executing communication between the processor elements twice. it can. Therefore, even when the processing capacity of each processor element for which dedicated hardware or software is provided is low, a failure can be detected in a short time without reducing the processing capacity of the original application.
- the first communication out of the above two communications can be realized by a normal procedure for providing the original processing of the multiprocessor system, not by a special procedure for detecting a failure. Therefore, it is possible to further speed up the failure detection process.
- FIG. 1 is a diagram showing a configuration of a multiprocessor system according to an embodiment of the present invention.
- FIG. 2 is a diagram illustrating an outline of a method for detecting a failure in a communication path between PEs in the multiprocessor system of the embodiment.
- FIG. 3A A diagram showing a format of a general instruction.
- FIG. 3B is a diagram showing a format of a status check instruction.
- FIG. 4 is a diagram showing a configuration of a processor element.
- FIG. 5 is a flowchart showing a procedure for detecting a failure of a communication node between PEs.
- FIG. 1 is a diagram showing a configuration of a multiprocessor system according to an embodiment of the present invention.
- the multiprocessor system of the embodiment includes four processor elements (PEO to PE3).
- Each processor element PEO to PE3 can perform independent processing in parallel.
- processes A to D are assigned to the processor elements PEO to PE3, respectively.
- the multiprocessor system of the embodiment is not particularly limited, but in this embodiment, it is used in an embedded system.
- this multi-volume sensor system is incorporated into a device (for example, an automobile, an aircraft, a ship, etc.) Control the operation of the device.
- the multiprocessor system requires a high-speed response in order to control the operation of the device without delay.
- the processing capacity of each processor element PEO to PE3 is assumed to be lower than that of the processor used in the server system or the like.
- the processor elements PEO to PE3 are connected to the memory bus 11, respectively.
- An SDRAM 12 is connected to the memory bus 11.
- the SDRAM 12 is shared by the processor elements PEO to PE3.
- the processor elements PEO to PE3 are also connected to the iZO bus 13, respectively, and the non-volatile memory 14 is connected to the IZO bus 13 !.
- the nonvolatile memory 14 stores a real-time OS for the processor elements ⁇ 0 to ⁇ 3, application programs to be executed by the processor elements PE0 to PE3, parameters that define the operations of the processor elements PE0 to PE3, and the like.
- the program stored in the non-volatile memory 14 is loaded into the SDRAM 12 and used.
- the processor elements PE0 to PE3 are connected to the LAN via the IZO bus 13.
- the processor elements PE0 to PE3 are connected by an inter-processor communication path (hereinafter referred to as an inter-PE communication path) 15.
- the PE communication path 15 may be a serial signal path or a parallel signal path.
- the communication path configuration may be a bus type or a crossbar switch type.
- Each processor element can transmit a command to a desired processor element via the inter-PE communication path 15.
- signals for establishing synchronization between the processor elements PE0 to PE3 and data having a small capacity can be transferred using the inter-PE communication path 15.
- the multiprocessor system shown in FIG. 1 is configured to include four processor elements.
- the number of force processor elements is not particularly limited.
- the configuration of the nose is not particularly limited.
- the memory bus 11 is mainly used for transferring data (in particular, large-capacity data).
- the inter-PE communication path 15 is used for applications that require high speed such as command transfer and signal transfer to establish synchronization between processor elements. Therefore, the inter-PE communication path 15 If a failure occurs, the effect on the entire multiprocessor system is large. Therefore, the multiprocessor system of the embodiment has a function for detecting a failure in the inter-PE communication path 15. The failure detection function will be described below.
- FIG. 2 is a diagram illustrating an outline of a method for detecting a failure in the inter-PE communication path 15 in the multiprocessor system of the embodiment.
- the failure of the communication path between PEs refers to the communication between PEs only due to the failure of the signal line (conductor that propagates the electrical signal or optical transmission line that propagates the optical signal) that connects the processor elements. This includes failure of the transmitting circuit of the transmitting processor element in communication using the path and failure of the receiving circuit of the receiving processor element in communication using the communication path between PEs.
- the procedure for detecting a failure in the inter-PE communication path 15 is executed when communication via the inter-PE communication path 15 fails between a pair of processor elements.
- the communication via the inter-PE communication path 15 includes a procedure for transmitting a command from a certain processor element to another processor element via the inter-PE communication path 15 in this embodiment.
- the general command transmitted via the inter-PE communication node 15 includes, for example, a PE ID, a command code, and data as shown in FIG. 3A.
- the “general instruction” is not particularly limited, but corresponds to an instruction for providing an original function of the multiprocessor system.
- the PE—ID identifies the processor element to which the instruction is sent. Note that both the identifier of the destination processor element and the identifier of the source processor element may be written in the PE-ID area.
- the command code indicates the operation (for example, task start / end, register value read / write, etc.) in the destination processor element.
- Data is a parameter used when executing an instruction, and is added as necessary.
- the processor element that has received the instruction executes processing according to the instruction and returns a status response.
- the status response indicates whether the command has been received normally. In this embodiment, “0” is returned when the command is normally received, and “1” is returned when the command is not normally received.
- a status check is performed.
- Command is used.
- the format of the status check instruction is the same as that of the general instruction.
- the command code of the status check command is “Status Check”
- the data of the status check command is “zero (empty)”.
- the processor element that receives the status check command does not execute any other processing to return a status response.
- the failure detection procedure of the embodiment will be described with reference to FIG. In the following, it is assumed that a failure in the inter-PE communication path 15 is detected between the processor element PE0 and the processor element PE1. In this case, the failure detection procedure is executed when communication between the processor elements PEO and PE1 fails. Specifically, one processor element (PE0) is executed when an instruction is issued to the other processor element (PE1).
- the processor element PEO issues a general command for inter-PE communication and transmits it to the processor element PE 1 via the inter-PE communication path 15.
- the PE—ID of this instruction contains a value that identifies the processor element PE1.
- the command code indicates the processing to be executed by the processor element PE1.
- the failure detection procedure of the embodiment it is possible to specify the location of the failure relating to the inter-PE communication path 15 only by issuing two commands. In other words, it is possible to quickly detect a failure related to the inter-PE communication path 15 without adding dedicated hardware for failure detection or special monitoring software. In addition, the processing amount of the processor element required for this failure detection is small. In addition, the first of these two instructions is for normal processing of a multiprocessor system, so only one status check instruction is issued for fault detection. Therefore, the time required for failure detection is very short.
- FIG. 4 is a diagram showing a configuration of each processor element. Here, the communication between PEs If the function is not directly related to the function, it is omitted.
- Each processor element is connected to a communication path 15 between PEs.
- the inter-PE communication node 15 is composed of a communication packet path 16 for transferring commands and data and a communication status path 17 for transferring status response signals.
- Each processor element includes a processor core 21.
- the processor core 21 provides a corresponding function by executing a given program.
- the processor core 21 includes an instruction cache 22 and a data cache 23.
- the transmission buffer 31 temporarily holds the instruction packet generated by the processor core 21.
- the instruction packet read from the transmission buffer 31 is output to the communication packet path 16.
- the instruction packet output to the communication packet path 16 is written to the reception buffer 32 of each processor element.
- the decoder 33 takes out the instruction packet from the reception buffer 32 and decodes the PE-ID and the command code. At this time, if the PE-ID as the destination address indicates another processor element, the received instruction packet is discarded. Further, the decoder 33 checks whether or not the command code is normal. The checking method is not particularly limited. For example, the received command code power is checked to determine whether it matches any one of a plurality of predefined command codes. In this case, if the received command code matches one of the predefined command codes, it is determined that the command has been received normally. On the other hand, if the received command code does not match any of the predefined command codes, it is determined that the command has not been received normally. As another method, it is possible to determine whether or not the command has been successfully received using the NORITY bit. Then, when the instruction is normally received, the decoder 33 gives the instruction to the processor core 21.
- the status response generator 34 generates a status response according to the decoding result by the decoder 33.
- a status response notifying “0” is generated when the command is normally received
- a status response notifying “1” is generated when the command cannot be normally received.
- the generated status response is temporarily held in the status signal transmission buffer 35 with the PE-ID of the processor element added.
- Caro with PE-ID There are two methods: adding the PE-ID on the receiving side and adding the PE-ID on the transmitting side. Here, the PE-ID on the receiving side is added.
- the status response read from the status signal transmission notfer 35 is output to the communication status path 17.
- the status response output to the communication status path 17 is written to the status signal reception buffer 36 of each processor element.
- the status check unit 37 checks the status response held in the status signal reception buffer 36 and notifies the processor core 21 of the result. In this case, if the PE-ID in the status response indicates something other than the destination processor element, the received status is discarded.
- FIG. 5 is a flowchart showing a procedure for detecting a failure in the inter-PE communication path 15. This process is executed by the processor core 21 of an arbitrary processor element (here, the processor element (a)).
- step S1 a general instruction is generated and transmitted to the processor element Hb) via the inter-PE communication path 15.
- step S2 it is checked whether the above instruction has been successfully received by the processor element (b). If the status response returned from the processor element (b) is “0”, it is determined that the instruction has been normally received by the processor element (b). On the other hand, the status response returned from the processor element (b) If the answer is “1”, it is determined that the above command has not been received normally by the processor element (b). Note that if the command is issued in step S1 and the force fails to receive the status response within the predetermined time, it is determined that the command transmission has failed.
- step S 3 a state check command is generated and transmitted to the processor element (c) via the inter-PE communication path 15.
- the processor element (c) is an arbitrary processor element other than the processor element (b).
- step S4 the same check as in step S2 is performed. If the returned status response is “1”, the process proceeds to step S5.
- the processor element (a) also fails to send a command for the deviation between the processor element (b) and the processor element (c)! Therefore, it is determined that the transmission circuit of the processor element (a) has failed.
- the transmission circuit of the plugging element is, for example, the transmission buffer 31.
- Processor element ( c ) force If the returned status response is "0", proceed to step S6. In this case, the processor element (a) failed to transmit the instruction to the processor element (b), but the instruction transmission to the processor element (c) was successful. Therefore, it is determined that the receiving circuit of the processor element (b) has failed.
- the reception circuit of the processor element is, for example, a reception buffer 32, a decoder 33, a status response generation unit 34, and a status signal transmission buffer 35. Thereafter, another processor element may be notified that the processor element (b) has failed.
- the dedicated status check instruction is used as the second instruction after the failure of the first instruction.
- the present invention is not limited to this method. In other words, the same effect can be obtained by using a general instruction as the second instruction and performing dummy processing in the processor element to which the instruction is transmitted.
- the communication destination processor element to which the second instruction is to be transmitted is any processor element as long as it is a processor element other than the transmission destination of the first instruction. Element.
- the first command and the second command are issued in order to improve the force detection accuracy in which the first command and the second command are issued only once to perform fault detection. Life Even if you send the command repeatedly several times,
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Multi Processors (AREA)
Abstract
L'invention concerne un élément de processeur (PE0) transmettant une commande à un élément de processeur (PE1) via un chemin de communication entre les éléments de processeur. Lorsqu'une réponse d'état provenant de l'élément de processeur (PE1) indique un état anormal, l'élément de processeur (PE0) transmet une commande de vérification d'état à un élément de processeur (PE2) via un chemin de communication entre les éléments de processeur. Lorsque la communication entre l'élément de processeur (PE0) et l'élément de processeur (PE2) réussit, il est déterminé que le circuit de réception de l'élément de processeur (PE1) est hors service. Lorsque la communication entre l'élément de processeur (PE0) et l'élément de processeur (PE2) échoue, il est déterminé que le circuit de transmission de l'élément de processeur (PE0) est hors service.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2006/323168 WO2008062511A1 (fr) | 2006-11-21 | 2006-11-21 | Système multiprocesseur |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2006/323168 WO2008062511A1 (fr) | 2006-11-21 | 2006-11-21 | Système multiprocesseur |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008062511A1 true WO2008062511A1 (fr) | 2008-05-29 |
Family
ID=39429452
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2006/323168 WO2008062511A1 (fr) | 2006-11-21 | 2006-11-21 | Système multiprocesseur |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008062511A1 (fr) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103294022A (zh) * | 2012-03-01 | 2013-09-11 | 德州仪器公司 | 用于控制工业工艺的的多芯片模块和方法 |
JP2014229208A (ja) * | 2013-05-24 | 2014-12-08 | 株式会社ケーヒン | マルチコアシステム |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10133963A (ja) * | 1996-10-28 | 1998-05-22 | Mitsubishi Electric Corp | 計算機の故障検出・回復方式 |
JP2001195377A (ja) * | 2000-01-17 | 2001-07-19 | Nec Software Kyushu Ltd | 孤立判定システムとその管理方法及び記録媒体 |
JP2002118564A (ja) * | 2000-10-06 | 2002-04-19 | Shimadzu Corp | 通信異常診断方法 |
-
2006
- 2006-11-21 WO PCT/JP2006/323168 patent/WO2008062511A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10133963A (ja) * | 1996-10-28 | 1998-05-22 | Mitsubishi Electric Corp | 計算機の故障検出・回復方式 |
JP2001195377A (ja) * | 2000-01-17 | 2001-07-19 | Nec Software Kyushu Ltd | 孤立判定システムとその管理方法及び記録媒体 |
JP2002118564A (ja) * | 2000-10-06 | 2002-04-19 | Shimadzu Corp | 通信異常診断方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103294022A (zh) * | 2012-03-01 | 2013-09-11 | 德州仪器公司 | 用于控制工业工艺的的多芯片模块和方法 |
JP2014229208A (ja) * | 2013-05-24 | 2014-12-08 | 株式会社ケーヒン | マルチコアシステム |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI502376B (zh) | 多處理器資料處理系統中之錯誤偵測之方法及系統 | |
US7747897B2 (en) | Method and apparatus for lockstep processing on a fixed-latency interconnect | |
EP2527985A1 (fr) | Système et procédé de vérification automatique d'opération de bus 1553 | |
US7774638B1 (en) | Uncorrectable data error containment systems and methods | |
KR20070116102A (ko) | Dma 컨트롤러, 노드, 데이터 전송 제어 방법 및 프로그램을 기록한 컴퓨터 판독가능한 기록 매체 | |
WO2001025924A1 (fr) | Mecanisme permettant d'ameliorer l'isolation et le diagnostic de defaillances dans des orinateurs | |
CN103678031A (zh) | 二乘二取二冗余系统及方法 | |
US20060212749A1 (en) | Failure communication method | |
JP2006178615A (ja) | フォールトトレラントシステム、これで用いる制御装置、アクセス制御方法、及び制御プログラム | |
JP2020021313A (ja) | データ処理装置および診断方法 | |
JPH0375834A (ja) | パリティの置換装置及び方法 | |
JP2009169854A (ja) | コンピュータシステム、障害処理方法および障害処理プログラム | |
WO2008062511A1 (fr) | Système multiprocesseur | |
JP5381109B2 (ja) | 通信装置及びその制御プログラム | |
US20090177890A1 (en) | Method and Device for Forming a Signature | |
JP2008152552A (ja) | 計算機システム及び障害情報管理方法 | |
JP2005215809A (ja) | コンピュータシステム、バスコントローラ及びそれらに用いるバス障害処理方法 | |
US7243257B2 (en) | Computer system for preventing inter-node fault propagation | |
US8264948B2 (en) | Interconnection device | |
KR20210116342A (ko) | 데이터 처리 디바이스 및 데이터 처리 방법 | |
JP2012235335A (ja) | 装置間ケーブルの誤接続検出方法及び装置 | |
US20120331334A1 (en) | Multi-cluster system and information processing system | |
JP2009223506A (ja) | データ処理システム | |
JP2001007893A (ja) | 情報処理システム及びそれに用いる障害処理方式 | |
US20040153842A1 (en) | Method for allowing distributed high performance coherent memory with full error containment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 06833018 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06833018 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |