WO2008062511A1 - Système multiprocesseur - Google Patents

Système multiprocesseur Download PDF

Info

Publication number
WO2008062511A1
WO2008062511A1 PCT/JP2006/323168 JP2006323168W WO2008062511A1 WO 2008062511 A1 WO2008062511 A1 WO 2008062511A1 JP 2006323168 W JP2006323168 W JP 2006323168W WO 2008062511 A1 WO2008062511 A1 WO 2008062511A1
Authority
WO
WIPO (PCT)
Prior art keywords
processor element
processor
communication
multiprocessor system
inter
Prior art date
Application number
PCT/JP2006/323168
Other languages
English (en)
Japanese (ja)
Inventor
Hiromasa Takahashi
Takashi Chiba
Shunsuke Kamijo
Original Assignee
Fujitsu Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Limited filed Critical Fujitsu Limited
Priority to PCT/JP2006/323168 priority Critical patent/WO2008062511A1/fr
Publication of WO2008062511A1 publication Critical patent/WO2008062511A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2236Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors
    • G06F11/2242Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test CPU or processors in multi-processor systems, e.g. one processor becoming the test master

Definitions

  • the present invention relates to a multiprocessor system including a plurality of processor elements, and more particularly to a technique for improving the reliability of an embedded multiprocessor system.
  • Patent Documents 1 to 3 describe techniques for detecting a failure of a processor element in a server system! RU
  • the first processor checks the states of all other processors.
  • the second processor checks the status of all other processors. In the same way, the status of all other processors is checked in the same manner for each processor. These confirmation operations are executed periodically. According to this procedure, since the status of all the processors is regularly monitored, it is possible to reliably detect a processor failure.
  • a heartbeat path for transmitting a signal indicating that each processor is alive is provided in addition to a path for transmitting and receiving data. The failure of each processor is detected by monitoring this heartbeat path.
  • a multiprocessor system usually includes a node (hereinafter referred to as an inter-PE communication node) that transmits and receives commands and the like between processor elements without using a shared memory.
  • an inter-PE communication node a node that transmits and receives commands and the like between processor elements without using a shared memory.
  • a communication node between PEs fails, the failure may affect the entire system. Therefore, in a multiprocessor system, in addition to detecting the failure of each processor element itself, it is also important to detect a failure in the communication path between PEs. For example, the following two methods are known as conventional techniques for detecting a failure in a communication path between PEs.
  • the processor element on the receiving side of the communication path between PEs may fail, but it may be determined that the processor element on the transmitting side has failed.
  • the second method is similar to the method described in Patent Document 1, but each processor element periodically monitors all other processor elements. According to this method, a failed processor element can be reliably detected. However, this method increases the overhead of the multi-processor system and increases the time required to detect a failure.
  • an embedded system is an information processing system that is built in a target device to be controlled and controls the operation and state of the device. Examples of devices to be controlled include automobiles, aircraft, and ships. For this reason, in an embedded system, it is generally necessary to control a device without delay, and a high-speed response is required. In addition, since embedded systems are required to be low cost, The processing capacity of the mouth element is usually about a fraction of that of the processor element used in the server system. Therefore, it is not appropriate to introduce the failure detection method adopted in the server system directly into the embedded system. In particular, failures related to the communication path between PEs could not be detected easily in a short time when the processing capacity of each processor element was low.
  • Patent Document 1 Japanese Patent Laid-Open No. 63-4366
  • Patent Document 2 Japanese Patent Laid-Open No. 7-262042
  • Patent Document 3 Japanese Unexamined Patent Publication No. 2006-11992
  • An object of the present invention is to easily detect a failure related to a communication path between PEs in a short time in a multiprocessor system including a plurality of processor elements.
  • the multiprocessor system of the present invention has a configuration in which a plurality of processor elements are connected by an interprocessor communication path, and communication from the first processor element to another processor element via the interprocessor communication path has succeeded.
  • the multiprocessor system of the present invention it is possible to specify a location where a failure related to the interprocessor communication path occurs only by executing communication between the processor elements twice. it can. Therefore, even when the processing capacity of each processor element for which dedicated hardware or software is provided is low, a failure can be detected in a short time without reducing the processing capacity of the original application.
  • the first communication out of the above two communications can be realized by a normal procedure for providing the original processing of the multiprocessor system, not by a special procedure for detecting a failure. Therefore, it is possible to further speed up the failure detection process.
  • FIG. 1 is a diagram showing a configuration of a multiprocessor system according to an embodiment of the present invention.
  • FIG. 2 is a diagram illustrating an outline of a method for detecting a failure in a communication path between PEs in the multiprocessor system of the embodiment.
  • FIG. 3A A diagram showing a format of a general instruction.
  • FIG. 3B is a diagram showing a format of a status check instruction.
  • FIG. 4 is a diagram showing a configuration of a processor element.
  • FIG. 5 is a flowchart showing a procedure for detecting a failure of a communication node between PEs.
  • FIG. 1 is a diagram showing a configuration of a multiprocessor system according to an embodiment of the present invention.
  • the multiprocessor system of the embodiment includes four processor elements (PEO to PE3).
  • Each processor element PEO to PE3 can perform independent processing in parallel.
  • processes A to D are assigned to the processor elements PEO to PE3, respectively.
  • the multiprocessor system of the embodiment is not particularly limited, but in this embodiment, it is used in an embedded system.
  • this multi-volume sensor system is incorporated into a device (for example, an automobile, an aircraft, a ship, etc.) Control the operation of the device.
  • the multiprocessor system requires a high-speed response in order to control the operation of the device without delay.
  • the processing capacity of each processor element PEO to PE3 is assumed to be lower than that of the processor used in the server system or the like.
  • the processor elements PEO to PE3 are connected to the memory bus 11, respectively.
  • An SDRAM 12 is connected to the memory bus 11.
  • the SDRAM 12 is shared by the processor elements PEO to PE3.
  • the processor elements PEO to PE3 are also connected to the iZO bus 13, respectively, and the non-volatile memory 14 is connected to the IZO bus 13 !.
  • the nonvolatile memory 14 stores a real-time OS for the processor elements ⁇ 0 to ⁇ 3, application programs to be executed by the processor elements PE0 to PE3, parameters that define the operations of the processor elements PE0 to PE3, and the like.
  • the program stored in the non-volatile memory 14 is loaded into the SDRAM 12 and used.
  • the processor elements PE0 to PE3 are connected to the LAN via the IZO bus 13.
  • the processor elements PE0 to PE3 are connected by an inter-processor communication path (hereinafter referred to as an inter-PE communication path) 15.
  • the PE communication path 15 may be a serial signal path or a parallel signal path.
  • the communication path configuration may be a bus type or a crossbar switch type.
  • Each processor element can transmit a command to a desired processor element via the inter-PE communication path 15.
  • signals for establishing synchronization between the processor elements PE0 to PE3 and data having a small capacity can be transferred using the inter-PE communication path 15.
  • the multiprocessor system shown in FIG. 1 is configured to include four processor elements.
  • the number of force processor elements is not particularly limited.
  • the configuration of the nose is not particularly limited.
  • the memory bus 11 is mainly used for transferring data (in particular, large-capacity data).
  • the inter-PE communication path 15 is used for applications that require high speed such as command transfer and signal transfer to establish synchronization between processor elements. Therefore, the inter-PE communication path 15 If a failure occurs, the effect on the entire multiprocessor system is large. Therefore, the multiprocessor system of the embodiment has a function for detecting a failure in the inter-PE communication path 15. The failure detection function will be described below.
  • FIG. 2 is a diagram illustrating an outline of a method for detecting a failure in the inter-PE communication path 15 in the multiprocessor system of the embodiment.
  • the failure of the communication path between PEs refers to the communication between PEs only due to the failure of the signal line (conductor that propagates the electrical signal or optical transmission line that propagates the optical signal) that connects the processor elements. This includes failure of the transmitting circuit of the transmitting processor element in communication using the path and failure of the receiving circuit of the receiving processor element in communication using the communication path between PEs.
  • the procedure for detecting a failure in the inter-PE communication path 15 is executed when communication via the inter-PE communication path 15 fails between a pair of processor elements.
  • the communication via the inter-PE communication path 15 includes a procedure for transmitting a command from a certain processor element to another processor element via the inter-PE communication path 15 in this embodiment.
  • the general command transmitted via the inter-PE communication node 15 includes, for example, a PE ID, a command code, and data as shown in FIG. 3A.
  • the “general instruction” is not particularly limited, but corresponds to an instruction for providing an original function of the multiprocessor system.
  • the PE—ID identifies the processor element to which the instruction is sent. Note that both the identifier of the destination processor element and the identifier of the source processor element may be written in the PE-ID area.
  • the command code indicates the operation (for example, task start / end, register value read / write, etc.) in the destination processor element.
  • Data is a parameter used when executing an instruction, and is added as necessary.
  • the processor element that has received the instruction executes processing according to the instruction and returns a status response.
  • the status response indicates whether the command has been received normally. In this embodiment, “0” is returned when the command is normally received, and “1” is returned when the command is not normally received.
  • a status check is performed.
  • Command is used.
  • the format of the status check instruction is the same as that of the general instruction.
  • the command code of the status check command is “Status Check”
  • the data of the status check command is “zero (empty)”.
  • the processor element that receives the status check command does not execute any other processing to return a status response.
  • the failure detection procedure of the embodiment will be described with reference to FIG. In the following, it is assumed that a failure in the inter-PE communication path 15 is detected between the processor element PE0 and the processor element PE1. In this case, the failure detection procedure is executed when communication between the processor elements PEO and PE1 fails. Specifically, one processor element (PE0) is executed when an instruction is issued to the other processor element (PE1).
  • the processor element PEO issues a general command for inter-PE communication and transmits it to the processor element PE 1 via the inter-PE communication path 15.
  • the PE—ID of this instruction contains a value that identifies the processor element PE1.
  • the command code indicates the processing to be executed by the processor element PE1.
  • the failure detection procedure of the embodiment it is possible to specify the location of the failure relating to the inter-PE communication path 15 only by issuing two commands. In other words, it is possible to quickly detect a failure related to the inter-PE communication path 15 without adding dedicated hardware for failure detection or special monitoring software. In addition, the processing amount of the processor element required for this failure detection is small. In addition, the first of these two instructions is for normal processing of a multiprocessor system, so only one status check instruction is issued for fault detection. Therefore, the time required for failure detection is very short.
  • FIG. 4 is a diagram showing a configuration of each processor element. Here, the communication between PEs If the function is not directly related to the function, it is omitted.
  • Each processor element is connected to a communication path 15 between PEs.
  • the inter-PE communication node 15 is composed of a communication packet path 16 for transferring commands and data and a communication status path 17 for transferring status response signals.
  • Each processor element includes a processor core 21.
  • the processor core 21 provides a corresponding function by executing a given program.
  • the processor core 21 includes an instruction cache 22 and a data cache 23.
  • the transmission buffer 31 temporarily holds the instruction packet generated by the processor core 21.
  • the instruction packet read from the transmission buffer 31 is output to the communication packet path 16.
  • the instruction packet output to the communication packet path 16 is written to the reception buffer 32 of each processor element.
  • the decoder 33 takes out the instruction packet from the reception buffer 32 and decodes the PE-ID and the command code. At this time, if the PE-ID as the destination address indicates another processor element, the received instruction packet is discarded. Further, the decoder 33 checks whether or not the command code is normal. The checking method is not particularly limited. For example, the received command code power is checked to determine whether it matches any one of a plurality of predefined command codes. In this case, if the received command code matches one of the predefined command codes, it is determined that the command has been received normally. On the other hand, if the received command code does not match any of the predefined command codes, it is determined that the command has not been received normally. As another method, it is possible to determine whether or not the command has been successfully received using the NORITY bit. Then, when the instruction is normally received, the decoder 33 gives the instruction to the processor core 21.
  • the status response generator 34 generates a status response according to the decoding result by the decoder 33.
  • a status response notifying “0” is generated when the command is normally received
  • a status response notifying “1” is generated when the command cannot be normally received.
  • the generated status response is temporarily held in the status signal transmission buffer 35 with the PE-ID of the processor element added.
  • Caro with PE-ID There are two methods: adding the PE-ID on the receiving side and adding the PE-ID on the transmitting side. Here, the PE-ID on the receiving side is added.
  • the status response read from the status signal transmission notfer 35 is output to the communication status path 17.
  • the status response output to the communication status path 17 is written to the status signal reception buffer 36 of each processor element.
  • the status check unit 37 checks the status response held in the status signal reception buffer 36 and notifies the processor core 21 of the result. In this case, if the PE-ID in the status response indicates something other than the destination processor element, the received status is discarded.
  • FIG. 5 is a flowchart showing a procedure for detecting a failure in the inter-PE communication path 15. This process is executed by the processor core 21 of an arbitrary processor element (here, the processor element (a)).
  • step S1 a general instruction is generated and transmitted to the processor element Hb) via the inter-PE communication path 15.
  • step S2 it is checked whether the above instruction has been successfully received by the processor element (b). If the status response returned from the processor element (b) is “0”, it is determined that the instruction has been normally received by the processor element (b). On the other hand, the status response returned from the processor element (b) If the answer is “1”, it is determined that the above command has not been received normally by the processor element (b). Note that if the command is issued in step S1 and the force fails to receive the status response within the predetermined time, it is determined that the command transmission has failed.
  • step S 3 a state check command is generated and transmitted to the processor element (c) via the inter-PE communication path 15.
  • the processor element (c) is an arbitrary processor element other than the processor element (b).
  • step S4 the same check as in step S2 is performed. If the returned status response is “1”, the process proceeds to step S5.
  • the processor element (a) also fails to send a command for the deviation between the processor element (b) and the processor element (c)! Therefore, it is determined that the transmission circuit of the processor element (a) has failed.
  • the transmission circuit of the plugging element is, for example, the transmission buffer 31.
  • Processor element ( c ) force If the returned status response is "0", proceed to step S6. In this case, the processor element (a) failed to transmit the instruction to the processor element (b), but the instruction transmission to the processor element (c) was successful. Therefore, it is determined that the receiving circuit of the processor element (b) has failed.
  • the reception circuit of the processor element is, for example, a reception buffer 32, a decoder 33, a status response generation unit 34, and a status signal transmission buffer 35. Thereafter, another processor element may be notified that the processor element (b) has failed.
  • the dedicated status check instruction is used as the second instruction after the failure of the first instruction.
  • the present invention is not limited to this method. In other words, the same effect can be obtained by using a general instruction as the second instruction and performing dummy processing in the processor element to which the instruction is transmitted.
  • the communication destination processor element to which the second instruction is to be transmitted is any processor element as long as it is a processor element other than the transmission destination of the first instruction. Element.
  • the first command and the second command are issued in order to improve the force detection accuracy in which the first command and the second command are issued only once to perform fault detection. Life Even if you send the command repeatedly several times,

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Multi Processors (AREA)

Abstract

L'invention concerne un élément de processeur (PE0) transmettant une commande à un élément de processeur (PE1) via un chemin de communication entre les éléments de processeur. Lorsqu'une réponse d'état provenant de l'élément de processeur (PE1) indique un état anormal, l'élément de processeur (PE0) transmet une commande de vérification d'état à un élément de processeur (PE2) via un chemin de communication entre les éléments de processeur. Lorsque la communication entre l'élément de processeur (PE0) et l'élément de processeur (PE2) réussit, il est déterminé que le circuit de réception de l'élément de processeur (PE1) est hors service. Lorsque la communication entre l'élément de processeur (PE0) et l'élément de processeur (PE2) échoue, il est déterminé que le circuit de transmission de l'élément de processeur (PE0) est hors service.
PCT/JP2006/323168 2006-11-21 2006-11-21 Système multiprocesseur WO2008062511A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2006/323168 WO2008062511A1 (fr) 2006-11-21 2006-11-21 Système multiprocesseur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2006/323168 WO2008062511A1 (fr) 2006-11-21 2006-11-21 Système multiprocesseur

Publications (1)

Publication Number Publication Date
WO2008062511A1 true WO2008062511A1 (fr) 2008-05-29

Family

ID=39429452

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2006/323168 WO2008062511A1 (fr) 2006-11-21 2006-11-21 Système multiprocesseur

Country Status (1)

Country Link
WO (1) WO2008062511A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294022A (zh) * 2012-03-01 2013-09-11 德州仪器公司 用于控制工业工艺的的多芯片模块和方法
JP2014229208A (ja) * 2013-05-24 2014-12-08 株式会社ケーヒン マルチコアシステム

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10133963A (ja) * 1996-10-28 1998-05-22 Mitsubishi Electric Corp 計算機の故障検出・回復方式
JP2001195377A (ja) * 2000-01-17 2001-07-19 Nec Software Kyushu Ltd 孤立判定システムとその管理方法及び記録媒体
JP2002118564A (ja) * 2000-10-06 2002-04-19 Shimadzu Corp 通信異常診断方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10133963A (ja) * 1996-10-28 1998-05-22 Mitsubishi Electric Corp 計算機の故障検出・回復方式
JP2001195377A (ja) * 2000-01-17 2001-07-19 Nec Software Kyushu Ltd 孤立判定システムとその管理方法及び記録媒体
JP2002118564A (ja) * 2000-10-06 2002-04-19 Shimadzu Corp 通信異常診断方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103294022A (zh) * 2012-03-01 2013-09-11 德州仪器公司 用于控制工业工艺的的多芯片模块和方法
JP2014229208A (ja) * 2013-05-24 2014-12-08 株式会社ケーヒン マルチコアシステム

Similar Documents

Publication Publication Date Title
TWI502376B (zh) 多處理器資料處理系統中之錯誤偵測之方法及系統
US7747897B2 (en) Method and apparatus for lockstep processing on a fixed-latency interconnect
EP2527985A1 (fr) Système et procédé de vérification automatique d'opération de bus 1553
US7774638B1 (en) Uncorrectable data error containment systems and methods
KR20070116102A (ko) Dma 컨트롤러, 노드, 데이터 전송 제어 방법 및 프로그램을 기록한 컴퓨터 판독가능한 기록 매체
WO2001025924A1 (fr) Mecanisme permettant d'ameliorer l'isolation et le diagnostic de defaillances dans des orinateurs
CN103678031A (zh) 二乘二取二冗余系统及方法
US20060212749A1 (en) Failure communication method
JP2006178615A (ja) フォールトトレラントシステム、これで用いる制御装置、アクセス制御方法、及び制御プログラム
JP2020021313A (ja) データ処理装置および診断方法
JPH0375834A (ja) パリティの置換装置及び方法
JP2009169854A (ja) コンピュータシステム、障害処理方法および障害処理プログラム
WO2008062511A1 (fr) Système multiprocesseur
JP5381109B2 (ja) 通信装置及びその制御プログラム
US20090177890A1 (en) Method and Device for Forming a Signature
JP2008152552A (ja) 計算機システム及び障害情報管理方法
JP2005215809A (ja) コンピュータシステム、バスコントローラ及びそれらに用いるバス障害処理方法
US7243257B2 (en) Computer system for preventing inter-node fault propagation
US8264948B2 (en) Interconnection device
KR20210116342A (ko) 데이터 처리 디바이스 및 데이터 처리 방법
JP2012235335A (ja) 装置間ケーブルの誤接続検出方法及び装置
US20120331334A1 (en) Multi-cluster system and information processing system
JP2009223506A (ja) データ処理システム
JP2001007893A (ja) 情報処理システム及びそれに用いる障害処理方式
US20040153842A1 (en) Method for allowing distributed high performance coherent memory with full error containment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06833018

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06833018

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP