JP2021158563A

JP2021158563A - Device and method for data processing

Info

Publication number: JP2021158563A
Application number: JP2020057875A
Authority: JP
Inventors: 力佐々木; Tsutomu Sasaki; 圭介黒木; Keisuke Kuroki
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2021-10-07
Anticipated expiration: 2040-03-27
Also published as: JP7463650B2

Abstract

To substitute the function of a fault part using a redundant circuit to automatically recover data processing, in an integrated circuit in which a variety of types of functional processors are combined.SOLUTION: A data processing device, which performs automatic recovery using a programmable redundant circuit, includes: a plurality of types of processing units which perform data processing; a programmable redundant circuit; and a control unit which, when a fault is detected in one of the processing units, stores a code necessary to exhibit the function of the processing unit having the detected fault, so as to configure a packet route in a manner to divert the processing unit having the detected fault, to pass through the redundant circuit. The redundant circuit executes data processing based on the stored code to substitute the function of the processing unit having the detected fault.SELECTED DRAWING: Figure 1

Description

本発明は、プログラム可能な冗長回路を用いて、自動的復旧を行なうデータ処理装置およびデータ処理方法に関する。 The present invention relates to a data processing apparatus and a data processing method that automatically recover using a programmable redundant circuit.

従来から、修復可能な論理領域を有する集積回路を提供する技術が知られている。例えば、特許文献１には、ＦＰＧＡ等のプログラム可能な集積回路内において、故障が発生した場合にスペア行に迂回するためのスイッチング処理（迂回処理）が開示されている。より具体的には、特許文献１に記載されている集積回路は、スペア行の回路網を含む複数の行の回路網を有する論理領域を有し、複数のルーティングセグメントの各々は、複数の行の回路網のうちのそれぞれの１つに結合された第１の端と、複数の行の回路網のうちの隣接する１つに結合された反対の第２の端とを有する。また、複数の迂回回路の各々は、複数のルーティングセグメントのうちのそれぞれの対の間に結合されている。そして、複数の迂回回路は、複数の行の回路網のうちの１つが欠陥回路を含む場合、スペア行を使用にスイッチングするように動作可能であり、各ルーティングセグメントは、複数の迂回回路のうちのそれぞれの１つを用いて、関連付けられたルーティングセグメントによって駆動される。 Conventionally, a technique for providing an integrated circuit having a repairable logic region has been known. For example, Patent Document 1 discloses a switching process (detour process) for bypassing to a spare line when a failure occurs in a programmable integrated circuit such as an FPGA. More specifically, the integrated circuit described in Patent Document 1 has a logical region having a network of a plurality of rows including a network of spare rows, and each of the plurality of routing segments has a plurality of rows. It has a first end coupled to each one of the networks of a plurality of rows and an opposite second end coupled to an adjacent one of the networks of multiple rows. Also, each of the plurality of detour circuits is coupled between each pair of the plurality of routing segments. The plurality of bypass circuits can then operate to switch to use the spare row if one of the networks of the plurality of rows contains a defective circuit, and each routing segment is among the plurality of bypass circuits. It is driven by an associated routing segment using each one of.

また、特許公報２には、アプリケーション毎に実行するアクセラレータの紐づけを変更する技術が開示されている。より具体的には、特許文献２に記載されているアクセラレータ管理装置は、アプリ識別子に対応付けてアクセラレータ識別子を記憶するアクセラレータ紐付けＤＢを有する。アクセラレータ管理装置は、拡張Ｉ／Ｏボックスの各スロットを識別する各スロット識別子に対応付けて、スロットに搭載されるアクセラレータのアクセラレータ識別子を記憶するアクセラレータ搭載情報ＤＢを有する。そして、アクセラレータ管理装置は、ホストからアプリケーションの実行要求を受信した場合に、アプリケーションに対応するアクセラレータ識別子をアクセラレータ紐付けＤＢから特定する。アクセラレータ管理装置は、特定したアクセラレータ識別子に対応するスロット識別子をアクセラレータ搭載情報ＤＢから特定する。アクセラレータ管理装置は、特定したスロット識別子により識別されるスロットをホストに割当てる。 Further, Patent Gazette 2 discloses a technique for changing the association of accelerators to be executed for each application. More specifically, the accelerator management device described in Patent Document 2 has an accelerator association DB that stores the accelerator identifier in association with the application identifier. The accelerator management device has an accelerator mounting information DB that stores the accelerator identifier of the accelerator mounted in the slot in association with each slot identifier that identifies each slot of the extended I / O box. Then, when the accelerator management device receives the execution request of the application from the host, the accelerator management device specifies the accelerator identifier corresponding to the application from the accelerator association DB. The accelerator management device specifies the slot identifier corresponding to the specified accelerator identifier from the accelerator mounting information DB. The accelerator management device assigns the slot identified by the identified slot identifier to the host.

特開２０１４−０９３７８２号公報Japanese Unexamined Patent Publication No. 2014-093782 特開２０１３−１９６２０６号公報Japanese Unexamined Patent Publication No. 2013-196206

しかしながら、特許公報１記載の技術では、故障が発生した場合に使用する行の故障個所から下にスライドし、故障個所を迂回するようにスイッチング／ルーティングすることから、スイッチング制御のみで機能回復ができる一方で、すでに回路を焼き込んでいる場合は、迂回構成にて再度焼き込みが必要となってしまう。また、スペア行にはオリジナルの回路をそのまま書き込むため、同一の集積回路しかスペアとして使うことはできず、複数回路から構成される集積回路には利用することができない。 However, in the technique described in Patent Gazette 1, when a failure occurs, the line used when a failure occurs slides down from the failure location and switches / routes so as to bypass the failure location, so that the function can be recovered only by switching control. On the other hand, if the circuit has already been burned, it will be necessary to burn it again in the detour configuration. Further, since the original circuit is written as it is in the spare line, only the same integrated circuit can be used as a spare, and it cannot be used for an integrated circuit composed of a plurality of circuits.

また、特許公報２記載の技術では、アクセラレータとそれを制御するサーバ（CPU）を筐体的（物理的）に分離している。サーバ（CPU）が故障したとしても、管理サーバ経由でアプリに対応するＣＰＵ‐アクセラレータの組み合わせを動的に変更しアプリを継続させることができる。すなわち、あるＣＰＵが壊れても、別のＣＰＵを経由するように切り替えることで、アクセラレータを継続利用することができる。しかし、この切り替え方法も基本的には、故障したハードウェアと同じ種類のハードウェアに迂回させる手法であるため、同種のハードウェアを用意しなければならない。 Further, in the technique described in Patent Gazette 2, the accelerator and the server (CPU) that controls the accelerator are separated into a housing (physical). Even if the server (CPU) fails, the CPU-accelerator combination corresponding to the application can be dynamically changed via the management server to continue the application. That is, even if a certain CPU is broken, the accelerator can be continuously used by switching to go through another CPU. However, since this switching method is basically a method of bypassing to the same type of hardware as the failed hardware, the same type of hardware must be prepared.

本発明は、このような事情に鑑みてなされたものであり、故障個所に応じて内部回路の処理経路の変更、およびプログラム可能な集積回路への動的な書き込みを行なうことができるデータ処理装置およびデータ処理方法を提供することを目的とする。 The present invention has been made in view of such circumstances, and is a data processing apparatus capable of changing the processing path of an internal circuit and dynamically writing to a programmable integrated circuit according to a faulty part. And to provide a data processing method.

（１）上記の目的を達成するために、本発明は、以下のような手段を講じた。すなわち、本発明のデータ処理装置は、プログラム可能な冗長回路を用いて、自動的復旧を行なうデータ処理装置であって、データ処理を行なう複数種類の処理部と、プログラム可能な冗長回路と、いずれかの前記処理部の故障が検出されたときに、故障が検出された前記処理部の機能を発揮するために必要なコードを前記冗長回路に書き込むと共に、故障が検出された前記処理部を迂回して、前記冗長回路を経由するように、パケットの経路を構築する制御部と、を備え、前記冗長回路が、前記書き込まれたコードに基づいてデータ処理を行なうことによって、故障が検出された前記処理部の機能を代替することを特徴とする。
（２）また、本発明のデータ処理装置において、前記制御部は、前記各処理部の機能を発揮するために必要な複数種類のコードを記録する書き込みコード保管部と、前記各処理部の故障を検出する故障検出部と、どの処理部が故障したのかを判定する故障個所判定制御部と、故障した処理部に対応するコードを前記書き込みコード保管部から選択するコード選択部と、前記選択されたコードを、前記冗長回路に書き込む書き込み部と、故障が検出された前記処理部を迂回して、前記冗長回路を経由するように、パケットの経路を構築する経路管理部と、を備えることを特徴とする。 (1) In order to achieve the above object, the present invention has taken the following measures. That is, the data processing device of the present invention is a data processing device that automatically recovers using a programmable redundant circuit, and may be a plurality of types of processing units that perform data processing or a programmable redundant circuit. When a failure of the processing unit is detected, the code necessary for exerting the function of the processing unit in which the failure is detected is written in the redundant circuit, and the processing unit in which the failure is detected is bypassed. Then, a control unit for constructing a packet route so as to pass through the redundant circuit is provided, and the redundant circuit performs data processing based on the written code, whereby a failure is detected. It is characterized in that it substitutes the function of the processing unit.
(2) Further, in the data processing apparatus of the present invention, the control unit has a write code storage unit that records a plurality of types of codes necessary for exerting the functions of the respective processing units, and a failure of the respective processing units. A failure detection unit that detects a failure, a failure location determination control unit that determines which processing unit has failed, and a code selection unit that selects a code corresponding to the failed processing unit from the writing code storage unit. It is provided with a writing unit for writing the code to the redundant circuit and a route management unit for constructing a packet route so as to bypass the processing unit in which a failure is detected and pass through the redundant circuit. It is a feature.

（３）また、本発明のデータ処理装置において、前記各処理部は、前記経路管理部が構築した経路に基づいて、パケットの出力先を切り替えるスイッチング部を備えることを特徴とする。 (3) Further, in the data processing apparatus of the present invention, each processing unit includes a switching unit that switches a packet output destination based on a route constructed by the route management unit.

（４）また、本発明のデータ処理装置は、前記各処理部に接続され、前記経路管理部が構築した経路に基づいて、パケットの入力および出力を切り替えるスイッチ回路をさらに備えることを特徴とする。 (4) Further, the data processing apparatus of the present invention is further provided with a switch circuit connected to each of the processing units and switching between input and output of packets based on a route constructed by the route management unit. ..

（５）また、本発明のデータ処理装置において、前記制御部は、前記冗長回路の性能に基づいて、前記コードのパラメータを適応させることを特徴とする。 (5) Further, in the data processing apparatus of the present invention, the control unit is characterized in that the parameters of the code are adapted based on the performance of the redundant circuit.

（６）また、本発明のデータ処理方法は、プログラム可能な冗長回路を用いて、自動的復旧を行なうデータ処理装置のデータ処理方法であって、いずれかの前記処理部の故障を検出するステップと、故障が検出された前記処理部の機能を発揮するために必要なコードを前記冗長回路に書き込むステップと、故障が検出された前記処理部を迂回して、前記冗長回路を経由するように、パケットの経路を構築するステップと、を少なくとも含み、前記冗長回路が、前記書き込まれたコードに基づいてデータ処理を行なうことによって、故障が検出された前記処理部の機能を代替することを特徴とする。 (6) Further, the data processing method of the present invention is a data processing method of a data processing apparatus that automatically recovers using a programmable redundant circuit, and is a step of detecting a failure of any one of the processing units. Then, the step of writing the code necessary for exerting the function of the processing unit in which the failure is detected to the redundant circuit and the process bypassing the processing unit in which the failure is detected are bypassed so as to pass through the redundant circuit. , And at least a step of constructing a packet path, characterized in that the redundant circuit substitutes the function of the processing unit in which a failure is detected by performing data processing based on the written code. And.

本発明によれば、一部のプロセッサが破損したとしても、自動的に復旧させることが可能となり、継続して機能を実現することが可能となる。また、ハードウェア構成の規模を小さくすることが可能となる。 According to the present invention, even if a part of the processor is damaged, it can be automatically recovered, and the function can be continuously realized. In addition, the scale of the hardware configuration can be reduced.

本実施形態に係るデータ処理装置１の概略構成を示す図である。It is a figure which shows the schematic structure of the data processing apparatus 1 which concerns on this embodiment. データ処理装置１における集積回路３の概略構成を示す図である。It is a figure which shows the schematic structure of the integrated circuit 3 in the data processing apparatus 1. データ処理装置１における集積回路３の変形例１を示す図である。It is a figure which shows the modification 1 of the integrated circuit 3 in the data processing apparatus 1. データ処理装置１における集積回路３の変形例２を示す図である。It is a figure which shows the modification 2 of the integrated circuit 3 in the data processing apparatus 1. 本実施形態に係るデータ処理装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the data processing apparatus which concerns on this embodiment. 通常のデータ処理装置の処理の流れを示す図である。It is a figure which shows the processing flow of a normal data processing apparatus. 故障が発生したときの処理の流れを示す図である。It is a figure which shows the flow of processing when a failure occurs. 故障時の動作を示す図である。It is a figure which shows the operation at the time of a failure. 故障時の動作を示す図である。It is a figure which shows the operation at the time of a failure. 故障検知の様子を示す図である。It is a figure which shows the state of failure detection. 故障検知の様子を示す図である。It is a figure which shows the state of failure detection. アウトプットの判定を試験用回路（冗長回路１５）に戻した様子を示す図である。It is a figure which shows the state which returned the determination of the output to a test circuit (redundant circuit 15). 冗長回路１５が、故障したＧＰＵの機能を発揮するためのコードを示す図である。It is a figure which shows the code for exerting the function of the failure GPU in the redundant circuit 15. 変形例として、スイッチ４０を設けた構成を示す図である。As a modification, it is a figure which shows the structure which provided the switch 40.

本発明者らは、ＦＰＧＡ（Field Programmable Gate Array）が、プログラム可能な集積回路であることに着目し、チップレットのような多種の機能プロセッサを組み合わせた集積回路において、当該チップレットの故障個所に応じて、内部回路の処理経路の変更と、ＦＰＧＡへの動的な書き込みを行なうことによって、小さな回路規模で、自動的復旧を行なうことができることを見出し、本発明に至った。 The present inventors have focused on an FPGA (Field Programmable Gate Array) being a programmable integrated circuit, and in an integrated circuit combining various functional processors such as a chiplet, the failure point of the chiplet is found. Accordingly, they have found that automatic recovery can be performed on a small circuit scale by changing the processing path of the internal circuit and dynamically writing to the FPGA, and have reached the present invention.

すなわち、本発明のデータ処理装置は、プログラム可能な冗長回路を用いて、自動的復旧を行なうデータ処理装置であって、データ処理を行なう複数種類の処理部と、プログラム可能な冗長回路と、いずれかの前記処理部の故障が検出されたときに、故障が検出された前記処理部の機能を発揮するために必要なコードを前記冗長回路に書き込むと共に、故障が検出された前記処理部を迂回して、前記冗長回路を経由するように、パケットの経路を構築する制御部と、を備え、前記冗長回路が、前記書き込まれたコードに基づいてデータ処理を行なうことによって、故障が検出された前記処理部の機能を代替することを特徴とする。 That is, the data processing device of the present invention is a data processing device that automatically recovers using a programmable redundant circuit, and may be a plurality of types of processing units that perform data processing or a programmable redundant circuit. When a failure of the processing unit is detected, the code necessary for exerting the function of the processing unit in which the failure is detected is written in the redundant circuit, and the processing unit in which the failure is detected is bypassed. Then, a control unit for constructing a packet route so as to pass through the redundant circuit is provided, and the redundant circuit performs data processing based on the written code, whereby a failure is detected. It is characterized in that it substitutes the function of the processing unit.

これにより、本発明者らは、一部のプロセッサが破損したとしても、自動的に復旧させることを可能とし、継続して機能を実現することを可能とした。また、ハードウェア構成の規模を小さくすることを可能とした。以下、本発明の実施形態について、図面を参照しながら具体的に説明する。 As a result, the present inventors have made it possible to automatically recover even if a part of the processor is damaged, and to continuously realize the function. It also made it possible to reduce the scale of the hardware configuration. Hereinafter, embodiments of the present invention will be specifically described with reference to the drawings.

図１は、本実施形態に係るデータ処理装置１の概略構成を示す図であり、図２Ａは、データ処理装置１の集積回路３の概略構成を示す図である。このデータ処理装置１は、複数種類のチップレットを有する集積回路３と制御部５とから構成されている。集積回路３は、ＦＰＧＡ（Field Programmable Gate Array）で構成され、パケットフィルタ処理を行なうチップレット７、ＡＳＩＣ（Application Specific Integrated Circuit）で構成され、暗号処理を行なうチップレット９、ＧＰＵ（Graphics Processing Unit）で構成され、ルーティング処理を行なうチップレット１１、および、ＦＰＧＡで構成され、トンネリング処理を行なうチップレット１３を備えている。各チップレット７、９、１１、１３は、それぞれ、スイッチング７ａ、９ａ、１１ａ、１３ａおよび、メモリ７ｂ、９ｂ、１１ｂ、１３ｂを備えている。この構成により、各チップレットは、それぞれ、主回路と迂回路を切り替える機能を発揮することが可能である。 FIG. 1 is a diagram showing a schematic configuration of a data processing device 1 according to the present embodiment, and FIG. 2A is a diagram showing a schematic configuration of an integrated circuit 3 of the data processing device 1. The data processing device 1 is composed of an integrated circuit 3 having a plurality of types of chiplets and a control unit 5. The integrated circuit 3 is composed of an FPGA (Field Programmable Gate Array), a chiplet 7 that performs packet filtering processing, and a chiplet 9 that is composed of an ASIC (Application Specific Integrated Circuit) and performs encryption processing, and a GPU (Graphics Processing Unit). The chiplet 11 is composed of a chiplet 11 and performs a routing process, and the chiplet 13 is composed of an FPGA and performs a tunneling process. Each chiplet 7, 9, 11, 13 includes switching 7a, 9a, 11a, 13a and memories 7b, 9b, 11b, 13b, respectively. With this configuration, each chiplet can exert a function of switching between a main circuit and a detour.

また、ＦＰＧＡで構成された冗長回路１５、スイッチング回路１５ａ、メモリ１５ｂを備えている。冗長回路１５は、プログラム可能なＦＰＧＡであり、故障したプロセッサ（チップ）に応じて、プログラムを選択し、冗長回路１５に焼き込むことで、自動的に故障を修復することが可能となる。これにより、Ｎ個の機能を有する回路に使用するプロセッサのバリエーション数によらず、「Ｎ＋１」の冗長構成として実現する。また、後述するように、冗長回路１５を用いて迂回経路を複数実現し、その複数の出力結果から故障したプロセッサの個所を検知することが可能となる。 Further, it includes a redundant circuit 15 composed of FPGA, a switching circuit 15a, and a memory 15b. The redundant circuit 15 is a programmable FPGA, and by selecting a program according to the failed processor (chip) and burning it into the redundant circuit 15, it is possible to automatically repair the failure. As a result, it is realized as a redundant configuration of "N + 1" regardless of the number of variations of the processor used in the circuit having N functions. Further, as will be described later, it is possible to realize a plurality of detour routes by using the redundant circuit 15 and detect the location of the failed processor from the plurality of output results.

本実施形態では、ＦＰＧＡから構成される冗長回路１５に対する動的な回路焼き込みにより、複数種類のプロセッサの構成において用意すべき冗長部をＦＰＧＡのみとすることができる。その結果、ＧＰＵやＡＳＩＣなどの予備部が不要となる。また、スイッチング／ルーティング、および物理的な冗長接続部を保持することで、複数の迂回処理経路を実現する。さらに、上記の複数の迂回回路とサンプル入力を用いて、想定しない結果を出力する迂回経路から、故障した回路部（プロセッサ）を特定することが可能となる。 In the present embodiment, by dynamically burning the redundant circuit 15 composed of the FPGA, the redundant part to be prepared in the configuration of a plurality of types of processors can be limited to the FPGA. As a result, spare parts such as GPU and ASIC are not required. In addition, by maintaining switching / routing and physical redundant connections, a plurality of detour processing routes are realized. Further, by using the above-mentioned plurality of detour circuits and sample input, it is possible to identify the failed circuit unit (processor) from the detour route that outputs an unexpected result.

なお、本来、その処理に適したハードウェア(ASICやGPU)で処理している機能を冗長回路１５としてのＦＰＧＡで処理させている関係で、性能（スループットや遅延）については、故障前に比べて劣っている可能性はあり得る。ただし、ＦＰＧＡからＦＰＧＡの迂回は同等となる可能性が高い。 It should be noted that the performance (throughput and delay) is higher than that before the failure because the functions that are originally processed by the hardware (ASIC or GPU) suitable for the processing are processed by the FPGA as the redundant circuit 15. It is possible that it is inferior. However, the detour from FPGA to FPGA is likely to be equivalent.

また、図１に示す制御部５において、書き込みコード保管部１７は、各チップレットの機能を発揮するために必要な複数種類のコードを記録する。故障検知部１９は、集積回路３の各チップレットの呼称を検知し、故障個所判定制御部２１は、どのチップレットが故障したのかを判定する。書き込みコード選択部２３は、故障したチップレットに対応するコードを、書き込みコード保管部１７から選択する。書き込み部２５は、書き込みコード選択部２３によって選択されたコードを、冗長回路１５に書き込む。集積回路経路管理部２７は、故障が検出されたチップレットを迂回して、冗長回路１５を経由するように、パケットの経路を構築する。冗長ＦＰＧＡメモリコピー制御部２９は、冗長回路１５へコードを書き込む際に一時的にデータを記録し、実行プログラム制御部３１は、各チップレットに対して実行するコードを制御すると共にその結果をチップレットで読み出す。また、集積回路３は、データ処理（例えば、パケット処理）を行なうため、隣のプロセッサ３３、３７との間でパケットを送受信する。なお、以上の構成例では、故障検知部１９が故障を検知し、故障個所判定制御部２１が故障したチップレットを判定する例を示したが、本発明はこれに限定されず、どのチップレットが故障したのかを外部装置から入力したり、オペレータが直接入力したりすることも可能である。 Further, in the control unit 5 shown in FIG. 1, the writing code storage unit 17 records a plurality of types of codes necessary for exerting the function of each chiplet. The failure detection unit 19 detects the name of each chiplet of the integrated circuit 3, and the failure location determination control unit 21 determines which chiplet has failed. The writing code selection unit 23 selects a code corresponding to the failed chiplet from the writing code storage unit 17. The writing unit 25 writes the code selected by the writing code selection unit 23 to the redundant circuit 15. The integrated circuit route management unit 27 constructs a packet route so as to bypass the chiplet in which a failure is detected and pass through the redundant circuit 15. The redundant FPGA memory copy control unit 29 temporarily records data when writing the code to the redundant circuit 15, and the execution program control unit 31 controls the code to be executed for each chiplet and outputs the result to the chip. Read with a let. Further, the integrated circuit 3 transmits / receives packets to / from the adjacent processors 33 and 37 in order to perform data processing (for example, packet processing). In the above configuration example, the failure detection unit 19 detects the failure and the failure location determination control unit 21 determines the failed chiplet. However, the present invention is not limited to this, and any chiplet is not limited to this. It is also possible to input from an external device whether or not the device has failed, or to input directly by the operator.

図２Ｂは、データ処理装置１における集積回路３の変形例１を示す図であり、図２Ｃは、データ処理装置１における集積回路３の変形例２を示す図である。図２Ｂおよび図２Ｃに示すように、集積回路３は、隣のプロセッサ３３、３７との間に、スイッチング回路４０、４２を有していても良い。隣のプロセッサ３３、３７からスイッチング回路４０、４２を認識させないようにする場合は、集積回路３の内部で切り替えられるよう、図２Ｂに示す構成を採る。一方、隣のプロセッサ３３、３７によって、スイッチング回路４０、４２を切り替えさせる場合は、図２Ｃに示す構成を採る。これにより、冗長回路１５を経由させることが可能となる。 FIG. 2B is a diagram showing a modification 1 of the integrated circuit 3 in the data processing device 1, and FIG. 2C is a diagram showing a modification 2 of the integrated circuit 3 in the data processing device 1. As shown in FIGS. 2B and 2C, the integrated circuit 3 may have switching circuits 40 and 42 between the adjacent processors 33 and 37. When the switching circuits 40 and 42 are not recognized by the adjacent processors 33 and 37, the configuration shown in FIG. 2B is adopted so that the switching circuits 40 and 42 can be switched inside the integrated circuit 3. On the other hand, when the switching circuits 40 and 42 are switched by the adjacent processors 33 and 37, the configuration shown in FIG. 2C is adopted. This makes it possible to pass through the redundant circuit 15.

次に、いずれかのチップレットが故障したときの処理について説明する。図３は、本実施形態に係るデータ処理装置の動作を示すフローチャートであり、図４は、通常のデータ処理装置の処理の流れを示す図であり、図５は、故障が発生したときの処理の流れを示す図である。いずれかのチップレットで故障が発生した場合、故障したチップレットを避ける形で、冗長回路１５を経由するように迂回すると共に、故障個所を代替する機能に相当するＦＰＧＡのコードを検索し、当該コードを冗長回路１５に書き込む。ここでは、ＧＰＵで構成されているチップレット１１が故障し、迂回路を構築すると共に、冗長回路１５にＧＰＵとして機能を発揮するためのコードを焼き込んで、冗長回路１５がＧＰＵとして機能する例を示す。なお、迂回路を構築した後、故障したチップレット１１を修理または交換し、経路を元に戻すようにしても良い。 Next, processing when one of the chiplets fails will be described. FIG. 3 is a flowchart showing the operation of the data processing device according to the present embodiment, FIG. 4 is a diagram showing a processing flow of a normal data processing device, and FIG. 5 is a process when a failure occurs. It is a figure which shows the flow of. When a failure occurs in any of the chiplets, the FPGA code corresponding to the function of substituting the failed part is searched for while bypassing the redundant circuit 15 in a form of avoiding the failed chiplet. Write the code to the redundant circuit 15. Here, an example in which the chiplet 11 composed of the GPU fails, a detour circuit is constructed, and the code for demonstrating the function as the GPU is burned into the redundant circuit 15 so that the redundant circuit 15 functions as the GPU. Is shown. After constructing the detour, the failed chiplet 11 may be repaired or replaced to restore the route.

図６および図７は、故障時の動作を示す図である。回路に故障が発生した際は何も出力されなくなるか、不正な出力がされる。どちらの場合も、それぞれのプロセッサから個別アラームなどが無い限りは、当該集積回路自体で故障を知るすべはない。そのため、実際は当該チップレット以降の処理におけるシステム全体での検知なども必要となる。それは、サービスレベルであったり、トンネル先のノードでのフィルターであったりする。すなわち、故障検知については、基本的には、ハードウェアからのエラーログやチップ内の、いわゆる「死活監視」が基本となるが、故障のパターンによっては、エラーログを出すことが難しくなったり、死活監視で問題が無くても所望の処理をしていない可能性があったりするため（サイレント障害）、システム全体での検知も必要となる。もし、それらの手段によって、故障があるとシステム的に分かっている場合は、集積回路３内でサンプル入力と試験用の回路（冗長回路１５）を用いて検知することができる。 6 and 7 are diagrams showing the operation at the time of failure. When a circuit failure occurs, nothing is output or an invalid output is output. In either case, there is no way to know the failure in the integrated circuit itself unless there is an individual alarm from each processor. Therefore, in reality, it is necessary to detect the entire system in the processing after the chiplet. It can be a service level or a filter at a tunnel destination node. That is, for failure detection, basically, error logs from hardware and so-called "life and death monitoring" in the chip are the basics, but depending on the failure pattern, it may be difficult to output error logs. Even if there is no problem in alive monitoring, there is a possibility that the desired processing is not performed (silent failure), so detection by the entire system is also necessary. If it is systematically known that there is a failure by those means, it can be detected by using the sample input and the test circuit (redundant circuit 15) in the integrated circuit 3.

図３において、まず、制御部５の故障検知部１９が故障を検知し（ステップＳ１）、故障個所判定制御部２１が、どのチップレットが故障したのかを判定する（ステップＳ２）。ここでは、ＧＰＵとしてのチップレット１１が故障したものとする。書き込みコード選択部２３が、ＦＰＧＡからなる冗長回路１５にＧＰＵとして機能させるためのコードを選択し（ステップＳ３）、書き込み部２５が、上記選択したコードを冗長回路１５に書き込む（ステップＳ４）。次に、集積回路経路管理部２７が、図５に示すように、チップレット１１を迂回するように迂回路を選定し（ステップＳ５）、経路を切り替えて（ステップＳ６）、終了する。これにより、自動的に故障から復旧することが可能となる。 In FIG. 3, first, the failure detection unit 19 of the control unit 5 detects a failure (step S1), and the failure location determination control unit 21 determines which chiplet has failed (step S2). Here, it is assumed that the chiplet 11 as the GPU has failed. The writing code selection unit 23 selects a code for causing the redundant circuit 15 made of FPGA to function as a GPU (step S3), and the writing unit 25 writes the selected code to the redundant circuit 15 (step S4). Next, as shown in FIG. 5, the integrated circuit route management unit 27 selects a detour so as to bypass the chiplet 11 (step S5), switches the route (step S6), and ends. This makes it possible to automatically recover from the failure.

図８および図９は、故障検知の様子を示す図である。ＧＰＵからなるチップレット１１に故障の疑いがある場合、外部システムから、冗長回路１５に対して、試験用のコードを書き込み、サンプル入力を生成し、被疑回路（チップレット１１）に対して入力する。そして、出力側で想定するアウトプットかどうかを確認する。故障想定の回路ごとに試験用コードやサンプルコードを選択し、内部の迂回用接続も切り替える。図８および図９では、ルーティング処理を行なうＧＰＵとしてのチップレット１１に故障が発生しているため、チップレット１３におけるトンネリング処理の検査では問題は発生しないが（想定出力）、ルーティングの処理では出力がないか、または、想定外の出力となり、ルーティング処理に問題があることが分かる。 8 and 9 are diagrams showing a state of failure detection. When there is a suspicion of failure in the chiplet 11 made of GPU, a test code is written to the redundant circuit 15 from the external system, a sample input is generated, and the sample input is input to the suspected circuit (chiplet 11). .. Then, check whether the output is expected on the output side. Select a test code or sample code for each circuit that is expected to fail, and switch the internal bypass connection. In FIGS. 8 and 9, since the chiplet 11 as the GPU that performs the routing process has a failure, no problem occurs in the inspection of the tunneling process in the chiplet 13 (assumed output), but the output is in the routing process. There is no or unexpected output, and it can be seen that there is a problem with the routing process.

図１０は、アウトプットの判定を試験用回路（冗長回路１５）に戻した様子を示す図である。この手法は、処理フローの最初と最後以外を判定する場合に利用可能である。最初と最後を判定しようとすると隣のプロセッサやサービスレベルでの判定が必要となる。今回は、出力側の回路から順番に試験を行なっているが、２分木探索などの方法により回数を減らすことができる。また、試験中は本来の機能を提供できなくなるため、サービスが一時的に止めるか、サービスを継続させたい場合は別の回路や筐体に迂回するなどの対応が必要である。さらに、本実施形態では、構成上、回路内で試験を行なっているが、別筐体にわたってまたいで試験を行なう場合もある。 FIG. 10 is a diagram showing how the output determination is returned to the test circuit (redundant circuit 15). This method can be used to determine other than the beginning and end of the processing flow. When trying to judge the beginning and the end, it is necessary to judge at the adjacent processor or service level. This time, the tests are conducted in order from the output side circuit, but the number of times can be reduced by a method such as a binary tree search. In addition, since the original function cannot be provided during the test, it is necessary to take measures such as temporarily stopping the service or bypassing to another circuit or housing if the service is to be continued. Further, in the present embodiment, the test is performed in the circuit due to the configuration, but the test may be performed over different housings.

図１１は、冗長回路１５が、故障したＧＰＵの機能を発揮するためのコードを示す図である。ＦＰＧＡから構成された冗長回路１５は、例えば、ＯｐｅｎＣＬなどの共通言語で同じ機能を実現することは容易である。また、ハードウェアの違いは、コンパイラが吸収するため、問題とならない。ただし、ＦＰＧＡとＧＰＵとの性能差を考慮すると、ハードウェアのコア数やスレッド数に応じて、パラメータを変更する必要が生ずる場合がある。例えば、ループの展開数を明示的に変更し、すなわち、並列数を変更して高速化する場合がある。図１１では、故障したＧＰＵにおいて、コア数が多い場合、ループの展開数を抑制する必要がないが、冗長回路１５としてのＦＰＧＡでは、ルックアップテーブルの数がそれほど多くはないため、ループ展開数を４に抑制して、コンパイルする必要があることから、パラメータを変更する。 FIG. 11 is a diagram showing a code for the redundant circuit 15 to exert the function of the failed GPU. The redundant circuit 15 composed of FPGA can easily realize the same function in a common language such as OpenCL. Also, the difference in hardware is absorbed by the compiler, so it does not matter. However, considering the performance difference between FPGA and GPU, it may be necessary to change the parameters according to the number of hardware cores and threads. For example, the number of loop unrolls may be explicitly changed, that is, the number of parallels may be changed to increase the speed. In FIG. 11, when the number of cores is large in the failed GPU, it is not necessary to suppress the number of loop unrolls. However, in the FPGA as the redundant circuit 15, the number of look-up tables is not so large, so the number of loop unrolls is large. Since it is necessary to suppress to 4 and compile, change the parameter.

図１２は、変形例として、スイッチ４０を設けた構成を示す図である。スイッチ４０によって、構築された迂回路にパケットが流れるように、経路を切り替えることが可能となる。 FIG. 12 is a diagram showing a configuration in which the switch 40 is provided as a modified example. The switch 40 makes it possible to switch the route so that the packet flows through the constructed detour.

以上説明したように、本実施形態によれば、複数回路から構成される集積回路３において、「Ｎ+１」冗長を実現することによって、回路の規模を小さくすることが可能となる。すなわち、一部のプロセッサで破損が生じたとしても、自動的に継続して機能を実現することが可能となる。 As described above, according to the present embodiment, the scale of the circuit can be reduced by realizing "N + 1" redundancy in the integrated circuit 3 composed of a plurality of circuits. That is, even if some processors are damaged, the functions can be automatically and continuously realized.

１データ処理装置
３集積回路
５制御部
７、９、１１、１３チップレット
１５冗長回路
１５ａスイッチング回路
１５ｂメモリ
１７書き込みコード保管部
１９故障検知部
２１故障個所判定制御部
２３書き込みコード選択部
２５書き込み部
２７集積回路経路管理部
２９冗長ＦＰＧＡメモリコピー制御部
３１実行プログラム制御部
４０、４２スイッチング回路 1 Data processing device 3 Integrated circuit 5 Control unit 7, 9, 11, 13 Chiplet 15 Redundant circuit 15a Switching circuit 15b Memory 17 Writing code storage unit 19 Failure detection unit 21 Failure location determination control unit 23 Writing code selection unit 25 Writing unit 27 Integrated circuit route management unit 29 Redundant FPGA memory copy control unit 31 Execution program control unit 40, 42 Switching circuit

Claims

A data processing device that automatically recovers using a programmable redundant circuit.
Multiple types of processing units that perform data processing,
Programmable redundant circuits and
When a failure of any of the processing units is detected, the code necessary for exerting the function of the processing unit in which the failure is detected is written in the redundant circuit, and the processing unit in which the failure is detected is written. A control unit that bypasses and constructs a packet route so as to pass through the redundant circuit is provided.
A data processing apparatus characterized in that the redundant circuit substitutes the function of the processing unit in which a failure is detected by performing data processing based on the written code.

The control unit
A writing code storage unit that records a plurality of types of codes necessary for exerting the functions of each processing unit, and a writing code storage unit.
A failure detection unit that detects a failure in each of the processing units,
A failure location determination control unit that determines which processing unit has failed, and a failure location determination control unit.
A code selection unit that selects the code corresponding to the failed processing unit from the writing code storage unit, and
A writing unit that writes the selected code to the redundant circuit,
The data processing device according to claim 1, further comprising a route management unit that constructs a packet route so as to bypass the processing unit in which a failure is detected and pass through the redundant circuit.

The data processing apparatus according to claim 2, wherein each processing unit includes a switching unit that switches a packet output destination based on a route constructed by the route management unit.

The data processing apparatus according to claim 2, further comprising a switch circuit connected to each of the processing units and switching between input and output of packets based on a route constructed by the route management unit.

The data processing device according to claim 1, wherein the control unit adapts parameters of the code based on the performance of the redundant circuit.

It is a data processing method of a data processing device that automatically recovers using a programmable redundant circuit.
A step of detecting a failure of any of the processing units, and
The step of writing the code necessary for exerting the function of the processing unit in which the failure is detected to the redundant circuit, and
It includes at least a step of constructing a packet route so as to bypass the processing unit in which a failure is detected and pass through the redundant circuit.
A data processing method characterized in that the redundant circuit substitutes the function of the processing unit in which a failure is detected by performing data processing based on the written code.