JP6913312B2

JP6913312B2 - Data processing device and data transfer method

Info

Publication number: JP6913312B2
Application number: JP2018034512A
Authority: JP
Inventors: 貴大鈴木; サンヨプキム; 淳一可児; 敏博塙
Original assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Current assignee: Nippon Telegraph and Telephone Corp; University of Tokyo NUC
Priority date: 2018-02-28
Filing date: 2018-02-28
Publication date: 2021-08-04
Anticipated expiration: 2038-02-28
Also published as: JP2019149086A

Description

本発明は、データ処理装置及びデータ転送方法に関する。 The present invention relates to a data processing apparatus and a data transfer method.

近年、ネットワークの分野では仮想化が注目されている。仮想化により、実際の物理的なハードウェア構成によらず、ネットワークを構成する装置を論理的に利用できる。仮想化のため、光アクセスシステムにおいて従来は専用のハードウェアで作られていた装置を汎用ハードウェアで構成し、機能をソフトウェアで実装する構成が検討されている。機能をソフトウェアで実現することで、装置の機能が入れ替え可能となり、装置の共通化やリソース共有化が図れるため、ＣＡＰＥＸ（Capital Expenditure）の削減が期待できる。また、機能のアップデートや設定変更を容易とすることでＯＰＥＸ（Operating Expense）削減に繋がると考えられている。そこで、光アクセスシステムのソフトウェア領域を物理層処理にまで拡大し、光アクセスシステムを構成する通信装置が備えるＧＰＵ（Graphics Processing Unit）等のアクセラレータに物理層処理を実装することが考えられる。 In recent years, virtualization has attracted attention in the field of networks. With virtualization, the devices that make up the network can be logically used regardless of the actual physical hardware configuration. For virtualization, a configuration is being studied in which a device that was conventionally made of dedicated hardware in an optical access system is configured with general-purpose hardware and the functions are implemented by software. By realizing the functions with software, the functions of the devices can be replaced, and the devices can be shared and resources can be shared, so that the reduction of CAPEX (Capital Expenditure) can be expected. In addition, it is thought that OPEX (Operating Expense) can be reduced by facilitating function updates and setting changes. Therefore, it is conceivable to expand the software area of the optical access system to include physical layer processing and implement physical layer processing on an accelerator such as a GPU (Graphics Processing Unit) included in a communication device constituting the optical access system.

しかしながら、従来は通信処理の物理演算は専用チップを用いて行われてきたため、ＧＰＵを使って処理を行う従来研究は少ない。一方で、ＦＰＧＡ等のハードウェアを使って、誤り訂正を実装する検討例は複数存在する（例えば、非特許文献１、２参照）。これらの検討はＲＴＬ（Register Transfer Level）の設計であり、レジスタ間のタスクレベルの並列性や全体のアーキテクチャの提案となっているため、ＧＰＵを活用する本検討の設計思想とは異なっている。 However, since the physical calculation of communication processing has conventionally been performed using a dedicated chip, there are few conventional studies in which processing is performed using a GPU. On the other hand, there are a plurality of study examples of implementing error correction using hardware such as FPGA (see, for example, Non-Patent Documents 1 and 2). These studies are RTL (Register Transfer Level) designs, and are different from the design concept of this study that utilizes GPUs because they are proposals for task-level parallelism between registers and the overall architecture.

誤り訂正をＧＰＵで実行する例として、ＲＡＩＤ（Redundant Arrays of Inexpensive Disks）システムへの適応がある（例えば、非特許文献３参照）。この手法においては具体的な実装方法までは記載されておらず、システムの提案を行っているのみである。加えて、システムのスループットも大きくない。 An example of performing error correction on a GPU is adaptation to a RAID (Redundant Arrays of Inexpensive Disks) system (see, for example, Non-Patent Document 3). In this method, the concrete implementation method is not described, and only the system is proposed. In addition, the throughput of the system is not high.

また、ＧＰＵは主にＣＰＵ（central processing unit）の処理をアクセラレートするために用いられている。そのため、ＧＰＵへのデータの転送技術としては、図７に示すように、一般的にはＣＰＵからＧＰＵに転送を行う構成が用いられる。ＤＭＡ（Direct Memory Access）転送用メモリとして、高速なＤＤＰ−ＤＲＡＭ（Dual-Data-Port Dynamic Random Access Memory）を用いる手法が挙げられる（例えば、非特許文献４参照）。しかし、調査した限り、汎用化されていない規格の信号の外部入力を、ＣＰＵを介さずに直接ＧＰＵに転送する方法はない。 In addition, the GPU is mainly used for accelerating the processing of the CPU (central processing unit). Therefore, as a technique for transferring data to the GPU, as shown in FIG. 7, a configuration for transferring data from the CPU to the GPU is generally used. As a DMA (Direct Memory Access) transfer memory, a method using a high-speed DDP-DRAM (Dual-Data-Port Dynamic Random Access Memory) can be mentioned (see, for example, Non-Patent Document 4). However, as far as we have investigated, there is no way to directly transfer the external input of a non-generalized standard signal to the GPU without going through the CPU.

Hanho Lee, Chang-Seok Choi, Jongyoon Shin, Je-Soo Ko, "100-Gb/s Three-Parallel Reed-Solomon based Foward Error Correction Architecture for Optical Communications", International SoC Design Conference 2008 (ISOCC '08), p. I-265-I-268, 2008年11月Hanho Lee, Chang-Seok Choi, Jongyoon Shin, Je-Soo Ko, "100-Gb / s Three-Parallel Reed-Solomon based Foward Error Correction Architecture for Optical Communications", International SoC Design Conference 2008 (ISOCC '08), p . I-265-I-268, November 2008 Hanho Lee, "A High-Speed Low-Complexity Reed-Solomon Decoder for Optical Communications", IEEE Transactions on Circuits and Systems II: Express Briefs, Vol.52, No.8, p.461-465, 2005年8月Hanho Lee, "A High-Speed Low-Complexity Reed-Solomon Decoder for Optical Communications", IEEE Transactions on Circuits and Systems II: Express Briefs, Vol.52, No.8, p.461-465, August 2005 Matthew L. Curry, Anthony Skjellum, H. Lee Ward, Ron Brightwell, "Accelerating Reed-Solomon Coding in RAID systems with GPUs", IEEE International Symposium on Parallel and Distributed Processing 2008 (IPDPS 2008), 2008年4月Matthew L. Curry, Anthony Skjellum, H. Lee Ward, Ron Brightwell, "Accelerating Reed-Solomon Coding in RAID systems with GPUs", IEEE International Symposium on Parallel and Distributed Processing 2008 (IPDPS 2008), April 2008 Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, Onur Mutlu, "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM", In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), 2015年10月Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, Onur Mutlu, "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM", In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation ( PACT), October 2015

これらのシステムを通信へ応用することを考えた際には、データ処理の低遅延化が重要となる。 When considering the application of these systems to communication, it is important to reduce the delay in data processing.

上記事情に鑑み、本発明は、汎用デバイスを用いてデータ処理を高速に行うことができるデータ処理装置及びデータ転送方法を提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a data processing apparatus and a data transfer method capable of performing data processing at high speed using a general-purpose device.

本発明の一態様は、外部から受信したデータに、データ更新を示す更新情報を付与して出力するインタフェース回路と、前記データを用いて演算処理を行うアクセラレータとを備え、前記アクセラレータは、前記インタフェース回路から出力された前記データを記憶する記憶部と、前記記憶部に記憶される前記データに付与された前記更新情報を繰り返し監視し、データ更新を示す前記更新情報を検出した場合に、検出した前記更新情報を更新検出済みに書き換えるポーリング部と、前記ポーリング部により検出された前記更新情報が付与された前記データを用いた演算処理の実行を制御する制御部と、前記制御部の制御に基づいて、前記記憶部に記憶された前記データを用いて演算処理を実行する演算部と、を備える、データ処理装置である。 One aspect of the present invention includes an interface circuit that adds update information indicating data update to data received from the outside and outputs the data, and an accelerator that performs arithmetic processing using the data. The accelerator is the interface. The storage unit that stores the data output from the circuit and the update information added to the data stored in the storage unit are repeatedly monitored, and when the update information indicating the data update is detected, it is detected. Based on the control of the poll unit that rewrites the update information to update detected, the control unit that controls the execution of arithmetic processing using the data to which the update information detected by the poll unit is added, and the control of the control unit. The data processing apparatus includes an arithmetic unit that executes arithmetic processing using the data stored in the storage unit.

本発明の一態様は、上述のデータ処理装置であって、前記インタフェース回路は、前記データの長さを示す長さ情報を前記データにさらに付与して出力し、前記制御部は、前記長さ情報に基づく並列度で前記演算部に演算処理を実行させる。 One aspect of the present invention is the above-mentioned data processing device, in which the interface circuit further adds length information indicating the length of the data to the data and outputs the data, and the control unit outputs the data. The arithmetic unit is made to execute arithmetic processing with a degree of parallelism based on information.

本発明の一態様は、上述のデータ処理装置であって、前記インタフェース回路は、演算処理の種類を示す制御情報を前記データにさらに付与して出力し、前記制御部は、前記制御情報が示す前記種類の演算処理を前記演算部に実行させる。 One aspect of the present invention is the above-mentioned data processing apparatus, in which the interface circuit further adds control information indicating a type of arithmetic processing to the data and outputs the data, and the control unit is indicated by the control information. The arithmetic unit is made to execute the arithmetic processing of the kind.

本発明の一態様は、上述のデータ処理装置であって、前記データ処理装置は、通信装置である。 One aspect of the present invention is the above-mentioned data processing device, and the data processing device is a communication device.

本発明の一態様は、インタフェース回路が、外部から受信したデータに、データ更新を示す更新情報を付与して出力する出力ステップと、アクセラレータが、前記インタフェース回路から出力された前記データを記憶部に記憶する記憶ステップと、前記記憶部に記憶される前記データに付与された前記更新情報を繰り返し監視する監視ステップと、前記監視ステップにおいてデータ更新を示す前記更新情報を検出した場合に、検出した前記更新情報を更新検出済みに書き換える書き換えステップと、前記監視ステップにおいて検出された前記更新情報が付与された前記データを用いた演算処理を実行する演算ステップと、を有するデータ転送方法である。 One aspect of the present invention is an output step in which the interface circuit adds update information indicating data update to the data received from the outside and outputs the data, and the accelerator stores the data output from the interface circuit in the storage unit. The storage step to be stored, the monitoring step for repeatedly monitoring the update information added to the data stored in the storage unit, and the detected update information when the update information indicating data update is detected in the monitoring step. It is a data transfer method including a rewriting step of rewriting the update information to update detected, and a calculation step of executing a calculation process using the data to which the update information detected in the monitoring step is added.

本発明により、汎用デバイスを用いてデータ処理を高速に行うことが可能となる。 INDUSTRIAL APPLICABILITY According to the present invention, data processing can be performed at high speed using a general-purpose device.

本発明の第１の実施形態による通信装置に用いられるデバイス間のデータ転送を示す図である。It is a figure which shows the data transfer between the device used for the communication apparatus by 1st Embodiment of this invention. 同実施形態による通信装置の機能ブロック図である。It is a functional block diagram of the communication device by the same embodiment. 同実施形態によるデータ形式の例を示す図である。It is a figure which shows the example of the data format by the same embodiment. 同実施形態によるポーリング処理の処理フローを示す図である。It is a figure which shows the processing flow of the polling processing by the same embodiment. 第２の実施形態によるＧＰＵにおける並列演算処理の処理フローを示す図である。It is a figure which shows the processing flow of the parallel arithmetic processing in GPU by 2nd Embodiment. 第３の実施形態によるＧＰＵにおける演算処理の種類の切替を行う処理フローを示す図である。It is a figure which shows the processing flow which switches the type of arithmetic processing in GPU according to 3rd Embodiment. 従来技術によるＧＰＵへのデータ転送を示す図である。It is a figure which shows the data transfer to the GPU by the prior art. 割込みを用いたＧＰＵへのデータ転送を示す図である。It is a figure which shows the data transfer to a GPU using an interrupt.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
汎用ハードウェアを用いた装置を通信へ応用する際には、低遅延化が重要となる。データ転送と処理の低遅延化を行うためには、短いデータを受信し、それを演算処理することが必要となる。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
When applying a device using general-purpose hardware to communication, it is important to reduce the delay. In order to transfer data and reduce the processing delay, it is necessary to receive short data and perform arithmetic processing on it.

汎用ハードウェアであるＧＰＵにデータ入力を行うにはＰＣＩｅ（ピーシーアイエクスプレス）等のインタフェースを介する必要があり、そのためには、ＦＰＧＡ（field-programmable gate array）等のハードウェアが必要となる。また、ＧＰＵがデータを受信する際には、割り込みが多く用いられる。しかしながら、ＦＰＧＡ等のハードウェアからの割込みに対応したＧＰＵはない。一方で、ＣＰＵは、割り込み制御を用いて、ＦＰＧＡからＧＰＵへのデータ転送や、ＧＰＵのプログラムを実行するＧＰＵカーネル起動を行う。 In order to input data to the GPU, which is general-purpose hardware, it is necessary to use an interface such as PCIe (PCIe), and for that purpose, hardware such as FPGA (field-programmable gate array) is required. Further, when the GPU receives data, interrupts are often used. However, there is no GPU that supports interrupts from hardware such as FPGA. On the other hand, the CPU uses interrupt control to transfer data from the FPGA to the GPU and start the GPU kernel that executes the GPU program.

上記から、例えば、通信装置を図８のような実装とすることが考えられる。この実装例では、外部入力をＧＰＵに転送する際に、ＦＰＧＡはＣＰＵに割込みを実行し、ＣＰＵはＦＰＧＡにデータの転送先アドレス指定や、転送命令を行う。また、ＣＰＵは、ＧＰＵに対しても並列度や実行命令を指定するカーネル実行命令を行う。 From the above, for example, it is conceivable to implement the communication device as shown in FIG. In this implementation example, when transferring an external input to the GPU, the FPGA executes an interrupt to the CPU, and the CPU specifies a data transfer destination address to the FPGA and issues a transfer instruction to the CPU. The CPU also issues a kernel execution instruction to the GPU to specify the degree of parallelism and the execution instruction.

このような実装では、１度に転送するデータ量が小さい場合は、ＣＰＵと他のプロセッサ間での通信回数が増加し、遅延量が増加する。そのため、ＣＰＵを介さず直接ＧＰＵに外部信号を転送する方法が望まれる。 In such an implementation, when the amount of data transferred at one time is small, the number of communications between the CPU and another processor increases, and the amount of delay increases. Therefore, a method of directly transferring an external signal to the GPU without going through a CPU is desired.

また、１度に転送するデータ量（フレームデータ量）が小さく、連続的にデータが入力されるときは、時間当たりの割込みの回数が増加する。その結果、ＣＰＵ−ＧＰＵ−ＦＰＧＡ間での通信が増加し、制約時間内での処理が間に合わない場合がある。 Further, when the amount of data to be transferred at one time (frame data amount) is small and data is continuously input, the number of interrupts per hour increases. As a result, communication between the CPU, GPU, and FPGA increases, and processing within the restricted time may not be in time.

ＧＰＵは従来、ＣＰＵのアクセラレータとして利用されており、ＣＰＵ制御によってデータ入力のタイミング制御や、機能の変更を行う。そのため、ＣＰＵを介さず、ＦＰＧＡなど他のデバイスからのデータ入力に対するタイミング制御、データに対する逐次の機能変更をどのように行うかが課題である。ＧＰＵ等のアクセラレータは割込みに対応していないため、ポーリングを使った、データ入力に対するタイミング制御の実装方法が必要である。 The GPU has been conventionally used as an accelerator of a CPU, and the timing of data input is controlled and the function is changed by the CPU control. Therefore, the issue is how to perform timing control for data input from other devices such as FPGA and sequential function change for data without going through the CPU. Since accelerators such as GPUs do not support interrupts, a method of implementing timing control for data input using polling is required.

また、通常、ＧＰＵの演算処理においてはＣＰＵの制御によってＧＰＵが行う処理の並列度の変更を行い、ＣＰＵの介入が生じる。更に、ＧＰＵが実行する処理を変更する際も、ＣＰＵの制御によって行うことが必要となる。そのため、ＣＰＵ−ＧＰＵ−ＦＰＧＡ間の通信を削減した転送方法とその演算処理方法が求められる。 Further, in the arithmetic processing of the GPU, the degree of parallelism of the processing performed by the GPU is usually changed by the control of the CPU, and the intervention of the CPU occurs. Further, when changing the process executed by the GPU, it is necessary to perform it under the control of the CPU. Therefore, a transfer method that reduces communication between the CPU, GPU, and FPGA and a calculation processing method thereof are required.

そこで、本実施形態では、ＣＰＵ−ＧＰＵ−ＦＰＧＡ間での通信を削減するため、ＧＰＵ内にポーリング実装を行う。ＧＰＵがＣＰＵを介さずにデータの入力タイミングを知るために、ＦＰＧＡは転送するフレーム全てに、データの更新を示すフラグを付与する。ＧＰＵは、転送されたデータをメモリにバッファし、バッファしたデータをポーリングにより読み込んでフラグが更新されていると判断した場合に、演算処理を開始する。また、ＦＰＧＡは、ＧＰＵに転送するフレームに対して、並列度や演算処理の種類の情報を付与する。これにより、ＧＰＵが演算処理を実行する際の並列度の変更や、ＧＰＵが実行する演算処理の種類を変更する。 Therefore, in the present embodiment, in order to reduce the communication between the CPU-GPU-FPGA, polling is implemented in the GPU. In order for the GPU to know the data input timing without going through the CPU, the FPGA adds a flag indicating data update to all the frames to be transferred. The GPU buffers the transferred data in the memory, reads the buffered data by polling, and starts arithmetic processing when it is determined that the flag has been updated. Further, the FPGA adds information on the degree of parallelism and the type of arithmetic processing to the frame to be transferred to the GPU. As a result, the degree of parallelism when the GPU executes the arithmetic processing is changed, and the type of the arithmetic processing executed by the GPU is changed.

［第１の実施形態］
図１は、本実施形態の通信装置１に用いられるデバイス間のデータ転送を示す図である。通信装置１は、データ転送装置の一例である。通信装置１は、例えば、ＰＯＮ（Passive Optical Network；受動光ネットワーク）における光加入者線端局装置（ＯＬＴ：Optical Line Terminal）や光回線終端装置（ＯＮＵ：Optical Network Unit）として用いることができる。 [First Embodiment]
FIG. 1 is a diagram showing data transfer between devices used in the communication device 1 of the present embodiment. The communication device 1 is an example of a data transfer device. The communication device 1 can be used, for example, as an optical subscriber line terminal (OLT) or an optical network unit (ONU) in a PON (Passive Optical Network).

通信装置１は、ＩＦ（インタフェース）回路として用いられるＦＰＧＡ２と、アクセラレータの一例であるＧＰＵ３とを備える。ＦＰＧＡ２は、伝送路を介して他の装置から信号を受信し、受信した信号に含まれるフレームデータをＧＰＵ３に転送する。ＦＰＧＡ２は、ＧＰＵ３に転送するフレームデータに対してデータ更新を表すフラグ等の付加データを付与することで、ＣＰＵの制御を介さないデータ転送を実現する。ＧＰＵ３は、ＦＰＧＡ２が出力したフレームデータを受信し、演算を行う。なお、ＧＰＵ３は、演算結果のデータをＦＰＧＡ２に出力し、ＦＰＧＡ２は、ＧＰＵ３から受信したデータが設定された信号を、伝送路を介して他の装置へ送信してもよい。 The communication device 1 includes an FPGA 2 used as an IF (interface) circuit and a GPU 3 which is an example of an accelerator. The FPGA 2 receives a signal from another device via a transmission line, and transfers the frame data included in the received signal to the GPU 3. The FPGA 2 realizes data transfer without CPU control by adding additional data such as a flag indicating data update to the frame data to be transferred to the GPU 3. The GPU 3 receives the frame data output by the FPGA 2 and performs an operation. The GPU 3 may output the calculation result data to the FPGA 2, and the FPGA 2 may transmit a signal in which the data received from the GPU 3 is set to another device via the transmission line.

通信装置１は、付加データを用いることによって、図８の場合と比較して、ＦＰＧＡ−ＣＰＵ間、及び、ＣＰＵ−ＧＰＵ間の通信を削減することができる。ＦＰＧＡ２からのデータ転送の実行前には前処理が必要である。前処理として、ＦＰＧＡ２は、転送先ＧＰＵメモリの確保、転送先ＧＰＵアドレスの取得、付加データの値の設定を事前に行う。さらには、ポーリング処理を行うＧＰＵ３のカーネルも実行しておく必要がある。 By using the additional data, the communication device 1 can reduce the communication between the FPGA and the CPU and between the CPU and the GPU as compared with the case of FIG. Preprocessing is required before executing data transfer from FPGA 2. As a preprocessing, the FPGA 2 secures the transfer destination GPU memory, acquires the transfer destination GPU address, and sets the value of the additional data in advance. Furthermore, it is necessary to execute the kernel of GPU3 that performs polling processing.

図２は、通信装置１の機能ブロック図である。同図に、ＦＰＧＡ２及びＧＰＵ３の各デバイスにおける機能部を示す。
ＦＰＧＡ２は、ＩＦ部２１、メモリ２２、フラグ付与部２３及び転送部２４を備える。ＩＦ部２１は、伝送路を伝送した外部信号を入力する。通信装置１が例えば、ＯＬＴ又はＯＮＵである場合、ＩＦ部２１は、光信号から電気信号への変換又は電気信号から光信号への変換を行う。メモリ２２は、ＩＦ部２１を介して入力された外部信号を記憶（バッファ）する。 FIG. 2 is a functional block diagram of the communication device 1. The figure shows the functional parts of each device of FPGA2 and GPU3.
The FPGA 2 includes an IF unit 21, a memory 22, a flagging unit 23, and a transfer unit 24. The IF unit 21 inputs an external signal transmitted through the transmission line. When the communication device 1 is, for example, an OLT or an ONU, the IF unit 21 converts an optical signal to an electric signal or an electric signal to an optical signal. The memory 22 stores (buffers) an external signal input via the IF unit 21.

フラグ付与部２３は、メモリ２２にバッファされた、ある長さを持つフレームデータにフラグを付与する。フラグ付与部２３は、更新フラグ付与部２３１、長さフラグ付与部２３２、及び、制御フラグ付与部２３３を備える。更新フラグ付与部２３１は、フレームデータにデータ更新を示すＵｐｄａｔｅ（更新）フラグを付与する。長さフラグ付与部２３２は、フレームデータに当該データの長さを示すＬｅｎｇｔｈ（長さ）フラグを付与する。制御フラグ付与部２３３は、フレームデータに演算処理の種類を示すＣｏｎｔｒｏｌ（制御）フラグを付与する。転送部２４は、各種フラグが付与されたフレームデータを、ＧＰＵ３のメモリ３１に転送する。 The flag giving unit 23 gives a flag to the frame data having a certain length buffered in the memory 22. The flag-giving unit 23 includes an update flag-giving unit 231, a length flag-giving unit 232, and a control flag-giving unit 233. The update flag giving unit 231 adds an Update flag indicating data update to the frame data. The length flag adding unit 232 adds a Length flag indicating the length of the data to the frame data. The control flag assigning unit 233 assigns a Control flag indicating the type of arithmetic processing to the frame data. The transfer unit 24 transfers the frame data to which various flags are added to the memory 31 of the GPU 3.

ＧＰＵ３は、メモリ３１、ポーリング部３２、制御部３３及び演算部３４を備える。メモリ３１は、データを記憶する記憶部の一例である。メモリ３１は、ＦＰＧＡ２の転送部２４から転送されたデータをバッファする。ＧＰＵ３のポーリング部３２は、メモリ３１にバッファリングされたデータに対してポーリング処理を行って信号の入力を検知する。制御部３３は、カーネルを実行することにより、各種処理の実行を制御する。また、制御部３３は、演算部３４における演算時の並列度の変更や演算部３４が行う演算処理の種類の切替を行う。演算部３４は、制御部３３から制御に基づいて入力信号に対する演算処理を行う。 The GPU 3 includes a memory 31, a polling unit 32, a control unit 33, and a calculation unit 34. The memory 31 is an example of a storage unit that stores data. The memory 31 buffers the data transferred from the transfer unit 24 of the FPGA 2. The polling unit 32 of the GPU 3 performs polling processing on the data buffered in the memory 31 to detect the input of the signal. The control unit 33 controls the execution of various processes by executing the kernel. Further, the control unit 33 changes the degree of parallelism at the time of calculation in the calculation unit 34 and switches the type of calculation processing performed by the calculation unit 34. The calculation unit 34 performs calculation processing on the input signal based on the control from the control unit 33.

図３は、ＦＰＧＡ２におけるフラグ付与によって生成されるデータ形式の例を示す図である。同図に示すように、データには、ヘッダ名「Ｕｐｄａｔｅ」、３２ビット長のＵｐｄａｔｅフラグと、ヘッダ名「Ｌｅｎｇｔｈ」、３２ビット長のＬｅｎｇｔｈフラグと、ヘッダ名「Ｃｏｎｔｒｏｌ」、４４８ビット長のＣｏｎｔｒｏｌフラグが付与される。 FIG. 3 is a diagram showing an example of a data format generated by flagging in FPGA 2. As shown in the figure, the data includes the header name "Update", the 32-bit length Update flag, the header name "Length", the 32-bit length Length flag, the header name "Control", and the 448-bit length Control. Flagged.

更新フラグ付与部２３１は、Ｕｐｄａｔｅフラグに常にデータ更新を表す値「１」を設定し、データの更新をＧＰＵ３に通知する。ＧＰＵ３のポーリング部３２は、メモリ３１に記憶されるデータに値「１」のＵｐｄａｔｅフラグを検出した場合に、このＵｐｄａｔｅフラグを、更新検出済みを表す値「０」に書き換える。Ｌｅｎｇｔｈフラグは、データの長さを示す。Ｌｅｎｇｔｈフラグにデータの長さを設定することにより、ＧＰＵ３において、処理を行うデータの範囲の認識や、並列演算を行う際の並列度の決定などに用いることができる。Ｃｏｎｔｒｏｌフラグは、処理制御用に用いられる。Ｃｏｎｔｒｏｌフラグは、ＧＰＵ３において行われる演算処理の種類を変更する際に利用される。これらのフラグの値は、ＦＰＧＡ２のレジスタの値の書き換えを行うことなどにより、プログラム実行中に変更される。 The update flag assigning unit 231 always sets the Update flag to a value “1” indicating data update, and notifies the GPU 3 of the data update. When the polling unit 32 of the GPU 3 detects the Update flag of the value "1" in the data stored in the memory 31, the Polling unit 32 rewrites the Update flag to the value "0" indicating that the update has been detected. The Length flag indicates the length of the data. By setting the length of the data in the Length flag, the GPU 3 can be used for recognizing the range of data to be processed, determining the degree of parallelism when performing parallel calculation, and the like. The Control flag is used for processing control. The Control flag is used when changing the type of arithmetic processing performed on the GPU 3. The values of these flags are changed during program execution by rewriting the values of the registers of FPGA2.

図４は、ＧＰＵカーネルにより実行されるポーリング処理の処理フローを示す図である。ＧＰＵカーネルは、事前設定時に起動される。ＦＰＧＡ２のフラグ付与部２３は、ＩＦ部２１が入力した信号のフレームデータがメモリ２２にバッファされると、そのフレームデータにＵｐｄａｔｅフラグ、Ｌｅｎｇｔｈフラグ、及び、Ｃｏｎｔｒｏｌフラグを付与し、転送部２４に出力する。転送部２４は、各種フラグが付与されたフレームデータを、ＧＰＵ３のメモリ３１に転送する。ＧＰＵ３のメモリ３１は、ＦＰＧＡ２から転送されたデータをバッファリングする。 FIG. 4 is a diagram showing a processing flow of polling processing executed by the GPU kernel. The GPU kernel is booted at pre-configuration. When the frame data of the signal input by the IF unit 21 is buffered in the memory 22, the flag adding unit 23 of the FPGA 2 adds the Update flag, the Length flag, and the Control flag to the frame data, and outputs the frame data to the transfer unit 24. do. The transfer unit 24 transfers the frame data to which various flags are added to the memory 31 of the GPU 3. The memory 31 of the GPU 3 buffers the data transferred from the FPGA 2.

ＧＰＵ３のポーリング部３２は、ポーリング処理により、常にメモリ３１に記憶されているデータのＵｐｄａｔｅフラグをチェックする（ステップＳ１１０）。例えば、ポーリング部３２は、所定時間間隔でＵｐｄａｔｅフラグをチェックする。ポーリング部３２は、Ｕｐｄａｔｅフラグの値が０であると判断した場合（ステップＳ１１０：＝＝０）、まだＦＰＧＡ２から新たなフレームデータは到着していないとみなす。ＧＰＵ３は、フレームデータの演算処理は行わず、ステップＳ１１０に戻り、再度Ｕｐｄａｔｅフラグのチェックを再開する。 The polling unit 32 of the GPU 3 constantly checks the Update flag of the data stored in the memory 31 by the polling process (step S110). For example, the polling unit 32 checks the Update flag at predetermined time intervals. When the polling unit 32 determines that the value of the Update flag is 0 (step S110: == 0), it considers that new frame data has not yet arrived from FPGA 2. The GPU 3 does not perform the calculation processing of the frame data, returns to step S110, and restarts the check of the Update flag again.

一方、ポーリング部３２は、Ｕｐｄａｔｅフラグの値が１であると判断した場合（ステップＳ１１０：！＝０）、新たなフレームデータが入力されたとみなし、Ｕｐｄａｔｅフラグを０にリセットする（ステップＳ１２０）。Ｕｐｄａｔｅフラグのリセット後、制御部３３は、フレームデータに対する任意の演算処理を演算部３４により実行させる（ステップＳ１３０）。ＧＰＵ３は、ステップＳ１１０からの処理を繰り返す。 On the other hand, when the polling unit 32 determines that the value of the Update flag is 1 (step S110 :! = 0), it considers that new frame data has been input and resets the Update flag to 0 (step S120). After resetting the Update flag, the control unit 33 causes the calculation unit 34 to execute an arbitrary calculation process for the frame data (step S130). GPU3 repeats the process from step S110.

なお、メモリ３１は、例えば、リングバッファである。ＧＰＵ３の制御部３３は、Ｕｐｄａｔｅフラグのリセットの度に、次に更新をチェックするバッファ位置を表すアドレス値を逐次変更していく。 The memory 31 is, for example, a ring buffer. The control unit 33 of the GPU 3 sequentially changes the address value representing the buffer position for which the update is checked next each time the Update flag is reset.

［第２の実施形態］
ＧＰＵは複数コアを有しており、並列演算が可能である。本実施形態では、ＧＰＵはフレームデータの並列演算処理を行う。 [Second Embodiment]
The GPU has a plurality of cores and can perform parallel operations. In this embodiment, the GPU performs parallel arithmetic processing of frame data.

図５は、ＧＰＵ３がフレームデータに対して並列演算処理を行う処理フローを示す図である。同図を用いて、ＧＰＵ３が、カーネル実行中に演算処理を実行する並列度を変更する方法を説明する。並列度の変更には、フレームデータに付与されたＬｅｎｇｔｈフラグが用いられる。ＧＰＵ３がｎビット単位で演算処理を行う際は、Ｌｅｎｇｔｈ／ｎの並列度を指定し、カーネルを起動する。ここではＮＶＩＤＩＡのＧＰＵで利用できるDynamic parallelismを想定し、既に実行しているカーネル内から動的にカーネルを起動する。例えば、１０Ｇ−ＥＰＯＮ（Gigabit - Ethernet（登録商標） Passive Optical Network）フレームとＮＧ−ＰＯＮ２（Next generation - Passive Optical Network 2）フレームではｎが異なるため、ＦＰＧＡ２の転送部２４から逐次ＧＰＵ３に転送するデータの長さ（Ｌｅｎｇｔｈフラグへの設定値）をｎの倍数に変更する必要がある。 FIG. 5 is a diagram showing a processing flow in which the GPU 3 performs parallel arithmetic processing on the frame data. The method of changing the degree of parallelism in which the GPU 3 executes arithmetic processing during kernel execution will be described with reference to the figure. The Length flag assigned to the frame data is used to change the degree of parallelism. When the GPU 3 performs arithmetic processing in n-bit units, the parallelism of Length / n is specified and the kernel is started. Here, assuming Dynamic parallelism that can be used with NVIDIA GPUs, the kernel is dynamically booted from within the kernel that is already running. For example, since n is different between the 10G-EPON (Gigabit --Ethernet (registered trademark) Passive Optical Network) frame and the NG-PON2 (Next generation --Passive Optical Network 2) frame, the data to be sequentially transferred from the transfer unit 24 of the FPGA 2 to the GPU 3. It is necessary to change the length of (the value set for the Length flag) to a multiple of n.

図５のステップＳ２１０〜ステップＳ２２０の処理は、図４に示すステップＳ１１０〜ステップＳ１２０の処理と同様である。ステップＳ２２０の処理の後、制御部３３は、メモリ３１に記憶されているフレームデータのＬｅｎｇｔｈフラグを読み出し、読み出したＬｅｎｇｔｈフラグに設定されているデータ長を予め設定されたｎで除算して並列度を計算する（ステップＳ２３０）。制御部３３は、既に実行しているカーネル内から、計算された並列度のカーネルを起動することにより、計算された並列度に基づいてカーネルを起動し、各カーネルはｎビット単位の演算処理を並列で演算部３４に実行させる（ステップＳ２４０）。ＧＰＵ３は、ステップＳ２１０からの処理を繰り返す。 The process of steps S210 to S220 of FIG. 5 is the same as the process of steps S110 to S120 shown in FIG. After the process of step S220, the control unit 33 reads the Length flag of the frame data stored in the memory 31 and divides the data length set in the read Length flag by a preset n to determine the degree of parallelism. Is calculated (step S230). The control unit 33 starts the kernel based on the calculated parallelism by starting the kernel of the calculated parallelism from the kernel that has already been executed, and each kernel performs arithmetic processing in n-bit units. The calculation unit 34 is made to execute in parallel (step S240). GPU3 repeats the process from step S210.

［第３の実施形態］
本実施形態では、ＧＰＵ３が実行する演算処理の種類を切り替える。
図６は、ＧＰＵ３が実行する演算処理の種類の切替えを行う処理フローを示す図である。演算処理の種類の切換えには、フレームデータに付与されたＣｏｎｔｒｏｌフラグを用いる。 [Third Embodiment]
In the present embodiment, the type of arithmetic processing executed by the GPU 3 is switched.
FIG. 6 is a diagram showing a processing flow for switching the type of arithmetic processing executed by the GPU 3. The Control flag assigned to the frame data is used to switch the type of arithmetic processing.

図６のステップＳ３１０〜ステップＳ３２０の処理は、図４に示すステップＳ１１０〜ステップＳ１２０の処理と同様である。ステップＳ３２０の処理の後、制御部３３は、分岐命令においてＣｏｎｔｒｏｌフラグの値を参照する（ステップＳ３３０）。制御部３３は、Ｃｏｎｔｒｏｌフラグの値に応じて、起動するカーネルを切替える。複数のカーネルをカーネル０〜ｋ（ｋは１以上の整数）としたとき、制御部３３は、Ｃｏｎｔｒｏｌフラグの値ｉ（ｉは０以上ｋ以下の整数）である場合に、カーネルｉを起動する（ステップＳ３４０−０〜Ｓ３４０−ｋ）。これにより、例えば、制御部３３は、入力データに対して、リードソロモン（２５５，２２３）や、リードソロモン（２５５，２３９）等の機能の切替えを、ＣＰＵを介さず行うことができる。ＧＰＵ３は、ステップＳ３４０−０〜Ｓ３４０−ｋのいずれかの実行後、ステップＳ３１０からの処理を繰り返す。 The process of steps S310 to S320 of FIG. 6 is the same as the process of steps S110 to S120 shown in FIG. After the process of step S320, the control unit 33 refers to the value of the Control flag in the branch instruction (step S330). The control unit 33 switches the kernel to be booted according to the value of the Control flag. When a plurality of kernels are set to kernels 0 to k (k is an integer of 1 or more), the control unit 33 starts kernel i when the value i of the Control flag (i is an integer of 0 or more and k or less). (Steps S340-0 to S340-k). As a result, for example, the control unit 33 can switch functions such as Reed-Solomon (255,223) and Reed-Solomon (255,239) with respect to the input data without going through the CPU. The GPU 3 repeats the process from step S310 after executing any one of steps S340-0 to S340-k.

以上説明した実施形態によれば、データ処理装置は、ＦＰＧＡなどのインタフェース回路と、ＧＰＵなどのアクセラレータとを備える。データ処理装置は、例えば、通信装置である。インタフェース回路は、入力されたフレームデータに対して、データの更新を示す更新情報を付与してアクセラレータに転送する。更新情報は、例えば、Ｕｐｄａｔｅフラグである。アクセラレータは、受信したフレームデータを記憶部に記憶し、記憶されているデータの更新情報を繰り返し監視する。アクセラレータは、データ更新を示す更新情報を検出した場合に、フレームデータが入力されたと判断して、検出した更新情報を更新検出済みへ書き換えるとともに、当該更新情報が付与されたフレームデータを用いた演算処理を開始する。これにより、ＣＰＵを介することなく、アクセラレータがフレームデータの入力タイミングを検出することができため、通信装置におけるデータ転送および演算処理の低遅延化が可能となる。 According to the embodiment described above, the data processing device includes an interface circuit such as an FPGA and an accelerator such as a GPU. The data processing device is, for example, a communication device. The interface circuit adds update information indicating data update to the input frame data and transfers it to the accelerator. The update information is, for example, the Update flag. The accelerator stores the received frame data in the storage unit, and repeatedly monitors the update information of the stored data. When the accelerator detects the update information indicating the data update, it determines that the frame data has been input, rewrites the detected update information to the update detected, and performs an operation using the frame data to which the update information is added. Start processing. As a result, the accelerator can detect the input timing of the frame data without going through the CPU, so that it is possible to reduce the delay of data transfer and arithmetic processing in the communication device.

また、インタフェース回路は、データの長さを示す長さ情報及び演算処理の種類を示す制御情報をさらにデータに付与してアクセラレータに転送してもよい。長さ情報、制御情報は、例えば、Ｌｅｎｇｔｈフラグ、Ｃｏｎｔｒｏｌフラグである。アクセラレータは、長さ情報に基づく並列度で演算処理を実行する。これにより、演算処理の低遅延化をさらに図ることができる。また、アクセラレータは、制御情報が示す種類の演算処理を実行する。これにより、インタフェース回路から、アクセラレータが実行する演算処理を切り替えることができる。 Further, the interface circuit may further add length information indicating the length of the data and control information indicating the type of arithmetic processing to the data and transfer the data to the accelerator. The length information and control information are, for example, the Length flag and the Control flag. The accelerator executes arithmetic processing with a degree of parallelism based on length information. As a result, it is possible to further reduce the delay in arithmetic processing. In addition, the accelerator executes the kind of arithmetic processing indicated by the control information. As a result, the arithmetic processing executed by the accelerator can be switched from the interface circuit.

以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described in detail with reference to the drawings, the specific configuration is not limited to this embodiment, and includes designs and the like within a range that does not deviate from the gist of the present invention.

１…通信装置，２…ＦＰＧＡ，３…ＧＰＵ，２１…ＩＦ部，２２…メモリ，２３…フラグ付与部，２４…転送部，３１…メモリ，３２…ポーリング部，３３…制御部，３４…演算部，２３１…更新フラグ付与部，２３２…長さフラグ付与部，２３３…制御フラグ付与部 1 ... Communication device, 2 ... FPGA, 3 ... GPU, 21 ... IF section, 22 ... Memory, 23 ... Flagging section, 24 ... Transfer section, 31 ... Memory, 32 ... Polling section, 33 ... Control section, 34 ... Calculation Unit, 231 ... Update flag assignment unit, 232 ... Length flag assignment unit, 233 ... Control flag assignment unit

Claims

An interface circuit that adds update information indicating data update and control information indicating the type of arithmetic processing to data received from the outside and outputs it.
It is equipped with an accelerator that performs arithmetic processing using the data.
The accelerator
A storage unit that stores the data output from the interface circuit,
A polling unit that repeatedly monitors the update information added to the data stored in the storage unit and rewrites the detected update information to update detected when the update information indicating data update is detected.
A control unit that controls execution of arithmetic processing using the data to which the update information detected by the polling unit is added, and a control unit.
A calculation unit that executes arithmetic processing using the data stored in the storage unit based on the control of the control unit, and a calculation unit.
With
The control unit causes the calculation unit to execute the type of calculation processing indicated by the control information.
Data processing device.

The interface circuit further adds length information indicating the length of the data to the data and outputs the data.
The control unit causes the calculation unit to execute arithmetic processing at a degree of parallelism based on the length information.
The data processing device according to claim 1.

The data processing device is a communication device.
The data processing device according to claim 1 or 2.

The interface circuit is
An output step that adds update information indicating data update and control information indicating the type of arithmetic processing to data received from the outside and outputs it.
Accelerator,
A storage step for storing the data output from the interface circuit in the storage unit, and
A monitoring step of repeatedly monitoring the update information added to the data stored in the storage unit, and
When the update information indicating data update is detected in the monitoring step, a rewrite step of rewriting the detected update information to update detected, and a rewrite step.
A calculation step for executing a calculation process using the data to which the update information detected in the monitoring step is added, and a calculation step.
Have,
In the calculation step, the kind of calculation processing indicated by the control information is executed.
Data transfer method.