JP7136343B2

JP7136343B2 - Data processing system, method and program

Info

Publication number: JP7136343B2
Application number: JP2021515247A
Authority: JP
Inventors: 鶴鳴孫
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2022-09-13
Anticipated expiration: 2038-09-18
Also published as: JP2022500782A; WO2020059156A1

Description

本発明は、一般に、一次元配列アーキテクチャに基づき、特に、行列転置計算を含む連続行列処理に使用されるデータ処理システム、方法、およびプログラムに関するものである。 The present invention relates generally to data processing systems, methods, and programs based on one-dimensional array architectures and, more particularly, to continuous matrix processing, including matrix transpose calculations.

機械学習は、多くの研究分野における高いパフォーマンスのために、近年非常に人気がある。機械学習用のますます多くのアプリケーションが開発されるにつれて、計算の複雑さが大幅に増している。したがって、効率的なデータ処理が非常に重要である。計算効率を向上させるために、ベクトル処理の並列処理を増やすことは非常に有益であり、したがって好ましい。 Machine learning has become very popular in recent years due to its high performance in many research areas. As more and more applications for machine learning are developed, the computational complexity increases significantly. Efficient data processing is therefore very important. Increasing the parallelism of vector processing to improve computational efficiency is highly beneficial and therefore preferred.

ベクトル処理では、主な概念はベクトルパターンでデータを処理することである。行列データを処理するために、各ベクトルレーンに算術論理演算装置（ＡＬＵ）を含めたベクトルレーンが使用される。計算データを保存するために、各ベクトルレーンにはローカルメモリが含まれている必要がある。さらに、オンチップローカルメモリとオフチップ外部メモリとの間でデータを転送するために、ダイレクトメモリアクセス（ＤＭＡ）を使用することができる。最後に、全体的な制御ロジックについては、中央制御ユニットがシステムで使用されることが好ましい。上記の概念に基づいて、いくつかのベクトル処理装置の設計は、特許文献１、非特許文献１、および非特許文献２に記載されている。 In vector processing, the main concept is to process data in vector patterns. Vector lanes containing an arithmetic logic unit (ALU) in each vector lane are used to process the matrix data. Each vector lane must contain local memory to store computational data. Additionally, direct memory access (DMA) can be used to transfer data between on-chip local memory and off-chip external memory. Finally, for the overall control logic, a central control unit is preferably used in the system. Based on the above concepts, several vector processor designs are described in US Pat.

機械学習では、最も一般的な計算の１つが行列計算である。汎用行列乗算（ＧＥＭＭ）は、順方向と逆方向の両方の伝搬で使用される。ＧＥＭＭ機能を実行する場合、すべてのソースデータをメモリから取得する必要があり、各エレメントからフェッチされたデータを十分に活用するために、二次元のプロセッシングエレメントが使用される。ただし、二次元配列の欠点は、計算の実際の行列サイズがサポートされている二次元配列のサイズよりもはるかに小さい場合、未使用のプロセッシングエレメントの計算リソースが無駄になることである。別の方法として、柔軟性が高いため、一次元配列を使用することもできる。 In machine learning, one of the most common computations is matrix computation. Generalized Matrix Multiplication (GEMM) is used in both forward and backward propagation. When performing the GEMM function, all source data must be retrieved from memory, and two-dimensional processing elements are used to fully exploit the data fetched from each element. However, a drawback of 2D arrays is that if the actual matrix size of the computation is much smaller than the supported 2D array size, the computational resources of unused processing elements are wasted. Alternatively, one-dimensional arrays can be used as they are more flexible.

米国特許第００５６００８４３号U.S. Patent No. 005600843

フィールドプログラマブルロジックアンドアプリケーションズにおけるインターナショナルカンファレンス（ｉｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃａｎｄＡｐｐｌｉｃａｔｉｏｎｓ）で２０１７年９月に公開されたステファンノルティング（ＳｔｅｐｈａｎＮｏｌｔｉｎｇ）らによる“アドバンストドライバアシスタンスシステムズのためのアプリケーションスペシフィックソフト－コアベクトルプロセッサ（Ａｐｐｌｉｃａｔｉｏｎ－ＳｐｅｃｉｆｉｃＳｏｆｔ－ＣｏｒｅＶｅｃｔｏｒＰｒｏｃｅｓｓｏｒｆｏｒＡｄｖａｎｃｅｄＤｒｉｖｅｒＡｓｓｉｓｔａｎｃｅＳｙｓｔｅｍｓ）”"Application-Specific Soft-Core Vectors for Advanced Driver Assistance Systems" by Stephan Nolting et al., published September 2017 at the International Conference on Field Programmable Logic and Applications. Processor (Application-Specific Soft-Core Vector Processor for Advanced Driver Assistance Systems)” チャイニーズジャーナルオブエレクトローニックス（ＣｈｉｎｅｓｅＪｏｕｒｎａｌｏｆＥｌｅｃｔｒｏｎｉｃｓ），第２６巻第６号第１１９８－１２０５頁で２０１７年１１月に公開されたＹｅｙｏｎｇＰａｎｇらによる“ＣＰＵアクセラレータとしてのフルリーパイプラインドソフトベクトルプロセッサ（ＦｕｌｌｙＰｉｐｅｌｉｎｅｄＳｏｆｔＶｅｃｔｏｒＰｒｏｃｅｓｓｏｒａｓａＣＰＵＡｃｃｅｌｅｒａｔｏｒ）”“Fully Pipelined Soft Vector Processors as CPU Accelerators” by Yeyong Pang et al. Fully Pipelined Soft Vector Processor as a CPU Accelerator)

従来の技術の最初の課題は、行列転置がベクトル処理装置で一般的にサポートされていないことである。機械学習では、全結合レイヤの逆伝搬など、いくつかのケースで行列転置が使用される。参考文献では、処理装置内で行列を効率的に転置する方法は開示されていない。したがって、行列転置の特定の命令がベクトル処理装置でサポートされていない場合、転置はベクトル処理装置の外部で実行される必要がある。これを行うには、転置する行列を最初にベクトル処理装置内のローカルメモリから処理装置外部の外部メモリに転送する必要がある。転置された結果が外部メモリで準備された後、転置された行列はベクトル処理装置の外部の外部メモリからベクトル処理装置に転送される必要があり、これはデータ転送に多くの時間を浪費する。 A first problem with the prior art is that matrix transposition is not generally supported in vector processors. In machine learning, matrix transpose is used in several cases, such as backpropagation of fully connected layers. The reference does not disclose how to efficiently transpose the matrix within the processor. Therefore, if the particular instruction for matrix transposition is not supported by the vector processor, the transposition must be performed outside the vector processor. To do this, the matrix to be transposed must first be transferred from local memory within the vector processor to external memory outside the processor. After the transposed result is prepared in the external memory, the transposed matrix needs to be transferred from the external memory outside the vector processor to the vector processor, which wastes a lot of time in data transfer.

２つ目の課題は、主に２つのタイプで構成されるデータ転送が多いことである。１つはローカルメモリ自体の間の転送であり、もう１つはローカルメモリと外部メモリとの間の転送である。外部メモリは現在のベクトル処理装置で常に使用できるとは限らないため、ローカルメモリと外部メモリとの間のデータ転送には通常バブルサイクルがある。また、ローカルメモリと外部メモリとの間でデータ転送を行う場合、バブルサイクルが複数ある場合でも、要求が１つずつ連続して提示されるため、他の転送要求を処理できず、これにより計算システム全体のスループット性能が低下する。 The second problem is that many data transfers are mainly composed of two types. One is the transfer between the local memory itself and the other is the transfer between the local memory and the external memory. Data transfers between local memory and external memory usually have bubble cycles because external memory is not always available in current vector processing units. Also, when transferring data between local memory and external memory, even if there are multiple bubble cycles, the requests are presented one after another in succession, making it impossible to process other transfer requests. Overall system throughput performance is degraded.

本発明の１つの例示的な目的は、行列転置が従来のベクトル処理エンジンでサポートされていない（上記で特定された）第１の課題を解決することができる行列転置装置を提供することである。 One exemplary object of the present invention is to provide a matrix transposing device that can solve the first problem (identified above) that matrix transposing is not supported in conventional vector processing engines. .

本発明の別の例示的な目的は、１つのローカルメモリに対する複数の要求を同時に受け入れることができない（上記で特定された）第２の課題を解決することができる並列システムを提供することである。 Another exemplary object of the present invention is to provide a parallel system that can solve the second problem (identified above) of not being able to accept multiple requests for a single local memory simultaneously. .

本開示の第１の態様は、中央処理装置と、前記中央処理装置に電子的に接続され、前記中央処理装置から受信した命令に基づいて動作を実行するように構成されたベクトル処理装置と、前記中央処理装置に電子的に接続され、命令を格納するように構成された命令メモリユニットと、外部メモリユニットと、前記中央処理装置に電子的に接続され、一次元シストリックデータを格納するように構成された第１のローカルメモリユニットと、前記ベクトル処理装置に電子的に接続され、行列データを格納するように構成された第２のローカルメモリユニットと、前記第１のローカルメモリユニット、前記第２のローカルメモリユニット、前記命令メモリユニット、および前記外部メモリユニットに電子的に接続され、前記外部メモリユニット内のデータにアクセスするように構成されたダイレクトメモリアクセスユニットと、を備えるデータ処理システムを提供し、データが、所定の選択優先度に基づくタイミングで、前記ダイレクトメモリアクセスユニットを介して転送される。 A first aspect of the present disclosure is a central processing unit; a vector processing unit electronically coupled to the central processing unit and configured to perform operations based on instructions received from the central processing unit; an instruction memory unit electronically connected to the central processing unit and configured to store instructions; an external memory unit electronically connected to the central processing unit and configured to store one-dimensional systolic data; a second local memory unit electronically coupled to said vector processing device and configured to store matrix data; said first local memory unit; a second local memory unit, said instruction memory unit, and a direct memory access unit electronically coupled to said external memory unit and configured to access data in said external memory unit. and data is transferred through the direct memory access unit with timing based on a predetermined selection priority.

本開示の第２の態様は、データ処理システムのための方法を提供する。前記データ処理システムは、中央処理装置と、前記中央処理装置に電子的に接続されたベクトル処理装置と、前記中央処理装置に電子的に接続された命令メモリユニットと、外部メモリユニットと、前記中央処理装置に電子的に接続された第１のローカルメモリユニットと、前記ベクトル処理装置に電子的に接続された第２のローカルメモリユニットと、前記第１のローカルメモリユニット、前記第２のローカルメモリユニット、前記命令メモリユニット、および前記外部メモリユニットに電子的に接続されたダイレクトメモリアクセスユニットと、を備える。前記方法は、前記ベクトル処理装置によって、前記中央処理装置から受信した命令に基づいて動作を実行することと、前記命令メモリユニットによって、命令を格納することと、前記第１のローカルメモリユニットによって、一次元シストリックデータを格納することと、前記第２のローカルメモリユニットによって、行列データを格納することと、前記ダイレクトメモリアクセスユニットによって、前記外部メモリユニット内のデータにアクセスすることと、を含む。データが、所定の選択優先度に基づくタイミングで、前記ダイレクトメモリアクセスユニットを介して転送される。 A second aspect of the disclosure provides a method for a data processing system. The data processing system includes a central processing unit, a vector processing unit electronically connected to the central processing unit, an instruction memory unit electronically connected to the central processing unit, an external memory unit, and a a first local memory unit electronically connected to a processing device; a second local memory unit electronically connected to said vector processing device; said first local memory unit; said second local memory; a direct memory access unit electronically connected to said instruction memory unit and said external memory unit. The method includes: performing operations based on instructions received from the central processing unit by the vector processing unit; storing instructions by the instruction memory unit; storing instructions by the first local memory unit; storing one-dimensional systolic data; storing matrix data by the second local memory unit; and accessing data in the external memory unit by the direct memory access unit. . Data is transferred through the direct memory access unit with timing based on predetermined selection priorities.

本開示の第３の態様は、データ処理システムのためのプログラムを提供する。前記データ処理システムは、中央処理装置と、前記中央処理装置に電子的に接続されたベクトル処理装置と、前記中央処理装置に電子的に接続された命令メモリユニットと、外部メモリユニットと、前記中央処理装置に電子的に接続された第１のローカルメモリユニットと、前記ベクトル処理装置に電子的に接続された第２のローカルメモリユニットと、前記第１のローカルメモリユニット、前記第２のローカルメモリユニット、前記命令メモリユニット、および前記外部メモリユニットに電子的に接続されたダイレクトメモリアクセスユニットと、を備え、前記プログラムは、前記ベクトル処理装置に、前記中央処理装置から受信した命令に基づいて動作を実行させ、前記命令メモリユニットに、命令を格納させ、前記第１のローカルメモリユニットに、一次元シストリックデータを格納させ、前記第２のローカルメモリユニットに、行列データを格納させ、前記ダイレクトメモリアクセスユニットに、前記外部メモリユニット内のデータにアクセスさせる。データが、所定の選択優先度に基づくタイミングで、前記ダイレクトメモリアクセスユニットを介して転送される。 A third aspect of the present disclosure provides a program for a data processing system. The data processing system includes a central processing unit, a vector processing unit electronically connected to the central processing unit, an instruction memory unit electronically connected to the central processing unit, an external memory unit, and a a first local memory unit electronically connected to a processing device; a second local memory unit electronically connected to said vector processing device; said first local memory unit; said second local memory; a direct memory access unit electronically connected to said instruction memory unit and said external memory unit, said program instructing said vector processing unit to act on instructions received from said central processing unit. , causing the instruction memory unit to store instructions, causing the first local memory unit to store one-dimensional systolic data, causing the second local memory unit to store matrix data, and causing the direct A memory access unit is caused to access data in the external memory unit. Data is transferred through the direct memory access unit with timing based on predetermined selection priorities.

本発明の１つの効果は、行列転置がローカルメモリ内で実行され得ることである。この効果の理由は、行列転置のための特定の命令を作成できるためである。 One advantage of the present invention is that matrix transposition can be performed in local memory. The reason for this effect is the ability to create specific instructions for matrix transposition.

ベクトル処理装置はこの命令を受信すると、一方のローカルメモリからもう一方のローカルメモリへのデータの転送を開始する。転送中に、ソースメモリとデスティネーションメモリのためのマッピングされたアドレスが計算されるため、ソース行列からの各データ項目がデスティネーション行列に転置された方法で入力される。 When the vector processor receives this command, it begins transferring data from one local memory to the other. During the transfer, each data item from the source matrix is entered in the destination matrix in a transposed manner, as the mapped addresses for the source and destination memories are calculated.

２つ目の効果は、２つのローカルメモリと外部メモリとの間のデータを同じ期間内に転送できることである。 A second advantage is that data can be transferred between the two local memories and the external memory within the same period.

この効果の理由は、優先度要求選択方式が実装されているため、優先度の高いデータ転送のバブルサイクル中に優先度の低いデータ転送を実行できるためである。 The reason for this effect is that a priority request selection scheme is implemented so that a low priority data transfer can be performed during the bubble cycle of a high priority data transfer.

本発明の第１の例示的な実施形態の構造を示すブロック図である。1 is a block diagram showing the structure of a first exemplary embodiment of the present invention; FIG. 第１の例示的な実施形態のローカルメモリの構造を示すブロック図である。Fig. 3 is a block diagram showing the structure of the local memory of the first exemplary embodiment; 第１の例示的な実施形態のローカルメモリの構造を示すブロック図である。Fig. 3 is a block diagram showing the structure of the local memory of the first exemplary embodiment; 第１の例示的な実施形態のプロセッシングエレメントの構造を示すブロック図である。Fig. 3 is a block diagram showing the structure of the processing element of the first exemplary embodiment; 第１の例示的な実施形態の行列転送の構造を示すブロック図である。Fig. 4 is a block diagram showing the structure of matrix transfer of the first exemplary embodiment; 第１の例示的な実施形態のローカルメモリのデータマッピングを示すブロック図である。Figure 3 is a block diagram illustrating data mapping of local memory for the first exemplary embodiment; 第１の例示的な実施形態のローカルメモリのデータマッピングを示すブロック図である。Figure 3 is a block diagram illustrating data mapping of local memory for the first exemplary embodiment; 第１の例示的な実施形態のローカルメモリについての異なるメモリアクセス要求の優先度を示すブロック図である。FIG. 4 is a block diagram illustrating priorities of different memory access requests for the local memory of the first exemplary embodiment; 第１の例示的な実施形態のローカルメモリに対する異なるメモリアクセス要求の優先度を示すブロック図である。FIG. 4 is a block diagram illustrating priorities of different memory access requests to the local memory of the first exemplary embodiment; 従来の方法による行列転置を伴わない２つの連続的なＧＥＭＭ計算の手順を示すフロー図である。FIG. 2 is a flow diagram showing the procedure for two consecutive GEMM computations without matrix transposition according to conventional methods; 第１の例示的な実施形態の方法による行列転置を伴わない２つの連続的なＧＥＭＭ計算の手順を示すフロー図である。FIG. 4 is a flow diagram showing the procedure for two consecutive GEMM computations without matrix transposition according to the method of the first exemplary embodiment; 従来の方法による行列転置を伴う２つの連続的なＧＥＭＭ計算の手順を示すフロー図である。FIG. 2 is a flow diagram showing the procedure for two successive GEMM computations with matrix transposition according to conventional methods; 第１の例示的な実施形態の方法による行列転置を伴う２つの連続的なＧＥＭＭ計算の手順を示すフロー図である。Fig. 2 is a flow diagram showing the procedure for two successive GEMM computations with matrix transposition according to the method of the first exemplary embodiment; 第１の例示的な実施形態の優先度選択を用いて、同じ期間に動作する２つの要求のメカニズムを示すブロック図である。Fig. 2 is a block diagram showing the mechanism of two requests operating in the same period using the priority selection of the first exemplary embodiment; 本発明の別の例示的な実施形態の構造を示すブロック図である。FIG. 4 is a block diagram showing the structure of another exemplary embodiment of the present invention;

（構成の説明）
まず、本発明の第１の例示的な実施形態を、添付の図面を参照して以下に詳述する。 (Description of configuration)
First, a first exemplary embodiment of the present invention will be detailed below with reference to the accompanying drawings.

図１を参照すると、本発明の第１の例示的な実施形態では、データ処理システム１００は、中央処理装置（ＣＰ）１１０、ベクトル処理（ＶＰ）エンジン１２０、ＤＭＡ１３０、行列転送デバイス１４０、データ記憶用の第１のローカルメモリ１５０、データ記憶用の第２のローカルメモリ１６０、命令記憶用の命令メモリ１８０、および外部メモリ１７０を含む。 Referring to FIG. 1, in a first exemplary embodiment of the present invention, data processing system 100 includes central processing unit (CP) 110, vector processing (VP) engine 120, DMA 130, matrix transfer device 140, data storage a first local memory 150 for data storage, a second local memory 160 for data storage, an instruction memory 180 for instruction storage, and an external memory 170 .

ＣＰ１１０は、ＭＩＰＳ処理装置、または同様のアーキテクチャ処理装置であってもよく、算術計算などの基本的な命令をサポートするサポートし、中央処理装置１１０内の汎用レジスタを用いて外部メモリ１７０への／からの格納／ロードを行う。中央処理装置１１０は、第１および第２のローカルメモリ１５０，１６０を制御して、外部メモリ１７０から計算データをフェッチし、また、命令メモリ１８０に格納されたすべての命令を準備する。次に、中央処理装置１１０は、ベクトル命令および行列データをベクトル処理装置１２０に送信する。ベクトル処理装置１２０は、命令を受け取り、計算を開始する。ベクトル処理装置１２０の計算は、一次元シストリックパターンに基づいている。一方の入力は第１のローカルメモリ１５０からフェッチされ、他方の入力は第２のローカルメモリ１６０からフェッチされる。計算後、結果は第２のローカルメモリ１６０に格納される。第１のローカルメモリ１５０と第２のローカルメモリ１６０との間にパスがあり、行列転送デバイス１４０を介して、第１のローカルメモリ１５０と第２のローカルメモリ１６０との間でデータを転送することができる。第１および第２のローカルメモリ１５０，１６０間の転送は、通常の転送または転置転送であり得る。計算が終了すると、第２のローカルメモリ１６０内の結果は、ダイレクトメモリアクセス１３０を介して外部メモリ１７０に転送されてもよい。 CP 110, which may be a MIPS processor, or similar architecture processor, supports supporting basic instructions such as arithmetic calculations, and uses general purpose registers within central processing unit 110 to/from external memory 170. Store/load from Central processing unit 110 controls first and second local memories 150 and 160 to fetch computation data from external memory 170 and prepare all instructions stored in instruction memory 180 . Central processing unit 110 then transmits vector instructions and matrix data to vector processing unit 120 . The vector processor 120 receives instructions and begins computation. The calculations of vector processor 120 are based on one-dimensional systolic patterns. One input is fetched from the first local memory 150 and the other input is fetched from the second local memory 160 . After calculation, the results are stored in the second local memory 160 . There is a path between the first local memory 150 and the second local memory 160 to transfer data between the first local memory 150 and the second local memory 160 via the matrix transfer device 140 be able to. Transfers between the first and second local memories 150, 160 can be normal transfers or transposed transfers. When the computation is finished, the results in second local memory 160 may be transferred to external memory 170 via direct memory access 130 .

ベクトル処理装置１２０内の詳細なアーキテクチャが図４に示されている。複数の処理装置１２１のそれぞれは、互いに接続されている。データ値および命令情報を転送するために使用される、中央処理装置１１０と第１のプロセッシングエレメント１２１との間の接続が存在する。隣接するプロセッシングエレメント１２１同士の間に、各プロセッシングエレメント１２１に情報をブロードキャストするために使用される接続チャネルが存在する。各プロセッシングエレメント１２１の内部において、乗算および加算を効率的に計算するために、デジタル信号処理装置（ＤＳＰ）を使用して、より低い電力およびより高い周波数を達成することができる。さらに、中間結果を格納するために、各プロセッシングエレメント１２１に専用レジスタ１２５が存在する。各プロセッシングエレメント１２１は、１つの専用レジスタ１２５を有する。 A detailed architecture within the vector processor 120 is shown in FIG. Each of the plurality of processing units 121 is connected to each other. A connection exists between the central processing unit 110 and the first processing element 121 that is used to transfer data values and instruction information. Between adjacent processing elements 121 are connecting channels that are used to broadcast information to each processing element 121 . Internal to each processing element 121, a digital signal processor (DSP) can be used to efficiently compute the multiplications and additions to achieve lower power and higher frequencies. In addition, there are dedicated registers 125 in each processing element 121 to store intermediate results. Each processing element 121 has one dedicated register 125 .

ローカルメモリ１５０のアーキテクチャの詳細を図２に示す。１６個の２ポートＲＡＭバンク１５１と１個の第１のデータ選択ユニット１５３がある。２ポートＲＡＭ１５１は、データを格納するために使用される。第１のデータ選択ユニット１５３は、外部メモリ１７０または第２のローカルメモリ１６０とのデータ転送を選択する。ベクトル処理装置１２０のためのシストリックデータ入力としての出力もある。この例示的な実施形態における第１のローカルメモリ１５０の２ポートＲＡＭバンクの数は１６個であるが、８個や３２個などであってもよいことに留意されたい。 The architectural details of local memory 150 are shown in FIG. There are 16 two-port RAM banks 151 and one first data select unit 153 . A 2-port RAM 151 is used to store data. The first data selection unit 153 selects data transfer with the external memory 170 or the second local memory 160 . There are also outputs as systolic data inputs for the vector processor 120 . Note that the number of 2-port RAM banks of the first local memory 150 in this exemplary embodiment is 16, but could be 8, 32, and so on.

第２のローカルメモリ１６０のアーキテクチャの詳細は、図３に示されている。２ポートＲＡＭバンク１６１、第２のデータ選択ユニット１６２、およびデータリング１６３が存在する。ＲＡＭバンク１６１の数は、プロセッシングエレメント１２１の数に等しく、１つのプロセッシングエレメント１２１は、１つのＲＡＭバンク１６１に対応する。 Details of the architecture of the second local memory 160 are shown in FIG. There is a two port RAM bank 161, a second data select unit 162 and a data ring 163. The number of RAM banks 161 is equal to the number of processing elements 121 and one processing element 121 corresponds to one RAM bank 161 .

行列転送デバイス１４０のアーキテクチャの詳細は、図５に示されている。行列転送デバイス１４０の内部には、アドレスジェネレータ１４１およびバンク数ジェネレータ１４２が存在する。第１のローカルメモリ１５０と第２のローカルメモリ１６０の構成が異なるため、一方のローカルメモリから他方のローカルメモリに転送されるデータの種類に関係なく、デスティネーションバンクのバンクおよび各バンクのアドレスは、ソースバンクのバンクおよびアドレスとは異なり得る。したがって、アドレスジェネレータ１４１およびバンク数ジェネレータ１４２が使用される。 The architectural details of matrix transfer device 140 are shown in FIG. Inside the matrix transfer device 140 are an address generator 141 and a bank number generator 142 . Since the first local memory 150 and the second local memory 160 have different configurations, regardless of the type of data transferred from one local memory to the other local memory, the bank of the destination bank and the address of each bank are , may differ from the bank and address of the source bank. Therefore, address generator 141 and bank number generator 142 are used.

［動作の説明］
次に、図１０から図１３のフローチャートを参照して、本例示的な実施形態の一般的な動作を詳細に説明する。 [Explanation of operation]
The general operation of the exemplary embodiment will now be described in detail with reference to the flowcharts of FIGS. 10-13.

まず、ダイレクトメモリアクセスユニット（ＤＭＡ）は、命令を外部メモリ１７０から命令メモリ１８０に転送する。アプリケーションごとに、アプリケーションをアセンブリコードにコンパイルできる。したがって、アセンブリコードは、アプリケーションごとに作成され、命令メモリ１８０に格納される。 First, a direct memory access unit (DMA) transfers instructions from external memory 170 to instruction memory 180 . For each application, the application can be compiled into assembly code. Therefore, assembly code is created for each application and stored in instruction memory 180 .

その後、中央処理装置１１０は、命令メモリ１８０から命令を１つずつフェッチする。初期データは、計算を開始する前に、第１のローカルメモリ１５０および第２のローカルメモリ１６０に格納される。汎用行列乗算（ＧＥＭＭ）機能を実行する場合、左側の行列は第１のローカルメモリ１５０に格納され、右側の行列は第２のローカルメモリ１６０に格納される。各行列Ｍ＊Ｋについて、第１のローカルメモリ１５０におけるデータマッピングは、図６を参照して説明される。第１のローカルメモリ１５０に使用されるＲＡＭの１６個のバンクがある場合、Ｄ１［０，０］（Ｄ１［ｍ，ｋ］は、行列のｍ番目の行およびｋ番目の列のエレメントを表す）は、行列を格納するために第１のローカルメモリ１５０のバンク０のアドレス０に格納される。例えば、データＤ１［０，１］は、第１のローカルメモリ１５０のバンク１のアドレス０に格納され、データＤ１［０，１５］は、例えば、第１のローカルメモリ１５０のバンク１５のアドレス０に格納される。Ｄ１［０，１６］から開始して、データは、第１のローカルメモリ１５０のバンク０のアドレス１に格納される。 After that, central processing unit 110 fetches instructions from instruction memory 180 one by one. Initial data is stored in the first local memory 150 and the second local memory 160 before starting the computation. When performing a generalized matrix multiplication (GEMM) function, the left matrix is stored in the first local memory 150 and the right matrix is stored in the second local memory 160 . For each matrix M*K, the data mapping in the first local memory 150 is explained with reference to FIG. If there are 16 banks of RAM used in the first local memory 150, then D1[0,0] (D1[m,k] represents the mth row and kth column element of the matrix ) is stored at address 0 of bank 0 of the first local memory 150 to store the matrix. For example, data D1[0,1] is stored at address 0 of bank 1 of first local memory 150, and data D1[0,15] is stored at address 0 of bank 15 of first local memory 150, for example. stored in Starting at D1[0,16], data is stored at address 1 of bank 0 of the first local memory 150 .

したがって、アドレスジェネレータ１４１および対応するバンクは、以下の式によって計算される。
ＡＤＤＲ＝（ｍ＊Ｋ＋ｋ）／１６
ＢＡＮＫ＝（ｍ＊Ｋ＋ｋ）％１６ Therefore, the address generator 141 and corresponding bank are calculated by the following equations.
ADDR=(m*K+k)/16
BANK=(m*K+k)%16

第２のローカルメモリ１６０のデータマッピングについては、各プロセッシングエレメント１２１が１つのＲＡＭバンク１６１に対応しているため、ＲＡＭバンク１６１の数はプロセッシングエレメント１２１の数に等しい。２５６個のプロセッシングエレメント１２１が存在する例では、図７を参照してデータマッピング方法が与えられる。Ｄ２［０，０］は、第２のローカルメモリ１６０のバンク０のアドレス０に格納され、Ｄ２［０，１］は、第２のローカルメモリ１６０のバンク１のアドレス０に格納され、以下同様である。Ｄ２［１，０］の場合、データは第２のローカルメモリ１６０のバンク０のアドレス１に格納される。Ｄ２［１，１］の場合、データは第２のローカルメモリ１６０のバンク１のアドレス１に格納される。各行列Ｋ＊Ｎについて、ＮがＰＥ１２１の数よりも小さい場合、未使用の列はゼロで埋められる。ＮがＰＥ１２１の数よりも大きい場合、行列は、ＰＥ１２１の数に従って列ごとにカットされ、次いで、第２のローカルメモリ１６０にマッピングされる。したがって、エレメントＤ２［ｋ，ｎ］（ｋは行を表し、ｎは列を表す）の場合、アドレスジェネレータおよび対応するバンクは以下の式で得られる。
ＡＤＤＲ＝ｋ
ＢＡＮＫ＝ｎ For data mapping of the second local memory 160 , the number of RAM banks 161 is equal to the number of processing elements 121 since each processing element 121 corresponds to one RAM bank 161 . In an example where there are 256 processing elements 121, a data mapping method is given with reference to FIG. D2[0,0] is stored at address 0 of bank 0 of the second local memory 160, D2[0,1] is stored at address 0 of bank 1 of the second local memory 160, and so on. is. For D2[1,0], the data is stored at address 1 of bank 0 of the second local memory 160 . For D2[1,1], the data is stored at address 1 of bank 1 of the second local memory 160 . For each matrix K*N, if N is less than the number of PEs 121, unused columns are filled with zeros. If N is greater than the number of PEs 121 , the matrix is cut column by column according to the number of PEs 121 and then mapped to the second local memory 160 . Thus, for element D2[k,n] (where k represents row and n represents column), the address generator and corresponding bank are given by the following equations.
ADDR=k
BANK=n

行列を第２のローカルメモリ１６０に格納するために、データリング１６３を使用して、すべての２ポートＲＡＭバンクを介してデータを渡し、対応する２ポートＲＡＭバンクへ／からデータを書き込む／読み取る。例えば、キャッシュラインが６４バイトで、３２ビットの単精度浮動小数点を使用する場合、１つのキャッシュラインに１６個のデータ項目がある。各データ項目は、対応する２ポートＲＡＭバンク１６１に格納される必要がある。 To store the matrix in the second local memory 160, a data ring 163 is used to pass data through all the 2-port RAM banks and write/read data to/from the corresponding 2-port RAM banks. For example, if a cache line is 64 bytes and uses 32-bit single precision floating point, there are 16 data items in a cache line. Each data item must be stored in a corresponding 2-port RAM bank 161 .

左右の行列を第１のローカルメモリ１５０および第２のローカルメモリ１６０に格納した後、計算は一次元シストリック方式で実行される。ＧＥＭＭ計算の場合、左側の行列がＤ１（Ｍ，Ｋ）で、右側の行列がＤ２（Ｋ，Ｎ）であるとすると、乗算結果はＲＥＳ（Ｍ，Ｎ）になる。ＲＥＳ［ｍ，ｎ］の各エレメントについて、次の計算が実行される。

Ｄ１［０，０］は、ローカルメモリ１５０から読み取られ、プロセッシングエレメント１２１に転送される。Ｄ２［０，０］は、第２のローカルメモリ１６０から読み取られ、プロセッシングエレメント１２１のための他の入力オペランドとなる。計算後、Ｄ１［０，０］とＤ２［０，０］の乗算は専用レジスタに格納される。Ｄ１［０，０］は、シストリック方式で第２のプロセッシングエレメント１２１に転送される。Ｄ２［０，１］は、第２のローカルメモリ１６０から読み取られる。Ｄ１［０，０］とＤ２［０，１］は、乗算の２つのオペランドである。Ｄ１［０，０］＊Ｄ２［０，１］の結果は、専用レジスタ１２５に格納される。後のプロセッシングエレメント１２１については、Ｄ１［０，０］は常にシストリックの方法で転送される。最後に、Ｄ１［０，０］は、すべてのプロセッシングエレメント１２１を介して転送される。実際、各プロセッシングエレメント１２１のデータフローは完全に同じである。したがって、１つのプロセッシングエレメント１２１のデータフローを以下に簡単に説明する。 After storing the left and right matrices in the first local memory 150 and the second local memory 160, the computation is performed in a one-dimensional systolic manner. For GEMM computation, if the left matrix is D1(M,K) and the right matrix is D2(K,N), the multiplication result is RES(M,N). For each element of RES[m,n], the following computations are performed.

D1[0,0] is read from local memory 150 and transferred to processing element 121 . D2[0,0] is read from the second local memory 160 and becomes another input operand for the processing element 121; After computation, the multiplication of D1[0,0] and D2[0,0] is stored in a dedicated register. D1[0,0] is transferred to the second processing element 121 in a systolic fashion. D2[0,1] is read from the second local memory 160; D1[0,0] and D2[0,1] are the two operands of the multiplication. The result of D1[0,0]*D2[0,1] is stored in special purpose register 125. FIG. For later processing elements 121, D1[0,0] is always transferred in a systolic manner. Finally, D1[0,0] is transferred through all processing elements 121 . In fact, the data flow for each processing element 121 is exactly the same. Therefore, the data flow for one processing element 121 is briefly described below.

Ｄ１［０，０］がすべてのＰＥ１２１を介して転送された後、次のデータ項目が第１のローカルメモリ１５０から読み取られる。左側の行列は、行優先順に読み取られる。したがって、例えば、Ｄ１［０，１］は、第１のローカルメモリ１５０から読み取られて第１のプロセッシングエレメント１２１に送信され、Ｄ２［１，０］は、第１のプロセッシング１２１において第２のローカルメモリ１６０から読み取られる。その後、Ｄ１［０，１］＊Ｄ２［１，０］が計算される。Ｄ１［０，０］＊Ｄ２［０，０］の前の結果は、専用レジスタ１２５から読み取られ、Ｄ１［０，１］＊Ｄ２［１，０］の結果に追加される。合計は、専用レジスタ１２５に再び格納される。これを繰り返し行うことにより、第１のプロセッシングエレメント１２１の第１のエレメントＲＥＳ［０，０］を取得することができる。同様に、第１のプロセッシングエレメント１２１内のすべてのエレメントを計算することができる。エレメントを計算した後、結果は、第２のローカルメモリ１６０内の２ポートＲＡＭバンク１６１に送信される。 After D1[0,0] has been transferred through all PEs 121 , the next data item is read from the first local memory 150 . The left matrix is read in row-major order. So, for example, D1[0,1] is read from the first local memory 150 and sent to the first processing element 121, and D2[1,0] is sent to the second local memory 121 in the first processing element 121. Read from memory 160 . Then D1[0,1]*D2[1,0] is calculated. The previous result of D1[0,0]*D2[0,0] is read from the special purpose register 125 and added to the result of D1[0,1]*D2[1,0]. The sum is stored again in dedicated register 125 . By repeating this, the first element RES[0,0] of the first processing element 121 can be acquired. Similarly, all elements within the first processing element 121 can be computed. After computing the elements, the results are sent to the two-port RAM bank 161 in the second local memory 160 .

以下、第１のローカルメモリ１５０と第２のローカルメモリ１６０との間のデータ転送について説明する。４つのケースがある。第１のケースは、第１のローカルメモリ１５０から第２のローカルメモリ１６０への通常の転送である。第２のローカルメモリ１６０に格納された行列サイズがＫ×Ｎであると仮定すると、Ｎ個のプロセッシングエレメント１２１が使用され、各ＲＡＭバンクの深さはＫである。上記のように、Ｎがプロセッシングエレメント１２１の数（例えば２５６）よりも小さい場合、未使用の列はゼロで埋められる。したがって、第２のローカルメモリ１６０内のＤ２［０，０］は、ローカルメモリ１５０内の第１のＲＡＭバンクの第１のエレメントにマッピングされる。第２のローカルメモリ１６０内のＤ２［０，１］は、第１のローカルメモリ１５０内の第２のＲＡＭの第１のエレメントにマッピングされる。しかしながら、第２のローカルメモリ１６０内のＤ２［０，ｎ］について、ｎが第１のローカルメモリ１５０によってサポートされるバンクの数よりも大きい場合、第１のローカルメモリ１５０および第２のローカルメモリ１６０内のアドレスおよびバンクは、異なる。他のエレメントのアドレスとバンクのマッピングについて、計算方法が以下の式に示され、ここで、ＤＳＴ＿ＡＤＤＲとＤＳＴ＿ＢＡＮＫは、それぞれデスティネーションアドレスとバンクである。この場合、デスティネーションアドレスおよびバンクは、第１のローカルメモリ１５０にある。ＳＲＣ＿ＡＤＤＲおよびＳＲＣ＿ＢＡＮＫは、ソースアドレスおよびソースバンクを表し、この場合、ソースアドレスおよびソースバンクは、第２のローカルメモリ１６０にある。ＳＲＣ＿ＮＵＭ＿ＢＡＮＫは、第２のローカルメモリ１６０内の２ポートＲＡＭバンク１６１の数であり、ＤＳＴ＿ＮＵＭ＿ＢＡＮＫは、第１のローカルメモリ１５０でサポートされる２ポートＲＡＭバンク１５１の数である。
ＤＳＴ＿ＡＤＤＲ＝（ＳＲＣ＿ＡＤＤＲ＊ＳＲＣ＿ＮＵＭ＿ＢＡＮＫ＋ＳＲＣ＿ＢＡＮＫ）／ＤＳＴ＿ＮＵＭ＿ＢＡＮＫ
ＤＳＴ＿ＢＡＮＫ＝（ＳＲＣ＿ＡＤＤＲ＊ＳＲＣ＿ＮＵＭ＿ＢＡＮＫ＋ＳＲＣ＿ＢＡＮＫ）％ＤＳＴ＿ＮＵＭ＿ＢＡＮＫ Data transfer between the first local memory 150 and the second local memory 160 will be described below. There are four cases. The first case is a normal transfer from first local memory 150 to second local memory 160 . Assuming that the matrix size stored in the second local memory 160 is K×N, N processing elements 121 are used and each RAM bank is K deep. As noted above, when N is less than the number of processing elements 121 (eg, 256), unused columns are filled with zeros. Thus, D2[0,0] in second local memory 160 maps to the first element of the first RAM bank in local memory 150 . D2[0,1] in the second local memory 160 is mapped to the first element of the second RAM in the first local memory 150; However, for D2[0,n] in the second local memory 160, if n is greater than the number of banks supported by the first local memory 150, the first local memory 150 and the second local memory The addresses and banks within 160 are different. For other element address and bank mappings, the calculation method is shown in the following equations, where DST_ADDR and DST_BANK are the destination address and bank respectively. In this case the destination address and bank are in the first local memory 150 . SRC_ADDR and SRC_BANK represent the source address and source bank, where the source address and source bank are in the second local memory 160 . SRC_NUM_BANK is the number of 2-port RAM banks 161 in the second local memory 160 and DST_NUM_BANK is the number of 2-port RAM banks 151 supported in the first local memory 150 .
DST_ADDR=(SRC_ADDR*SRC_NUM_BANK+SRC_BANK)/DST_NUM_BANK
DST_BANK=(SRC_ADDR*SRC_NUM_BANK+SRC_BANK)% DST_NUM_BANK

実際の状況では、各クロックサイクルで、キャッシュラインが５１２ビットであると想定して５１２ビットのデータが転送される。３２ビットの単精度浮動小数点の場合、１つのデータ項目は３２ビットであるため、１クロックサイクルで１６個のデータ項目が転送される。２５６個のプロセッシングエレメント１２１が存在すると仮定すると、第１のローカルメモリ１５０と第２のローカルメモリ１６０との間のデータ転送について、第１のクロックサイクルにおいて、Ｄ２［０，０］，Ｄ２［０，１］…Ｄ２［０，１５］が各プロセッシングエレメント１２１のＲＡＭ１６１から取り出され、データリングを介して第１のローカルメモリ１５０に転送される。Ｄ２［０，０］，Ｄ２［０，１］…Ｄ２［０，１５］は、第１のローカルメモリ１５０の異なるバンクに格納されるので、これらのデータ項目を、同じクロックサイクルにおいて第１のローカルメモリ１５０に書き込むことができる。第２のクロックサイクルにおいて、Ｄ２［０，１６］，Ｄ２［０，１７］，Ｄ２［０，１８］…Ｄ２［０，３１］が、対応するプロセッシングエレメント１２１のＲＡＭバンク１６１から取り出されて、データリング１６３を介して第１のローカルメモリ１５０に転送される。Ｄ２［０，１６］，Ｄ２［０，１７］…Ｄ２［０，３１］はローカルメモリ１５０の異なるバンクに格納されるので、これらのデータ項目を、同じクロックサイクルにおいて第１のローカルメモリ１５０に書き込むことができる。同様に、各クロックサイクルにおいて、１６個のデータ項目が、対応するプロセッシングエレメント１２１のＲＡＭバンク１６１から読み取られ、次いで、データリング１６３を介して転送され、最後に、第１のローカルメモリ１５０に格納される。 In a practical situation, on each clock cycle, 512 bits of data are transferred assuming the cache line is 512 bits. For 32-bit single precision floating point, one data item is 32 bits, so 16 data items are transferred in one clock cycle. Assuming there are 256 processing elements 121, for data transfer between the first local memory 150 and the second local memory 160, in the first clock cycle, D2[0,0], D2[0 ,1] . . . D2[0,15] are retrieved from the RAM 161 of each processing element 121 and transferred to the first local memory 150 via the data ring. D2[0,0], D2[0,1] . Local memory 150 can be written. In the second clock cycle, D2[0,16], D2[0,17], D2[0,18]... D2[0,31] are fetched from the RAM bank 161 of the corresponding processing element 121 It is transferred to the first local memory 150 via the data ring 163 . D2[0,16], D2[0,17] . can be written. Similarly, in each clock cycle, 16 data items are read from the RAM bank 161 of the corresponding processing element 121, then transferred via the data ring 163, and finally stored in the first local memory 150. be done.

第２のケースは、第１のローカルメモリ１５０から第２のローカルメモリ１６０への通常の転送である。同様に、アドレスおよびバンクジェネレータは上記と同じであるが、唯一の違いは、ソースが第１のローカルメモリ１５０になり、デスティネーションが第２のローカルメモリ１６０になることである。 The second case is a normal transfer from first local memory 150 to second local memory 160 . Similarly, the address and bank generators are the same as above, the only difference being that the source will be the first local memory 150 and the destination will be the second local memory 160 .

第３のケースは、転置された行列が第１のローカルメモリ１５０から第２のローカルメモリ１６０に転送されることである。行列［Ｎ，Ｍ］が第１のローカルメモリ１５０に格納されていると仮定すると、この行列は［Ｍ，Ｎ］のサイズに転置され、第２のローカルメモリ１６０に格納される。例えば、第１のローカルメモリ１５０のＤ１［０，０］は、第２のローカルメモリ１６０のバンク０のアドレス０にマッピングされる。第２のローカルメモリ１６０のＤ１［０，１］は、第２のローカルメモリ１６０のバンク０のアドレス１にマッピングされる。他のエレメントのアドレスとバンクのマッピングについて、計算方法を以下の式に示す。
ＤＳＴ＿ＡＤＤＲ＝（ＳＲＣ＿ＡＤＤＲ＊ＳＲＣ＿ＮＵＭ＿ＢＡＮＫ＋ＳＲＣ＿ＢＡＮＫ）％ＤＳＴ＿ＮＵＭ＿ＢＡＮＫ
ＤＳＴ＿ＢＡＮＫ＝（ＳＲＣ＿ＡＤＤＲ＊ＳＲＣ＿ＮＵＭ＿ＢＡＮＫ＋ＳＲＣ＿ＢＡＮＫ）／ＤＳＴ＿ＮＵＭ＿ＢＡＮＫ A third case is that the transposed matrix is transferred from the first local memory 150 to the second local memory 160 . Assuming the matrix [N,M] is stored in the first local memory 150 , this matrix is transposed to size [M,N] and stored in the second local memory 160 . For example, D1[0,0] of the first local memory 150 is mapped to address 0 of bank 0 of the second local memory 160 . D1[0,1] of the second local memory 160 is mapped to address 1 of bank 0 of the second local memory 160 . The following formula shows the calculation method for the mapping of addresses of other elements and banks.
DST_ADDR=(SRC_ADDR*SRC_NUM_BANK+SRC_BANK)% DST_NUM_BANK
DST_BANK=(SRC_ADDR*SRC_NUM_BANK+SRC_BANK)/DST_NUM_BANK

転送は引き続き１６×１６ブロックに基づいている。しかしながら、この１６×１６ブロックは、第２のローカルメモリ１６０にある。したがって、１６クロックサイクルごとに第２のローカルメモリ１６０の側に１６×１６ブロックを生成するには、われわれは第１のローカルメモリ１５０のＤ２［０，０］，Ｄ２［１，１］，Ｄ２［２，２］…Ｄ２［１５，１５］の対応するエレメントを見つける必要がある。Ｄ２［０，０］の対応するアドレスはバンク０のアドレス０であり、Ｄ２［１，１］の対応するアドレスはバンク１のアドレス（Ｎ＋１）／１６であり、Ｄ２［２，２］の対応するアドレスはバンク２のアドレス（Ｎ×２＋２）／１６である。同様に、Ｄ２［１５，１５］の対応する位置は、バンク１５のアドレス（Ｎ×１５＋１５）／１６である。したがって、１クロックサイクルで１６個のデータ項目をフェッチできる。第２のクロックサイクルでは、Ｄ２［１，０］，Ｄ２［２，１］，Ｄ２［３，２］…Ｄ２［０，１５］の対応するエレメントが、第１のローカルメモリ１５０に見出される。Ｄ２［１，０］の対応するアドレスはバンク０のアドレスＮ／１６であり、Ｄ２［２，１］の対応するアドレスはバンク１のアドレス（Ｎ×２＋１）／１６である。同様に、Ｄ２［０，１５］の対応する位置は、バンク１５のアドレス０である。これらのマッピング方法により、ローカルメモリ１６０の１６×１６ブロックの第１のローカルメモリ１５０のアドレスを知ることができる。各クロックサイクルにおいて、第１のローカルメモリ１５０からフェッチされた１６個のエレメントが転置され、第２のローカルメモリ１６０に格納される。例えば、第１のサイクルでは、Ｄ２［０，０］，Ｄ２［１，１］，Ｄ２［２，２］…Ｄ２［１５，１５］は対角線であるため、転置された結果は同じである。第２のローカルメモリ１６０の場合、バンク０のアドレスは０であり、バンク１のアドレスは１であり、以下同様である。第２のクロックサイクルでは、Ｄ２［１，０］，Ｄ２［２，１］，Ｄ２［３，２］…Ｄ２［０，１５］が第１のローカルメモリ１５０からフェッチされる。Ｄ２［１，０］の場合、転置された結果はバンク１のアドレス０に格納される必要があり、Ｄ２［２，１］はバンク２のアドレス１に格納される必要があり、Ｄ２［３，２］はバンク３のアドレス２に格納される必要がある。同様に、すべてのデータを、１クロックサイクルで第２のローカルメモリ１６０の１６個のバンクに格納することができる。 Transfers are still based on 16x16 blocks. However, this 16×16 block resides in the second local memory 160 . Therefore, to generate a 16×16 block on the second local memory 160 side every 16 clock cycles, we need D2[0,0], D2[1,1], D2 of the first local memory 150 [2,2]...D2[15,15] need to find corresponding elements. The corresponding address of D2[0,0] is address 0 of bank 0, the corresponding address of D2[1,1] is address (N+1)/16 of bank 1, and the corresponding address of D2[2,2] is The address to be used is the bank 2 address (N.times.2+2)/16. Similarly, the corresponding location for D2[15,15] is bank 15 address (N×15+15)/16. Therefore, 16 data items can be fetched in one clock cycle. At the second clock cycle, the corresponding elements of D2[1,0], D2[2,1], D2[3,2] . The corresponding address of D2[1,0] is bank 0 address N/16 and the corresponding address of D2[2,1] is bank 1 address (N×2+1)/16. Similarly, the corresponding location in D2[0,15] is bank 15, address 0. These mapping methods allow the first local memory 150 address of the 16×16 block of local memory 160 to be known. Each clock cycle, 16 elements fetched from the first local memory 150 are transposed and stored in the second local memory 160 . For example, in the first cycle, D2[0,0], D2[1,1], D2[2,2] . . . D2[15,15] are diagonal, so the transposed result is the same. For the second local memory 160, the address for bank 0 is 0, the address for bank 1 is 1, and so on. D2[1,0], D2[2,1], D2[3,2] . . . D2[0,15] are fetched from the first local memory 150 in the second clock cycle. For D2[1,0], the transposed result must be stored in bank 1 at address 0, D2[2,1] must be stored in bank 2 at address 1, D2[3 , 2] must be stored in bank 3 at address 2. Similarly, all data can be stored in 16 banks of the second local memory 160 in one clock cycle.

第１のローカルメモリ１５０のアドレスジェネレータは、以下の式として表すことができ、ここで、ＣＮＴは、１６×１６ブロックの各１６クロックサイクルのカウントであり、ＳＲＣ＿ＢＡＮＫは、第１のローカルメモリ１５０のどのバンクかを意味する。
ＳＲＣ＿ＡＤＤＲ＿ＩＮ＿Ｂ１６＝（Ｎ＊（ＳＲＣ＿ＢＡＮＫ＋ＣＮＴ）＋ＳＲＣ＿ＢＡＮＫ）／１６ｉｆＳＲＣ＿ＢＡＮＫ＋ＣＮＴ＜＝１５
ＳＲＣ＿ＡＤＤＲ＿ＩＮ＿Ｂ１６＝（Ｎ＊（ＳＲＣ＿ＢＡＮＫ＋ＣＮＴ－１６）＋ＳＲＣ＿ＢＡＮＫ）／１６ｉｆＳＲＣ＿ＢＡＮＫ＋ＣＮＴ＞１５ The address generator of the first local memory 150 can be expressed as the following equation, where CNT is the count of each 16 clock cycles of a 16×16 block and SRC_BANK is the count of the first local memory 150 It means which bank.
SRC_ADDR_IN_B16=(N*(SRC_BANK+CNT)+SRC_BANK)/16 if SRC_BANK+CNT<=15
SRC_ADDR_IN_B16=(N*(SRC_BANK+CNT-16)+SRC_BANK)/16 if SRC_BANK+CNT>15

１６×１６ブロックの基本アドレスは２つの部分で構成され、１つの部分はｄｅｌｔａ＿ｖとしてマークされた１６×１６ブロックの垂直方向の動きに対応し、もう１つの部分はｄｅｌｔａ＿ｈとしてマークされた１６×１６ブロックの水平方向の動きに対応する。Ｂ１６＿ｖｅｒｔｉｃａｌ＿ｎｕｍは、第２のローカルメモリ１６０内の１６×１６ブロックの垂直アドレスを表す。ここでは、第２のローカルメモリ１６０内の１６×１６ブロックの垂直スキャンが使用される。
Ｂ１６＿ｖｅｒｔｉｃａｌ＿ｎｕｍ＝（Ｂ１６＿ｖｅｒｔｉｃａｌ＿ｎｕｍ＜Ｍ／１６）？Ｂ１６＿ｖｅｒｔｉｃａｌ＿ｎｕｍ＋１：０
Ｄｅｌｔａ＿ｖ＝（Ｂ１６＿ｖｅｒｔｉｃａｌ＿ｎｕｍ＜Ｍ／１６）？（Ｄｅｌｔａ＿ｖ＋Ｎ）：０
Ｄｅｌｔａ＿ｈ＝（Ｂ１６＿ｖｅｒｔｉｃａｌ＿ｎｕｍ＝＝０）？（Ｄｅｌｔａ＿ｈ＋１）：Ｄｅｌｔａ＿ｈ
ＳＲＣ＿Ｂ１６＿ＡＤＤＲ＝Ｄｅｌｔａ＿ｖ＋Ｄｅｌｔａ＿ｈ The base address of a 16x16 block consists of two parts, one part corresponds to the vertical movement of the 16x16 block marked as delta_v, and the other part is the 16x16 marked as delta_h. Corresponds to horizontal movement of the block. B16_vertical_num represents the vertical address of the 16×16 block in the second local memory 160; Here a vertical scan of 16×16 blocks in the second local memory 160 is used.
B16_vertical_num=(B16_vertical_num<M/16)? B16_vertical_num+1:0
Delta_v=(B16_vertical_num<M/16)? (Delta_v+N): 0
Delta_h=(B16_vertical_num==0)? (Delta_h+1): Delta_h
SRC_B16_ADDR = Delta_v + Delta_h

したがって、第１のローカルメモリ１５０の１６個のバンクの最終アドレスを、以下の式で得ることができる。
ＳＲＣ＿ＡＤＤＲ＝ＳＲＣ＿ＡＤＤＲ＿ＩＮ＿Ｂ１６＋ＳＲＣ＿Ｂ１６＿ＡＤＤＲ Therefore, the final address of the 16 banks of the first local memory 150 can be obtained by the following formula.
SRC_ADDR = SRC_ADDR_IN_B16 + SRC_B16_ADDR

第１のローカルメモリ１５０のアドレスジェネレータは、上記で与えられている。次に、第２のローカルメモリ１６０のアドレスおよびバンクジェネレータについて説明する。ソース行列のスキャンは、第２のローカルメモリ１６０における垂直スキャンであるため、転置された行列のスキャンは、第２のローカルメモリ１６０における水平スキャンである。したがって、基本アドレスと基本バンクを次のように計算できる。
ＤＳＴ＿ＡＤＤＲ＿ＩＮ＿Ｂ１６＝（ＤＳＴ＿ＢＡＮＫ％１６－ＣＮＴ）ｉｆＤＳＴ＿ＢＡＮＫ％１６－ＣＮＴ＞＝０
ＤＳＴ＿ＡＤＤＲ＿ＩＮ＿Ｂ１６＝（ＤＳＴ＿ＢＡＮＫ％１６－ＣＮＴ＋１６）ｉｆＤＳＴ＿ＢＡＮＫ％１６－ＣＮＴ＜０
ＤＳＴ＿Ｂ１６＿ＡＤＤＲ＝（ＤＳＴ＿Ｂ１６＿ＢＡＮＫ＜Ｎ／１６）？ＤＳＴ＿Ｂ１６＿ＡＤＤＲ：（ＤＳＴ＿Ｂ１６＿ＡＤＤＲ＋１６）
ＤＳＴ＿ＡＤＤＲ＝ＤＳＴ＿ＡＤＤＲ＿ＩＮ＿Ｂ１６＋ＤＳＴ＿Ｂ１６＿ＡＤＤＲ The address generator for the first local memory 150 is given above. Next, the address and bank generator of the second local memory 160 will be described. Since scanning the source matrix is vertical scanning in the second local memory 160 , scanning the transposed matrix is horizontal scanning in the second local memory 160 . Therefore, the base address and base bank can be calculated as follows.
DST_ADDR_IN_B16=(DST_BANK%16-CNT) if DST_BANK%16-CNT>=0
DST_ADDR_IN_B16 = (DST_BANK%16-CNT+16) if DST_BANK%16-CNT<0
DST_B16_ADDR=(DST_B16_BANK<N/16)? DST_B16_ADDR: (DST_B16_ADDR+16)
DST_ADDR = DST_ADDR_IN_B16 + DST_B16_ADDR

第４のケースは、転置された行列が第２のローカルメモリ１６０から第１のローカルメモリ１５０に転送されることである。行列［Ｍ，Ｎ］が第２のローカルメモリ１６０に格納されていると仮定すると、この行列は［Ｎ，Ｍ］に転置されて第１のローカルメモリ１５０に格納される必要がある。例えば、第２のローカルメモリ１６０のＤ２［０，０］は、第１のローカルメモリ１５０のバンク０のアドレス０に引き続きマッピングされる。第２のローカルメモリ１６０のＤ２［０，１］は、第１のローカルメモリ１５０のバンク０のアドレス１にマッピングされる。他のエレメントのアドレスとバンクのマッピングについて、計算方法を以下の式に示す。
ＤＳＴ＿ＡＤＤＲ＝（ＳＲＣ＿ＢＡＮＫ＊Ｍ＋ＳＲＣ＿ＡＤＤＲ）／ＤＳＴ＿ＮＵＭ＿ＢＡＮＫ
ＤＳＴ＿ＢＡＮＫ＝（ＳＲＣ＿ＢＡＮＫ＊Ｍ＋ＳＲＣ＿ＡＤＤＲ）％ＤＳＴ＿ＮＵＭ＿ＢＡＮＫ A fourth case is that the transposed matrix is transferred from the second local memory 160 to the first local memory 150 . Assuming the matrix [M,N] is stored in the second local memory 160 , this matrix needs to be transposed to [N,M] and stored in the first local memory 150 . For example, D2[0,0] of the second local memory 160 continues to map to address 0 of bank 0 of the first local memory 150 . D2[0,1] of the second local memory 160 is mapped to address 1 of bank 0 of the first local memory 150 . The following formula shows the calculation method for the mapping of addresses of other elements and banks.
DST_ADDR=(SRC_BANK*M+SRC_ADDR)/DST_NUM_BANK
DST_BANK=(SRC_BANK*M+SRC_ADDR) %DST_NUM_BANK

Ｄ２［０，０］，Ｄ２［０，１］…Ｄ２［０，１５］のデータが各プロセッシングエレメント１２１のＲＡＭ１６１から取得された場合、１６個のデータ項目は第１のローカルメモリ１５０の同じＲＡＭバンクに格納される。したがって、われわれはこれらの１６個のデータエレメントを１クロックサイクルで第１のローカルメモリ１５０に格納することはできない。この問題を回避するために、サイクリックデータマッピング方法が使用される。転送は１６×１６ブロックに基づいており、これは、１個の１６×１６ブロックのデータにアクセスした後に、１６×１６ブロックが次の１６×１６ブロックにシフトされることを意味する。左上の１６×１６ブロックが最初に転送され、１６クロックサイクルかかる。第１のクロックサイクルにおいて、Ｄ２［０，０］，Ｄ２［１，１］，Ｄ２［２，２］…Ｄ２［１５，１５］が、第２のローカルメモリ１６０のＲＡＭ１６１から読み取られる。これらのエレメントは、第１のローカルメモリ１５０の異なるバンクに格納される。Ｄ２［０，０］は第１のローカルメモリ１５０の第１のＲＡＭに格納され、Ｄ２［１，１］はローカルメモリの第２のＲＡＭに格納され、以下同様である。したがって、われわれはこれらの１６個のデータ項目を１クロックサイクルで第１のローカルメモリ１５０に格納することができる。第２のクロックサイクルにおいて、Ｄ２［０，１］，Ｄ２［１，２］，Ｄ２［２，３］…Ｄ２［１４，１５］およびＤ２［１５，０］が、第２のローカルメモリ１６０のＲＡＭバンク１６１から読み取られる。これらのエレメントはまた、第１のローカルメモリ１５０の異なるバンクに格納される。したがって、われわれはこれらの１６個のデータを１クロックサイクルで第１のローカルメモリ１５０に格納することができる。アドレスジェネレータは第３のケースと同じであるが、唯一の違いは、第３のケースでは第１のローカルメモリ１５０から第２のローカルメモリ１６０に転送されるが、この第４のケースでは第２のローカルメモリ１６０から第１のローカルメモリ１５０に転送されることである。したがって、ソースとデスティネーションのアドレスが交換されている。 If the data in D2[0,0], D2[0,1] . stored in a bank. Therefore, we cannot store these 16 data elements in the first local memory 150 in one clock cycle. To avoid this problem, a cyclic data mapping method is used. The transfer is based on 16x16 blocks, which means that after accessing one 16x16 block of data, the 16x16 block is shifted to the next 16x16 block. The top left 16x16 block is transferred first and takes 16 clock cycles. D2[0,0], D2[1,1], D2[2,2] . These elements are stored in different banks of the first local memory 150 . D2[0,0] is stored in the first RAM of the first local memory 150, D2[1,1] is stored in the second RAM of the local memory, and so on. Therefore, we can store these 16 data items in the first local memory 150 in one clock cycle. In the second clock cycle, D2[0,1], D2[1,2], D2[2,3] . Read from RAM bank 161 . These elements are also stored in different banks of the first local memory 150 . Therefore, we can store these 16 data in the first local memory 150 in one clock cycle. The address generator is the same as in the third case, the only difference is that in the third case it is transferred from the first local memory 150 to the second local memory 160, whereas in this fourth case the second is transferred from the first local memory 160 to the first local memory 150 . Therefore, the source and destination addresses have been swapped.

第２のローカルメモリ１６０に関しては、外部メモリ１７０と第２のローカルメモリ１６０との間の転送、および第１のローカルメモリ１５０と第２のローカルメモリ１６０との間の転送が可能である。同じクロックサイクルで複数の転送要求を受信した場合、優先度はデータ選択ユニット１６２によって決定される。この動作の説明は、図８のフローチャートを参照して以下に説明される。第１の優先事項は、外部メモリ１７０から第２のローカルメモリ１６０へのデータ転送である。これは、データが有効である場合、データを外部メモリ１７０から取得しなければならないためである。第２の優先事項は、第２のローカルメモリ１６０から外部メモリ１７０への転送である。第３の優先事項は、第１のローカルメモリ１５０から第２のローカルメモリ１６０への転送であり、最後の優先事項は、第２のローカルメモリ１６０から第１のローカルメモリ１５０への転送である。 Regarding the second local memory 160, transfers between the external memory 170 and the second local memory 160 and transfers between the first local memory 150 and the second local memory 160 are possible. The priority is determined by the data selection unit 162 if multiple transfer requests are received in the same clock cycle. A description of this operation is provided below with reference to the flow chart of FIG. The first priority is data transfer from external memory 170 to second local memory 160 . This is because the data must be retrieved from the external memory 170 if the data is valid. A second priority is the transfer from the second local memory 160 to the external memory 170 . The third priority is the transfer from the first local memory 150 to the second local memory 160 and the last priority is the transfer from the second local memory 160 to the first local memory 150. .

第１のローカルメモリ１５０に関して、第１のデータ選択ユニット１５３が使用される。同じクロックサイクルで複数の転送要求を受信した場合、図９に示すように、優先度に基づいて選択が実行される。第１の優先事項は、外部メモリ１７０から第１のローカルメモリ１５０へのデータ転送である。第２の優先事項は、第１のローカルメモリ１５０から外部メモリ１７０への転送である。第３の優先事項は、第２のローカルメモリ１６０から第１のローカルメモリ１５０への転送であり、最後の優先事項は、第１のローカルメモリ１５０から第２のローカルメモリ１６０への転送である。 For the first local memory 150 a first data selection unit 153 is used. If multiple transfer requests are received in the same clock cycle, selection is made based on priority, as shown in FIG. The first priority is data transfer from external memory 170 to first local memory 150 . A second priority is the transfer from the first local memory 150 to the external memory 170 . The third priority is the transfer from the second local memory 160 to the first local memory 150 and the last priority is the transfer from the first local memory 150 to the second local memory 160. .

連続したＧＥＭＭの場合（例えば、第１のＧＥＭＭはＡ＊Ｂ＝Ｃであり、第２のＧＥＭＭはＣ＊Ｄ＝Ｅである）。以下、図１０を参照して説明する。第１のＧＥＭＭはＡ＊Ｂ＝Ｃであり、行列Ａは第１のローカルメモリ１５０に格納され、行列Ｂは第２のローカルメモリ１６０に格納される。行列Ａを第１のローカルメモリ１５０に格納し、行列Ｂをローカルメモリ１６０に格納した後、計算が実行され、結果が第２のローカルメモリ１６０に格納される。第２のＧＥＭＭがＣ＊Ｄ＝Ｅである場合、行列Ｃは、シストリック一次元入力として第１のローカルメモリ１５０に格納される必要がある。行列Ｃを第１のローカルメモリ１５０に格納するために、結果は、最初に外部メモリ１７０に送信され、次に外部メモリから第１のローカルメモリ１５０に転送されなければならない。その後、行列Ｄが第２のローカルメモリ１６０に格納され、計算が開始される。最後に、計算結果が取得され、外部メモリ１７０に転送される。 For consecutive GEMMs (eg, the first GEMM is A*B=C and the second GEMM is C*D=E). Description will be made below with reference to FIG. The first GEMM is A*B=C, with matrix A stored in the first local memory 150 and matrix B stored in the second local memory 160 . After storing matrix A in the first local memory 150 and matrix B in the local memory 160 , the computation is performed and the result is stored in the second local memory 160 . If the second GEMM is C*D=E, the matrix C needs to be stored in the first local memory 150 as a systolic one-dimensional input. To store the matrix C in the first local memory 150 , the result must first be sent to the external memory 170 and then transferred from the external memory to the first local memory 150 . Matrix D is then stored in the second local memory 160 and computation begins. Finally, the calculation result is obtained and transferred to the external memory 170 .

しかしながら、われわれの方法では、われわれは行列Ｃを第２のローカルメモリ１６０から第１のローカルメモリ１５０に直接転送することができる。したがって、われわれは行列Ｃの外部メモリ１７０への転送を省くことができる。処理手順を図１１に示す。第１のＧＥＭＭの計算は、第２のＧＥＭＭの行列Ｄを格納するときに並列化できる。 However, in our method we can transfer the matrix C directly from the second local memory 160 to the first local memory 150 . Therefore, we can omit the transfer of matrix C to external memory 170 . FIG. 11 shows the processing procedure. The computation of the first GEMM can be parallelized when storing the matrix D of the second GEMM.

同様に、第２のＧＥＭＭがＣ．Ｔ＊Ｄ＝Ｅであり、ここでＣ．ＴがＣの転置行列である場合、従来の方法において、手順が図１２に示されている。第１のＧＥＭＭを計算については、手順は図１０に示されるものと同じである。Ｃを外部メモリに格納した後、中央処理装置１１０は、外部メモリ１７０内の行列Ｃを転置する。その後、Ｃの転置行列が第１のローカルメモリ１５０に転送され、行列Ｄが第２のローカルメモリ１６０に転送され、第２のＧＥＭＭが動作を開始することができる。 Similarly, the second GEMM is C.I. T*D=E, where C. If T is the transposed matrix of C, in the conventional method, the procedure is shown in FIG. For computing the first GEMM, the procedure is the same as shown in FIG. After storing C in external memory, central processing unit 110 transposes matrix C in external memory 170 . The transposed matrix of C is then transferred to the first local memory 150, matrix D is transferred to the second local memory 160, and the second GEMM can begin operation.

しかしながら、本開示の方法では、行列Ｃを第２のローカルメモリ１６０から第１のローカルメモリ１５０に直接転送することができるだけでなく、転送中に転置を終了することができる。したがって、行列Ｃの外部メモリ１７０への転送を、より少ない動作で実行することができる。さらに、中央処理装置への転送時間を短縮できる。処理手順を図１３に示す。 However, the method of the present disclosure not only allows the matrix C to be directly transferred from the second local memory 160 to the first local memory 150, but also allows the transposition to be completed during the transfer. Therefore, the transfer of matrix C to external memory 170 can be performed with fewer operations. Furthermore, the transfer time to the central processing unit can be shortened. FIG. 13 shows the processing procedure.

上記の例では、同じローカルメモリ（すなわち、第１のローカルメモリ１５０または第２のローカルメモリ１６０）に対する同時要求はない。ただし、一部の実装では、これらのプロセスが発生する場合がある。例えば、第１のＧＥＭＭであるＡ＊Ｂ＝Ｃを計算した後、行列Ｃの結果は、外部メモリ１７０に転送される必要がある。その間、結果はまた、次のＧＥＭＭであるＣ＊Ｄ＝Ｅのために第１のローカルメモリ１５０に転送される必要がある。この場合、第２のローカルメモリ１６０と外部メモリ１７０との間の転送がより高い優先度を有する。図１４に示すように、外部メモリ１７０から／へのデータ転送はいくつかのバブルサイクルを有するので、これらのバブルサイクルを利用して、第１のローカルメモリ１５０と第２のローカルメモリ１６０との間の転送を実行することができる。 In the example above, there are no concurrent requests to the same local memory (ie, first local memory 150 or second local memory 160). However, in some implementations these processes may occur. For example, after computing the first GEMM A*B=C, the result of matrix C needs to be transferred to external memory 170 . Meanwhile, the result also needs to be transferred to the first local memory 150 for the next GEMM, C*D=E. In this case, transfers between the second local memory 160 and the external memory 170 have higher priority. As shown in FIG. 14, data transfer from/to external memory 170 has several bubble cycles, and these bubble cycles are used to You can perform transfers between

［効果の説明］
次に、本例示的な実施形態の効果について説明する。 [Explanation of effect]
Next, the effect of this exemplary embodiment will be described.

本例示的な実施形態は、２つのローカルメモリに対してマッピングアドレスが周期的に与えられるように構成されているので、アクセラレータ内部のデータリング１６３を介して転置方式でデータを転送することが可能であり、したがって、ホスト側で転置を実行する必要がない。 The exemplary embodiment is configured such that mapping addresses are provided periodically to the two local memories so that data can be transferred in a permuted manner via the data ring 163 inside the accelerator. , so there is no need to perform transposition on the host side.

さらに、例示的な実施形態は、データ転送および計算が２つの異なるチャネルを使用するように構成され、これにより、連続的な行列計算が可能になり、並列に実行することができる。さらに、前の行列計算のデータ転送を、次の行列計算の計算と同時に実行することができる。 Further, the exemplary embodiment is configured such that data transfer and computation use two different channels, allowing serial matrix computations to be performed in parallel. Furthermore, the data transfer of the previous matrix calculation can be performed concurrently with the calculation of the next matrix calculation.

さらに、ローカルメモリごとに同じ期間に複数の書き込み／読み取り要求を実行できるため、ローカルメモリと外部メモリとの間の通信のバブルサイクルを利用できる。 In addition, multiple write/read requests can be performed in the same period per local memory, thus taking advantage of the bubble cycle of communication between local memory and external memory.

図１５を参照すると、別の例示的な実施形態では、データ処理システム１００は、中央処理装置（ＣＰ）１１０、ベクトル処理（ＶＰ）エンジン１２０、ＤＭＡ１３０、データ記憶用の第１のローカルメモリ１５０、データ記憶用の第２のローカルメモリ１６０、命令記憶用の命令メモリ１８０、および外部メモリ１７０を含む。 15, in another exemplary embodiment, data processing system 100 includes central processing unit (CP) 110, vector processing (VP) engine 120, DMA 130, first local memory 150 for data storage, It includes a second local memory 160 for data storage, an instruction memory 180 for instruction storage, and an external memory 170 .

上記のプログラムは、上記の機能を部分的に実行するためのものであってもよい。上記のプログラムは、上記の機能を実行するために、コンピュータシステムにすでに記録されているプログラムと組み合わされた、いわゆる差分ファイル（差分プログラム）であってもよい。 The above program may be for partially executing the above functions. The above program may be a so-called difference file (difference program) combined with a program already recorded in the computer system to perform the above functions.

上記のデータ処理システムのすべてまたは一部の機能は、ＡＳＩＣ（特定用途向け集積回路）、ＰＬＤ（プログラマブルロジックデバイス）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）などのハードウェアを利用することによって実行され得る。 All or part of the functions of the data processing system described above may be performed by utilizing hardware such as ASICs (Application Specific Integrated Circuits), PLDs (Programmable Logic Devices), FPGAs (Field Programmable Gate Arrays) and the like.

それに加えて、上記の例示的な実施形態の特徴は、本発明の範囲から逸脱しない範囲内で、周知の特徴で適切に置き換えることができる。さらに、本発明の技術的範囲は、上記の例示的な実施形態に限定されず、本発明の範囲から逸脱しない範囲内で様々な修正を行うことができる。 Additionally, features of the exemplary embodiments described above may be appropriately replaced with well-known features without departing from the scope of the invention. Moreover, the technical scope of the present invention is not limited to the exemplary embodiments described above, and various modifications can be made without departing from the scope of the present invention.

本発明は、画像または動画処理プラットフォームや深層学習プラットフォームなどの、大量のベクトルまたは行列計算を含むデータ処理装置に適用可能である。 The present invention is applicable to data processing apparatus involving large amounts of vector or matrix computations, such as image or video processing platforms and deep learning platforms.

１００データ処理システム
１１０中央処理装置（ＣＰ）
１２０ベクトル処理装置（ＶＰ）
１２１プロセッシングエレメント（ＰＥ）
１２５専用レジスタ
１３０ダイレクトメモリアクセス（ＤＭＡ）
１４０行列転送デバイス
１４１アドレスジェネレータ
１４２バンク数ジェネレータ
１５０第１のローカルメモリ
１５３第１のデータ選択ユニット
１６０第２のローカルメモリ
１６１ＲＡＭバンク
１６２第２のデータ選択ユニット
１６３データリング
１７０外部メモリ
１８０命令メモリ 100 data processing system 110 central processing unit (CP)
120 vector processor (VP)
121 Processing Element (PE)
125 Dedicated Registers 130 Direct Memory Access (DMA)
140 matrix transfer device 141 address generator 142 bank number generator 150 first local memory 153 first data selection unit 160 second local memory 161 RAM bank 162 second data selection unit 163 data ring 170 external memory 180 instruction memory

Claims

a central processing unit;
a vector processing unit electronically coupled to the central processing unit and configured to perform operations based on instructions received from the central processing unit;
an instruction memory unit electronically connected to the central processing unit and configured to store instructions;
an external memory unit;
a first local memory unit electronically connected to the central processing unit and configured to store one-dimensional systolic data;
a second local memory unit electronically connected to the vector processing unit and configured to store matrix data;
a direct memory electronically coupled to the first local memory unit, the second local memory unit, the instruction memory unit, and the external memory unit and configured to access data in the external memory unit; an access unit;
a matrix transfer unit electronically connected between the first and second local memory units and configured to transfer data between the first and second local memory units;
with
data is transferred through the direct memory access unit with timing based on a predetermined selection priority;
data processing system.

The matrix transfer unit is capable of performing matrix transposition on the data when transferring data between the first and second local memory units.
The data processing system of claim 1 .

The matrix transfer unit comprises an address generator configured to generate memory addresses for a source memory and a destination memory, the source memory being one of the first local memory unit and the second local memory unit. one and the destination memory is the other of the first local memory unit and the second local memory unit;
3. The data processing system of claim 2 .

The second local memory unit has a data ring configured to ring broadcast data, the data ring using the predetermined selection priority when multiple memory requests are received. transferring data to the first local memory unit and the external memory unit via the direct memory access unit using
4. A data processing system according to any one of claims 1-3 .

The first and second local memory units each include a plurality of 2-port RAM banks in which data is stored, and the number of 2-port RAM banks in the first local memory unit is equal to the number of the 2-port RAM banks in the second local memory. equal to the number of said two-port RAM banks of a unit;
5. A data processing system according to any one of claims 1-4 .

A method for a data processing system, said data processing system comprising: a central processing unit; a vector processing unit electronically connected to said central processing unit; a memory unit; an external memory unit; a first local memory unit electronically connected to said central processing unit; a second local memory unit electronically connected to said vector processing unit; a direct memory access unit electronically connected to the local memory unit, the second local memory unit, the instruction memory unit, and the external memory unit; and electronically between the first and second local memory units a matrix transfer unit connected to and configured to transfer data between the first and second local memory units, the method comprising:
performing operations by the vector processing unit based on instructions received from the central processing unit;
storing instructions by the instruction memory unit;
storing one-dimensional systolic data by the first local memory unit;
storing matrix data by the second local memory unit;
accessing data in the external memory unit by the direct memory access unit;
the matrix transfer unit performing matrix transposition on the data when transferring data between the first and second local memory units;
including
data is transferred through the direct memory access unit with timing based on a predetermined selection priority;
Method.

A program for a data processing system, said data processing system comprising: a central processing unit; a vector processing unit electronically connected to said central processing unit; a memory unit; an external memory unit; a first local memory unit electronically connected to said central processing unit; a second local memory unit electronically connected to said vector processing unit; a direct memory access unit electronically connected to the local memory unit, the second local memory unit, the instruction memory unit, and the external memory unit; and electronically between the first and second local memory units a matrix transfer unit connected to and configured to transfer data between the first and second local memory units, the program comprising:
causing the vector processing unit to perform operations based on instructions received from the central processing unit;
storing instructions in the instruction memory unit;
causing the first local memory unit to store one-dimensional systolic data;
storing matrix data in the second local memory unit;
causing the direct memory access unit to access data in the external memory unit;
causing the matrix transfer unit to perform a matrix transpose on the data when transferring the data between the first and second local memory units;
data is transferred through the direct memory access unit with timing based on a predetermined selection priority;
program.