JP7492511B2

JP7492511B2 - Streaming Platform Flow and Architecture

Info

Publication number: JP7492511B2
Application number: JP2021524028A
Authority: JP
Inventors: チャンドラセカールエス．スヤマゴンドル，; ラヴィエヌ．クーラグンダ，; ケネスケー．チャン，; ラヴィスンカバリ，; ヘムシー．ニーマ，; カレンシエ，; ソナルサンタン，; リジホウ，
Original assignee: Xilinx Inc
Current assignee: Xilinx Inc
Priority date: 2018-11-09
Filing date: 2019-11-05
Publication date: 2024-05-29
Anticipated expiration: 2039-11-05
Also published as: KR20210088653A; EP3877864A1; CN112970010A; WO2020097013A1; CN112970010B; JP2022506592A

Description

著作権のある資料の権利の留保
本特許文書の開示の一部分が、著作権保護の対象となる資料を含んでいる。著作権所有者は、米国特許商標庁の特許ファイルまたは記録に現れるように、特許文書または特許開示の何者による複製にも異議はないが、それ以外は何であれすべての著作権を留保する。 Reservation of Rights for Copyrighted Material A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

本開示は集積回路（ＩＣ）に関し、より詳細には、ホストシステムとハードウェアアクセラレーテッド（ｈａｒｄｗａｒｅａｃｃｅｌｅｒａｔｅｄ）回路との間の通信のためにおよびハードウェアアクセラレーテッド回路のカーネル回路間の通信のためにデータストリームを使用することに関する。 The present disclosure relates to integrated circuits (ICs), and more particularly to using data streams for communication between a host system and a hardware accelerated circuit and for communication between kernel circuits of the hardware accelerated circuit.

ハードウェアアクセラレーションは、ハードウェアまたは回路においてプログラムコードの一部の機能性を実装することを指す。ハードウェアアクセラレーテッドプログラムコードは、元のプログラムコードと機能的に等価である。プロセッサを使用する実行可能なバイナリなど、プログラムコードのコンパイルされたバージョンを実行する代わりに、プログラムコードは、実行可能なバイナリと同じ機能性を提供するように設定された回路として実装される。プログラムコードのハードウェアアクセラレーテッドバージョンは、通常、何らかのプロセッサを使用してプログラムコードを実行することと比較して、改善された性能を提供する。いくつかの場合には、プログラムコードは、プログラマブルＩＣ内に実装される回路設計にコンパイルされる。 Hardware acceleration refers to implementing the functionality of a portion of program code in hardware or circuitry. Hardware accelerated program code is functionally equivalent to the original program code. Instead of running a compiled version of the program code, such as an executable binary using a processor, the program code is implemented as a circuit configured to provide the same functionality as the executable binary. The hardware accelerated version of the program code usually provides improved performance compared to running the program code using any processor. In some cases, the program code is compiled into a circuit design that is implemented within a programmable IC.

上記および他の実装形態は、各々、随意に、単独でまたは組み合わせて、以下の特徴のうちの１つまたは複数を含むか、あるいは以下の特徴のすべてを含むことができる。 These and other implementations may each optionally include one or more of the following features, alone or in combination, or may include all of the following features.

１つまたは複数の実施形態では、システムが、ホストシステムと、通信インターフェースを通してホストシステムに結合されたＩＣとを含む。ＩＣは、ハードウェアアクセラレーションのために設定される。ＩＣは、通信インターフェースに結合された直接メモリアクセス回路と、カーネル回路と、直接メモリアクセス回路とカーネル回路とに結合されたストリームトラフィックマネージャ回路（ｓｔｒｅａｍｔｒａｆｆｉｃｍａｎａｇｅｒｃｉｒｃｕｉｔ）とを含む。ストリームトラフィックマネージャ回路は、ホストシステムとカーネル回路との間で交換されるデータストリームを制御するように設定される。 In one or more embodiments, a system includes a host system and an IC coupled to the host system through a communications interface. The IC is configured for hardware acceleration. The IC includes a direct memory access circuit coupled to the communications interface, a kernel circuit, and a stream traffic manager circuit coupled to the direct memory access circuit and the kernel circuit. The stream traffic manager circuit is configured to control data streams exchanged between the host system and the kernel circuit.

一態様では、ホストシステムとＩＣとは、パケット化されたデータを交換することによって通信する。 In one aspect, the host system and the IC communicate by exchanging packetized data.

別の態様では、ＩＣは、ストリームトラフィックマネージャ回路とカーネル回路とを接続する相互接続回路を含む。 In another aspect, the IC includes interconnect circuitry connecting the stream traffic manager circuitry and the kernel circuitry.

別の態様では、カーネル回路は、複数のカーネル回路のうちの１つであり、ストリームトラフィックマネージャ回路は、複数のカーネル回路に提供されるデータストリームをインターリーブするように設定される。 In another aspect, the kernel circuit is one of a plurality of kernel circuits, and the stream traffic manager circuit is configured to interleave the data streams provided to the plurality of kernel circuits.

別の態様では、ＩＣは、相互接続回路とカーネル回路とに結合された入力バッファを含み、入力バッファは、ストリームトラフィックマネージャ回路からのパケット化されたデータを一時的に保持し、パケット化されたデータを、カーネル回路に提供されるデータストリームに変換するように設定される。ＩＣは、相互接続回路とカーネル回路とに結合された出力バッファをさらに含み、出力バッファは、カーネル回路から出力されたデータストリームを一時的に保持し、データストリームを、パケット化されたデータに変換するように設定される。 In another aspect, the IC includes an input buffer coupled to the interconnect circuitry and the kernel circuitry, the input buffer configured to temporarily hold packetized data from the stream traffic manager circuitry and convert the packetized data into a data stream provided to the kernel circuitry. The IC further includes an output buffer coupled to the interconnect circuitry and the kernel circuitry, the output buffer configured to temporarily hold a data stream output from the kernel circuitry and convert the data stream into packetized data.

別の態様では、ホストシステムは、メモリに結合されたプロセッサを含み、プロセッサは、メモリにおいて、入力バッファに対応する書込みキューと出力バッファに対応する読取りキューとを実装するように設定される。書込みキューは、入力バッファにストリーミングされるべきデータを指定する記述子を記憶する。読取りキューは、出力バッファからホストシステムのメモリにストリーミングされるべきデータを指定する記述子を記憶する。 In another aspect, a host system includes a processor coupled to a memory, the processor configured to implement in the memory a write queue corresponding to the input buffer and a read queue corresponding to the output buffer. The write queue stores a descriptor specifying data to be streamed to the input buffer. The read queue stores a descriptor specifying data to be streamed from the output buffer to a memory of the host system.

別の態様では、ホストシステムは、カーネル回路に、帯域内命令をもつパケット化されたデータを送出するように設定される。 In another aspect, the host system is configured to send packetized data with in-band instructions to the kernel circuitry.

１つまたは複数の実施形態では、方法が、コンピュータハードウェアを使用して、カーネル回路を指定する設定ビットストリームと、カーネル回路についてのメタデータとを含むコンテナファイルを選択することと、コンピュータハードウェアを使用して、コンテナファイルから設定ビットストリームを抽出し、ＩＣ内にカーネル回路を実装するために、ＩＣ内に設定ビットストリームをロードすることと、コンピュータハードウェアを使用して、メタデータからパイププロパティを決定することであって、パイププロパティが、ホストシステムからカーネル回路にデータをストリーミングするためのセッティングを指定する、パイププロパティを決定することとを含む。方法は、データストリームに変換され、パイププロパティによって指定されたセッティングを使用してカーネル回路に提供される、パケット化されたデータとしての、ホストシステムからカーネル回路への直接のデータ転送を実装することをも含むことができる。 In one or more embodiments, the method includes using computer hardware to select a container file that includes a configuration bitstream that specifies a kernel circuit and metadata about the kernel circuit; using computer hardware to extract the configuration bitstream from the container file and load the configuration bitstream into the IC to implement the kernel circuit in the IC; and using computer hardware to determine pipe properties from the metadata, the pipe properties specifying settings for streaming data from the host system to the kernel circuit. The method may also include implementing a direct data transfer from the host system to the kernel circuit as packetized data that is converted into a data stream and provided to the kernel circuit using the settings specified by the pipe properties.

一態様では、方法は、結果を指定するさらなるデータストリームとしての、カーネル回路から直接ホストシステムへのさらなるデータ転送を実装することを含む。 In one aspect, the method includes implementing a further data transfer from the kernel circuitry directly to the host system as a further data stream specifying the results.

別の態様では、さらなるデータ転送を実装することは、ホストシステムに位置する、カーネル回路に対応する書込みキューが、データの完全なパケットを受信するためのスペースを有するかどうかを決定することと、書込みキューがスペースを有すると決定したことに応答して、カーネル回路からホストシステムへのデータ転送を始動することとを含む。 In another aspect, implementing the further data transfer includes determining whether a write queue corresponding to the kernel circuit, located in the host system, has space to receive a complete packet of data, and in response to determining that the write queue has space, initiating a data transfer from the kernel circuit to the host system.

別の態様では、方法は、ＩＣ内のストリームトラフィックマネージャ回路にセッティングを送出することを含み、ストリームトラフィックマネージャ回路は、ホストシステムとカーネル回路との間でデータをストリーミングするためにセッティングを実装する。 In another aspect, the method includes sending settings to a stream traffic manager circuit in the IC, the stream traffic manager circuit implementing the settings to stream data between the host system and the kernel circuit.

別の態様では、方法は、データストリーム内の帯域内にカーネル回路についての命令を含めることを含む。 In another aspect, the method includes including instructions for the kernel circuitry in-band within the data stream.

別の態様では、方法は、データ転送が、データ転送を要求するユーザアプリケーションによって使用されるデータタイプに基づいてデータストリームとして実装されるべきであると決定することを含む。 In another aspect, the method includes determining that the data transfer should be implemented as a data stream based on a data type used by a user application requesting the data transfer.

別の態様では、データ転送を実装することは、ＩＣ内のカーネル回路に結合された入力バッファが、データを受信するためのスペースを有するかどうかを決定することと、入力バッファがスペースを有すると決定したことに応答して、カーネル回路へのデータ転送を始動することとを含む。 In another aspect, implementing the data transfer includes determining whether an input buffer coupled to a kernel circuit in the IC has space to receive the data, and in response to determining that the input buffer has space, initiating a data transfer to the kernel circuit.

１つまたは複数の実施形態では、ＩＣが、ホストシステムに結合された通信インターフェースと、通信インターフェースに結合された直接メモリアクセス回路と、プログラマブル回路を使用して実装されるカーネル回路と、直接メモリアクセス回路とカーネル回路とに結合されたストリームトラフィックマネージャ回路とを含む。ストリームトラフィックマネージャ回路は、ホストシステムとカーネル回路との間で交換されるデータストリームを制御するように設定される。 In one or more embodiments, an IC includes a communications interface coupled to a host system, a direct memory access circuit coupled to the communications interface, a kernel circuit implemented using programmable circuits, and a stream traffic manager circuit coupled to the direct memory access circuit and the kernel circuit. The stream traffic manager circuit is configured to control data streams exchanged between the host system and the kernel circuit.

一態様では、ＩＣは、ストリームトラフィックマネージャ回路から、パケット化されたデータを受信し、パケット化されたデータをカーネル回路に配信するように設定された第１の相互接続と、カーネル回路からデータを受信し、データをストリームトラフィックマネージャ回路に提供するように設定された第２の相互接続とを含む。 In one aspect, the IC includes a first interconnect configured to receive packetized data from the stream traffic manager circuit and deliver the packetized data to the kernel circuit, and a second interconnect configured to receive data from the kernel circuit and provide the data to the stream traffic manager circuit.

別の態様では、ＩＣは、第１の相互接続の出力ポートとカーネル回路の入力ポートとに結合された入力バッファであって、入力バッファが、パケット化されたデータを一時的に記憶し、パケット化されたデータをデータストリームに変換し、データストリームをカーネル回路に提供するように設定された、入力バッファを含む。ストリームトラフィックマネージャ回路は、入力バッファが、利用可能なスペースを有すると決定したことに応答して、カーネル回路へのデータ転送を始動する。 In another aspect, the IC includes an input buffer coupled to an output port of the first interconnect and to an input port of the kernel circuit, the input buffer configured to temporarily store packetized data, convert the packetized data into a data stream, and provide the data stream to the kernel circuit. In response to the input buffer determining that it has available space, the stream traffic manager circuit initiates a data transfer to the kernel circuit.

別の態様では、ＩＣは、カーネル回路の出力ポートとストリームトラフィックマネージャ回路の入力ポートとに結合された出力バッファであって、出力バッファが、カーネル回路から出力されたデータストリームを一時的に記憶し、データストリームを、パケット化されたデータに変換し、パケット化されたデータを第２の相互接続に提供するように設定された、出力バッファを含む。ストリームトラフィックマネージャ回路は、出力バッファに対応する、ホストシステムにおけるバッファが、利用可能なスペースを有し、出力バッファが少なくとも１つの完全なパケットを含むと決定したことに応答して、カーネル回路からホストシステムへのデータ転送を始動する。 In another aspect, the IC includes an output buffer coupled to an output port of the kernel circuit and an input port of the stream traffic manager circuit, the output buffer configured to temporarily store a data stream output from the kernel circuit, convert the data stream into packetized data, and provide the packetized data to a second interconnect. The stream traffic manager circuit initiates a data transfer from the kernel circuit to the host system in response to determining that a buffer in the host system corresponding to the output buffer has available space and that the output buffer contains at least one complete packet.

別の態様では、カーネル回路は、プログラマブル回路において実装される複数のカーネル回路のうちの１つである。ストリームトラフィックマネージャ回路は、複数のカーネル回路の各々に結合され、複数のカーネル回路と交換されるデータストリームをインターリーブするように設定される。 In another aspect, the kernel circuit is one of a plurality of kernel circuits implemented in a programmable circuit. The stream traffic manager circuit is coupled to each of the plurality of kernel circuits and configured to interleave data streams exchanged with the plurality of kernel circuits.

別の態様では、複数のカーネル回路の各カーネル回路は、バッファおよび相互接続を通してストリームトラフィックマネージャ回路に結合される。ストリームトラフィックマネージャ回路は、各それぞれのカーネル回路に対応するバッファのスペース利用可能性に基づいて、複数のカーネル回路の各々にデータをストリーミングするためのラウンドロビンアービトレーション方式を実装する。 In another aspect, each kernel circuit of the plurality of kernel circuits is coupled to a stream traffic manager circuit through a buffer and an interconnect. The stream traffic manager circuit implements a round-robin arbitration scheme for streaming data to each of the plurality of kernel circuits based on space availability in a buffer corresponding to each respective kernel circuit.

１つまたは複数の実施形態では、ＩＣが、プログラマブル回路において実装される第１のカーネル回路と、プログラマブル回路において実装される第２のカーネル回路と、第１のカーネル回路と第２のカーネル回路とに結合されたストリームトラフィックマネージャ回路とを含む。ストリームトラフィックマネージャ回路は、第１のカーネル回路と第２のカーネル回路との間で交換されるデータストリームを制御するように設定される。 In one or more embodiments, an IC includes a first kernel circuit implemented in a programmable circuit, a second kernel circuit implemented in the programmable circuit, and a stream traffic manager circuit coupled to the first kernel circuit and the second kernel circuit. The stream traffic manager circuit is configured to control data streams exchanged between the first kernel circuit and the second kernel circuit.

一態様では、第１のカーネル回路から第２のカーネル回路に送出される選択されたデータストリームが、第２のカーネル回路についての帯域内命令を含む。 In one aspect, the selected data stream sent from the first kernel circuit to the second kernel circuit includes in-band instructions for the second kernel circuit.

別の態様では、第１のカーネル回路は、第１の入力バッファと第１の出力バッファとを通して第１の相互接続に結合され、第２のカーネル回路は、第２の入力バッファと第２の出力バッファとを通して第２の相互接続に結合され、第１の相互接続と第２の相互接続とはストリームトラフィックマネージャに結合される。 In another aspect, the first kernel circuit is coupled to the first interconnect through a first input buffer and a first output buffer, the second kernel circuit is coupled to the second interconnect through a second input buffer and a second output buffer, and the first interconnect and the second interconnect are coupled to a stream traffic manager.

別の態様では、ストリームトラフィックマネージャ回路は、選択されたデータストリームを、集積回路に結合されたホストシステムから第１のカーネル回路にまたは第２のカーネル回路に直接提供し、結果データストリームを、第１のカーネル回路または第２のカーネル回路からホストシステムに提供するように設定される。 In another aspect, the stream traffic manager circuit is configured to provide a selected data stream from a host system coupled to the integrated circuit to the first kernel circuit or directly to the second kernel circuit, and provide a resulting data stream from the first kernel circuit or the second kernel circuit to the host system.

別の態様では、選択されたデータストリームは、第１のカーネル回路または第２のカーネル回路についての帯域内命令を含む。 In another aspect, the selected data stream includes in-band instructions for the first kernel circuit or the second kernel circuit.

別の態様では、第１のカーネル回路は集積回路の第１のダイに位置し、第２のカーネル回路は集積回路の第２のダイに位置する。 In another aspect, the first kernel circuit is located on a first die of the integrated circuit and the second kernel circuit is located on a second die of the integrated circuit.

別の態様では、ストリームトラフィックマネージャ回路は第１のダイに位置する。 In another aspect, the stream traffic manager circuitry is located on the first die.

別の態様では、ＩＣは、第２のダイ内の第２のカーネル回路の入力ポートに結合され、第２のカーネル回路にストリーミングされるデータを一時的に記憶するように設定された入力バッファと、第１のダイ内の第１のカーネル回路の出力ポートに結合され、第１のカーネル回路から出力されたデータを一時的に記憶するように設定された出力バッファとを含む。ストリームトラフィックマネージャ回路は、入力バッファが、利用可能なスペースを有し、出力バッファがデータを記憶していると決定したことに応答して、第１のカーネル回路から第２のカーネル回路へのデータ転送を始動するように設定される。 In another aspect, the IC includes an input buffer coupled to an input port of a second kernel circuit in the second die and configured to temporarily store data streamed to the second kernel circuit, and an output buffer coupled to an output port of a first kernel circuit in the first die and configured to temporarily store data output from the first kernel circuit. The stream traffic manager circuit is configured to initiate a data transfer from the first kernel circuit to the second kernel circuit in response to the input buffer determining that the input buffer has available space and that the output buffer is storing data.

別の態様では、ＩＣは、第１のダイ内の第１のカーネル回路の入力ポートに結合され、第１のカーネル回路にストリーミングされるデータを一時的に記憶するように設定された入力バッファと、第２のダイ内の第２のカーネル回路の出力ポートに結合され、第２のカーネル回路から出力されたデータを一時的に記憶するように設定された出力バッファとを含む。ストリームトラフィックマネージャ回路は、入力バッファが、利用可能なスペースを有し、出力バッファがデータを記憶していると決定したことに応答して、第２のカーネル回路から第１のカーネル回路へのデータ転送を始動するように設定される。 In another aspect, the IC includes an input buffer coupled to an input port of a first kernel circuit in a first die and configured to temporarily store data streamed to the first kernel circuit, and an output buffer coupled to an output port of a second kernel circuit in a second die and configured to temporarily store data output from the second kernel circuit. The stream traffic manager circuit is configured to initiate a data transfer from the second kernel circuit to the first kernel circuit in response to the input buffer determining that the input buffer has available space and that the output buffer is storing data.

１つまたは複数の実施形態では、システムが、第１の複数のカーネル回路と、第１の複数のカーネル回路のうちの異なるカーネル回路間で交換されるデータストリームを制御するように設定されたストリームトラフィックマネージャ回路と、第１のトランシーバとを有する第１のＩＣと、第２の複数のカーネル回路と、第２の複数のカーネル回路のうちの異なるカーネル回路間で交換されるデータストリームを制御するように設定された衛星ストリームトラフィックマネージャ回路と、第１のトランシーバに結合された第２のトランシーバとを有する第２のＩＣとを含む。ストリームトラフィックマネージャ回路と衛星ストリームトラフィックマネージャ回路とは、第１の複数のカーネル回路のうちの選択されたカーネル回路と第２の複数のカーネル回路のうちの選択されたカーネル回路との間で渡されるデータストリームを交換するように設定される。 In one or more embodiments, a system includes a first IC having a first plurality of kernel circuits, a stream traffic manager circuit configured to control data streams exchanged between different kernel circuits of the first plurality of kernel circuits, and a first transceiver, and a second IC having a second plurality of kernel circuits, a satellite stream traffic manager circuit configured to control data streams exchanged between different kernel circuits of the second plurality of kernel circuits, and a second transceiver coupled to the first transceiver. The stream traffic manager circuit and the satellite stream traffic manager circuit are configured to exchange data streams passed between selected kernel circuits of the first plurality of kernel circuits and selected kernel circuits of the second plurality of kernel circuits.

一態様では、第１の複数のカーネル回路は第１のＩＣの異なるダイに位置し、第２の複数のカーネル回路は第２のＩＣの異なるダイに位置する。 In one aspect, the first plurality of kernel circuits are located on different dies of the first IC and the second plurality of kernel circuits are located on different dies of the second IC.

別の態様では、第１の複数のカーネル回路のうちの選択されたカーネル回路と第２の複数のカーネル回路のうちの選択されたカーネル回路との間で交換されるデータストリームは、第２のカーネル回路についての帯域内命令を含む。 In another aspect, the data stream exchanged between a selected kernel circuit of the first plurality of kernel circuits and a selected kernel circuit of the second plurality of kernel circuits includes in-band instructions for the second kernel circuit.

別の態様では、ストリームトラフィックマネージャ回路は、選択されたデータストリームを、第１のＩＣに結合されたホストシステムから、第１の複数のカーネル回路のうちの選択されたカーネル回路または第２の複数のカーネル回路のうちの選択されたカーネル回路に直接提供し、結果データストリームを、第１の複数のカーネル回路のうちの選択されたカーネル回路または第２の複数のカーネル回路のうちの選択されたカーネル回路からホストシステムに提供するように設定される。 In another aspect, the stream traffic manager circuit is configured to provide a selected data stream directly from a host system coupled to the first IC to a selected kernel circuit of the first plurality of kernel circuits or a selected kernel circuit of the second plurality of kernel circuits, and provide a resulting data stream from the selected kernel circuit of the first plurality of kernel circuits or the selected kernel circuit of the second plurality of kernel circuits to the host system.

別の態様では、第１のＩＣは、ストリームトラフィックマネージャと第１の複数のカーネル回路とに結合された相互接続を含み、第２のＩＣは、衛星ストリームトラフィックマネージャと第２の複数のカーネル回路とに結合された相互接続を含む。 In another aspect, the first IC includes an interconnect coupled to the stream traffic manager and the first plurality of kernel circuits, and the second IC includes an interconnect coupled to the satellite stream traffic manager and the second plurality of kernel circuits.

別の態様では、ストリームトラフィックマネージャ回路と衛星ストリームトラフィックマネージャ回路とは、受信カーネル回路の入力バッファが、利用可能なスペースを有すると決定したことに応答して、データストリームを交換するように設定される。 In another aspect, the stream traffic manager circuit and the satellite stream traffic manager circuit are configured to exchange data streams in response to determining that the input buffer of the receiving kernel circuit has available space.

別の態様では、第１のＩＣは、第１の複数のダイを含み、第１の複数のカーネル回路が第１の複数のダイにわたって分散される。各ダイは、ストリームトラフィックマネージャと、ダイ内の第１の複数のカーネル回路のうちの特定のカーネル回路とに結合された相互接続を含む。 In another aspect, the first IC includes a first plurality of dies, with the first plurality of kernel circuits distributed across the first plurality of dies. Each die includes an interconnect coupled to a stream traffic manager and a particular kernel circuit of the first plurality of kernel circuits in the die.

１つまたは複数の実施形態では、方法が、ストリームトラフィックマネージャ回路（ｓｔｒｅａｍｔｒａｆｆｉｃｍａｎａｇｅｒｃｉｒｃｕｉｔｒｙ）によって、パケットについて、カーネル回路の出力バッファを監視することであって、カーネル回路が少なくとも１つのＩＣのプログラマブル回路において実装される、出力バッファを監視することと、送出カーネル回路の出力バッファがパケットを記憶していることを検出したことに応答して、ストリームトラフィックマネージャ回路によって、パケットについて、受信カーネル回路を決定することと、ストリームトラフィックマネージャ回路によって、受信カーネル回路の入力バッファが、パケットを記憶するために利用可能なスペースを有するかどうかを決定することと、入力バッファが、パケットを記憶するために利用可能なスペースを有すると決定したことに応答して、ストリームトラフィックマネージャ回路によって、送出カーネル回路の出力バッファから受信カーネル回路の入力バッファへのストリームデータ転送を始動することとを含む。 In one or more embodiments, the method includes monitoring, by a stream traffic manager circuitry, an output buffer of a kernel circuit for a packet, the kernel circuit being implemented in a programmable circuit of at least one IC; determining, by the stream traffic manager circuitry, in response to detecting that the output buffer of the sending kernel circuitry stores the packet, a receiving kernel circuit for the packet; determining, by the stream traffic manager circuitry, whether an input buffer of the receiving kernel circuitry has space available to store the packet; and in response to determining that the input buffer has space available to store the packet, initiating, by the stream traffic manager circuitry, a stream data transfer from the output buffer of the sending kernel circuitry to the input buffer of the receiving kernel circuit.

一態様では、ストリームデータ転送は、ホストシステムの関与なしに実施される。 In one aspect, the stream data transfer is performed without the involvement of the host system.

別の態様では、ストリームデータ転送は、受信カーネル回路の動作を制御する帯域内命令を含む。 In another aspect, the stream data transfer includes in-band instructions that control the operation of the receiving kernel circuitry.

本発明の概要セクションは、いくつかの概念を導入するために提供されるにすぎず、請求される主題の重要な、または本質的な特徴を識別するために提供されるものではない。本発明の構成の他の特徴は、添付の図面および以下の発明を実施するための形態から明らかになろう。 This summary section is provided merely to introduce some concepts and is not intended to identify key or essential features of the claimed subject matter. Other features of the inventive subject matter will be apparent from the accompanying drawings and the detailed description that follows.

本発明の構成は、添付の図面において例として示される。しかしながら、図面は、本発明の構成を、図示される特定の実装形態のみに限定するものと解釈されるべきではない。様々な態様および利点が、以下の発明を実施するための形態を検討し、図面を参照すると明らかになろう。 Configurations of the present invention are illustrated by way of example in the accompanying drawings. However, the drawings should not be construed as limiting the configurations of the present invention to only the particular implementations shown. Various aspects and advantages will become apparent upon review of the following detailed description and upon reference to the drawings.

ハードウェアアクセラレーションのための例示的なアーキテクチャを示す図である。FIG. 1 illustrates an exemplary architecture for hardware acceleration. 図１のアーキテクチャの別の例示的な実装形態を示す図である。FIG. 2 illustrates another exemplary implementation of the architecture of FIG. 1 . データストリームを使用してホストシステムとハードウェアアクセラレータのカーネル回路との間でデータを転送する例示的な方法を示す図である。FIG. 2 illustrates an exemplary method for transferring data between a host system and kernel circuitry of a hardware accelerator using a data stream. データストリームを使用してカーネル回路間でデータを交換するための例示的なアーキテクチャを示す図である。FIG. 1 illustrates an exemplary architecture for exchanging data between kernel circuits using data streams. データストリームを使用してカーネル回路間でデータを交換する例示的な方法を示す図である。FIG. 1 illustrates an exemplary method for exchanging data between kernel circuits using data streams. 本明細書で説明される１つまたは複数の実施形態とともに使用するための例示的なシステムを示す図である。FIG. 1 illustrates an exemplary system for use with one or more embodiments described herein. ＩＣのための例示的なアーキテクチャを示す図である。FIG. 1 illustrates an exemplary architecture for an IC.

本開示は、新規の特徴を定義する特許請求の範囲で締めくくるが、本開示内で説明される様々な特徴は、図面とともにその説明を考慮することにより、より良く理解されると考えられる。本明細書で説明される（１つまたは複数の）プロセス、（１つまたは複数の）機械、（１つまたは複数の）製造物およびその任意の変形形態は、例示のために提供される。本開示内で説明される特定の構造的および機能的詳細は、限定するものとして解釈されるべきではなく、単に、特許請求の範囲のための基礎として、およびほぼすべての適切に詳細な構造において説明される特徴を様々に採用するように当業者に教示するための代表的基礎として解釈されるべきである。さらに、本開示内で使用される用語および句は、限定するものではなく、むしろ、説明される特徴の理解可能な説明を提供するものである。 Although the present disclosure concludes with claims defining novel features, it is believed that the various features described within the present disclosure will be better understood by considering the description in conjunction with the drawings. The process(es), machine(s), product(s) and any variations thereof described herein are provided for illustrative purposes. The specific structural and functional details described within the present disclosure should not be construed as limiting, but merely as a representative basis for teaching those skilled in the art to employ the described features in various ways in nearly all appropriately detailed structures. Moreover, the terms and phrases used within the present disclosure are not limiting, but rather provide an understandable description of the described features.

本開示はＩＣに関し、より詳細には、ホストシステムとハードウェアアクセラレーテッド回路との間の通信のためにおよびハードウェアアクセラレーテッド回路のカーネル回路間の通信のためにデータストリームを使用することに関する。ＩＣは、ハードウェアアクセラレーテッド回路を１つまたは複数のカーネル回路として実装する。たとえば、各カーネル回路は、ハードウェアアクセラレーテッドプログラムコードを表す。ホストシステムは、ＩＣ内に実装されるカーネル回路に１つまたは複数のタスクをオフロードすることが可能である。カーネル回路に１つまたは複数のタスクをオフロードする際に、ホストシステムは、データストリームをサポートするアーキテクチャを使用して、カーネル回路による動作の対象となるべきデータを転送する。カーネル回路は、データストリーム対応アーキテクチャを使用して互いとデータを交換することが可能である。カーネル回路はまた、データ、たとえば、結果を、ホストシステムに送出するより前にパケット化されたデータストリームとして、ホストシステムに転送する。 The present disclosure relates to ICs, and more particularly to using data streams for communication between a host system and a hardware accelerated circuit and for communication between kernel circuits of the hardware accelerated circuit. The IC implements the hardware accelerated circuit as one or more kernel circuits. For example, each kernel circuit represents hardware accelerated program code. The host system can offload one or more tasks to the kernel circuits implemented in the IC. In offloading one or more tasks to the kernel circuits, the host system transfers data to be operated on by the kernel circuits using an architecture that supports data streams. The kernel circuits can exchange data with each other using a data stream-enabled architecture. The kernel circuits also transfer data, e.g., results, to the host system as a packetized data stream before sending them to the host system.

従来のシステムでは、カーネル回路にタスクをオフロードするとき、ホストシステムは、カーネル回路を実装するＩＣに結合されたランダムアクセスメモリ（ＲＡＭ）を介した、カーネル回路へのデータ転送を始動する。しかしながら、ＲＡＭは、カーネル回路と、同じ回路板（たとえば、アクセラレータカード）上に位置するが、同じＩＣ中にない。データがＲＡＭに転送されると、ホストシステムは、カーネル回路に、データが、使用するための準備ができていることを通知する。これは、ＲＡＭへのデータ転送が完了するまで、カーネル回路がデータに対して動作することを始めることができないことを意味する。ホストシステムからカーネル回路に提供される命令は、データに対して、別個に、たとえば、帯域外で提供される。たとえば、コマンドが、データを伝達するために使用されるものとは異なる物理インターフェース上で、カーネル回路に提供される。 In conventional systems, when offloading a task to a kernel circuit, the host system initiates a data transfer to the kernel circuit via a random access memory (RAM) coupled to the IC that implements the kernel circuit. However, the RAM is located on the same circuit board (e.g., accelerator card) as the kernel circuit, but not in the same IC. Once the data is transferred to the RAM, the host system notifies the kernel circuit that the data is ready for use. This means that the kernel circuit cannot begin to operate on the data until the data transfer to the RAM is complete. Instructions provided from the host system to the kernel circuit are provided separately, e.g., out-of-band, to the data. For example, commands are provided to the kernel circuit over a different physical interface than that used to convey the data.

従来のシステムでは、カーネル回路は、データの利用可能性を通知されると、ＲＡＭからデータを読み取り、データを処理し、ＲＡＭに結果を書き込む。カーネル回路がＲＡＭに結果を書き込み終えたとき、カーネル回路は、ホストシステムに、結果の利用可能性を通知する。次いで、ホストシステムは、ＲＡＭから結果を取り出す。 In a conventional system, when the kernel circuit is notified of data availability, it reads the data from RAM, processes the data, and writes the results to RAM. When the kernel circuit has finished writing the results to RAM, it notifies the host system of the availability of the results. The host system then retrieves the results from RAM.

本開示内で説明される本発明の構成によれば、データが、データストリームおよびパケット化を使用してホストシステムとカーネル回路との間で交換される。ホストシステムによって発信されるデータがカーネル回路に直接送出される。同様に、カーネル回路によって発信されるデータがホストシステムに直接送出される。例示的なおよび非限定的な例として、ホストシステムからカーネル回路へのデータ転送がホストシステムからカーネル回路に直接流れる。ホストシステムから転送されるデータは、最初にオフチップＲＡＭに記憶および累積されず、次いでカーネル回路によって読み取られる。同様に、カーネル回路からホストシステムに転送される結果は、ホストシステムに提供される前に、最初にオフチップＲＡＭに記憶および累積されない。代わりに、データは、カーネル回路からホストシステムに直接流れる。１つまたは複数のより小さい内部メモリバッファを利用するＩＣ内のデータ経路上で、ストリーミングが実施される。メモリバッファは、たとえば、ホストシステムとカーネル回路との間で交換されるデータの量よりもサイズが小さい。 According to the inventive configurations described within this disclosure, data is exchanged between the host system and the kernel circuit using data streams and packetization. Data originating from the host system is sent directly to the kernel circuit. Similarly, data originating from the kernel circuit is sent directly to the host system. As an illustrative and non-limiting example, data transfer from the host system to the kernel circuit flows directly from the host system to the kernel circuit. Data transferred from the host system is not first stored and accumulated in off-chip RAM and then read by the kernel circuit. Similarly, results transferred from the kernel circuit to the host system are not first stored and accumulated in off-chip RAM before being provided to the host system. Instead, data flows directly from the kernel circuit to the host system. Streaming is implemented over a data path within the IC that utilizes one or more smaller internal memory buffers. The memory buffers are, for example, smaller in size than the amount of data exchanged between the host system and the kernel circuit.

本開示内で説明されるストリーミングアーキテクチャは、従来のシステムと比較して、より速いデータ転送、より少ないレイテンシ、およびメモリのより効率的な使用を容易にする。たとえば、カーネル回路は、データの全体が最初にオフチップＲＡＭに転送され、次いでカーネル回路にロードされるのを待つのではなく、データの全体よりも少ないデータを受信した後直ちにそのデータに対する動作を始めることができる。これは、システム全体の速度およびレイテンシを改善する。速度およびレイテンシにおける同様の利得が、カーネル回路からホストシステムにデータをストリーミングすることによって取得される。ストリーミングアーキテクチャを使用すると、ホストシステムからカーネル回路へのコマンドが、データストリーム自体の中に含まれ、たとえば、帯域内に入れ（ｉｎ－ｂａｎｄ）られ、これはシステムレイテンシをさらに低減する。ＩＣの内部メモリをより効率的に利用することによって、より少ないオフチップＲＡＭが必要とされ、これは、システムおよび／またはハードウェアアクセラレータの電力要件を低減する。 The streaming architecture described within this disclosure facilitates faster data transfers, less latency, and more efficient use of memory compared to conventional systems. For example, a kernel circuit can begin operating on less than the entire data immediately after receiving the data, rather than waiting for the entire data to first be transferred to off-chip RAM and then loaded into the kernel circuit. This improves the speed and latency of the overall system. Similar gains in speed and latency are obtained by streaming data from the kernel circuit to the host system. With a streaming architecture, commands from the host system to the kernel circuit are included within the data stream itself, e.g., in-band, which further reduces system latency. By utilizing the IC's internal memory more efficiently, less off-chip RAM is needed, which reduces the power requirements of the system and/or hardware accelerators.

特定の実施形態では、カーネル回路はまた、データストリームを使用して互いと通信することが可能である。また、ホストシステム－カーネル回路通信に関係して説明される利益は、カーネル回路の間の通信のためにデータストリームを使用することによって達成される。さらに、（１つまたは複数の）プログラマブルＩＣ内にストリーミングインフラストラクチャを含めることによって、カーネル回路は、あまり複雑でないインフラストラクチャ、たとえば、互いと通信することが意図されたカーネル回路の間の直接ポイントツーポイント通信リンクを必要としないインフラストラクチャを使用して、互いとデータを交換することが可能である。 In certain embodiments, the kernel circuits may also communicate with each other using data streams. Also, the benefits described with respect to host system-kernel circuit communication are achieved by using data streams for communication between the kernel circuits. Furthermore, by including a streaming infrastructure within the programmable IC(s), the kernel circuits may exchange data with each other using a less complex infrastructure, e.g., an infrastructure that does not require a direct point-to-point communication link between the kernel circuits that are intended to communicate with each other.

図を参照しながら、本発明の構成のさらなる態様が以下でより詳細に説明される。例示を単純および明快にするために、図に示されている要素は、必ずしも一定の縮尺で描かれているとは限らない。たとえば、要素のうちのいくつかの寸法は、明快のために、他の要素に対して誇張され得る。さらに、適切と見なされる場合、対応する、類似する、または同様の特徴を示すために、参照番号が図の間で繰り返される。 Further aspects of the configuration of the present invention are described in more detail below with reference to the figures. For simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, dimensions of some of the elements may be exaggerated relative to other elements for clarity. Furthermore, where considered appropriate, reference numerals have been repeated among the figures to indicate corresponding, similar or similar features.

図１は、ハードウェアアクセラレーションのための例示的なアーキテクチャ１００を示す。アーキテクチャ１００は、ホストシステム１０２とハードウェアアクセラレータ１０３とを含む。ホストシステム１０２は、サーバまたは他のデータ処理システムなど、コンピュータシステムとして実装される。ハードウェアアクセラレータ１０３は、ＩＣ１０４とＩＣ１０４にアタッチされたメモリ１０６とを有する回路板として実装される。たとえば、ハードウェアアクセラレータ１０３は、ホストシステム１０２の利用可能な周辺スロットに挿入され得るエッジコネクタを有するアクセラレータカードとして、実装され得る。 FIG. 1 illustrates an exemplary architecture 100 for hardware acceleration. Architecture 100 includes a host system 102 and a hardware accelerator 103. Host system 102 is implemented as a computer system, such as a server or other data processing system. Hardware accelerator 103 is implemented as a circuit board having an IC 104 and memory 106 attached to IC 104. For example, hardware accelerator 103 may be implemented as an accelerator card having an edge connector that may be inserted into an available peripheral slot of host system 102.

図１の例は、ＩＣ１０４の外部にあるメモリ（たとえば、ＲＡＭ）を使用して説明されるが、ストリーミングデータに関係して本明細書で説明される実施形態は、ＩＣ１０４が、メモリ１０６が必要とされないような十分なオンチップメモリを含む場合も、有効および適用可能である。ＩＣ１０４が、十分なオンチップメモリまたは同じダイメモリを含む場合、外部メモリに関与する問題と同様の問題が、カーネル回路がデータに対して動作することを許可される前に、データ全体がメモリに転送されなければならないとき、起こる。内部メモリを使用することは、外部メモリを使用することよりも高速であるが、レイテンシの増加、記憶容量（メモリ）の増加の必要、および同期など、問題が生じ、これらは、本明細書で説明されるストリーミング対応実施形態によって克服される。 1 is described using memory (e.g., RAM) external to the IC 104, the embodiments described herein relating to streaming data are also valid and applicable when the IC 104 includes sufficient on-chip memory such that memory 106 is not required. When the IC 104 includes sufficient on-chip or same-die memory, problems similar to those involving external memory arise when the entire data must be transferred to memory before the kernel circuitry is permitted to operate on the data. Using internal memory is faster than using external memory, but introduces problems such as increased latency, increased storage capacity (memory) requirements, and synchronization that are overcome by the streaming-enabled embodiments described herein.

１つまたは複数の実施形態では、ＩＣ１０４は、プログラマブルＩＣとして実装される。特定の実施形態では、ＩＣ１０４は、図７に関して説明されるものと同じまたは同様のアーキテクチャを使用して実装される。図１の例では、ＩＣ１０４は、エンドポイント１０８と、直接メモリアクセス回路（ＤＭＡ）１１０と、カーネル回路１１２と、メモリコントローラ１１４とを含む。エンドポイント１０８は、通信バス上でホストシステム１０２と通信することが可能であるインターフェースである。例示的および非限定的な例として、通信バスは、周辺構成要素相互接続エクスプレス（ＰＣＩｅ）バスとして実装され得る。したがって、エンドポイント１０８は、ＰＣＩｅエンドポイントとして実装され得る。ただし、他の通信バスが使用され得ることと、提供された例は限定するものではないこととを諒解されたい。したがって、エンドポイント１０８は、通信バス上で通信するための様々な好適なインターフェースのいずれかとして実装され得る。 In one or more embodiments, the IC 104 is implemented as a programmable IC. In certain embodiments, the IC 104 is implemented using the same or similar architecture as described with respect to FIG. 7. In the example of FIG. 1, the IC 104 includes an endpoint 108, a direct memory access circuit (DMA) 110, a kernel circuit 112, and a memory controller 114. The endpoint 108 is an interface capable of communicating with the host system 102 over a communication bus. As an illustrative and non-limiting example, the communication bus may be implemented as a Peripheral Component Interconnect Express (PCIe) bus. Thus, the endpoint 108 may be implemented as a PCIe endpoint. However, it should be appreciated that other communication buses may be used and that the examples provided are not limiting. Thus, the endpoint 108 may be implemented as any of a variety of suitable interfaces for communicating over a communication bus.

エンドポイント１０８は、ＤＭＡ１１０に結合される。また、ＤＭＡ１１０は、カーネル回路１１２と（図１で「ＭＣ」と略される）メモリコントローラ１１４とに結合される。特定の実施形態では、ＤＭＡ１１０は、エンドポイント１０８とのおよびカーネル回路１１２との双方向通信をサポートする２つの独立チャネルを含む。図１の例では、ＤＭＡ１１０は、１つまたは複数のインターフェース１１６を通してカーネル回路１１２に結合される。したがって、ホストシステム１０２は、カーネル回路１１２に提供される前に１つまたは複数のデータストリームに変換されたパケット化されたデータとして、エンドポイント１０８およびＤＭＡ１１０を介してカーネル回路１１２にデータを転送することが可能である。同様に、カーネル回路１１２は、ＤＭＡ１１０およびエンドポイント１０８を介してホストシステム１０２に提供されるより前にパケット化されるデータストリームを出力することによって、ホストシステム１０２にデータを転送することが可能である。データの転送に関係するさらなる詳細が、図２に関してより詳細に説明される。概して、１つのデータストリームは、ホストシステム１０２において発信するのか、カーネル回路１１２から発信するのかにかかわらず、複数のパケットに変換されるが、データストリームのサイズに応じて（たとえば、データストリームがより少ない量のデータを伝達する場合）、データストリームが単一のパケットに変換される場合があり得る。 The endpoint 108 is coupled to the DMA 110. The DMA 110 is also coupled to the kernel circuit 112 and to the memory controller 114 (abbreviated as "MC" in FIG. 1). In a particular embodiment, the DMA 110 includes two independent channels supporting bidirectional communication with the endpoint 108 and with the kernel circuit 112. In the example of FIG. 1, the DMA 110 is coupled to the kernel circuit 112 through one or more interfaces 116. Thus, the host system 102 can transfer data to the kernel circuit 112 via the endpoint 108 and the DMA 110 as packetized data that is converted into one or more data streams before being provided to the kernel circuit 112. Similarly, the kernel circuit 112 can transfer data to the host system 102 by outputting a data stream that is packetized before being provided to the host system 102 via the DMA 110 and the endpoint 108. Further details related to the transfer of data are described in more detail with respect to FIG. 2. Generally, a data stream, whether originating at the host system 102 or from the kernel circuit 112, is converted into multiple packets, although depending on the size of the data stream (e.g., if the data stream conveys a smaller amount of data), there may be cases where the data stream is converted into a single packet.

インターフェース１１６の一例は、アドバンストマイクロコントローラバスアーキテクチャ（ＡＭＢＡ（登録商標）：ＡｄｖａｎｃｅｄＭｉｃｒｏｃｏｎｔｒｏｌｌｅｒＢｕｓＡｒｃｈｉｔｅｃｔｕｒｅ）アドバンスト拡張可能インターフェース（ＡＸＩ：ＡｄｖａｎｃｅｄＥｘｔｅｎｓｉｂｌｅＩｎｔｅｒｆａｃｅ）など、ストリーム対応オンチップ相互接続である。ＡＸＩストリーム相互接続は、異種マスタ／スレーブＡＭＢＡ（登録商標）ＡＸＩストリームプロトコル準拠回路ブロックの接続を可能にする。インターフェース１１６は、１つまたは複数のマスタから１つまたは複数のスレーブに、パケット化されたデータを伝達する接続をルーティングすることが可能である。ＡＸＩは、限定するものではなく、例示の目的で提供される。インターフェース１１６は、様々な相互接続のいずれかとして実装され得ることを諒解されたい。たとえば、インターフェース１１６は、バス、ネットワークオンチップ（ＮｏＣ）、クロスバー、スイッチ、または他のタイプの相互接続として実装され得る。 One example of the interface 116 is a stream-enabled on-chip interconnect, such as the Advanced Microcontroller Bus Architecture (AMBA®) Advanced Extensible Interface (AXI). The AXI stream interconnect allows for the connection of heterogeneous master/slave AMBA® AXI stream protocol compliant circuit blocks. The interface 116 is capable of routing connections conveying packetized data from one or more masters to one or more slaves. AXI is provided for illustrative purposes and not for limitation. It should be appreciated that the interface 116 may be implemented as any of a variety of interconnects. For example, the interface 116 may be implemented as a bus, a network-on-chip (NoC), a crossbar, a switch, or other type of interconnect.

１つまたは複数の実施形態では、メモリコントローラ１１４はメモリ１０６に結合される。メモリ１０６はＲＡＭとして実装される。メモリコントローラ１１４は、マルチポートであり得、ＤＭＡ１１０とカーネル回路１１２とに結合される。メモリコントローラ１１４は、ＤＭＡ１１０および／またはカーネル回路１１２の制御下でメモリ１０６にアクセスする（たとえば、読み取るおよび／または書き込む）ことが可能である。たとえば、ＤＭＡ１１０は、メモリマッピングされたインターフェース１１８を通してメモリコントローラ１１４に結合される。同様に、カーネル回路１１２は、メモリマッピングされたインターフェース１２０を通してメモリコントローラ１１４に結合される。ＤＭＡ１１０は、制御インターフェース１２２を介してカーネル回路１１２に結合される。１つまたは複数の実施形態では、制御インターフェース１２２は、回路ブロックとのポイントツーポイント双方向通信を提供するように設定されたＡＸＩ－Ｌｉｔｅインターフェースとして実装される。ＡＸＩ－Ｌｉｔｅは、カーネル回路１１２のための制御インターフェースとして使用され得る。説明されたように、ＡＸＩは、限定ではなく、例示の目的で提供される。 In one or more embodiments, the memory controller 114 is coupled to the memory 106. The memory 106 is implemented as a RAM. The memory controller 114 may be multi-ported and is coupled to the DMA 110 and the kernel circuit 112. The memory controller 114 is capable of accessing (e.g., reading and/or writing) the memory 106 under the control of the DMA 110 and/or the kernel circuit 112. For example, the DMA 110 is coupled to the memory controller 114 through a memory-mapped interface 118. Similarly, the kernel circuit 112 is coupled to the memory controller 114 through a memory-mapped interface 120. The DMA 110 is coupled to the kernel circuit 112 through a control interface 122. In one or more embodiments, the control interface 122 is implemented as an AXI-Lite interface configured to provide point-to-point bidirectional communication with the circuit block. AXI-Lite may be used as a control interface for the kernel circuit 112. As described, AXI is provided for purposes of illustration and not limitation.

メモリマッピングされたインターフェース１１８および１２０と制御インターフェース１２２とを使用すると、図１に示されているアーキテクチャは、メモリ１０６を通した、ホストシステム１０２とカーネル回路１１２との間のデータ転送をサポートすることも可能である。たとえば、ホストシステム１０２はメモリ１０６にデータを送出する。データは、ＤＭＡ１１０に提供され得、ＤＭＡ１１０は、メモリコントローラ１１４を使用してメモリ１０６内にデータを記憶する。データは、前に説明されたように、データ転送が完了するまで、メモリ１０６に累積および記憶される。ホストシステム１０２は、制御インターフェース１２２を通してカーネル回路１１２に、メモリ１０６中のデータの利用可能性を通知し得る。カーネル回路１１２は、メモリコントローラ１１４にアクセスしてメモリ１０６からデータを読み取ることが可能である。カーネル回路１１２は、結果を生成し、その結果をメモリ１０６内に記憶する。カーネル回路１１２は、制御インターフェース１２２を通してホストシステム１０２に、メモリ１０６中の結果の利用可能性を通知する。 Using the memory-mapped interfaces 118 and 120 and the control interface 122, the architecture shown in FIG. 1 can also support data transfer between the host system 102 and the kernel circuit 112 through the memory 106. For example, the host system 102 sends data to the memory 106. The data can be provided to the DMA 110, which uses the memory controller 114 to store the data in the memory 106. The data is accumulated and stored in the memory 106 until the data transfer is completed, as previously described. The host system 102 can notify the kernel circuit 112 through the control interface 122 of the availability of data in the memory 106. The kernel circuit 112 can access the memory controller 114 to read data from the memory 106. The kernel circuit 112 generates a result and stores the result in the memory 106. The kernel circuit 112 notifies the host system 102 through the control interface 122 of the availability of the result in the memory 106.

データが、メモリ１０６を使用して、ＩＣ１０４において実装されるカーネル回路１１２または複数のカーネル回路に転送される例では、ホストシステム１０２は、様々なカーネル回路間でメモリ１０６を割り振り、共有する責任を有する。ホストシステム１０２は、制御インターフェース１２２を通して、カーネル回路を設定し、開始する。しかしながら、制御インターフェース１２２は、かなりのレイテンシを伴うより遅いインターフェースである傾向がある。制御インターフェース１２２を通してカーネル回路と通信しなければならないことのほかに、ホストシステム１０２は、また、カーネル回路動作を管理し、同期させなければならず、ホストシステム１０２にかなりのオーバーヘッドを追加する。ホストシステム１０２は、たとえば、（１つまたは複数の）適切な時間においてカーネル回路を開始および／または停止するために、データ転送を制御信号と同期させなければならない。 In examples where data is transferred to the kernel circuit 112 or multiple kernel circuits implemented in the IC 104 using memory 106, the host system 102 is responsible for allocating and sharing the memory 106 among the various kernel circuits. The host system 102 configures and starts the kernel circuits through a control interface 122. However, the control interface 122 tends to be a slower interface with significant latency. In addition to having to communicate with the kernel circuits through the control interface 122, the host system 102 must also manage and synchronize the kernel circuit operations, adding significant overhead to the host system 102. The host system 102 must synchronize data transfers with control signals, for example, to start and/or stop the kernel circuits at the appropriate time(s).

説明されたように、他の実施形態では、ＩＣ１０４は、メモリ１０６がＩＣ１０４内の内部メモリとして実装されるような十分なメモリリソースを含む。その場合、ＩＣ１０４において記述される回路ブロックが、ＩＣ１０４内のインターフェース回路を使用して内部メモリにアクセスすることが可能であり、したがって、メモリコントローラ１１４は除外され得る。 As described, in other embodiments, IC 104 includes sufficient memory resources such that memory 106 is implemented as internal memory within IC 104. In that case, the circuit blocks described in IC 104 can access the internal memory using interface circuitry within IC 104, and thus memory controller 114 can be omitted.

１つまたは複数の実施形態では、アーキテクチャ１００は、パケット化されたデータとデータストリームとを介した、ホストシステム１０２とカーネル回路１１２との間の直接通信をサポートするために実装される。その場合、メモリマッピングされた通信能力が省略され得る。たとえば、（メモリ１０６が省略され得るように）制御インターフェース１２２と、メモリマッピングされたインターフェース１１８および１２０と、メモリコントローラ１１４とが省略され得る。しかしながら、１つまたは複数の他の実施形態では、アーキテクチャ１００は、メモリ１０６を伴うメモリマッピングされた通信と、パケット化されたデータとデータストリームとを使用する直接通信の両方をサポートするために実装される。たとえば、ＤＭＡ１１０は、両方のタイプのデータ転送をサポートし得る。さらに、図１の例において、単一のカーネル回路が示されているが、複数のカーネル回路が実装され得、いくつかのカーネル回路は、データストリームを介したホストシステム１０２との直接通信を利用し、他のカーネル回路は、ホストシステム１０２とのデータ転送のためにメモリ１０６を利用する。さらに他の実施形態では、カーネル回路を呼び出している、ホストシステム１０２によって実行される特定のアプリケーション、またはカーネル回路と通信するためにアプリケーションによって呼び出される特定の機能に応じて、データストリームを介した直接通信またはデータ転送のためのメモリ１０６のいずれかを利用するために、カーネル回路が実装され得る。 In one or more embodiments, the architecture 100 is implemented to support direct communication between the host system 102 and the kernel circuit 112 via packetized data and data streams. In that case, memory-mapped communication capabilities may be omitted. For example, the control interface 122, the memory-mapped interfaces 118 and 120, and the memory controller 114 may be omitted (as may the memory 106). However, in one or more other embodiments, the architecture 100 is implemented to support both memory-mapped communication involving the memory 106 and direct communication using packetized data and data streams. For example, the DMA 110 may support both types of data transfer. Additionally, although a single kernel circuit is shown in the example of FIG. 1, multiple kernel circuits may be implemented, with some kernel circuits utilizing direct communication with the host system 102 via data streams and other kernel circuits utilizing the memory 106 for data transfer with the host system 102. In yet other embodiments, the kernel circuit may be implemented to utilize either direct communication via a data stream or memory 106 for data transfer, depending on the particular application being executed by the host system 102 that is invoking the kernel circuit, or the particular function being invoked by the application to communicate with the kernel circuit.

アーキテクチャ１００および本明細書で説明される他のストリーミングアーキテクチャが、カーネル回路を設定し、管理するためのより効率的なやり方を提供する。特定の実施形態では、命令が、データストリームのデータペイロードとともに帯域内でカーネル回路に提供され得る。データとともに命令を含めること、たとえば、「命令を帯域内に入れること」は、データストリームが使用されるときの、制御インターフェース１２２の必要を除去し、より効率的なホストシステム－カーネル回路通信を提供する。 Architecture 100 and other streaming architectures described herein provide a more efficient way to configure and manage kernel circuits. In certain embodiments, instructions may be provided to the kernel circuits in-band along with the data payload of the data stream. Including instructions with the data, e.g., "putting instructions in-band," eliminates the need for control interface 122 when data streams are used and provides more efficient host system-kernel circuit communication.

ホストシステム１０２は、メモリマッピングされたユーザアプリケーション１２４および／またはストリームユーザアプリケーション１２６など、１つまたは複数のユーザアプリケーションを含むソフトウェアフレームワークを実行することが可能である。メモリマッピングされたユーザアプリケーション１２４は、カーネル回路１１２などのカーネル回路を呼び出し、メモリマッピングされたインターフェース１１８および１２０と、制御インターフェース１２２と、メモリ１０６とを使用してカーネル回路１１２とデータを交換するように設定された、ホストシステム１０２によって実行されるアプリケーションである。ストリームユーザアプリケーション１２６は、カーネル回路１１２などのカーネル回路を呼び出し、ストリーミングインターフェース１１６を使用してカーネル回路１１２とデータを交換するように設定された、ホストシステム１０２によって実行されるアプリケーションである。 The host system 102 may execute a software framework that includes one or more user applications, such as a memory-mapped user application 124 and/or a stream user application 126. The memory-mapped user application 124 is an application executed by the host system 102 that is configured to call kernel circuits, such as the kernel circuit 112, and exchange data with the kernel circuit 112 using the memory-mapped interfaces 118 and 120, the control interface 122, and the memory 106. The stream user application 126 is an application executed by the host system 102 that is configured to call kernel circuits, such as the kernel circuit 112, and exchange data with the kernel circuit 112 using the streaming interface 116.

ソフトウェアフレームワークはランタイム１２８をも含む。ランタイム１２８は、ＩＣ１０４と通信するための機能、たとえば、アプリケーションプログラミングインターフェース（ＡＰＩ）を提供する。たとえば、ランタイム１２８は、ＰＣＩｅ上でのＤＭＡ転送を実装するための機能を提供することが可能である。１つまたは複数の実施形態では、ランタイム１２８は、インターフェース１１６を使用してカーネル回路１１２とホストシステム１０２との間でデータをストリーミングするためのサポートを提供することが可能である。１つまたは複数の他の実施形態では、ランタイム１２８は、メモリ１０６と、メモリマッピングされたインターフェース１１８および１２０と、制御インターフェース１２２とを使用してカーネル回路１１２とホストシステム１０２との間でデータを転送するためのサポートを提供することが可能である。例示的な例として、ランタイム１２８は、メモリマッピングされたユーザアプリケーション１２４の実行と、メモリ１０６を介した、カーネル回路１１２とのデータの転送とをサポートすること、および／またはストリームユーザアプリケーション１２６の実行と、インターフェース１１６を介した、カーネル回路１１２とのデータの転送とをサポートすることが可能である。 The software framework also includes a runtime 128. The runtime 128 provides functionality, e.g., an application programming interface (API), for communicating with the IC 104. For example, the runtime 128 may provide functionality for implementing DMA transfers over PCIe. In one or more embodiments, the runtime 128 may provide support for streaming data between the kernel circuit 112 and the host system 102 using the interface 116. In one or more other embodiments, the runtime 128 may provide support for transferring data between the kernel circuit 112 and the host system 102 using the memory 106, the memory-mapped interfaces 118 and 120, and the control interface 122. As an illustrative example, the runtime 128 may support the execution of a memory-mapped user application 124 and the transfer of data to and from the kernel circuit 112 via the memory 106, and/or the execution of a streamed user application 126 and the transfer of data to and from the kernel circuit 112 via the interface 116.

ドライバ１３０が、ホストシステム１０２内のエンドポイント（図示せず）を制御することが可能である。ＰＣＩｅ接続の場合、たとえば、ホストシステム１０２内のエンドポイントは、ルートコンプレックスとして実装される。したがって、ドライバ１３０は、ホストシステム１０２とＩＣ１０４との間のデータ転送を制御する記述子を記憶するための複数の読取りキューおよび書込みキューを実装し、管理することが可能である。 The driver 130 may control an endpoint (not shown) within the host system 102. In the case of a PCIe connection, for example, the endpoint within the host system 102 is implemented as a root complex. The driver 130 may therefore implement and manage multiple read and write queues for storing descriptors that control data transfer between the host system 102 and the IC 104.

１つまたは複数の実施形態では、ドライバ１３０は、カーネル回路への大きいデータの転送（たとえば、ストリーミングされるデータの転送）についての要求を、パケットと呼ばれるデータのより小さいチャンクの複数のストリーム転送に分割することが可能である。ドライバ１３０によって実施される、データのこの分割、または「パケットへのデータのパケット化」は、大部分は、カーネル回路１１２から隠される。パケット化は、異なるカーネル回路に宛てられるおよび／または異なるカーネル回路から宛てられるパケットをインターリーブすることによって、複数のカーネル回路をコンカレントにサービスするためにＩＣ１０４において実装される相互接続ファブリックを可能にする。ドライバ１３０はパケットサイズを、パケット化オーバーヘッドを効率的にならす（ａｍｏｒｔｉｚｅ）のに十分に大きいが、ストリーミングされるデータを他のカーネル回路が転送している間、ストリーミングされるデータを送出および／または受信する順番を待ちながらカーネル回路がストールすることをパケットが引き起こすほど大きくならないように、決定することが可能である。 In one or more embodiments, the driver 130 can split requests for large data transfers (e.g., streamed data transfers) to a kernel circuit into multiple stream transfers of smaller chunks of data called packets. This splitting of data, or "packetization of data into packets," performed by the driver 130, is largely hidden from the kernel circuit 112. The packetization enables the interconnect fabric implemented in the IC 104 to service multiple kernel circuits concurrently by interleaving packets addressed to and/or from different kernel circuits. The driver 130 can determine the packet size to be large enough to efficiently amortize the packetization overhead, but not so large that the packets cause kernel circuits to stall while waiting their turn to send and/or receive streamed data while other kernel circuits are transferring streamed data.

概して説明されるように、制御インターフェース１２２は、カーネル回路１１２に伝達される制御信号とカーネル回路１１２に配信されるデータとの間の同期を必要とする遅い接続である傾向がある。たとえば、制御インターフェース１２２が、データストリームを用いる帯域外シグナリングのために使用される場合、速度要件および／または同期要件が、しばしば、（１つまたは複数の）データストリームを再開するより前に制御信号を変更するために、カーネル回路への（１つまたは複数の）データストリームを停止することをもたらす。 As generally described, the control interface 122 tends to be a slow connection that requires synchronization between the control signals communicated to the kernel circuit 112 and the data delivered to the kernel circuit 112. For example, if the control interface 122 is used for out-of-band signaling with a data stream, speed and/or synchronization requirements often result in stopping the data stream(s) to the kernel circuit in order to change the control signals before resuming the data stream(s).

例示的なおよび非限定的な例として、カーネル回路１１２が暗号化動作を実装する場合を考慮する。カーネル回路１１２に提供される異なるデータペイロードが、一般に、暗号化のための異なる鍵を必要とする。制御インターフェース１２２が使用されるべきである場合、カーネル回路１１２へのデータストリームが停止され、鍵は、制御インターフェース１１２を介して更新され、次いで（１つまたは複数の）データストリームが再開されることになる。そのような動作はホストシステム１０２によって協調されることになり、これは、ホストシステム１０２のオーバーヘッドを増やす。本明細書で説明される１つまたは複数の実施形態では、カーネル回路１１２への１つまたは複数の命令が帯域内で提供される。したがって、新しいおよび／または更新された鍵は、カーネル回路１１２に提供される、帯域内のデータストリーム中に含まれ得る。命令は、ペイロードとともにまたはペイロードの直前に含まれ得る。特定の実施形態では、命令は、各パケットについてのカスタム定義されたヘッダにおいて指定され得る。本例では、ホストシステム１０２は、カーネル回路１１２が動作するべきである対象の１つまたは複数のパケットの（１つまたは複数の）平文ペイロードのためのパケットヘッダの一部として、暗号化鍵を送出することが可能である。したがって、カーネル回路１１２は、効率的に動作し、この場合、カーネル回路１１２が停止されおよび／または制御インターフェース１２２と同期される必要がないので、ホストシステム１０２が同期オーバーヘッドを招くことなしに、およびデータ転送のための従来の技法と比較して、低減されたレイテンシで、異なるペイロードのために暗号化鍵を切り替えることが可能である。 As an illustrative and non-limiting example, consider the case where the kernel circuit 112 implements an encryption operation. Different data payloads provided to the kernel circuit 112 generally require different keys for encryption. If the control interface 122 is to be used, the data stream to the kernel circuit 112 would be stopped, the key would be updated via the control interface 112, and then the data stream(s) would be resumed. Such an operation would be coordinated by the host system 102, which increases the overhead of the host system 102. In one or more embodiments described herein, one or more instructions to the kernel circuit 112 are provided in-band. Thus, new and/or updated keys may be included in the in-band data stream provided to the kernel circuit 112. The instructions may be included with or immediately before the payload. In certain embodiments, the instructions may be specified in a custom-defined header for each packet. In this example, the host system 102 may send an encryption key as part of a packet header for the plaintext payload(s) of the packet(s) on which the kernel circuit 112 should operate. Thus, the kernel circuit 112 operates efficiently, in that the kernel circuit 112 does not need to be stopped and/or synchronized with the control interface 122, allowing the host system 102 to switch encryption keys for different payloads without incurring synchronization overhead and with reduced latency compared to conventional techniques for data transfer.

図２は、図１のアーキテクチャ１００の別の例示的な実装形態を示す。図２は、図１に関して説明されたより高いレベルのビューに示されていないアーキテクチャ１００のさらなる態様を示す。ただし、例示の目的で、ホストシステム１０２によって実行されるソフトウェアフレームワークの選択された要素、ＩＣ１０４内のエンドポイント１０８およびメモリコントローラ１１４、ならびにメモリ１０６など、図１に示されているいくつかの要素が図２に示されていない。 Figure 2 illustrates another exemplary implementation of the architecture 100 of Figure 1. Figure 2 illustrates additional aspects of the architecture 100 that are not shown in the higher level view described with respect to Figure 1. However, for illustrative purposes, some elements shown in Figure 1 are not shown in Figure 2, such as selected elements of the software framework executed by the host system 102, the endpoints 108 and memory controller 114 in the IC 104, and the memory 106.

図２の例では、ホストシステム１０２によって実行されるソフトウェアフレームワークのドライバ１３０が示されている。ドライバ１３０は、複数のキュー２０２－１～２０２－８を実装することが可能である。ドライバ１３０は、ＩＣ１０４内に実装される各カーネル回路について、読取りキューおよび書込みキューを作成することが可能である。例示の目的で、書込みキューとして設定されたキュー２０２は陰影を付けられており、読取りキューとして設定されたキュー２０２は陰影を付けられていない。ＩＣ１０４は、４つのカーネル回路２３４－１、２３４－２、２３４－３、および２３４－４を実装するので、ドライバ１３０は、４つの書込みキュー（たとえば、２０２－１、２０２－３、２０２－５、および２０２－７）と４つの読取りキュー（たとえば、２０２－２、２０２－４、２０２－６、および２０２－８）とを実装する。キュー２０２の各々が、１つまたは複数の記述子を記憶することが可能であり、各記述子は、実施されるべきデータ転送を記述する。書込みキューに記憶される各記述子は、ホストシステム１０２からカーネル回路２３４へのデータ転送を記述し、読取りキューに記憶される各記述子は、カーネル回路２３４からホストシステム１０２へのデータ転送を記述する。 In the example of FIG. 2, a driver 130 of a software framework executed by the host system 102 is shown. The driver 130 can implement multiple queues 202-1 through 202-8. The driver 130 can create a read queue and a write queue for each kernel circuit implemented in the IC 104. For illustrative purposes, the queues 202 configured as write queues are shaded and the queues 202 configured as read queues are not shaded. Since the IC 104 implements four kernel circuits 234-1, 234-2, 234-3, and 234-4, the driver 130 implements four write queues (e.g., 202-1, 202-3, 202-5, and 202-7) and four read queues (e.g., 202-2, 202-4, 202-6, and 202-8). Each of the queues 202 may store one or more descriptors, each describing a data transfer to be performed. Each descriptor stored in a write queue describes a data transfer from the host system 102 to the kernel circuitry 234, and each descriptor stored in a read queue describes a data transfer from the kernel circuitry 234 to the host system 102.

ＤＭＡ１１０は、述べられたように、２つのチャネルを含む。書込みチャネルが、ホストシステム１０２からカーネル回路２３４へのデータの転送をサポートする。書込みチャネルは、書込み回路２０４とアービトレーション回路２０６とを含む。書込み回路２０４は、カーネル回路２３４にコマンドおよび／またはデータをフォワーディングするより前に、ホストシステム１０２から受信されたコマンドおよび／またはデータを記憶することが可能である。読取りチャネルが、カーネル回路２３４からホストシステム１０２へのデータの転送をサポートする。読取りチャネルは、読取り回路２０８とアービトレーション回路２１０とを含む。読取り回路２０８は、ホストシステム１０２にデータをフォワーディングするより前に、カーネル回路２３４から受信されたデータを記憶することが可能である。 DMA 110, as mentioned, includes two channels. The write channel supports the transfer of data from host system 102 to kernel circuit 234. The write channel includes write circuit 204 and arbitration circuit 206. Write circuit 204 can store commands and/or data received from host system 102 before forwarding the commands and/or data to kernel circuit 234. The read channel supports the transfer of data from kernel circuit 234 to host system 102. The read channel includes read circuit 208 and arbitration circuit 210. Read circuit 208 can store data received from kernel circuit 234 before forwarding the data to host system 102.

ＤＭＡ１１０は、ホストシステム１０２のホストメモリ（図示せず）とバッファ２１８、２２０、２２２、２２４、２２６、２２８、２３０、および２３２との間でデータを移動する。ＤＭＡ１１０は、転送されるべきあらゆるパケットについてのアドレス、たとえば、記述子のリストをフェッチおよび維持し、エンドポイント１０８のためにコマンドおよびアドレスのシーケンスを形成する。１つまたは複数の実施形態では、ＤＭＡ１１０は高度に設定可能である。したがって、ストリームトラフィックマネージャ２１２を通して、ＤＭＡ１１０のためのトラフィック管理およびフロー制御が実施される。ストリームトラフィックマネージャ２１２は、すべてのカーネル回路２３４が、ホストシステム１０２との間でのデータ転送のための、ＤＭＡ１１０へのフェアなアクセスを有することを効果的に確実にする。 DMA 110 moves data between host memory (not shown) of host system 102 and buffers 218, 220, 222, 224, 226, 228, 230, and 232. DMA 110 fetches and maintains a list of addresses, e.g., descriptors, for every packet to be transferred and forms a sequence of commands and addresses for endpoint 108. In one or more embodiments, DMA 110 is highly configurable. Thus, traffic management and flow control for DMA 110 is implemented through stream traffic manager 212. Stream traffic manager 212 effectively ensures that all kernel circuits 234 have fair access to DMA 110 for data transfer to and from host system 102.

ストリームトラフィックマネージャ２１２は、ＤＭＡ１１０と相互接続２１４および２１６とに結合される。ストリームトラフィックマネージャ２１２は、ホストシステム１０２とカーネル回路２３４との間のデータストリーム／パケットのフローを調節することが可能である。図２の例では、ストリームトラフィックマネージャ２１２は、コントローラ２３６と、１つまたは複数のバッファ２３８と、１つまたは複数のデータムーバエンジン２４０と、フロー対パイプマップ（マップ）２４２と、パイプ対ルートマップ（マップ）２４４とを含む。 Stream traffic manager 212 is coupled to DMA 110 and interconnects 214 and 216. Stream traffic manager 212 can regulate the flow of data streams/packets between host system 102 and kernel circuitry 234. In the example of FIG. 2, stream traffic manager 212 includes controller 236, one or more buffers 238, one or more data mover engines 240, flow-to-pipe map (map) 242, and pipe-to-route map (map) 244.

特定の実施形態では、相互接続２１４および相互接続２１６は、図１のインターフェース１１６を実装する。図２の例では、相互接続２１４は、ストリームトラフィックマネージャ２１２から、パケット化されたデータを受信し、パケット化されたデータを適切なカーネル回路２３４にルーティングするように設定される。相互接続２１６は、カーネル回路２３４から、パケット化されたデータを受信し、パケット化されたデータをストリームトラフィックマネージャ２１２に提供するように設定される。 In certain embodiments, interconnect 214 and interconnect 216 implement interface 116 of FIG. 1. In the example of FIG. 2, interconnect 214 is configured to receive packetized data from stream traffic manager 212 and route the packetized data to the appropriate kernel circuit 234. Interconnect 216 is configured to receive packetized data from kernel circuit 234 and provide the packetized data to stream traffic manager 212.

図２の例では、カーネル回路２３４は、バッファを通して相互接続２１４と相互接続２１６とに接続される。カーネル回路２３４の各々が、対応する入力バッファを通してデータストリームを受信するように設定された入力ポートと、対応する出力バッファを通してデータストリームを送出するように設定された出力ポートとを有する。例示の目的で、入力バッファ（たとえば、バッファ２１８、２２２、２２６、および２３０）は陰影を付けられている。出力バッファ（たとえば、バッファ２２０、２２４、２２８、および２３２）は陰影を付けられていない。 In the example of FIG. 2, kernel circuits 234 are connected to interconnects 214 and 216 through buffers. Each of kernel circuits 234 has an input port configured to receive a data stream through a corresponding input buffer and an output port configured to send a data stream through a corresponding output buffer. For illustrative purposes, the input buffers (e.g., buffers 218, 222, 226, and 230) are shaded. The output buffers (e.g., buffers 220, 224, 228, and 232) are not shaded.

カーネル回路２３４－１は、バッファ２１８を通して相互接続２１４に接続され、バッファ２２０を通して相互接続２１６に接続される。カーネル回路２３４－２は、バッファ２２２を通して相互接続２１４に接続され、バッファ２２４を通して相互接続２１６に接続される。カーネル回路２３４－３は、バッファ２２６を通して相互接続２１４に接続され、バッファ２２８を通して相互接続２１６に接続される。カーネル回路２３４－４は、バッファ２３０を通して相互接続２１４に接続され、バッファ２３２を通して相互接続２１６に接続される。 Kernel circuit 234-1 is connected to interconnect 214 through buffer 218 and to interconnect 216 through buffer 220. Kernel circuit 234-2 is connected to interconnect 214 through buffer 222 and to interconnect 216 through buffer 224. Kernel circuit 234-3 is connected to interconnect 214 through buffer 226 and to interconnect 216 through buffer 228. Kernel circuit 234-4 is connected to interconnect 214 through buffer 230 and to interconnect 216 through buffer 232.

述べられたように、相互接続２１４および２１６はＡＸＩストリーム相互接続として実装され得るが、本発明の構成は、そのように限定されるものではない。パケット化されたデータを配信するための様々な回路アーキテクチャのいずれかが使用され得る。相互接続２１４および２１６を実装するために使用され得る他の例示的な回路アーキテクチャが、限定はしないが、クロスバー、多重化されたバス、メッシュネットワーク、および／またはネットワークオンチップ（ＮｏＣ）を含む。 As mentioned, interconnects 214 and 216 may be implemented as AXI stream interconnects, although the inventive configurations are not so limited. Any of a variety of circuit architectures for delivering packetized data may be used. Other exemplary circuit architectures that may be used to implement interconnects 214 and 216 include, but are not limited to, a crossbar, a multiplexed bus, a mesh network, and/or a network-on-chip (NoC).

入力バッファ２１８、２２２、２２６、および２３０の各々が、相互接続２１４と、それぞれ、カーネル回路２３４－１、２３４－２、２３４－３、および２３４－４の入力ポートとに結合される。各入力バッファは、カーネル回路が受信されたデータを直ちに受け入れることまたは処理することが可能でない場合、対応するカーネル回路２３４に向けられた、ホストシステム１０２からのパケット化されたデータを一時的に記憶することが可能である。さらに、各入力バッファは、ホストシステム１０２から受信されたパケット化されたデータを、対応するカーネル回路２３４に提供されるデータストリームに変換することも可能である。たとえば、各入力バッファは、対応するカーネル回路に提供され得るデータストリームを生成するために、１つまたは複数のパケットのシーケンスを組み合わせることが可能である。 Each of the input buffers 218, 222, 226, and 230 is coupled to the interconnect 214 and to an input port of a kernel circuit 234-1, 234-2, 234-3, and 234-4, respectively. Each input buffer can temporarily store packetized data from the host system 102 that is directed to the corresponding kernel circuit 234 if the kernel circuit is not able to immediately accept or process the received data. Additionally, each input buffer can also convert the packetized data received from the host system 102 into a data stream that is provided to the corresponding kernel circuit 234. For example, each input buffer can combine a sequence of one or more packets to generate a data stream that can be provided to the corresponding kernel circuit.

出力バッファ２２０、２２４、２２８、および２３２の各々が、相互接続２１６と、それぞれ、カーネル回路２３４－１、２３４－２、２３４－３、および２３４－４の出力ポートとに結合される。各出力バッファは、対応するカーネル回路２３４から出力されたデータストリームを一時的に保持し、データストリームを、パケット化されたデータに変換し、パケット化されたデータを、相互接続２１６を介してホストシステム１０２に送出することが可能である。各出力バッファは、カーネル回路がストリーミングインフラストラクチャと歩調を合わせることができない場合、データを記憶することが可能である。各出力バッファは、たとえば、対応するカーネル回路から出力されたデータストリームを１つまたは複数のパケットに分離することが可能である。 Each of the output buffers 220, 224, 228, and 232 is coupled to the interconnect 216 and to an output port of the kernel circuit 234-1, 234-2, 234-3, and 234-4, respectively. Each output buffer can temporarily hold the data stream output from the corresponding kernel circuit 234, convert the data stream into packetized data, and send the packetized data to the host system 102 via the interconnect 216. Each output buffer can store data when the kernel circuit cannot keep pace with the streaming infrastructure. Each output buffer can, for example, separate the data stream output from the corresponding kernel circuit into one or more packets.

１つまたは複数の実施形態では、出力バッファ２２０、２２４、２２８、および２３２は、ソースカーネル回路および／または宛先カーネル回路を識別するためのカーネルタグ付け情報を提供することが可能である。たとえば、出力バッファは、プリペンドされた（ｐｒｅ－ｐｅｎｄｅｄ）ヘッダとしてタグ付け情報を追加することが可能である。出力バッファによって実施されるタグ付けは、パケット内のデータが、ホストメモリ中の正しい場所にまたは適切なカーネル回路に配置またはルーティングされることを可能にする。たとえば、カーネル回路２３４に対応する各出力バッファは、各パケットにソースカーネル識別子をタグ付けし、それらのパケットを相互接続２１６に送出することが可能である。相互接続２１６は、ストリームトラフィックマネージャ２１２およびＤＭＡエンジン１１０にパケットを配信する。ＤＭＡエンジン１１０は、パケット化されたデータをホストメモリに移動する。 In one or more embodiments, output buffers 220, 224, 228, and 232 can provide kernel tagging information to identify the source kernel circuit and/or the destination kernel circuit. For example, the output buffers can add the tagging information as a pre-pended header. The tagging performed by the output buffers allows the data in the packets to be placed or routed to the correct location in host memory or to the appropriate kernel circuit. For example, each output buffer corresponding to kernel circuit 234 can tag each packet with a source kernel identifier and send those packets out to interconnect 216. Interconnect 216 delivers the packets to stream traffic manager 212 and DMA engine 110. DMA engine 110 moves the packetized data to host memory.

例示の目的で、カーネル回路２３４－１が説明される。カーネル回路２３４－２、２３４－３、および２３４－４は、同じまたは同様の様式で動作し得ることを諒解されたい。図２の例では、カーネル回路２３４－１の入力ポートが、バッファ２１８を通して相互接続２１４に接続される。カーネル回路２３４－１の出力ポートが、バッファ２２０を通して相互接続２１６に接続される。例示の目的で、書込みキュー２０２－１が入力バッファ２１８にマッピングされ、読取りキュー２０２－２が出力バッファ２２０にマッピングされる。概して、キュー２０２の各々が、バッファ２１８～２３２のうちの１つにマッピングされる。しかしながら、バッファ２１８～２３２は、キュー２０２のうちの２つ以上にマッピングされ得る。例示の目的で、キュー２０２－１および２０２－２はバッファ２１８および２２０に対応し、キュー２０２－３および２０２－４はバッファ２２２および２２４に対応し、キュー２０２－５および２０２－６はバッファ２２６および２２８に対応し、キュー２０２－７および２０２－８はバッファ２３０および２３２に対応する。 For illustrative purposes, kernel circuit 234-1 will be described. It should be appreciated that kernel circuits 234-2, 234-3, and 234-4 may operate in the same or similar manner. In the example of FIG. 2, an input port of kernel circuit 234-1 is connected to interconnect 214 through buffer 218. An output port of kernel circuit 234-1 is connected to interconnect 216 through buffer 220. For illustrative purposes, write queue 202-1 is mapped to input buffer 218 and read queue 202-2 is mapped to output buffer 220. Generally, each of queues 202 is mapped to one of buffers 218-232. However, buffers 218-232 may be mapped to more than one of queues 202. For purposes of illustration, queues 202-1 and 202-2 correspond to buffers 218 and 220, queues 202-3 and 202-4 correspond to buffers 222 and 224, queues 202-5 and 202-6 correspond to buffers 226 and 228, and queues 202-7 and 202-8 correspond to buffers 230 and 232.

図２の例では、ホストシステム１０２は、データストリーミングのために設定されたユーザアプリケーションを実行する。カーネル回路２３４への接続を確立するために、ホストシステム１０２は、キュー２０２のペアを作成する。例示的な例として、ユーザアプリケーションは、ドライバ１３０に、それぞれ、バッファ２１８および２２０に対応するキュー２０２－１および２０２－２のペアを作成させる、ランタイム１２８によって提供される機能を呼び出し得る。キュー２０２のペアが作成されると、この例では、データがホストシステム１０２とカーネル回路２３４－１との間でストリーミングされ得るように、ホストプロセッサが、ＤＭＡ１１０内の制御レジスタ（図示せず）とストリームトラフィックマネージャ２１２のマップ２４２および２４４とを設定するためのさらなる機能を呼び出すことが可能である。 In the example of FIG. 2, the host system 102 executes a user application configured for data streaming. To establish a connection to the kernel circuit 234, the host system 102 creates a pair of queues 202. As an illustrative example, the user application may call a function provided by the runtime 128 that causes the driver 130 to create a pair of queues 202-1 and 202-2 that correspond to the buffers 218 and 220, respectively. Once the pair of queues 202 is created, in this example, the host processor may call further functions to configure control registers (not shown) in the DMA 110 and maps 242 and 244 of the stream traffic manager 212 so that data may be streamed between the host system 102 and the kernel circuit 234-1.

ユーザアプリケーションを実行する際に、ホストシステム１０２は、カーネル回路２３４－１にデータを送出する（たとえば、書き込む）ための命令を指定する記述子をキュー２０２－１内に配置し、適宜に、カーネル回路２３４－１からデータを受信する（たとえば、読み取る）ための命令を指定する記述子を読取りキュー２０２－２内に配置する。特定の実施形態では、ドライバ１３０は、ＩＣ１０４に送出されるべきデータをパケット化し、ＤＭＡ１１０に、フェッチされるべき、キュー２０２において利用可能な記述子の数を通知することが可能である。ＤＭＡ１１０は、その情報をストリームトラフィックマネージャ２１２に伝達する。 When executing a user application, the host system 102 places descriptors in the queue 202-1 that specify instructions for sending (e.g., writing) data to the kernel circuit 234-1, and appropriately places descriptors in the read queue 202-2 that specify instructions for receiving (e.g., reading) data from the kernel circuit 234-1. In a particular embodiment, the driver 130 can packetize the data to be sent to the IC 104 and inform the DMA 110 of the number of descriptors available in the queue 202 to be fetched. The DMA 110 communicates that information to the stream traffic manager 212.

ストリームトラフィックマネージャ２１２は、マップ２４２およびマップ２４４を使用して、バッファ２１８～２３２へのキュー２０２のマッピングを維持する。記憶されたマッピングを使用して、ストリームトラフィックマネージャ２１２は、キュー２０２－１がバッファ２１８に対応し、キュー２０２－２がバッファ２２０に対応すると決定する。キュー２０２－１において利用可能な記述子に気づいているコントローラ２３６は、カーネル回路２３４－１の入力ポートのためのバッファ２１８にアクセスすることが可能である。コントローラ２３６は、バッファ２１８が、データを受信するために利用可能なスペースを有するかどうかを決定し、データを受信するために利用可能なスペースを有する場合、バッファ２１８において受信し、記憶され得るデータの量を決定する。 Stream traffic manager 212 maintains a mapping of queues 202 to buffers 218-232 using maps 242 and 244. Using the stored mappings, stream traffic manager 212 determines that queue 202-1 corresponds to buffer 218 and queue 202-2 corresponds to buffer 220. Controller 236, aware of the available descriptors in queue 202-1, can access buffer 218 for the input port of kernel circuit 234-1. Controller 236 determines whether buffer 218 has space available to receive data, and if so, determines the amount of data that can be received and stored in buffer 218.

１つまたは複数の実施形態では、ＤＭＡ１１０は、キュー２０２の各々がどのくらいいっぱいであるかを決定し、コントローラ２３６に知らせることが可能である。書込み回路２０４は、たとえば、キュー２０２－１、２０２－３、２０２－５、および２０２－７の各々中の記述子の数を決定することが可能である。読取り回路２０８は、キュー２０２－２、２０２－４、２０２－６、および２０２－８の各々中の記述子の数を決定することが可能である。読取り回路２０４および書込み回路２０８は、ストリームトラフィックマネージャ２１２に、それぞれのキュー２０２中の記述子の数を知らせることが可能である。さらに、書込み回路２０４および読取り回路２０８は、ストリームトラフィックマネージャ２１２の制御下でキュー２０２から記述子を取り出すことが可能である。 In one or more embodiments, the DMA 110 can determine how full each of the queues 202 is and inform the controller 236. The write circuit 204 can, for example, determine the number of descriptors in each of the queues 202-1, 202-3, 202-5, and 202-7. The read circuit 208 can determine the number of descriptors in each of the queues 202-2, 202-4, 202-6, and 202-8. The read circuit 204 and the write circuit 208 can inform the stream traffic manager 212 of the number of descriptors in each queue 202. Additionally, the write circuit 204 and the read circuit 208 can remove descriptors from the queues 202 under the control of the stream traffic manager 212.

ストリームトラフィックマネージャ２１２内で、（１つまたは複数の）バッファ２３８は、ＤＭＡ１１０を介してキュー２０２から取り出された記述子を記憶する。たとえば、コントローラ２３６は、ＤＭＡ１１０が、（１つまたは複数の）バッファ２３８内で利用可能なスペース量に応じて、特定の数の記述子を取り出すことを要求することが可能である。ＤＭＡ１１０は、取り出された記述子をストリームトラフィックマネージャ２１２に提供する。したがって、ストリームトラフィックマネージャ２１２は、キュー２０２の各々に記憶された記述子のサブセットを（１つまたは複数の）バッファ２３８内に内部的に記憶することが可能である。 Within the stream traffic manager 212, the buffer(s) 238 store the descriptors retrieved from the queues 202 via the DMA 110. For example, the controller 236 can request that the DMA 110 retrieve a certain number of descriptors depending on the amount of space available in the buffer(s) 238. The DMA 110 provides the retrieved descriptors to the stream traffic manager 212. Thus, the stream traffic manager 212 can internally store a subset of the descriptors stored in each of the queues 202 in the buffer(s) 238.

１つまたは複数の実施形態では、記述子のフォーマットまたはシンタックスが、パケットを形成するためにいくつの記述子が必要とされるか、およびパケット中のバイト数を示す。コントローラ２３６は、バッファ２１８が、データを受信するために利用可能なスペースを有すると決定したことに応答して、（たとえば、記述子がキュー２０２－１から取り出された場合）カーネル回路２３４－１に対応する、（１つまたは複数の）バッファ２３８内に記憶された記述子を評価し、（１つまたは複数の）記述子自体内のデータに基づいて、バッファ２１８に記憶すべき、バッファ２１８の利用可能なスペースを越えない、十分な量のデータ（たとえば、（１つまたは複数の）パケット）を取り出すために実行すべき記述子の数を決定する。 In one or more embodiments, the format or syntax of the descriptor indicates how many descriptors are needed to form a packet and the number of bytes in the packet. In response to determining that the buffer 218 has available space to receive data, the controller 236 evaluates the stored descriptors in the buffer(s) 238 that correspond to the kernel circuit 234-1 (e.g., when a descriptor is retrieved from the queue 202-1) and determines, based on the data in the descriptor(s) themselves, how many descriptors to perform to retrieve a sufficient amount of data (e.g., packet(s)) to store in the buffer 218 without exceeding the available space of the buffer 218.

１つまたは複数の実施形態では、データムーバエンジン２４０の各々が、ホストシステム１０２からデータを取り出すことと、ＤＭＡ１１０を介してホストシステム１０２にデータを送出することとが可能である。データムーバエンジン２４０は、コンカレントに動作することが可能である。コントローラ２３６は、（１つまたは複数の）バッファ２３８からデータムーバエンジン２４０のうちの利用可能なデータムーバエンジンに、実行されるべき記述子を割り当てることが可能である。各データムーバエンジン２４０は、それぞれの記述子の各々によって指定されたデータをフェッチすることによって、割り当てられた記述子を処理する。たとえば、データムーバエンジン２４０は、（１つまたは複数の）記述子によって指定された取り出されたパケット化されたデータを、相互接続２１４を介してバッファ２１８に送出することが可能である。述べられたように、入力バッファ２１８は、パケット化されたデータを記憶し、パケット化されたデータをデータストリームに変換し、データストリームをカーネル回路２３４－１に提供することが可能である。 In one or more embodiments, each of the data mover engines 240 can retrieve data from the host system 102 and send data to the host system 102 via the DMA 110. The data mover engines 240 can operate concurrently. The controller 236 can assign descriptors to be executed to available ones of the data mover engines 240 from the buffer(s) 238. Each data mover engine 240 processes the assigned descriptors by fetching data specified by each of the respective descriptors. For example, the data mover engine 240 can send retrieved packetized data specified by the descriptor(s) to the buffer 218 via the interconnect 214. As mentioned, the input buffer 218 can store the packetized data, convert the packetized data into a data stream, and provide the data stream to the kernel circuit 234-1.

ストリームトラフィックマネージャ２１２のパケットハンドリング能力は、異なるデータストリームに対応し得るパケットが、インターリーブされた様式で取り出されることを可能にする。パケットは、Ｎ個の異なるデータストリームについて、インターリーブされた様式でホストシステム１０２から取り出され（またはホストシステム１０２に送出され）得る。 The packet handling capabilities of the stream traffic manager 212 allow packets that may correspond to different data streams to be retrieved in an interleaved manner. Packets may be retrieved from (or sent to) the host system 102 in an interleaved manner for N different data streams.

ストリームトラフィックマネージャ２１２は、カーネル回路２３４の各々について記述される動作を実施することが可能である。したがって、ストリームトラフィックマネージャ２１２は、各カーネル回路２３４について入力バッファを継続的に監視し、入力バッファが、データを受信し、記憶するためのスペースを有すると最初に決定したことに応答してのみ、バッファへのデータ転送を始動することが可能である。言い換えれば、コントローラ２３６は、キュー２０２中のどの記述子が、利用可能な十分なスペースを有する、ＩＣ１０４中の対応するバッファを有するかを継続的に決定し、次いで、そのような記述子を実行することが可能である。 The stream traffic manager 212 can perform the operations described for each of the kernel circuits 234. Thus, the stream traffic manager 212 can continuously monitor the input buffer for each kernel circuit 234 and initiate data transfer to the buffer only in response to first determining that the input buffer has space to receive and store data. In other words, the controller 236 can continuously determine which descriptors in the queue 202 have corresponding buffers in the IC 104 with sufficient space available and then execute such descriptors.

所与の時間において、ＩＣ１０４とホストシステム１０２とを接続する通信バスは、フェッチされている複数の記述子および／またはデータを同時に搬送することが可能である。相互接続２１４および２１６の各々が、一度に単一のパケットを伝達することが可能である。特定の実施形態では、アービトレーション回路２０６は、異なるカーネル回路に対応して一度に１つのパケットを渡すためのラウンドロビンアービトレーション方式を実装することが可能である。他の実施形態では、アービトレーション回路２０６は、異なるアービトレーション方式を使用し得る。ストリームトラフィックマネージャ２１２は、入力バッファ中に利用可能なスペースを有するカーネル回路２３４についてのみ記述子を実行する（読取り要求を始動する）ので、ストリームアービトレーション２０６から受信されたパケットは、ターゲットカーネル回路２３４の意図された入力バッファ上に渡され、バックプレッシャーを有しないことが保証される。入力バッファ中のスペースが事前に割り振られたので、パケット化されたデータを受信するためのスペースが保証される。 At a given time, the communication bus connecting the IC 104 and the host system 102 can simultaneously carry multiple descriptors and/or data being fetched. Each of the interconnects 214 and 216 can carry a single packet at a time. In a particular embodiment, the arbitration circuit 206 can implement a round-robin arbitration scheme to pass one packet at a time to the different kernel circuits. In other embodiments, the arbitration circuit 206 may use a different arbitration scheme. Because the stream traffic manager 212 executes the descriptors (initiates a read request) only for kernel circuits 234 that have available space in their input buffers, packets received from the stream arbitration 206 are guaranteed to be passed on the intended input buffer of the target kernel circuit 234 and not have back pressure. Because the space in the input buffer was pre-allocated, space is guaranteed to receive the packetized data.

ストリームトラフィックマネージャ２１２は、さらに、ＤＭＡ１１０に、インターリーブされた様式でデータをフェッチするように命令することが可能である。例示的な例として、コントローラ２３６は、どのカーネル回路がビジーであるか、および入力バッファ中の利用可能なスペースに基づいて、ＤＭＡ１１０に、カーネル回路２３４－１のための１つまたは複数のパケットを取り出し、次いで、カーネル回路２３４－２のための１つまたは複数のパケットなどを取り出すように要求する。ストリームトラフィックマネージャ２１２は、カーネル回路２３４の各々がどのくらいビジーであるか、および、各カーネル回路２３４の各それぞれの入力バッファ内で、どれくらいのデータストレージが利用可能であるかを知ると、カーネル回路２３４の間でアービトレーションを実施する。特定の実施形態では、コントローラ２３６は、（１つまたは複数の）バッファ２３８に、書込みキュー２０２の各々についての最初の「Ｎ個の」記述子をローカルに記憶し、利用可能なスペースについて各カーネル回路の各入力バッファを検査するラウンドロビンアービトレーション方式を実施する。 The stream traffic manager 212 may further instruct the DMA 110 to fetch data in an interleaved fashion. As an illustrative example, the controller 236 requests the DMA 110 to fetch one or more packets for kernel circuit 234-1, then one or more packets for kernel circuit 234-2, and so on, based on which kernel circuits are busy and the available space in the input buffer. Once the stream traffic manager 212 knows how busy each of the kernel circuits 234 is and how much data storage is available in each respective input buffer of each kernel circuit 234, it arbitrates between the kernel circuits 234. In a particular embodiment, the controller 236 implements a round-robin arbitration scheme that stores the first "N" descriptors for each of the write queues 202 locally in the buffer(s) 238 and checks each input buffer of each kernel circuit for available space.

アーキテクチャ１００は、カーネル回路２３４からホストシステム１０２にデータを転送するとき、同様の様式で動作することが可能である。たとえば、ストリームトラフィックマネージャ２１２は、読取りキュー２０２－２、２０２－４、２０２－６、および２０２－８の各々の最初の「Ｎ個の」記述子を記憶することが可能である。ストリームトラフィックマネージャ２１２は、結果データがカーネル回路２３４についての出力キューにおいていつ利用可能であるかを決定することが可能である。出力バッファに対応する記述子が利用可能である、すなわち、記憶された結果を含んでいる、と決定したことに応答して、コントローラ２３６は、利用可能なデータムーバエンジン２４０を使用して出力バッファからホストシステム１０２へのデータ転送を始動する。記述子の利用可能性は、ホストシステム１０２が、カーネル回路から結果を受信するための利用可能なスペースを有することを示す。 The architecture 100 may operate in a similar manner when transferring data from the kernel circuit 234 to the host system 102. For example, the stream traffic manager 212 may store the first "N" descriptors for each of the read queues 202-2, 202-4, 202-6, and 202-8. The stream traffic manager 212 may determine when result data is available in the output queue for the kernel circuit 234. In response to determining that a descriptor corresponding to an output buffer is available, i.e., contains stored results, the controller 236 initiates a data transfer from the output buffer to the host system 102 using the available data mover engine 240. The availability of a descriptor indicates that the host system 102 has available space to receive results from the kernel circuit.

例示の目的で、カーネル回路２３４－１は、入力バッファ２１８からのデータに対して動作することが可能である。カーネル回路２３４－１は、結果データを、データストリームとして出力バッファ２２０に出力する。ストリームトラフィックマネージャ２１２、たとえば、コントローラ２３６は、いつデータが利用可能であるか、たとえば、いつ、データの少なくとも完全なパケットが出力バッファにおいて利用可能であり、対応する読取りキューが、そのデータ（たとえば、少なくとも完全なパケット）を記憶するために利用可能な十分なスペースを有するかを決定するために、出力バッファを監視することが可能である。出力バッファ２２０が利用可能なデータを有すると決定したことと、（ストリームトラフィックマネージャ２１２中のバッファ２３８において取り出され、キャッシュされ得る）記述子が、対応する読取りキュー２０２－２において利用可能であると決定したこととに応答して、コントローラ２３６は、出力バッファ２２０から相互接続２１６を通したＤＭＡ１１０とホストシステム１０２とへのデータ転送を始動する。出力バッファ２１６は、相互接続２１６に、およびホストシステム１０２上にデータを送出する前に、データストリームを、パケット化されたデータに変換する。１つまたは複数の実施形態では、アービトレーション２１０はラウンドロビンアービトレーションを実装することが可能である。他の実施形態では、アービトレーション２１０は、他のアービトレーション技法を実装することが可能である。アービトレーション技法は、ラウンドロビンなのか、それ以外なのかにかかわらず、カーネル回路２３４からのデータストリームおよび／またはパケットのインターリービングまたは交代を実装する。 For illustrative purposes, kernel circuit 234-1 may operate on data from input buffer 218. Kernel circuit 234-1 outputs the resulting data as a data stream to output buffer 220. Stream traffic manager 212, e.g., controller 236, may monitor the output buffer to determine when data is available, e.g., when at least a complete packet of data is available in the output buffer and the corresponding read queue has sufficient space available to store that data (e.g., at least a complete packet). In response to determining that output buffer 220 has available data and that a descriptor (which may be retrieved and cached in buffer 238 in stream traffic manager 212) is available in the corresponding read queue 202-2, controller 236 initiates a data transfer from output buffer 220 through interconnect 216 to DMA 110 and host system 102. The output buffer 216 converts the data stream into packetized data before sending the data out onto the interconnect 216 and onto the host system 102. In one or more embodiments, the arbitration 210 may implement round robin arbitration. In other embodiments, the arbitration 210 may implement other arbitration techniques. The arbitration technique, whether round robin or otherwise, implements interleaving or alternation of the data stream and/or packets from the kernel circuit 234.

複数のストリーミング対応カーネル回路がＩＣ内に実装される実施形態では、各アクティブカーネル回路が、ＩＣのデータ転送帯域幅の一部分を受信する。複数のストリーミング対応カーネル回路のコンカレント動作は、一般に、算出が始まる前に、そのようなカーネル回路が、完了されたデータ転送全体に対して動作するのではなく、データフラグメントが各それぞれのカーネル回路に到着するとき、データのフラグメントに対して動作するように設計されることを意味する。データのより小さいフラグメントに対して動作するこの能力は、本明細書で説明されるストリーミング対応カーネル回路にデータへのより迅速なアクセスを与え、これは、より低いレイテンシ、より高い性能、より低いデータストレージ要件、より低い全体的コスト、およびより低い電力消費を容易にする。 In embodiments in which multiple streaming-enabled kernel circuits are implemented within an IC, each active kernel circuit receives a portion of the IC's data transfer bandwidth. Concurrent operation of multiple streaming-enabled kernel circuits generally means that such kernel circuits are designed to operate on fragments of data as they arrive at each respective kernel circuit, rather than operating on the entire completed data transfer before computation begins. This ability to operate on smaller fragments of data gives the streaming-enabled kernel circuits described herein more rapid access to data, which facilitates lower latency, higher performance, lower data storage requirements, lower overall cost, and lower power consumption.

ＤＭＡ１１０にデータを送出するおよび／またはＤＭＡ１１０からデータを受信する異なるカーネル回路の間でインターリーブ（または交代）するとき、ストリームトラフィックマネージャ２１２は、相互接続ファブリック、たとえば、相互接続２１４、２１６が、遅いカーネル回路によってブロックされないことを確実にすることが可能である。これは、少なくとも部分的に、バッファ２１８～２３２を使用することによって達成される。一実施形態では、バッファ２１８～２３２の各々が、データの少なくとも１つの完全なパケットを記憶するようにサイズ決定される。説明されたように、カーネル回路に向けられたデータは、カーネル回路の入力バッファにおいてバッファスペースが利用可能でない限り、送出されない。パケットのバーストが入力バッファに到着すると、カーネル回路は、相互接続２１４上のトラフィックに悪影響を及ぼすことなしにカーネル回路自体のタイムテーブル上のバッファを空にし、それにより、「ヘッドオブラインブロッキング」として知られる輻輳状態を防ぐことが可能である。同様に、カーネル回路からホストシステム１０２に向けられたデータは、全パケットが出力バッファに転送されるまで、カーネル回路から相互接続２１６にわたって送出されない。 When interleaving (or alternating) between different kernel circuits sending data to and/or receiving data from the DMA 110, the stream traffic manager 212 can ensure that the interconnect fabric, e.g., the interconnects 214, 216, is not blocked by slow kernel circuits. This is accomplished, at least in part, through the use of buffers 218-232. In one embodiment, each of the buffers 218-232 is sized to store at least one complete packet of data. As described, data destined for a kernel circuit is not sent out unless buffer space is available at the kernel circuit's input buffer. When a burst of packets arrives at the input buffer, the kernel circuit can empty the buffer on its own timetable without adversely affecting traffic on the interconnect 214, thereby preventing a congestion condition known as "head of line blocking." Similarly, data destined for the host system 102 from the kernel circuit is not sent out across the interconnect 216 from the kernel circuit until the entire packet has been transferred to the output buffer.

カーネル回路の出力バッファは、データ転送が始まると、相互接続２１６を独占的に使用する。カーネル回路が、パケットの途中で、データを送出することに遅れるかまたはデータを送出することを停止した場合、相互接続２１６は、相互接続２１６がパケット全体を受信するまで、別のカーネル回路をサービスすることに切り替えることができず、それにより、相互接続２１６をロックし、他のカーネル回路がホストシステムにデータを送出するのを妨げる。出力バッファが省略されるべきである場合、あるカーネル回路が他のカーネル回路の性能に悪影響を及ぼすことがある。本明細書で説明される本発明の構成によれば、各出力バッファは、相互接続２１６にデータを送出することを試みる前に、パケット全体の最小限を受信し、記憶することが可能である。この特徴は、パケットの送信が始まると、カーネル回路挙動またはカーネル回路出力データレートとは無関係に相互接続２１６およびアップストリームインフラストラクチャが転送を受け入れることができるほど迅速に、その送信が完了することを確実にする。 The output buffers of a kernel circuit have exclusive use of the interconnect 216 once a data transfer begins. If a kernel circuit falls behind or stops sending data mid-packet, the interconnect 216 cannot switch to servicing another kernel circuit until the interconnect 216 receives the entire packet, thereby locking the interconnect 216 and preventing the other kernel circuit from sending data to the host system. If an output buffer were to be omitted, one kernel circuit could adversely affect the performance of the other kernel circuit. In accordance with the inventive configurations described herein, each output buffer is capable of receiving and storing a minimum of an entire packet before attempting to send data to the interconnect 216. This feature ensures that once a packet begins to be transmitted, it is completed quickly enough that the interconnect 216 and the upstream infrastructure can accommodate the transfer, regardless of kernel circuit behavior or kernel circuit output data rate.

１つまたは複数の実施形態では、カーネル回路およびバッファは、プログラマブル回路を使用して実装される。したがって、バッファは、ＩＣ１０４において実際に実装されるカーネル回路についてのみ作成される。ＩＣ１０４の回路リソースは、少数のカーネル回路が展開されるとき、入力バッファおよび／または出力バッファに関して浪費されない。リソース使用量は、ＩＣ１０４において実装されるカーネル回路の数とともにスケーリングする。特定の実施形態では、相互接続２１４、２１６にわたるデータ転送は、ストリームトラフィックマネージャ２１２によって管理されるバッファクレジットのシステムを通して調節される。 In one or more embodiments, the kernel circuits and buffers are implemented using programmable circuits. Thus, buffers are created only for kernel circuits that are actually implemented in the IC 104. Circuit resources of the IC 104 are not wasted on input and/or output buffers when a small number of kernel circuits are deployed. Resource usage scales with the number of kernel circuits implemented in the IC 104. In certain embodiments, data transfers across the interconnects 214, 216 are regulated through a system of buffer credits managed by the stream traffic manager 212.

１つまたは複数の実施形態では、ランタイム１２８は、データストリームを使用する、カーネル回路との直接の通信をサポートするために、ユーザアプリケーションによって呼び出され得る様々なアプリケーションプログラミングインターフェース（ＡＰＩ）を提供することが可能である。以下は、ランタイム１２８によって提供される例示的なＡＰＩのリストである。
ｃｌＣｒｅａｔｅＨｏｓｔＰｉｐｅ－「ストリーミングパイプ」とも呼ばれる、データをストリーミングするための読取りタイプまたは書込みタイプのデータバッファを作成するＯｐｅｎＣＬＡＰＩ。
ｃｌＥｎｑｕｅｕｅＷｒｉｔｅＰｉｐｅＢｕｆｆｅｒ－書込み（カーネル回路へのデータ転送）のためのストリーミングパイプに直接パケットをキューイングする。
ｃｌＥｎｑｕｅｕｅＲｅａｄＰｉｐｅＢｕｆｆｅｒ－読取り（カーネル回路からのデータ転送）のためのストリーミングパイプに直接パケットをキューイングする。 In one or more embodiments, runtime 128 may provide various application programming interfaces (APIs) that can be called by user applications to support direct communication with kernel circuitry using data streams. Below is a list of exemplary APIs provided by runtime 128:
clCreateHostPipe - An OpenCL API that creates a read or write type data buffer for streaming data, also called a "streaming pipe."
clEnqueueWritePipeBuffer--Queues a packet directly onto a streaming pipe for writing (transferring data to the kernel circuit).
clEnqueueReadPipeBuffer--Queues a packet directly onto a streaming pipe for reading (data transfer from the kernel circuit).

ランタイム１２８は、読取りキューペアおよび／または書込みキューペアを作成、破棄、開始、停止、および修正するためのＡＰＩをさらに提供し得る。
ｓｔｒｕｃｔｘｃｌＱｕｅｕｅＣｏｎｔｅｘｔ
ｘｃｌＣｒｅａｔｅＷｒｉｔｅＱｕｅｕｅ－ホストシステムにおいて書込みキューを作成する。ホストシステムにおいてリソースを割り振り、ＤＭＡ１１０のための書込みキューを初期化して、デバイス要求を発行する。作成された書込みキューについてのキューハンドルが将来のアクセスのために返される。
ｘｃｌＣｒｅａｔｅＲｅａｄＱｕｅｕｅ－ホストシステムにおいて読取りキューを作成する。ホストシステムにおいてリソースを割り振り、ＤＭＡ１１０のための読取りキューを初期化して、「デバイスからの」要求を発行する。作成された読取りキューについてのキューハンドルが将来のアクセスのために返される。
ｘｃｌＤｅｓｔｒｏｙＱｕｅｕｅ－指定された読取りキュー／書込みキューを破棄し、破棄された読取りキュー／書込みキューを実装するために使用されたリソースを回収する。
ｘｃｌＭｏｄｉｆｙＱｕｅｕｅ－指定された読取りキュー／書込みキューのパラメータを修正する。
ｘｃｌＳｔａｒｔＱｕｅｕｅ－指定された読取りキュー／書込みキューを、キューがＤＭＡ要求を受け付け、処理することを開始することが可能である稼働状態にする。
ｘｃｌＳｔｏｐＱｕｅｕｅ－指定された読取りキュー／書込みキューを、初期化された状態にする。すべての保留中のＤＭＡ要求がフラッシュされる。 The runtime 128 may further provide APIs for creating, destroying, starting, stopping, and modifying read and/or write queue pairs.
struct xclQueueContext
xclCreateWriteQueue--Creates a write queue in the host system. Allocates resources in the host system and initializes the write queue for the DMA 110 to issue device requests. A queue handle for the created write queue is returned for future access.
xclCreateReadQueue--Creates a read queue in the host system. Allocates resources in the host system and initializes a read queue for the DMA 110 to issue "from device" requests. A queue handle for the created read queue is returned for future access.
xclDestroyQueue--Destroys the specified read queue/write queue and reclaims the resources used to implement the destroyed read queue/write queue.
xclModifyQueue--Modifies the parameters of the specified read/write queue.
xclStartQueue--Puts the specified read/write queue into an active state where the queue can begin accepting and processing DMA requests.
xclStopQueue--Puts the specified read/write queue into an initialized state. All pending DMA requests are flushed.

ランタイム１２８は、以下のものなど、カーネル回路への書込みおよびカーネル回路からの読取りを発行するためのＡＰＩをさらに提供し得る。
ｓｔｒｕｃｔｘｃｌＱｕｅｕｅＲｅｑｕｅｓｔ
ｓｔｒｕｃｔｘｃｌＷＲＢｕｆｆｅｒ
ｘｃｌＷｒｉｔｅＱｕｅｕｅ－指定されたキューに書き込む。
ｘｃｌＲｅａｄＱｕｅｕｅ－指定されたキューから読み取る。 The runtime 128 may further provide APIs for issuing writes and reads to and from kernel circuits, such as the following:
struct xclQueueRequest
struct xclWRBuffer
xclWriteQueue--Writes to a specified queue.
xclReadQueue--Reads from the specified queue.

ドライバ１３０は、以下のものなど、ＤＭＡ１１０の動作をサポートするＡＰＩをさらに提供し得る。
ｓｔｒｅａｍｑ＿ｃｒｅａｔｅ（）
ｓｔｒｅａｍｑ＿ｄｅｓｔｒｏｙ（）
ｓｔｒｅａｍｑ＿ｗｒｉｔｅ（）／ｓｔｒｅａｍｑ＿ｒｅａｄ The driver 130 may further provide APIs to support the operations of the DMA 110, such as:
streamq_create()
streamq_destroy()
streamq_write()/streamq_read

１つまたは複数の実施形態では、ランタイム１２８は、読取り要求および／または書込み要求を作成、破棄、開始、停止、および修正するために呼び出され得る、ＩＣ１０４に関係する入出力動作のための入出力制御（ＩＯＣＴＬ）システム呼を提供する。特定の実施形態では、これらのシステム呼は、ホストシステム１０２において実行するユーザスペースアプリケーションにとって利用可能でない。ランタイム１２８は、ホストシステム１０２内で実行されるユーザスペースアプリケーションにとって利用可能であるポータブルオペレーティングシステムインターフェース（ＰＯＳＩＸ）読取り／書込み機能と非同期Ｉ／Ｏ（ＡＩＯ）読取り／書込み機能とをさらに提供し得る。 In one or more embodiments, runtime 128 provides input/output control (IOCTL) system calls for input/output operations involving IC 104 that may be invoked to create, destroy, start, stop, and modify read and/or write requests. In certain embodiments, these system calls are not available to user space applications executing in host system 102. Runtime 128 may further provide portable operating system interface (POSIX) read/write functions and asynchronous I/O (AIO) read/write functions that are available to user space applications executing within host system 102.

ハードウェアコンパイラ／システムリンカを含む電子設計オートメーション（ＥＤＡ）アプリケーションを実行するシステムが、カーネルを実装する設計フロー（たとえば、高レベル合成、合成、配置、ルーティング、および／または設定ビットストリーム生成）中にカーネル引数をキューにマッピングすることが可能である。マッピング情報が、生成され、コンテナファイル内に、カーネル回路を指定する設定ビットストリーム（たとえば、部分設定ビットストリーム）とともに記憶される。コンテナファイルは、ＩＣ１０４内での使用および実装のために、ホストシステム１０２に記憶される。 A system running an electronic design automation (EDA) application, including a hardware compiler/system linker, can map kernel arguments to queues during the design flow (e.g., high-level synthesis, synthesis, placement, routing, and/or configuration bitstream generation) that implements the kernel. The mapping information is generated and stored in a container file along with a configuration bitstream (e.g., a partial configuration bitstream) that specifies the kernel circuit. The container file is stored in the host system 102 for use and implementation in the IC 104.

ホストシステム１０２は、コンテナファイルを取り出して、ＩＣ１０４とともにコンテナファイルからの設定ビットストリームを実装するとき、ホストシステム１０２は、さらに、コンパイル中に生成されたマッピング情報を含むメタデータを抽出することが可能である。マッピング情報は、ＩＣ１０４内に実装されると、ホストシステム１０２とカーネル回路との間のデータストリームをルーティングするために通信経路をセットアップする際に使用するために、ランタイム１２８に提供される。 When the host system 102 retrieves the container file and implements the configuration bitstream from the container file with the IC 104, the host system 102 can further extract metadata including mapping information generated during compilation. Once implemented in the IC 104, the mapping information is provided to the runtime 128 for use in setting up communication paths to route data streams between the host system 102 and the kernel circuitry.

カーネルのためのプログラムコード内の「パイプ」データ構築物の利用に基づいて、ＥＤＡアプリケーションは、データ転送のために、オフチップＲＡＭまたは内部ＲＡＭのいずれかを伴うメモリマッピングされたトランザクションの代わりにデータストリームを使用するように設定されたカーネル回路を生成する（たとえば、設定ビットストリームがカーネル回路を指定する）ことが可能である。たとえば、パイプデータ構造を検出したことに応答して、ＥＤＡアプリケーションは、図１および／または図２に関して説明された、データストリームを使用するデータ転送をサポートする必要なハードウェアインフラストラクチャおよび／または回路を生成することが可能である。ＯｐｅｎＣＬにおいて指定されているカーネルの一例が例１として以下で提供される。

Based on the utilization of a "pipe" data construct in the program code for the kernel, the EDA application can generate kernel circuitry (e.g., the configuration bitstream specifies the kernel circuitry) that is configured to use data streams for data transfers instead of memory-mapped transactions involving either off-chip or internal RAM. For example, in response to detecting the pipe data structure, the EDA application can generate the necessary hardware infrastructure and/or circuitry to support data transfers using data streams, as described with respect to Figures 1 and/or 2. An example of a kernel specified in OpenCL is provided below as Example 1.

上記の例示的なカーネルをコンパイルするとき、ＥＤＡアプリケーションは、ｐ１およびｐ２についてのマッピング情報を生成する。マッピング情報は、ＩＣ１０４内に実装されると、ホストシステム１０２と、カーネル回路２３４－１などの特定のカーネル回路との間でデータストリームを正しくルーティングするように、（たとえば、マップ２４２および２４４にレジスタセッティングを記憶することによって）ストリームトラフィックマネージャ２１２を設定し、（ＤＭＡ１１０中の制御レジスタに記憶することによって）ＤＭＡ１１０を設定するための、レジスタセッティングを含む。一例では、マッピング情報は、各パイプがバインドされる特定のｒｏｕｔｅ＿ｉｄおよびｆｌｏｗ＿ｉｄ、ならびに／または、パイプｐ１およびパイプｐ２に関係する静的情報を指定する。このマッピングデータは、カーネルから生成されたカーネル回路を指定する設定ビットストリームについてのメタデータ（たとえば、プログラムコード）としてコンテナファイル内に記憶される。 When compiling the above exemplary kernel, the EDA application generates mapping information for p1 and p2. The mapping information includes register settings that, when implemented in IC 104, configure stream traffic manager 212 (e.g., by storing register settings in maps 242 and 244) and DMA 110 (by storing in control registers in DMA 110) to properly route data streams between host system 102 and a particular kernel circuit, such as kernel circuit 234-1. In one example, the mapping information specifies the particular route_id and flow_id to which each pipe is bound, and/or static information related to pipes p1 and p2. This mapping data is stored in the container file as metadata (e.g., program code) for the configuration bitstream that specifies the kernel circuit generated from the kernel.

たとえば、ホストシステム１０２のメモリからカーネル回路２３４－１にデータを送出するために、ランタイム１２８および／またはドライバ１３０は、その動作をｐ１に割り当て、ｐ１をキュー構造２０２－１にバインドする。ホストシステム１０２は、内部テーブルからカーネル回路２３４－１のためのｒｏｕｔｅ＿ｉｄをルックアップする。ｒｏｕｔｅ＿ｉｄは、カーネル回路２３４－１のロケーションを指定する。ホストシステム１０２は、パイプｐ１および関連するキュー２０２－１でＤＭＡ１１０の制御レジスタを設定する。ホストシステム１０２は、カーネル回路２３４－１のためのｒｏｕｔｅ＿ｉｄをキュー２０２－１およびパイプｐ１と相関させるエントリを作成する。１つまたは複数の実施形態では、ストリームトラフィックマネージャ２１２は、パイプｐ１に対応するデータを受信したことに応答して、ｐ１に属するカーネル回路バウンドデータ（ｋｅｒｎｅｌｃｉｒｃｕｉｔｂｏｕｎｄｄａｔａ）に正しいｒｏｕｔｅ＿ｉｄをタグ付けすることが可能である。このｒｏｕｔｅ＿ｉｄをタグ付けされたデータが与えられると、ストリームトラフィックマネージャ２１２および相互接続２１４は、バッファ２１８を介してカーネル回路２３４－１にデータを配信することが可能である。 For example, to send data from the host system's 102 memory to kernel circuit 234-1, the runtime 128 and/or driver 130 assigns the operation to p1 and binds p1 to queue structure 202-1. The host system 102 looks up the route_id for kernel circuit 234-1 from an internal table. The route_id specifies the location of kernel circuit 234-1. The host system 102 sets the control registers of the DMA 110 with pipe p1 and the associated queue 202-1. The host system 102 creates an entry correlating the route_id for kernel circuit 234-1 with queue 202-1 and pipe p1. In one or more embodiments, in response to receiving data corresponding to pipe p1, the stream traffic manager 212 can tag the kernel circuit bound data belonging to p1 with the correct route_id. Given the data tagged with this route_id, the stream traffic manager 212 and interconnect 214 can deliver the data to the kernel circuit 234-1 via the buffer 218.

同様に、カーネル回路２３４－１からホストシステム１０２内のメモリにデータを転送するために、ランタイム１２８および／またはドライバ１３０は、その動作をｐ２に割り当て、ｐ２をキュー２０２－１にバインドすることが可能である。ホストシステム１０２は、カーネル回路２３４－１からのホストバウンドデータ（ｈｏｓｔｂｏｕｎｄｄａｔａ）にタグ付けするために使用されるｆｌｏｗ＿ｉｄをルックアップする。１つまたは複数の実施形態では、カーネル回路２３４－１は、アウトバウンドデータ（ｏｕｔｂｏｕｎｄｄａｔａ）に適切なｆｌｏｗ＿ｉｄをタグ付けすることが可能である。１つまたは複数の他の実施形態では、バッファ２２０は、アウトバウンドデータに適切なｆｌｏｗ＿ｉｄをタグ付けすることが可能である回路を含む。ホストシステム１０２は、パイプｐ２でＤＭＡ１１０を設定し、パイプｐ２をキュー２０２－２に関連付ける。ホストシステム１０２は、さらに、データ転送のために、カーネル回路２３４－１のためのｆｌｏｗ＿ｉｄ（たとえば、バッファ２２０）をキュー２０２－２およびパイプｐ２と相関させるエントリを作成する。ストリームトラフィックマネージャ２１２は、さらに、そのデータをＤＭＡ１１０にフォワーディングするとき、ｆｌｏｗ＿ｉｄをタグ付けされたホストバウンドトラフィック（ｈｏｓｔ－ｂｏｕｎｄｔｒａｆｆｉｃ）をパイプｐ２にバインドすることが可能である。 Similarly, to transfer data from kernel circuit 234-1 to memory in host system 102, runtime 128 and/or driver 130 can assign the operation to p2 and bind p2 to queue 202-1. Host system 102 looks up the flow_id used to tag host bound data from kernel circuit 234-1. In one or more embodiments, kernel circuit 234-1 can tag outbound data with the appropriate flow_id. In one or more other embodiments, buffer 220 includes circuitry that can tag outbound data with the appropriate flow_id. Host system 102 configures DMA 110 on pipe p2 and associates pipe p2 with queue 202-2. The host system 102 further creates an entry correlating the flow_id (e.g., buffer 220) for the kernel circuit 234-1 with the queue 202-2 and pipe p2 for data transfer. The stream traffic manager 212 can further bind host-bound traffic tagged with the flow_id to pipe p2 when forwarding the data to the DMA 110.

ＤＭＡ１１０とストリームトラフィックマネージャ２１２の両方がデータ転送のために設定されると、ＤＭＡ１１０は、上記の例１に従って動作を始めるように指令される。 Once both the DMA 110 and the stream traffic manager 212 are configured for data transfer, the DMA 110 is instructed to begin operating according to example 1 above.

図３は、データストリームを使用してホストシステムとハードウェアアクセラレータのカーネル回路との間でデータを転送する例示的な方法３００を示す。方法３００は、ホストシステムがメモリ内に１つまたは複数のコンテナファイルを記憶する状態において始まることができる。各コンテナファイルが、１つまたは複数の設定ビットストリームと、対応するメタデータとを含む。部分設定ビットストリームであり得る設定ビットストリームの各々が、１つまたは複数のカーネル回路を指定する。 FIG. 3 illustrates an example method 300 for transferring data between a host system and kernel circuits of a hardware accelerator using a data stream. Method 300 can begin with the host system storing one or more container files in memory. Each container file includes one or more configuration bitstreams and corresponding metadata. Each of the configuration bitstreams, which may be partial configuration bitstreams, specifies one or more kernel circuits.

ブロック３０５において、ホストシステムはコンテナファイルを選択する。例示の目的で、コンテナファイルは、設定ビットストリームと設定ビットストリームについてのメタデータとを含む。設定ビットストリームは、部分設定ビットストリームであり得る。１つまたは複数の実施形態では、ホストシステムは、ユーザアプリケーションが、コンテナファイル中の設定ビットストリームによって指定されたカーネル回路によって実装されるハードウェアアクセラレーテッド機能性を要求したことに応答して、コンテナファイルを選択する。ユーザアプリケーションは、メモリから選択されるかまたは取り出され、ハードウェアアクセラレータにおいて実装されるべき、特定のコンテナファイルを指定し得る。 At block 305, the host system selects a container file. For illustrative purposes, the container file includes a configuration bitstream and metadata about the configuration bitstream. The configuration bitstream may be a partial configuration bitstream. In one or more embodiments, the host system selects the container file in response to a user application requesting hardware accelerated functionality implemented by a kernel circuit specified by the configuration bitstream in the container file. The user application may specify a particular container file to be selected or retrieved from memory and implemented in the hardware accelerator.

ブロック３１０において、ホストシステムはコンテナファイルから設定ビットストリームを抽出する。ホストシステムは、ハードウェアアクセラレータのＩＣ、たとえば、ＩＣ１０４に設定ビットストリームをロードする。ハードウェアアクセラレータのＩＣに設定ビットストリームをロードすることによって、設定ビットストリームによって指定されたカーネル回路は、ＩＣ内に物理的に実装され、ホストシステムによって要求されるタスクを実施するために利用可能である。 In block 310, the host system extracts the configuration bitstream from the container file. The host system loads the configuration bitstream into the hardware accelerator's IC, e.g., IC 104. By loading the configuration bitstream into the hardware accelerator's IC, the kernel circuitry specified by the configuration bitstream is physically implemented within the IC and is available to perform tasks requested by the host system.

ブロック３１５において、ホストシステムは、メタデータから１つまたは複数のパイププロパティを決定する。たとえば、ホストシステムは、選択されたコンテナファイルから設定ビットストリームについてのメタデータを抽出する。メタデータは、カーネルがコンパイルされたときに生成されたマッピング情報を含む。マッピングデータは、ＤＭＡ１１０およびストリームトラフィックマネージャ２１２を設定するために使用され得る１つまたは複数のパイププロパティを含む。たとえば、パイププロパティは、セッティング、たとえば、ホストシステムと、選択されたコンテナファイルから抽出された設定ビットストリームによって実装された１つまたは複数のカーネル回路との間でデータを交換するためのルートを確立するために、ＤＭＡ１１０および／またはストリームトラフィックマネージャにロードされ得るｒｏｕｔｅ＿ｉｄおよび／またはｆｌｏｗ＿ｉｄなど、レジスタセッティングを含み得る。 At block 315, the host system determines one or more pipe properties from the metadata. For example, the host system extracts metadata about the configuration bitstream from the selected container file. The metadata includes mapping information generated when the kernel was compiled. The mapping data includes one or more pipe properties that can be used to configure the DMA 110 and the stream traffic manager 212. For example, the pipe properties can include settings, e.g., register settings such as route_id and/or flow_id that can be loaded into the DMA 110 and/or stream traffic manager to establish a route for exchanging data between the host system and one or more kernel circuits implemented by the configuration bitstream extracted from the selected container file.

１つまたは複数の実施形態では、設定ビットストリームについてのメタデータは、ストリームトラフィックマネージャがより効率的に動作することを可能にする、設計フロー中に生成された追加情報を含む。たとえば、メタデータは、各カーネルに固有である情報、たとえば、セッティングを指定することができる。したがって、メタデータを使用して、ストリームトラフィックマネージャは、カーネル回路ごとに、データがどのようにカーネル回路にストリーミングされ、および／またはカーネル回路からホストシステムにストリーミングされるかを調整することが可能である。メタデータは、たとえば、（パケットサイズに対応する）カーネル回路のワーキングデータセットのサイズ、データセットごとにカーネル回路のために必要とされる算出時間、カーネル回路のために望まれるプリフェッチングの量などを指定することができる。ストリームトラフィックマネージャは、動作中に、その特定のカーネル回路についてのメタデータに従って、カーネルのために取り出されるデータの量とプリフェッチングの量とを調整することができる。 In one or more embodiments, the metadata for the configuration bitstream includes additional information generated during the design flow that allows the stream traffic manager to operate more efficiently. For example, the metadata can specify information, e.g., settings, that are specific to each kernel. Thus, using the metadata, the stream traffic manager can adjust, for each kernel circuit, how data is streamed to and/or from the kernel circuit to the host system. The metadata can specify, for example, the size of the kernel circuit's working data set (corresponding to a packet size), the computation time required for the kernel circuit per data set, the amount of prefetching desired for the kernel circuit, etc. During operation, the stream traffic manager can adjust the amount of data fetched for a kernel and the amount of prefetching according to the metadata for that particular kernel circuit.

ブロック３１５の一部として、ホストシステムは、ストリームトラフィックマネージャおよび／またはＤＭＡにセッティング（たとえば、説明されるパイププロパティおよび／または他の情報）を送出して、実装されたカーネル回路とホストシステムとの間でデータをストリーミングするためのデータ経路を設定することが可能である。たとえば、ホストシステムは、データ経路を設定するためにドライバおよび／またはランタイムにおいて利用可能な１つまたは複数の機能を呼び出す。その機能は、たとえば、ＤＭＡの制御レジスタへのセッティングとストリームトラフィックマネージャのマップとを書き込む。ストリームトラフィックマネージャは、本明細書で説明されるセッティングとともに書き込まれ得る追加の制御レジスタを含み得る。 As part of block 315, the host system can send settings (e.g., pipe properties and/or other information as described) to the stream traffic manager and/or DMA to set up a data path for streaming data between the implemented kernel circuitry and the host system. For example, the host system calls one or more functions available in the driver and/or runtime to set up the data path. The functions, for example, write settings to control registers of the DMA and maps of the stream traffic manager. The stream traffic manager can include additional control registers that can be written with settings as described herein.

ブロック３２０において、ホストシステムは、セッティングを使用して、データストリームとしての、ホストシステムからカーネル回路への直接のデータ転送を実装する。たとえば、ホストシステムは、ターゲットカーネル回路の入力バッファに対応する、ドライバ内の書込みキューに１つまたは複数の記述子を追加する。ＤＭＡは、記述子のうちの１つまたは複数を取り出し、取り出された記述子をストリームトラフィックマネージャに提供することが可能である。ストリームトラフィックマネージャは、記述子を内部バッファ内に一時的に記憶する。説明されたように、ストリームトラフィックマネージャは、ターゲットカーネル回路について入力バッファの状態を監視し、入力バッファ内でスペースが利用可能であるとき、ストリームトラフィックマネージャ中に含まれている利用可能なデータムーバエンジンを使用して、ターゲットカーネル回路の入力バッファに対応する記述子のうちの１つまたは複数を実行することが可能である。したがって、ＤＭＡ１１０は、ホストメモリから、パケット化された形態のデータを取り出す。ストリームトラフィックマネージャは、ターゲットカーネル回路の入力バッファにそのデータをストリーミングする。述べられたように、入力バッファは、パケット化されたデータを、ストリーミングされるデータに変換することが可能である。 In block 320, the host system uses the setting to implement a direct data transfer from the host system to the kernel circuit as a data stream. For example, the host system adds one or more descriptors to a write queue in the driver that corresponds to the input buffer of the target kernel circuit. The DMA can retrieve one or more of the descriptors and provide the retrieved descriptors to the stream traffic manager. The stream traffic manager temporarily stores the descriptors in an internal buffer. As described, the stream traffic manager can monitor the status of the input buffer for the target kernel circuit and, when space is available in the input buffer, can execute one or more of the descriptors that correspond to the input buffer of the target kernel circuit using an available data mover engine included in the stream traffic manager. Thus, the DMA 110 retrieves data in packetized form from the host memory. The stream traffic manager streams the data to the input buffer of the target kernel circuit. As described, the input buffer can convert the packetized data into streamed data.

１つまたは複数の実施形態では、ターゲットカーネル回路に転送されるデータは、そのデータ中に埋め込まれた、カーネル回路のための１つまたは複数の命令を含む。この点について、コマンドは、データとともにまたはデータに対して「帯域内」にあると言われる。データストリーム内にカーネル回路のための命令を含めることによって、カーネル回路の動作を開始および／または停止するための、カーネル回路のための別個のシグナリングが提供される必要がない。そのような動作は、（１つまたは複数の）データストリーム中に含まれる帯域内命令によって始動され得る。特定の実施形態では、カーネル回路および／またはホストシステムは、連続データストリーム、または随意に、命令（たとえば、コマンドまたはステータス情報）が点在しているデータストリームを交換することが可能である。 In one or more embodiments, the data transferred to the target kernel circuit includes one or more instructions for the kernel circuit embedded therein. In this regard, the commands are said to be "in-band" with or to the data. By including instructions for the kernel circuit in the data stream, no separate signaling needs to be provided for the kernel circuit to start and/or stop operation of the kernel circuit. Such operation may be initiated by an in-band instruction included in the data stream(s). In certain embodiments, the kernel circuit and/or the host system are capable of exchanging a continuous data stream, or a data stream optionally interspersed with instructions (e.g., commands or status information).

１つまたは複数の実施形態では、ホストシステムは、データ転送が、データ転送を要求するユーザアプリケーションによって使用されるデータタイプ、および／またはユーザアプリケーションによって呼び出される特定のＡＰＩに基づいて、データストリームとして実装されるべきであると決定することが可能である。 In one or more embodiments, the host system may determine that a data transfer should be implemented as a data stream based on the data type used by the user application requesting the data transfer and/or the particular API called by the user application.

ブロック３２５において、ホストシステムは、パイププロパティを使用して、データストリームとしての、カーネル回路から直接ホストシステムへのさらなるデータ転送を実装する。たとえば、ホストシステムは、ターゲットカーネル回路の出力バッファに対応する、ドライバの読取りキューに１つまたは複数の記述子を追加する。述べられたように、ＤＭＡは、記述子のうちの１つまたは複数を取り出し、取り出された記述子をストリームトラフィックマネージャに提供することが可能である。ストリームトラフィックマネージャは、記述子を内部バッファ内に一時的に記憶する。ストリームトラフィックマネージャは、カーネル回路について出力バッファの状態を監視し、出力バッファ内でデータストリームが利用可能であるとき、ストリームトラフィックマネージャ中に含まれている利用可能なデータムーバエンジンを使用して、ターゲットカーネル回路の出力バッファに対応する記述子のうちの１つまたは複数を実行することが可能である。したがって、ストリームトラフィックマネージャのデータムーバエンジンは、ターゲットカーネル回路の出力バッファから、パケット化されたデータを取り出し、パケット化されたデータをＤＭＡに提供する。述べられたように、出力バッファは、データストリームを、パケット化されたデータに変換する。ＤＭＡは、通信バス上で、パケット化されたデータをホストメモリに提供する。 In block 325, the host system uses the pipe property to implement further data transfer from the kernel circuit directly to the host system as a data stream. For example, the host system adds one or more descriptors to the driver's read queue corresponding to the output buffer of the target kernel circuit. As mentioned, the DMA can retrieve one or more of the descriptors and provide the retrieved descriptors to the stream traffic manager. The stream traffic manager temporarily stores the descriptors in an internal buffer. The stream traffic manager monitors the status of the output buffer for the kernel circuit and, when a data stream is available in the output buffer, can execute one or more of the descriptors corresponding to the output buffer of the target kernel circuit using an available data mover engine included in the stream traffic manager. Thus, the data mover engine of the stream traffic manager retrieves the packetized data from the output buffer of the target kernel circuit and provides the packetized data to the DMA. As mentioned, the output buffer converts the data stream to packetized data. The DMA provides the packetized data to the host memory over the communication bus.

図４は、データストリームを使用してカーネル回路間でデータを交換するための例示的なアーキテクチャ４００を示す。アーキテクチャ４００は、アプリケーションが複数の大きいおよび複雑なカーネル回路を必要とし、１次ＩＣによって提供されたプログラマブル回路を増強するために、追加のＩＣが使用される、使用事例をサポートする。１次ＩＣは、エンドポイントおよびＤＭＡを介した、ホストシステムとの通信をサポートするように設定される。１次ＩＣはまた、ストリームトラフィックマネージャを含む。１つまたは複数の実施形態では、ストリームトラフィックマネージャは、カーネル回路のためのパケット化されたデータを、独立した相互接続に各々接続されたいくつかの異なるポートのうちの１つにルーティングすることが可能である。カーネル回路を異なる相互接続に区分することは、カーネル回路が、ＩＣの異なる物理領域、たとえば、マルチダイＩＣの場合の異なるダイに位置することを可能にする。さらに、異なる相互接続は、異なる領域のカーネル回路を、互いに干渉しないように隔離する。この区分は、マルチダイＩＣが使用されることと、また、２次ＩＣが使用されることとを可能にする。 Figure 4 illustrates an exemplary architecture 400 for exchanging data between kernel circuits using data streams. The architecture 400 supports use cases where an application requires multiple large and complex kernel circuits, and additional ICs are used to augment the programmable circuitry provided by the primary IC. The primary IC is configured to support communication with a host system via endpoints and DMA. The primary IC also includes a stream traffic manager. In one or more embodiments, the stream traffic manager is capable of routing packetized data for the kernel circuits to one of several different ports, each connected to an independent interconnect. Partitioning the kernel circuits to different interconnects allows the kernel circuits to be located in different physical regions of the IC, e.g., different dies in the case of a multi-die IC. Furthermore, the different interconnects isolate the kernel circuits in the different regions so that they do not interfere with each other. This partitioning allows multi-die ICs to be used and also allows secondary ICs to be used.

アーキテクチャ４００は、ＩＣ１０４とＩＣ４０２とを含む。１つまたは複数の実施形態では、ＩＣ１０４および４０２は、ＲＡＭ（図示せず）をも含み得る同じ回路板、たとえば、ハードウェアアクセラレータに結合される。図４の例では、ＩＣ１０４および４０２の各々がマルチダイＩＣとして実装される。ＩＣ１０４は、ダイ４０４および４０６を含む。ＩＣ４０２は、ダイ４０８および４１０を含む。ダイ４０４、４０６、４０８、および４１０の各々が、図７に関して本明細書でより詳細に説明されるプログラマブル回路を含めるために実装される。特定の実施形態では、ダイ４０４、４０６、４０８、および４１０のうちの１つまたは複数が、１つまたは複数のハードワイヤード回路ブロックを含む。一例では、ダイ４０４、４０６、４０８、および４１０の各々はフィールドプログラマブルゲートアレイ（ＦＰＧＡ）として実装される。 The architecture 400 includes IC 104 and IC 402. In one or more embodiments, IC 104 and 402 are coupled to the same circuit board, e.g., a hardware accelerator, that may also include RAM (not shown). In the example of FIG. 4, each of IC 104 and 402 is implemented as a multi-die IC. IC 104 includes dies 404 and 406. IC 402 includes dies 408 and 410. Each of dies 404, 406, 408, and 410 is implemented to include programmable circuitry, which is described in more detail herein with respect to FIG. 7. In a particular embodiment, one or more of dies 404, 406, 408, and 410 include one or more hardwired circuit blocks. In one example, each of dies 404, 406, 408, and 410 is implemented as a field programmable gate array (FPGA).

図４の例では、ダイ４０４とダイ４０６とは同じパッケージ内に含まれるが、ダイ４０８とダイ４１０とは異なるパッケージ中に含まれる。ＩＣ１０４およびＩＣ４０２は、様々な利用可能なマルチダイ技術のいずれかを使用して実装され得る。１つまたは複数の実施形態では、ダイ４０４および４０６は、ダイ４０４とダイ４０６との間で信号を伝達することが可能なワイヤを含むインターポーザ（ｉｎｔｅｒｐｏｓｅｒ）上に取り付けられる。同様に、ダイ４０８および４１０は、ダイ４０８とダイ４１０との間で信号を伝達することが可能なワイヤを含むインターポーザ上に取り付けられる。それらのダイは、複数のはんだバンプまたは別の接続技術を使用して取り付けられ得る。インターポーザは、たとえば、選択された信号が、マルチダイＩＣパッケージの外部に渡り、基板に渡ることを可能にする、複数のスルービアを含む。 In the example of FIG. 4, die 404 and die 406 are included in the same package, but die 408 and die 410 are included in a different package. IC 104 and IC 402 may be implemented using any of a variety of available multi-die technologies. In one or more embodiments, die 404 and 406 are mounted on an interposer that includes wires capable of transmitting signals between die 404 and die 406. Similarly, die 408 and 410 are mounted on an interposer that includes wires capable of transmitting signals between die 408 and die 410. The dies may be attached using multiple solder bumps or another connection technique. The interposer includes multiple through vias that allow, for example, selected signals to cross over to the outside of the multi-die IC package and to a substrate.

例示の目的で、ダイ４０４および４０８は、各それぞれのダイ中に含まれる異なる回路ブロックをより良く示すために陰影を付けられている。図４の例では、ダイ４０４および４０８は、それぞれ、ダイ４０６および４１０中に含まれない追加の回路ブロックを含む。たとえば、ダイ４０４は、エンドポイント１０８と、ＤＭＡ１１０と、ストリームトラフィックマネージャ２１２と、トランシーバ４４２とを含むが、ダイ４０６はそれらを含まない。１つまたは複数の実施形態では、エンドポイント１０８、ＤＭＡ１１０、および／またはトランシーバ４４２のうちの１つまたは複数が、ハードワイヤード回路ブロックとして実装される。特定の実施形態では、エンドポイント１０８、ＤＭＡ１１０、および／またはトランシーバ４４２はプログラマブル回路において実装される。これらの回路構造はダイ４０６内で繰り返されない。同様に、ダイ４０８は、トランシーバ４４４と衛星ストリームトラフィックマネージャ４１２とを含むが、ダイ４１０はそれらを含まない。これらの構造はダイ４１０において繰り返されない。 For illustrative purposes, dies 404 and 408 are shaded to better show the different circuit blocks included in each respective die. In the example of FIG. 4, dies 404 and 408 each include additional circuit blocks not included in dies 406 and 410. For example, die 404 includes an endpoint 108, a DMA 110, a stream traffic manager 212, and a transceiver 442, while die 406 does not. In one or more embodiments, one or more of the endpoint 108, DMA 110, and/or transceiver 442 are implemented as hardwired circuit blocks. In certain embodiments, the endpoint 108, DMA 110, and/or transceiver 442 are implemented in programmable circuitry. These circuit structures are not repeated in die 406. Similarly, die 408 includes a transceiver 444 and a satellite stream traffic manager 412, while die 410 does not. These structures are not repeated in die 410.

図４の例では、エンドポイント１０８、ＤＭＡ１１０、およびストリームトラフィックマネージャ２１２は、実質的に、図１および図２に関して説明されたように実装される。しかしながら、図４の例では、ストリームトラフィックマネージャ２１２は追加のＩ／Ｏポートを含む。たとえば、ストリームトラフィックマネージャ２１２は、トランシーバ４４２に接続する追加のＩ／Ｏポートを含む。さらに、ストリームトラフィックマネージャ２１２の１つまたは複数のＩ／Ｏポートが、ダイ４０６に、詳細には、相互接続４１６に結合する。１つまたは複数の実施形態では、相互接続４１４および相互接続４１６は、各々、相互接続２１４のインスタンスおよび相互接続２１６のインスタンスを表す。したがって、ダイ４０４および４０６の各々が、相互接続２１４および相互接続２１６のインスタンスを含む。示されているように、カーネル回路２３４および対応するバッファが、ダイ４０４および４０６にわたって拡散される。 In the example of FIG. 4, the endpoints 108, the DMA 110, and the stream traffic manager 212 are implemented substantially as described with respect to FIGS. 1 and 2. However, in the example of FIG. 4, the stream traffic manager 212 includes additional I/O ports. For example, the stream traffic manager 212 includes an additional I/O port that connects to a transceiver 442. Additionally, one or more I/O ports of the stream traffic manager 212 couple to the die 406, and in particular to an interconnect 416. In one or more embodiments, the interconnect 414 and the interconnect 416 represent instances of the interconnect 214 and the interconnect 216, respectively. Thus, each of the dies 404 and 406 includes instances of the interconnect 214 and the interconnect 216. As shown, the kernel circuit 234 and corresponding buffers are spread across the dies 404 and 406.

ダイ４０８は、トランシーバ４４４と、衛星ストリームトラフィックマネージャ４１２と、相互接続４１８と、バッファ４２２、４２４、４２６、および４２８と、カーネル回路４４０－１および４４０－２とを含む。特定の実施形態では、相互接続４１８は、相互接続２１４の別のインスタンスおよび相互接続２１６の別のインスタンスを表す。ダイ４１０は、相互接続４２０と、バッファ４３２、４３４、４３６、および４３８と、カーネル回路４４０－３および４４０－４とを含む。同様に、相互接続４２０は、相互接続２１４の別のインスタンスおよび相互接続２１６の別のインスタンスを表す。１つまたは複数の実施形態では、トランシーバ４４４はハードワイヤード回路ブロックとして実装される。特定の実施形態では、トランシーバ４４４はプログラマブル回路において実装される。 The die 408 includes a transceiver 444, a satellite stream traffic manager 412, an interconnect 418, buffers 422, 424, 426, and 428, and kernel circuits 440-1 and 440-2. In a particular embodiment, the interconnect 418 represents another instance of the interconnect 214 and another instance of the interconnect 216. The die 410 includes an interconnect 420, buffers 432, 434, 436, and 438, and kernel circuits 440-3 and 440-4. Similarly, the interconnect 420 represents another instance of the interconnect 214 and another instance of the interconnect 216. In one or more embodiments, the transceiver 444 is implemented as a hardwired circuit block. In a particular embodiment, the transceiver 444 is implemented in a programmable circuit.

図４の例では、ＩＣ１０４は、ダイ４０４が、ホストシステム１０２と通信するためのエンドポイント１０８を含むという点で、マスタとして動作することが可能である。さらに、ストリームトラフィックマネージャ２１２は、トランシーバ４４２および４４４を介して衛星ストリームトラフィックマネージャ４１２と通信することが可能である。１つまたは複数の実施形態では、トランシーバ４４２および４４４は、複数のシリアルデータレーン（ｓｅｒｉａｌｄａｔａｌａｎｅ）を含む高速ポイントツーポイント相互接続を実装する。トランシーバ４４２および４４４によって形成された接続が、ストリームトラフィックマネージャ２１２と衛星ストリームトラフィックマネージャ４１２との間でデータを交換する。さらに、トランシーバ４４２および４４４は、ＩＣ境界を横断することによる追加のレイテンシを隠すために、バッファリングの追加の層を提供することが可能である。図４の例では、ストリームトラフィックマネージャ２１２および衛星ストリームトラフィックマネージャ４１２は、パケット化されたデータを送出および受信する。１つまたは複数の実施形態では、トランシーバ４４２および４４４は、一方のトランシーバから他方のトランシーバに送信する目的で、ストリームトラフィックマネージャ２１２と衛星ストリームトラフィックマネージャ４１２との間で交換されるストリーミングパケットをシリアライズすることと、ＩＣ１０４および４０２内で送出および／またはハンドリングするために、送信されたデータをデシリアライズすることとが可能である。同様に、トランシーバ４４２および４４４は、一方のトランシーバから他方のトランシーバに送信する目的で、ストリームトラフィックマネージャ２１２と衛星ストリームトラフィックマネージャ４１２との間で交換されるクレジットメッセージをシリアライズすることと、ＩＣ１０４および／またはＩＣ４０２内で送出および／またはハンドリングするためにそのようなメッセージをデシリアライズすることとが可能である。 In the example of FIG. 4, the IC 104 can act as a master in that the die 404 includes an endpoint 108 for communicating with the host system 102. Additionally, the stream traffic manager 212 can communicate with the satellite stream traffic manager 412 via transceivers 442 and 444. In one or more embodiments, the transceivers 442 and 444 implement a high-speed point-to-point interconnect including multiple serial data lanes. The connection formed by the transceivers 442 and 444 exchanges data between the stream traffic manager 212 and the satellite stream traffic manager 412. Additionally, the transceivers 442 and 444 can provide an additional layer of buffering to hide the additional latency due to crossing the IC boundary. In the example of FIG. 4, the stream traffic manager 212 and the satellite stream traffic manager 412 send and receive packetized data. In one or more embodiments, the transceivers 442 and 444 are capable of serializing streaming packets exchanged between the stream traffic manager 212 and the satellite stream traffic manager 412 for transmission from one transceiver to the other, and deserializing the transmitted data for sending and/or handling within the ICs 104 and 402. Similarly, the transceivers 442 and 444 are capable of serializing credit messages exchanged between the stream traffic manager 212 and the satellite stream traffic manager 412 for transmission from one transceiver to the other, and deserializing such messages for sending and/or handling within the ICs 104 and/or 402.

アーキテクチャ４００を使用して、ホストシステム１０２は、パケット化されたデータをルーティングするように、ＤＭＡ１１０、ストリームトラフィックマネージャ２１２、および衛星ストリームトラフィックマネージャ４１２を設定することが可能である。たとえば、ストリームトラフィックマネージャ２１２は、必要なマッピングデータおよび／またはセッティングを衛星ストリームトラフィックマネージャ４１２上に渡すことが可能である。設定されると、ホストシステム１０２は、ＩＣ１０４および／またはＩＣ４０２にタスクをオフロードすることが可能である。さらに、ホストシステム１０２は、カーネル回路２３４のうちの１つまたは複数および／またはカーネル回路４４０のうちの１つまたは複数にタスクを指示することが可能である。 Using architecture 400, host system 102 can configure DMA 110, stream traffic manager 212, and satellite stream traffic manager 412 to route packetized data. For example, stream traffic manager 212 can pass necessary mapping data and/or settings onto satellite stream traffic manager 412. Once configured, host system 102 can offload tasks to IC 104 and/or IC 402. Additionally, host system 102 can direct tasks to one or more of kernel circuits 234 and/or one or more of kernel circuits 440.

カーネル回路２３４は、図２の例では単一のダイ中に含まれたが、図４の例では、カーネル回路２３４は、ダイ４０４および４０６にわたって分散される。同様に、カーネル回路４４０は、ダイ４０８および４１０にわたって分散される。ストリームトラフィックマネージャ２１２は、データがコンカレントに複数のカーネル回路に提供されることを可能にするが、ストリームトラフィックマネージャ２１２は、カーネル回路２３４間で（たとえば、２３４－１から２３４－２にまたはその逆、２３４－１または２３４－２から２３４－３または２３４－４に、２３４－３から２３４－４にまたはその逆、２３４－３または２３４－４から２３４－２または２３４－１に）接続を確立することも可能である。たとえば、カーネル回路が初期に、互いと直接通信するように設定されない場合、ストリームトラフィックマネージャ２１２は、カーネル回路が、別のカーネル回路に、同じダイ中にあるのか、同じＩＣの異なるダイ中にあるのかにかかわらず、データをストリーミングすることを可能にすることが可能である。同様に、衛星ストリームトラフィックマネージャ４１８は、カーネル回路が、別のカーネル回路に、同じダイ中にあるのか、同じＩＣの異なるダイ中にあるのかにかかわらず（たとえば、４４０－１または４４０－２から４４０－３または４４０－４に、４４０－１から４４０－２にまたはその逆、４４０－３から４４０－４にまたはその逆、４４０－３または４４０－４から４４０－１または４４０－２に）データをストリーミングすることを可能にすることが可能である。異なるダイおよび／または異なるＩＣに位置するカーネル回路間で交換されるデータは、場合によっては、ストリームトラフィックマネージャ２１２および／または衛星ストリームトラフィックマネージャ４１２によって制御され、ストリームトラフィックマネージャ２１２および／または衛星ストリームトラフィックマネージャ４１２を通って流れなければならない。 While the kernel circuit 234 was included in a single die in the example of FIG. 2, in the example of FIG. 4, the kernel circuit 234 is distributed across dies 404 and 406. Similarly, the kernel circuit 440 is distributed across dies 408 and 410. Although the stream traffic manager 212 allows data to be provided to multiple kernel circuits concurrently, the stream traffic manager 212 can also establish connections between the kernel circuits 234 (e.g., from 234-1 to 234-2 or vice versa, from 234-1 or 234-2 to 234-3 or 234-4, from 234-3 to 234-4 or vice versa, from 234-3 or 234-4 to 234-2 or 234-1). For example, if the kernel circuits are not initially configured to communicate directly with each other, the stream traffic manager 212 can enable a kernel circuit to stream data to another kernel circuit, whether in the same die or in a different die of the same IC. Similarly, the satellite stream traffic manager 418 can enable a kernel circuit to stream data to another kernel circuit, whether in the same die or in a different die of the same IC (e.g., from 440-1 or 440-2 to 440-3 or 440-4, from 440-1 to 440-2 or vice versa, from 440-3 to 440-4 or vice versa, from 440-3 or 440-4 to 440-1 or 440-2). Data exchanged between kernel circuits located on different dies and/or different ICs must flow through, and be controlled by, the stream traffic manager 212 and/or the satellite stream traffic manager 412, as the case may be.

別の実施形態では、データが、同じダイに位置するカーネル回路間で交換されるとき、データは、場合によっては、送出カーネル回路から相互接続に流れ、相互接続から受信カーネル回路に流れ、ストリームトラフィックマネージャ２１２および／または衛星ストリームトラフィックマネージャ４１２をバイパスするが、ストリームトラフィックマネージャ２１２および／または衛星ストリームトラフィックマネージャ４１２の制御下にあり得る。いずれの場合も、送出カーネル回路の出力バッファは、送出カーネル回路から出力されたデータストリームを、パケット化されたデータに変換し、受信カーネル回路の入力バッファは、パケット化されたデータを、受信カーネル回路による消費のためのデータストリームに変換する。 In another embodiment, when data is exchanged between kernel circuits located on the same die, the data may possibly flow from the sending kernel circuit to the interconnect and from the interconnect to the receiving kernel circuit, bypassing the stream traffic manager 212 and/or the satellite stream traffic manager 412, but under the control of the stream traffic manager 212 and/or the satellite stream traffic manager 412. In either case, the output buffer of the sending kernel circuit converts the data stream output from the sending kernel circuit into packetized data, and the input buffer of the receiving kernel circuit converts the packetized data into a data stream for consumption by the receiving kernel circuit.

ストリームトラフィックマネージャ２１２は、衛星ストリームトラフィックマネージャ４１２と通信することも可能である。衛星ストリームトラフィックマネージャ４１２は、ストリームトラフィックマネージャ２１２と実質的に同様に実装される。トランシーバ４４２および４４４を介した、ストリームトラフィックマネージャ２１２と衛星ストリームトラフィックマネージャ４１２との間の通信は、あるＩＣ中のカーネル回路が、異なるＩＣ中のカーネル回路に（たとえば、２３４－１または２３４－２から４４０－１または４４０－２に、２３４－１または２３４－２から４４０－３または４４０－４に、２３４－３または２３４－４から４４０－１または４４０－２に、２３４－３または２３４－４から４４０－３または４４０－４に、４４０－１または４４０－２から２３４－１または２３４－２に、４４０－１または４４０－２から２３４－３または２３４－４に、４４０－３または４４０－４から２３４－１または２３４－２に、４４０－３または４４０－４から２３４－３または２３４－４に）データをストリーミングすることを可能にする。 The stream traffic manager 212 may also communicate with a satellite stream traffic manager 412. The satellite stream traffic manager 412 is implemented substantially similarly to the stream traffic manager 212. The communication between the stream traffic manager 212 and the satellite stream traffic manager 412 via the transceivers 442 and 444 may be implemented in a manner that allows a kernel circuit in one IC to communicate with a kernel circuit in a different IC (e.g., 234-1 or 234-2 to 440-1 or 440-2, 234-1 or 234-2 to 440-3 or 440-4, 234-3 or 234-4 to 440-1 or 440-2 ... or 440-2, 234-3 or 234-4 to 440-3 or 440-4, 440-1 or 440-2 to 234-1 or 234-2, 440-1 or 440-2 to 234-3 or 234-4, 440-3 or 440-4 to 234-1 or 234-2, 440-3 or 440-4 to 234-3 or 234-4).

それにもかかわらず、カーネル回路は、互いと直接通信するために実装され得る。その場合、カーネル回路は、プログラマブル回路内で作成および実装され、この能力が組み込まれる。そのような接続が図４に示されており、カーネル回路２３４－３が、カーネル２３４－４と直接通信して、ストリームトラフィックマネージャ２１２を使用することなしにカーネル２３４－４にデータ結果を提供することが可能である。カーネル回路が、異なるダイおよび／または異なるＩＣに位置する場合、ストリームトラフィックマネージャ２１２および／または衛星ストリームトラフィックマネージャ４１２は必要とされる。 Nevertheless, kernel circuits may be implemented to communicate directly with each other. In that case, the kernel circuits are created and implemented within a programmable circuit to incorporate this capability. Such a connection is shown in FIG. 4, where kernel circuit 234-3 can communicate directly with kernel 234-4 to provide data results to kernel 234-4 without using stream traffic manager 212. If the kernel circuits are located on different dies and/or different ICs, stream traffic manager 212 and/or satellite stream traffic manager 412 are required.

多くの場合、カーネル回路（たとえば、そのようなカーネル回路によって実施される動作）は一緒に直列に連結される。データは、各異なるカーネル回路が異なる動作を実施するようにカスタマイズされるステップにおいて、あるカーネル回路から別のカーネル回路に渡され得る。メモリがＩＣ内でローカルであるのか、ＩＣの外部にあるのかにかかわらず、メモリマッピングされたインターフェースを使用する他の実装形態では、たとえば、（１つまたは複数の）アップストリームカーネル回路の動作が検出されたとき、適時に（１つまたは複数の）ダウンストリームカーネル回路を開始するために、（１つまたは複数の）アップストリームカーネル回路の進捗がホストシステムによって追跡されなければならない。いくつかの場合には、ホストシステムはまた、ダウンストリームカーネル回路が、アップストリームカーネル回路と同じメモリへのアクセスを有しない場合、アップストリームカーネル回路からダウンストリームカーネルにデータをコピーしなければならない。このタイプのアーキテクチャは、ホストシステム１０２内のソフトウェアにおけるかなりのオーバーヘッドを生じ、しばしばハードウェア（カーネル回路）の過少利用を生じる。 In many cases, kernel circuits (e.g., the operations performed by such kernel circuits) are serially linked together. Data may be passed from one kernel circuit to another in steps where each different kernel circuit is customized to perform a different operation. In other implementations that use memory-mapped interfaces, whether the memory is local within the IC or external to the IC, the progress of the upstream kernel circuit(s) must be tracked by the host system in order to initiate the downstream kernel circuit(s) in a timely manner, e.g., when the operation of the upstream kernel circuit(s) is detected. In some cases, the host system must also copy data from the upstream kernel circuit to the downstream kernel if the downstream kernel circuit does not have access to the same memory as the upstream kernel circuit. This type of architecture creates significant overhead in the software in the host system 102 and often results in under-utilization of the hardware (kernel circuits).

カーネル回路からカーネル回路に渡されるデータストリーム内の帯域内命令を使用する、本開示内で説明されるストリーミングアーキテクチャは、あるカーネル回路が、データストリーム中に含まれる命令とともにデータを別のカーネル回路に直接渡すことを可能にし、それにより、ホストシステム１０２の関与なしの、複数のカーネル回路を通した、データの連結された処理を実装する。そのストリーミングアーキテクチャは、ホストシステムに課されるオーバーヘッドを低減し、ハードウェアリソースをより効率的に使用する。 The streaming architecture described in this disclosure, which uses in-band instructions in a data stream passed from kernel circuit to kernel circuit, allows one kernel circuit to pass data directly to another kernel circuit along with instructions contained in the data stream, thereby implementing concatenated processing of data through multiple kernel circuits without the involvement of the host system 102. The streaming architecture reduces the overhead imposed on the host system and uses hardware resources more efficiently.

ストリームトラフィックマネージャ回路は、ホストシステム１０２から、ＩＣ１０２またはＩＣ４０２において実装されるカーネル回路のいずれかにデータを提供することが可能であることを諒解されたい。ホストシステム１０２からＩＣ１０４中のカーネル回路に提供されたパケット化されたデータが、エンドポイント１０８、ＤＭＡ１１０、およびストリームトラフィックマネージャ２１２を通過する。ＩＣ１０４中のカーネル回路から出力されたデータストリーム（たとえば、結果データストリーム）が、ストリームトラフィックマネージャ２１２、ＤＭＡ１１０、およびエンドポイント１０８を介してホストシステム１０２に渡る。ホストシステム１０２からＩＣ４０２中のカーネル回路に提供されたパケット化されたデータが、エンドポイント１０８、ＤＭＡ１１０、ストリームトラフィックマネージャ２１２、トランシーバ４４２および４４４、ならびに衛星ストリームトラフィックマネージャ４１２を通過する。ＩＣ４０２中のカーネル回路から出力されたデータストリーム（たとえば、結果データストリーム）が、衛星ストリームトラフィックマネージャ４１２、トランシーバ４４４および４４２、ストリームトラフィックマネージャ２１２、ＤＭＡ１１０、ならびにエンドポイント１０８を通過する。ＩＣ４０２中のカーネル回路への、パケット化されたデータを送出および／または受信する際に、ホストシステム１０２は、実質的に、図２に関して説明されたように動作し得、入力ドライバ１３０は、各カーネル回路について、ＩＣ１０２において実装されるのか、ＩＣ４０２において実装されるのかにかかわらず、読取りキューおよび書込みキューを生成する。 It should be appreciated that the stream traffic manager circuitry can provide data from the host system 102 to either the kernel circuitry implemented in IC 102 or IC 402. Packetized data provided from the host system 102 to the kernel circuitry in IC 104 passes through the endpoint 108, DMA 110, and stream traffic manager 212. A data stream output from the kernel circuitry in IC 104 (e.g., a result data stream) passes to the host system 102 via the stream traffic manager 212, DMA 110, and endpoint 108. Packetized data provided from the host system 102 to the kernel circuitry in IC 402 passes through the endpoint 108, DMA 110, stream traffic manager 212, transceivers 442 and 444, and satellite stream traffic manager 412. The data streams output from the kernel circuits in IC 402 (e.g., result data streams) pass through satellite stream traffic manager 412, transceivers 444 and 442, stream traffic manager 212, DMA 110, and endpoint 108. In sending and/or receiving packetized data to the kernel circuits in IC 402, host system 102 may operate substantially as described with respect to FIG. 2, with input driver 130 generating read and write queues for each kernel circuit, whether implemented in IC 102 or IC 402.

図１、図２、および図４に示されているアーキテクチャは、カーネル回路の各可能なペア間の直接接続をサポートする、より複雑な相互接続回路を必要とすることなしに、アップストリームカーネル回路が、任意の利用可能なダウンストリームカーネル回路にデータをストリーミングすることを可能にする。図１、図２、および図４のアーキテクチャは、アップストリームカーネル回路にストリームトラフィックマネージャ回路へのデータを出力させることによってこの能力を実装する（説明の目的で、「ストリームトラフィックマネージャ回路」は、ストリームトラフィックマネージャ、衛星ストリームトラフィックマネージャ、または協調様式で動作するその両方を指す）。ストリームトラフィックマネージャ回路はダウンストリームカーネルにデータをルーティングする。データは、クレジットを使用してストリームトラフィックマネージャ回路によって調節されるので、大きいストアアンドフォワードバッファ（ｓｔｏｒｅａｎｄｆｏｒｗａｒｄｂｕｆｆｅｒ）が必要とされない。さらに、ホストシステム１０２はデータ転送に関与しない。例示的なおよび非限定的な例として、アップストリームカーネル回路、たとえば、送出カーネル回路が圧縮を実施し、ダウンストリームカーネル回路が暗号化を実施する。アップストリームカーネル回路は、得られた圧縮されたデータをストリームトラフィックマネージャ回路に送出し、ストリームトラフィックマネージャ回路は、送出カーネル回路の出力バッファによってパケット化されたデータを、ダウンストリームカーネル回路、たとえば、受信カーネル回路にルーティングする。受信カーネル回路の入力バッファは、パケット化されたデータをデータストリームに変換する。ダウンストリームカーネル回路は、得られた暗号化されたデータをストリームトラフィックマネージャ回路に提供し得、ストリームトラフィックマネージャ回路は、次いで、暗号化されたデータをまた別のカーネル回路にルーティングするか、または、暗号化されたデータをホストシステム１０２に提供し得る。 The architecture shown in Figures 1, 2, and 4 allows an upstream kernel circuit to stream data to any available downstream kernel circuit without requiring more complex interconnect circuitry supporting direct connections between each possible pair of kernel circuits. The architecture of Figures 1, 2, and 4 implements this capability by having the upstream kernel circuit output data to a stream traffic manager circuit (for purposes of explanation, "stream traffic manager circuit" refers to a stream traffic manager, a satellite stream traffic manager, or both operating in a cooperative manner). The stream traffic manager circuit routes the data to the downstream kernels. Since the data is regulated by the stream traffic manager circuit using credits, large store and forward buffers are not required. Furthermore, the host system 102 is not involved in the data transfer. As an illustrative and non-limiting example, the upstream kernel circuit, e.g., the send kernel circuit, performs compression and the downstream kernel circuit performs encryption. The upstream kernel circuit sends the resulting compressed data to a stream traffic manager circuit, which routes the packetized data by an output buffer of the sending kernel circuit to a downstream kernel circuit, e.g., a receiving kernel circuit. An input buffer of the receiving kernel circuit converts the packetized data into a data stream. The downstream kernel circuit can provide the resulting encrypted data to a stream traffic manager circuit, which can then route the encrypted data to yet another kernel circuit or provide the encrypted data to the host system 102.

本開示内で説明されるストリーミングアーキテクチャは、配置ツールおよびルートツールがアップストリームカーネル回路とダウンストリームカーネル回路との相対配置を考慮する必要がないので、（データ処理システムによって実行される）ＥＤＡアプリケーションの配置機能およびルート機能が、より効率的に（完了するための時間をあまり必要とせずに）動作することをも可能にする。これは、データストリームを介してデータを交換する２つまたはそれ以上のカーネル回路が、異なるダイおよび／または異なるＩＣに位置するとき、特に重要である。 The streaming architecture described within this disclosure also allows the place and route functions of an EDA application (executed by a data processing system) to operate more efficiently (require less time to complete) because the place and route tools do not need to consider the relative placement of upstream and downstream kernel circuits. This is particularly important when two or more kernel circuits that exchange data via a data stream are located on different dies and/or different ICs.

本開示内で説明される例示的なストリーミングアーキテクチャがない場合、直接カーネル－カーネルルーティング（ｄｉｒｅｃｔｋｅｒｎｅｌ－ｔｏ－ｋｅｒｎｅｌｒｏｕｔｉｎｇ）は、通信することが意図されたカーネル回路の各ペアの間で実装される必要があるであろう。このタイプの接続性は、タイミング要件を満たすためにカーネル回路の配置およびルーティングに対して著しい制限を課し、ダイ境界および／またはＩＣ境界を横断するときにさらに困難になる。さらに、本明細書で説明されるアーキテクチャを使用することは、説明されるフレキシビリティを提供しながら、実装されるカーネル回路について、より大きいクロック速度をも達成する。カーネル回路の各可能なペア間のポイントツーポイント接続を使用するアーキテクチャは、プログラマブル回路の極めて多くのリソースを必要とすることになるので、得られた実装形態は、本明細書で説明される例示的なストリーミングアーキテクチャを使用して達成可能であるよりも遅いクロック周波数において動作することになる。 Without the exemplary streaming architecture described within this disclosure, direct kernel-to-kernel routing would need to be implemented between each pair of kernel circuits that are intended to communicate. This type of connectivity imposes significant constraints on the placement and routing of kernel circuits to meet timing requirements, and becomes even more difficult when crossing die and/or IC boundaries. Furthermore, using the architecture described herein also achieves greater clock speeds for the implemented kernel circuits while providing the described flexibility. An architecture using point-to-point connections between each possible pair of kernel circuits would require significantly more resources of the programmable circuit, so the resulting implementation would operate at a slower clock frequency than is achievable using the exemplary streaming architecture described herein.

図４の例では、ＩＣ１０４およびＩＣ４０２は両方ともマルチダイＩＣとして実装される。１つまたは複数の他の実施形態では、ＩＣ１０４およびＩＣ４０２のうちの一方または両方が、トランシーバを含む単一ダイＩＣとして実装される。 In the example of FIG. 4, both IC104 and IC402 are implemented as multi-die ICs. In one or more other embodiments, one or both of IC104 and IC402 are implemented as single-die ICs that include a transceiver.

図５は、データストリームを使用してカーネル回路間でデータを交換する例示的な方法５００を示す。方法５００は、ホストシステムがハードウェアアクセラレータ内のカーネル回路にタスクをオフロードした状態において始まることができる。１つまたは複数の実施形態では、方法５００は、図３のブロック３０５～３２０を実施した後の状態においておよび／またはデータ転送に関与する各ＩＣについて始まる。図５の例では、本明細書では送出カーネル回路と呼ばれるカーネル回路は、一連の動作の中のある動作を実施し、各動作が、異なるカーネル回路によって実施される。 FIG. 5 illustrates an exemplary method 500 for exchanging data between kernel circuits using a data stream. Method 500 can begin with a host system offloading tasks to a kernel circuit in a hardware accelerator. In one or more embodiments, method 500 begins after blocks 305-320 of FIG. 3 are performed and/or for each IC involved in the data transfer. In the example of FIG. 5, a kernel circuit, referred to herein as a sending kernel circuit, performs an operation in a sequence of operations, each operation being performed by a different kernel circuit.

ブロック５０５において、送出カーネル回路は、出力ポートにアタッチされた出力バッファにデータストリーミングを出力するか、または記憶する。ブロック５１０において、ストリームトラフィックマネージャ回路は、送出カーネル回路の出力バッファに記憶されたデータストリームを検出する。ストリームトラフィックマネージャ回路は、図２に関して説明されたように、バッファのステータスを監視することが可能である。１つまたは複数の実施形態では、データストリームは、データの宛先を指定する情報を含む。宛先は、この例では、ホストシステムではなく、むしろ、受信カーネル回路と呼ばれる別のカーネル回路である。１つまたは複数の他の実施形態では、ストリームトラフィックマネージャ回路は、たとえば、前に説明されたマッピングデータを使用して、送出カーネル回路から、受信カーネル回路および／またはホストシステムなどの別の宛先にデータをルーティングするように設定される。ブロック５１５において、ストリームトラフィックマネージャ回路は受信カーネル回路を決定する。ストリームトラフィックマネージャ回路は、たとえば、送出カーネル回路の出力バッファに記憶されたデータストリームを読み取り、指定された受信カーネル回路を決定することが可能である。別の例では、ストリームトラフィックマネージャは、ストリームトラフィックマネージャに記憶されたマッピングデータ（たとえば、宛先への、特定のカーネル回路出力のマッピング）に基づいて受信カーネル回路を決定する。 At block 505, the sending kernel circuit outputs or stores the data streaming in an output buffer attached to the output port. At block 510, the stream traffic manager circuit detects the data stream stored in the output buffer of the sending kernel circuit. The stream traffic manager circuit can monitor the status of the buffer as described with respect to FIG. 2. In one or more embodiments, the data stream includes information specifying a destination of the data. The destination is not the host system in this example, but rather another kernel circuit called a receiving kernel circuit. In one or more other embodiments, the stream traffic manager circuit is configured to route the data from the sending kernel circuit to another destination, such as the receiving kernel circuit and/or the host system, for example, using the mapping data previously described. At block 515, the stream traffic manager circuit determines the receiving kernel circuit. The stream traffic manager circuit can, for example, read the data stream stored in the output buffer of the sending kernel circuit and determine the designated receiving kernel circuit. In another example, the stream traffic manager determines the receiving kernel circuit based on mapping data (e.g., mapping of a particular kernel circuit output to a destination) stored in the stream traffic manager.

ブロック５２０において、ストリームトラフィックマネージャ回路は、受信カーネル回路の入力バッファが、送出カーネル回路からのデータストリームを記憶するために利用可能な十分なスペースを有するかどうかを決定する。ブロック５２５において、受信カーネル回路の入力バッファが十分なスペースを有すると決定したことに応答して、ストリームトラフィックマネージャ回路は、送出カーネル回路から受信カーネル回路へのデータ転送を始動する。ストリームトラフィックマネージャ回路は、（１つまたは複数の）相互接続を通して、および／または、ＩＣ横断（ｃｒｏｓｓ－ＩＣ）データ転送が実施される場合、トランシーバを通して、送出カーネル回路の出力バッファから受信カーネル回路の入力バッファにデータを転送する。１つまたは複数の実施形態では、同じダイ中のカーネル回路間でデータを転送するとき、データは、ストリームトラフィックマネージャ回路を通過することなしに、ストリームトラフィックマネージャ回路の制御下の関連する相互接続を通して送出され得る。 At block 520, the stream traffic manager circuit determines whether the input buffer of the receiving kernel circuit has enough space available to store the data stream from the sending kernel circuit. At block 525, in response to determining that the input buffer of the receiving kernel circuit has enough space, the stream traffic manager circuit initiates a data transfer from the sending kernel circuit to the receiving kernel circuit. The stream traffic manager circuit transfers data from the output buffer of the sending kernel circuit to the input buffer of the receiving kernel circuit through the interconnect(s) and/or through a transceiver if a cross-IC data transfer is performed. In one or more embodiments, when transferring data between kernel circuits in the same die, the data may be sent through the associated interconnect under the control of the stream traffic manager circuit without passing through the stream traffic manager circuit.

特定の実施形態では、送出カーネル回路からのデータストリームは、データストリーム内の帯域内に１つまたは複数の命令を含む。一例では、命令は、送出カーネル回路から受信カーネル回路へのデータストリーム（またはパケット化されたデータ）のペイロード部分中に含まれる。説明されたように、送出カーネル回路の出力バッファは、受信カーネル回路に送出するために、データストリームを、パケット化されたデータに変換する。受信カーネル回路の入力バッファは、受信されたパケット化されたデータを、受信カーネル回路に提供されるデータストリームに変換する。 In a particular embodiment, the data stream from the sending kernel circuit includes one or more instructions in-band within the data stream. In one example, the instructions are included in the payload portion of the data stream (or packetized data) from the sending kernel circuit to the receiving kernel circuit. As described, the output buffer of the sending kernel circuit converts the data stream into packetized data for sending to the receiving kernel circuit. The input buffer of the receiving kernel circuit converts the received packetized data into a data stream that is provided to the receiving kernel circuit.

図５の例では、データストリームは、ダイ４０４中のカーネル回路からダイ４０６中のカーネル回路に送出され得るか、またはダイ４０６中のカーネル回路からダイ４０４中のカーネル回路に送出され得ることを諒解されたい。同様に、データストリームは、ダイ４０８中のカーネル回路からダイ４１０中のカーネル回路に送出され得るか、またはダイ４１０中のカーネル回路からダイ４０８中のカーネル回路に送出され得る。 In the example of FIG. 5, it should be appreciated that a data stream may be sent from a kernel circuit in die 404 to a kernel circuit in die 406, or from a kernel circuit in die 406 to a kernel circuit in die 404. Similarly, a data stream may be sent from a kernel circuit in die 408 to a kernel circuit in die 410, or from a kernel circuit in die 410 to a kernel circuit in die 408.

図５の例はストリームトラフィックマネージャ回路を参照する。この点について、方法５００が実施され得、ストリームトラフィックマネージャが、説明される動作を実施する（たとえば、送出カーネル回路と受信カーネル回路の両方がＩＣ１０４中にある）か、衛星ストリームトラフィックマネージャが、説明される動作を実施する（送出カーネル回路と受信カーネル回路の両方がＩＣ４０２中にある）か、またはストリームトラフィックマネージャと衛星ストリームトラフィックマネージャの両方がそれらの動作を実施する（たとえば、送出カーネル回路と受信カーネル回路とが、異なるＩＣ中にある）。後者の場合、ストリームトラフィックマネージャおよび衛星ストリームトラフィックマネージャの各々が、同じＩＣに位置するカーネル回路と対話することを諒解されたい。 5 references a stream traffic manager circuit. In this regard, method 500 may be implemented with the stream traffic manager performing the described operations (e.g., both the send kernel circuitry and the receive kernel circuitry are in IC 104), the satellite stream traffic manager performing the described operations (e.g., both the send kernel circuitry and the receive kernel circuitry are in IC 402), or both the stream traffic manager and the satellite stream traffic manager performing the operations (e.g., the send kernel circuitry and the receive kernel circuitry are in different ICs). In the latter case, it should be appreciated that the stream traffic manager and the satellite stream traffic manager each interact with kernel circuits located in the same IC.

たとえば、送出カーネル回路と受信カーネル回路とが、異なるＩＣに位置する場合、ストリームトラフィックマネージャと衛星ストリームトラフィックマネージャとは、カーネル回路の入力バッファおよび出力バッファのステータスを決定するために、トランシーバ４４２および４４４を介して通信することが可能である。たとえば、ストリームトラフィックマネージャは、ＩＣ１０４中のバッファのステータスを決定することが可能であり、衛星ストリームトラフィックマネージャは、ＩＣ４０２中のバッファのステータスを決定することが可能である。ストリームトラフィックマネージャは、ＩＣ４０２中の任意のバッファのステータスを衛星ストリームトラフィックマネージャに要求することが可能であり、衛星ストリームトラフィックマネージャは、（１つまたは複数の）要求されたステータスで応答する。同様に、衛星ストリームトラフィックマネージャは、ＩＣ１０４中の任意のバッファのステータスをストリームトラフィックマネージャに要求することが可能であり、ストリームトラフィックマネージャは、（１つまたは複数の）要求されたステータスで応答する。ストリームトラフィックマネージャと衛星ストリームトラフィックマネージャとの間の通信は、ＩＣ１０４の同じダイまたはＩＣ１０４の異なるダイに位置するか、ＩＣ４０２の同じダイまたはＩＣ４０２の異なるダイに位置するか、あるいは異なるＩＣに位置する、送信カーネル回路と受信カーネル回路とをサポートする。 For example, if the transmit kernel circuit and the receive kernel circuit are located on different ICs, the stream traffic manager and the satellite stream traffic manager can communicate via the transceivers 442 and 444 to determine the status of the input and output buffers of the kernel circuits. For example, the stream traffic manager can determine the status of the buffers in IC 104, and the satellite stream traffic manager can determine the status of the buffers in IC 402. The stream traffic manager can request the status of any buffer in IC 402 from the satellite stream traffic manager, and the satellite stream traffic manager responds with the requested status(es). Similarly, the satellite stream traffic manager can request the status of any buffer in IC 104 from the stream traffic manager, and the stream traffic manager responds with the requested status(es). The communication between the stream traffic manager and the satellite stream traffic manager supports transmit kernel circuit and receive kernel circuit located on the same die of IC 104 or on different dies of IC 104, on the same die of IC 402 or on different dies of IC 402, or on different ICs.

図６は、本明細書で説明される１つまたは複数の実施形態とともに使用するための例示的なシステム６００を示す。システム６００は、コンピュータ、サーバ、ラップトップまたはタブレットコンピュータなどのポータブルコンピュータ、あるいは他のデータ処理システムを実装するために使用され得るコンピュータハードウェアの一例である。たとえば、システム６００は、ホストシステム１０２、および／または本明細書で説明されるコンテナファイルを生成するためのＥＤＡアプリケーションを実行する別のシステムの例示的な一実装形態である。 FIG. 6 illustrates an exemplary system 600 for use with one or more embodiments described herein. System 600 is an example of computer hardware that may be used to implement a computer, a server, a portable computer such as a laptop or tablet computer, or other data processing system. For example, system 600 is an exemplary implementation of host system 102 and/or another system that executes an EDA application to generate container files as described herein.

図６の例では、システム６００は、少なくとも１つのプロセッサ６０５を含む。プロセッサ６０５は、インターフェース回路６１５を通してメモリ６１０に結合される。システム６００は、メモリ６１０内に（「プログラムコード」とも呼ばれる）コンピュータ可読命令を記憶することが可能である。メモリ６１０は、コンピュータ可読記憶媒体の一例である。プロセッサ６０５は、インターフェース回路６１５を介してメモリ６１０からアクセスされるプログラムコードを実行することが可能である。 In the example of FIG. 6, system 600 includes at least one processor 605. Processor 605 is coupled to memory 610 through interface circuitry 615. System 600 is capable of storing computer-readable instructions (also referred to as "program code") in memory 610. Memory 610 is an example of a computer-readable storage medium. Processor 605 is capable of executing program code accessed from memory 610 via interface circuitry 615.

メモリ６１０は、たとえば、ローカルメモリおよびバルク記憶デバイス（ｂｕｌｋｓｔｏｒａｇｅｄｅｖｉｃｅ）など、１つまたは複数の物理メモリデバイスを含み得る。ローカルメモリは、概してプログラムコードの実際の実行中に使用される（１つまたは複数の）非永続的メモリデバイスを指す。ローカルメモリの例は、ＲＡＭ、および／または、プログラムコードの実行中のプロセッサによる使用のために好適である様々なタイプのＲＡＭ（たとえば、ダイナミックＲＡＭまたは「ＤＲＡＭ」あるいはスタティックＲＡＭまたは「ＳＲＡＭ」）のいずれかを含む。バルク記憶デバイスは、永続的データ記憶デバイスを指す。バルク記憶デバイスの例は、限定はしないが、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）、フラッシュメモリ、読取り専用メモリ（ＲＯＭ）、消去可能プログラマブル読取り専用メモリ（ＥＰＲＯＭ）、電気的消去可能プログラマブル読取り専用メモリ（ＥＥＰＲＯＭ）、または他の好適なメモリを含む。システム６００は、プログラムコードが実行中にバルク記憶デバイスから取り出されなければならない回数を低減するために少なくともあるプログラムコードの一時的記憶を行う１つまたは複数のキャッシュメモリ（図示せず）をも含み得る。 Memory 610 may include one or more physical memory devices, such as, for example, local memory and bulk storage devices. Local memory generally refers to the non-persistent memory device(s) used during the actual execution of the program code. Examples of local memory include RAM and/or any of the various types of RAM suitable for use by the processor during execution of the program code (e.g., dynamic RAM or "DRAM" or static RAM or "SRAM"). Bulk storage devices refer to persistent data storage devices. Examples of bulk storage devices include, but are not limited to, hard disk drives (HDDs), solid state drives (SSDs), flash memory, read only memory (ROM), erasable programmable read only memory (EPROM), electrically erasable programmable read only memory (EEPROM), or other suitable memory. System 600 may also include one or more cache memories (not shown) that provide temporary storage of at least some program code to reduce the number of times the program code must be retrieved from a bulk storage device during execution.

メモリ６１０は、プログラムコードおよび／またはデータを記憶することが可能である。１つまたは複数の実施形態では、システム６００がホストシステム１０２などのシステムを実装するとき、メモリ６１０は、図１に関して説明されたものと同じまたは同様のフレームワークを記憶し、実行することが可能である。また、フレームワークはオペレーティングシステムを含み得る。インターフェース回路６１５を通してシステム６００にアタッチされたハードウェアアクセラレータ６２５内での実装のために、１つまたは複数のコンテナもメモリ６１０に記憶され得る。ハードウェアアクセラレータ６２５は、図７に関して説明されるものと同じまたは同様のアーキテクチャを有する１つまたは複数のＩＣを含む。 Memory 610 may store program code and/or data. In one or more embodiments, when system 600 implements a system such as host system 102, memory 610 may store and execute a framework the same or similar to that described with respect to FIG. 1. The framework may also include an operating system. One or more containers may also be stored in memory 610 for implementation within a hardware accelerator 625 attached to system 600 through interface circuit 615. Hardware accelerator 625 includes one or more ICs having the same or similar architecture as that described with respect to FIG. 7.

１つまたは複数の他の実施形態では、システム６００は、ＥＤＡアプリケーションを実行するＥＤＡシステムを実装する。したがって、システム６００は、設定ビットストリームまたは場合によっては部分設定ビットストリームとして、指定されたカーネル回路を生成するために、カーネルを指定するプログラムコードを処理することが可能である。システム６００は、コンテナファイル内に（１つまたは複数の）設定ビットストリームを含める。さらに、システム６００は、マッピング情報を生成し、そのマッピング情報をメタデータとしてコンテナファイル内に含めることが可能である。システム６００がＥＤＡシステムを実装する実施形態では、ハードウェアアクセラレータ６２５は、含まれることも含まれないこともある。 In one or more other embodiments, system 600 implements an EDA system that executes EDA applications. Thus, system 600 can process program code that specifies a kernel to generate a kernel circuit specified as a configuration bitstream or possibly a partial configuration bitstream. System 600 includes the configuration bitstream(s) in a container file. Additionally, system 600 can generate mapping information and include the mapping information as metadata in the container file. In embodiments in which system 600 implements an EDA system, hardware accelerator 625 may or may not be included.

システム６００、たとえば、プロセッサ６０５は、本開示内で説明される動作を実施するために、本明細書で説明されるオペレーティングシステム、アプリケーション、および／またはフレームワークを実行することが可能である。したがって、メモリ６１０に記憶された命令および／またはデータは、システム６００の一体部分と見なされ得る。さらに、システム６００（たとえば、プロセッサ６０５）によって使用され、生成され、および／または作用されるデータは、システムの一部として採用されたときに機能性を与える機能的データ構造であることを諒解されたい。 System 600, e.g., processor 605, is capable of executing the operating system, applications, and/or frameworks described herein to perform the operations described within this disclosure. Thus, the instructions and/or data stored in memory 610 may be considered an integral part of system 600. Furthermore, it should be appreciated that the data used, generated, and/or acted upon by system 600 (e.g., processor 605) are functional data structures that provide functionality when employed as part of the system.

インターフェース回路６１５の例は、限定はしないが、システムバスと入出力（Ｉ／Ｏ）バスとを含む。インターフェース回路６１５は、様々なバスアーキテクチャのいずれかを使用して実装され得る。バスアーキテクチャの例は、限定はしないが、拡張業界標準アーキテクチャ（ＥＩＳＡ）バス、アクセラレーテッドグラフィックスポート（ＡＧＰ）、ビデオエレクトロニクス規格協会（ＶＥＳＡ）ローカルバス、ユニバーサルシリアルバス（ＵＳＢ）、およびＰＣＩｅバスを含み得る。 Examples of interface circuitry 615 include, but are not limited to, a system bus and an input/output (I/O) bus. Interface circuitry 615 may be implemented using any of a variety of bus architectures. Examples of bus architectures may include, but are not limited to, an Extended Industry Standard Architecture (EISA) bus, an Accelerated Graphics Port (AGP), a Video Electronics Standards Association (VESA) local bus, a Universal Serial Bus (USB), and a PCIe bus.

システム６００は、インターフェース回路６１５に結合された１つまたは複数のＩ／Ｏデバイス６２０をさらに含み得る。Ｉ／Ｏデバイス６２０は、直接、または介在するＩ／Ｏコントローラを通してのいずれかで、システム６００、たとえば、インターフェース回路６１５に結合され得る。Ｉ／Ｏデバイス６２０の例は、限定はしないが、キーボード、ディスプレイデバイス、ポインティングデバイス、１つまたは複数の通信ポート、およびネットワークアダプタを含む。ネットワークアダプタは、システム６００が、介在するプライベートまたは公衆ネットワークを通して他のシステム、コンピュータシステム、リモートプリンタ、および／またはリモート記憶デバイスに結合されるようになることを可能にする回路を指す。モデム、ケーブルモデム、イーサネットカード、およびワイヤレストランシーバが、システム６００とともに使用され得る異なるタイプのネットワークアダプタの例である。 The system 600 may further include one or more I/O devices 620 coupled to the interface circuit 615. The I/O devices 620 may be coupled to the system 600, e.g., the interface circuit 615, either directly or through an intervening I/O controller. Examples of I/O devices 620 include, but are not limited to, a keyboard, a display device, a pointing device, one or more communication ports, and a network adapter. A network adapter refers to a circuit that allows the system 600 to become coupled to other systems, computer systems, remote printers, and/or remote storage devices through an intervening private or public network. Modems, cable modems, Ethernet cards, and wireless transceivers are examples of different types of network adapters that may be used with the system 600.

システム６００は、実装されるデバイスおよび／またはシステムの特定のタイプに応じて、図示された構成要素よりも少数の構成要素、または図６に示されていない追加の構成要素を含み得る。さらに、含まれる特定のオペレーティングシステム、（１つまたは複数の）アプリケーション、および／またはＩ／Ｏデバイスは、システムタイプに基づいて変動し得る。さらに、例示的な構成要素のうちの１つまたは複数は、別の構成要素に組み込まれるか、またはさもなければ、別の構成要素の一部分を形成し得る。たとえば、プロセッサが、少なくともあるメモリを含み得る。システム６００は、図６のアーキテクチャまたはそれと同様のアーキテクチャを使用して各々実装される単一のコンピュータあるいは複数のネットワーク化されたまたは相互接続されたコンピュータを実装するために使用され得る。 System 600 may include fewer components than those shown, or additional components not shown in FIG. 6, depending on the particular type of device and/or system being implemented. Additionally, the particular operating system, application(s), and/or I/O devices included may vary based on the system type. Additionally, one or more of the illustrated components may be incorporated into or otherwise form a portion of another component. For example, a processor may include at least some memory. System 600 may be used to implement a single computer or multiple networked or interconnected computers, each implemented using the architecture of FIG. 6 or a similar architecture.

プログラマブルＩＣと呼ばれるいくつかのＩＣは、指定された機能を実施するようにプログラムされ得る。プログラムされ得るＩＣの一例がＦＰＧＡである。ＦＰＧＡは、一般に、プログラマブルタイルのアレイを含む。これらのプログラマブルタイルは、たとえば、入出力ブロック（ＩＯＢ）、設定可能論理ブロック（ＣＬＢ）、専用ＲＡＭブロック（ＢＲＡＭ）、乗算器、デジタル信号処理ブロック（ＤＳＰ）、プロセッサ、クロックマネージャ、遅延ロックループ（ＤＬＬ）などを含み得る。 Some ICs, called programmable ICs, can be programmed to perform a specified function. One example of a programmable IC is an FPGA. An FPGA typically contains an array of programmable tiles. These programmable tiles can include, for example, input/output blocks (IOBs), configurable logic blocks (CLBs), dedicated RAM blocks (BRAMs), multipliers, digital signal processing blocks (DSPs), processors, clock managers, delay-locked loops (DLLs), etc.

各プログラマブルタイルは、一般に、プログラマブル相互接続回路とプログラマブル論理回路の両方を含む。プログラマブル相互接続回路は、一般に、プログラマブル相互接続点（ＰＩＰ）によって相互接続された変動する長さの多数の相互接続線を含む。プログラマブル論理回路は、たとえば、関数生成器、レジスタ、算術論理などを含み得る、プログラマブル要素を使用して、ユーザ設計の論理を実装する。 Each programmable tile typically includes both programmable interconnect circuitry and programmable logic circuitry. The programmable interconnect circuitry typically includes a number of interconnect lines of varying lengths interconnected by programmable interconnect points (PIPs). The programmable logic circuitry implements the logic of a user design using programmable elements, which may include, for example, function generators, registers, arithmetic logic, etc.

プログラマブル相互接続回路およびプログラマブル論理回路は、一般に、設定データのストリームを、プログラマブル要素がどのように設定されるかを定義する内部設定メモリセルにロードすることによってプログラムされる。設定データは、外部デバイスによって、メモリから（たとえば、外部ＰＲＯＭから）読み取られるかまたはＦＰＧＡに書き込まれ得る。その場合、個々のメモリセルの全体的な状態がＦＰＧＡの機能を決定する。 Programmable interconnect circuits and programmable logic circuits are typically programmed by loading a stream of configuration data into internal configuration memory cells that define how the programmable elements are configured. The configuration data can be read from memory (e.g., from an external PROM) or written to the FPGA by an external device. The overall state of the individual memory cells then determines the functionality of the FPGA.

別のタイプのプログラマブルＩＣは、複合プログラマブル論理デバイス、またはＣＰＬＤである。ＣＰＬＤは、相互接続スイッチマトリックスによって一緒に接続されたおよび入出力（Ｉ／Ｏ）リソースに接続された２つまたはそれ以上の「機能ブロック」を含む。ＣＰＬＤの各機能ブロックが、プログラマブル論理アレイ（ＰＬＡ）、およびプログラマブルアレイ論理（ＰＡＬ）デバイスにおいて使用されるものと同様の、２レベルＡＮＤ／ＯＲ構造を含む。ＣＰＬＤでは、設定データは、一般に、不揮発性メモリにオンチップで記憶される。いくつかのＣＰＬＤでは、設定データは、不揮発性メモリにオンチップで記憶され、次いで、初期設定（プログラミング）シーケンスの一部として揮発性メモリにダウンロードされる。 Another type of programmable IC is the Complex Programmable Logic Device, or CPLD. A CPLD contains two or more "functional blocks" connected together by an interconnect switch matrix and to input/output (I/O) resources. Each functional block in a CPLD contains a two-level AND/OR structure similar to those used in programmable logic array (PLA) and programmable array logic (PAL) devices. In CPLDs, configuration data is generally stored on-chip in non-volatile memory. In some CPLDs, configuration data is stored on-chip in non-volatile memory and then downloaded to volatile memory as part of an initialization (programming) sequence.

これらのプログラマブルＩＣのすべてについて、デバイスの機能性は、その目的でデバイスに提供されたデータビットによって制御される。データビットは、揮発性メモリ（たとえば、ＦＰＧＡおよびいくつかのＣＰＬＤの場合のような、スタティックメモリセル）、不揮発性メモリ（たとえば、いくつかのＣＰＬＤの場合のようなフラッシュメモリ）、または任意の他のタイプのメモリセルに記憶され得る。 For all of these programmable ICs, the functionality of the device is controlled by data bits provided to the device for that purpose. The data bits may be stored in volatile memory (e.g., static memory cells, as in FPGAs and some CPLDs), non-volatile memory (e.g., flash memory, as in some CPLDs), or any other type of memory cell.

他のプログラマブルＩＣが、デバイス上の様々な要素をプログラム可能に相互接続する、金属層など、処理層を適用することによってプログラムされる。これらのプログラマブルＩＣはマスクプログラマブルデバイスとして知られている。プログラマブルＩＣはまた、他のやり方で、たとえば、ヒューズまたはアンチヒューズ技術を使用して実装され得る。「プログラマブルＩＣ」という句は、限定はしないが、これらのデバイスを含み得、さらに、部分的にのみプログラム可能であるデバイスを包含し得る。たとえば、あるタイプのプログラマブルＩＣは、ハードコーディングされたトランジスタ論理と、ハードコーディングされたトランジスタ論理をプログラム可能に相互接続するプログラマブルスイッチファブリックとの組合せを含む。 Other programmable ICs are programmed by applying processing layers, such as metal layers, that programmably interconnect various elements on the device. These programmable ICs are known as mask-programmable devices. Programmable ICs may also be implemented in other ways, for example, using fuse or anti-fuse technology. The phrase "programmable IC" may include, but is not limited to, these devices, and may further encompass devices that are only partially programmable. For example, one type of programmable IC includes a combination of hard-coded transistor logic and a programmable switch fabric that programmably interconnects the hard-coded transistor logic.

図７は、ＩＣのための例示的なアーキテクチャ７００を示す。一態様では、アーキテクチャ７００は、プログラマブルＩＣ内に実装され得る。たとえば、アーキテクチャ７００は、ＦＰＧＡを実装するために使用され得る。アーキテクチャ７００はまた、ＩＣのシステムオンチップ（ＳｏＣ）タイプを表し得る。ＳｏＣは、プログラムコードを実行するプロセッサと、１つまたは複数の他の回路とを含むＩＣである。他の回路は、ハードワイヤード回路、プログラマブル回路、および／またはそれらの組合せとして実装され得る。回路は、互いと、および／またはプロセッサと協働して動作し得る。 Figure 7 illustrates an example architecture 700 for an IC. In one aspect, the architecture 700 may be implemented within a programmable IC. For example, the architecture 700 may be used to implement an FPGA. The architecture 700 may also represent a system-on-chip (SoC) type of IC. A SoC is an IC that includes a processor that executes program code and one or more other circuits. The other circuits may be implemented as hardwired circuits, programmable circuits, and/or combinations thereof. The circuits may operate in cooperation with each other and/or with the processor.

図示のように、アーキテクチャ７００は、いくつかの異なるタイプのプログラマブル回路、たとえば、論理、ブロックを含む。たとえば、アーキテクチャ７００は、マルチギガビットトランシーバ（ＭＧＴ：ｍｕｌｔｉ－ｇｉｇａｂｉｔｔｒａｎｓｃｅｉｖｅｒ）７０１、設定可能論理ブロック（ＣＬＢ）７０２、ＢＲＡＭ７０３、入出力ブロック（ＩＯＢ）７０４、設定およびクロッキング論理（ＣＯＮＦＩＧ／ＣＬＯＣＫＳ）７０５、デジタル信号処理ブロック（ＤＳＰ）７０６、特殊なＩ／Ｏブロック７０７（たとえば、設定ポートおよびクロックポート）、ならびにデジタルクロックマネージャ、アナログデジタル変換器、システム監視論理などの他のプログラマブル論理７０８を含む、多数の異なるプログラマブルタイルを含み得る。 As shown, architecture 700 includes several different types of programmable circuits, e.g., logic, blocks. For example, architecture 700 may include a number of different programmable tiles, including multi-gigabit transceivers (MGTs) 701, configurable logic blocks (CLBs) 702, BRAMs 703, input/output blocks (IOBs) 704, configuration and clocking logic (CONFIG/CLOCKS) 705, digital signal processing blocks (DSPs) 706, specialized I/O blocks 707 (e.g., configuration and clock ports), and other programmable logic 708, such as digital clock managers, analog-to-digital converters, system monitoring logic, etc.

いくつかのＩＣでは、各プログラマブルタイルは、プログラマブル相互接続要素（ＩＮＴ）７１１を含み、ＩＮＴ７１１は、各隣接するタイル中の対応するＩＮＴ７１１との間の規格化された接続を有する。したがって、ＩＮＴ７１１は、まとめると、示されているＩＣのためのプログラマブル相互接続構造を実装する。各ＩＮＴ７１１は、図７の上部に含まれる例によって示されているように、同じタイル内のプログラマブル論理要素との間の接続をも含む。 In some ICs, each programmable tile includes a programmable interconnect element (INT) 711 with standardized connections between corresponding INTs 711 in each adjacent tile. Thus, the INTs 711 collectively implement the programmable interconnect structure for the IC shown. Each INT 711 also includes connections between programmable logic elements in the same tile, as shown by the example included at the top of FIG. 7.

たとえば、ＣＬＢ７０２は、ユーザ論理を実装するようにプログラムされ得る設定可能論理要素（ＣＬＥ）７１２と、単一のＩＮＴ７１１とを含み得る。ＢＲＡＭ７０３は、１つまたは複数のＩＮＴ７１１に加えてＢＲＡＭ論理要素（ＢＲＬ）７１３を含み得る。一般に、タイル中に含まれるＩＮＴ７１１の数は、タイルの高さに依存する。描かれているように、ＢＲＡＭタイルは、５つのＣＬＢと同じ高さを有するが、他の数（たとえば、４つ）も使用され得る。ＤＳＰタイル７０６は、適切な数のＩＮＴ７１１に加えてＤＳＰ論理要素（ＤＳＰＬ）７１４を含み得る。ＩＯＢ７０４は、たとえば、ＩＮＴ７１１の１つのインスタンスに加えてＩ／Ｏ論理要素（ＩＯＬ）７１５の２つのインスタンスを含み得る。ＩＯＬ７１５に接続された実際のＩ／Ｏパッドは、ＩＯＬ７１５のエリアに制限されないことがある。 For example, CLB 702 may include a configurable logic element (CLE) 712 that may be programmed to implement user logic, and a single INT 711. BRAM 703 may include one or more INTs 711 plus a BRAM logic element (BRL) 713. In general, the number of INTs 711 included in a tile depends on the height of the tile. As depicted, the BRAM tile has the same height as five CLBs, although other numbers (e.g., four) may also be used. DSP tile 706 may include an appropriate number of INTs 711 plus a DSP logic element (DSPL) 714. IOB 704 may include, for example, one instance of an INT 711 plus two instances of an I/O logic element (IOL) 715. The actual I/O pads connected to the IOL 715 may not be limited to the area of the IOL 715.

図７に描かれている例では、ダイの中心の近くの、たとえば、領域７０５、７０７、および７０８から形成された、列状エリアが、設定、クロック、および他の制御論理のために使用され得る。この列から延びる水平エリア７０９が、プログラマブルＩＣの幅にわたってクロックおよび設定信号を分散させるために使用され得る。 In the example depicted in FIG. 7, a columnar area near the center of the die, for example formed from regions 705, 707, and 708, may be used for configuration, clocks, and other control logic. A horizontal area 709 extending from this column may be used to distribute clock and configuration signals across the width of the programmable IC.

図７に示されているアーキテクチャを利用するいくつかのＩＣは、ＩＣの大部分を作り上げる規則的な列状構造を損なう追加の論理ブロックを含む。追加の論理ブロックは、プログラマブルブロックおよび／または専用回路であり得る。たとえば、ＰＲＯＣ７１０として示されているプロセッサブロックが、ＣＬＢおよびＢＲＡＭのいくつかの列にまたがる。 Some ICs utilizing the architecture shown in FIG. 7 include additional logic blocks that disrupt the regular columnar structure that makes up most of the IC. The additional logic blocks may be programmable blocks and/or dedicated circuits. For example, the processor block shown as PROC 710 spans several columns of CLBs and BRAMs.

一態様では、ＰＲＯＣ７１０は、ＩＣのプログラマブル回路を実装するダイの一部として作製される専用回路として、たとえば、ハードワイヤードプロセッサとして実装され得る。ＰＲＯＣ７１０は、個々のプロセッサ、たとえば、プログラムコードを実行することが可能な単一のコアから、１つまたは複数のコア、モジュール、コプロセッサ、インターフェースなどを有するプロセッサシステム全体まで、複雑さに幅がある様々な異なるプロセッサタイプおよび／またはシステムのいずれかを表し得る。 In one aspect, PROC 710 may be implemented as a dedicated circuit fabricated as part of a die that implements the programmable circuitry of an IC, e.g., as a hardwired processor. PROC 710 may represent any of a variety of different processor types and/or systems ranging in complexity from an individual processor, e.g., a single core capable of executing program code, to an entire processor system having one or more cores, modules, coprocessors, interfaces, etc.

別の態様では、ＰＲＯＣ７１０は、アーキテクチャ７００から省略され、説明されるプログラマブルブロックの他の種類のうちの１つまたは複数と置き換えられ得る。さらに、そのようなブロックは、ＰＲＯＣ７１０の場合のようにプログラムコードを実行することができるプロセッサを形成するためにプログラマブル回路の様々なブロックが使用され得るという点で、「ソフトプロセッサ」を形成するために利用され得る。 In another aspect, PROC 710 may be omitted from architecture 700 and replaced with one or more of the other types of programmable blocks described. Furthermore, such blocks may be utilized to form a "soft processor" in that various blocks of programmable circuitry may be used to form a processor capable of executing program code, as in the case of PROC 710.

「プログラマブル回路」という句は、ＩＣ内のプログラマブル回路要素、たとえば、本明細書で説明される様々なプログラマブルまたは設定可能回路ブロックまたはタイル、ならびに、ＩＣにロードされた設定データに従って様々な回路ブロック、タイル、および／または要素を選択的に結合する相互接続回路を指す。たとえば、ＣＬＢ７０２など、ＰＲＯＣ７１０の外部にある、図７に示されている回路ブロックは、ＩＣのプログラマブル回路と見なされる。 The phrase "programmable circuitry" refers to programmable circuit elements within an IC, such as the various programmable or configurable circuit blocks or tiles described herein, as well as the interconnect circuitry that selectively couples the various circuit blocks, tiles, and/or elements according to configuration data loaded into the IC. For example, the circuit blocks shown in FIG. 7 that are external to PROC 710, such as CLB 702, are considered programmable circuitry of the IC.

概して、プログラマブル回路の機能性は、設定データがＩＣにロードされるまで確立されない。ＦＰＧＡなど、ＩＣのプログラマブル回路をプログラムするために、設定ビットのセットが使用され得る。（１つまたは複数の）設定ビットは、一般に、「設定ビットストリーム」と呼ばれる。概して、プログラマブル回路は、設定ビットストリームをＩＣに最初にロードしなければ、動作可能でないか、または機能可能でない。設定ビットストリームは、プログラマブル回路内に特定の回路設計を効果的に実装する。回路設計は、たとえば、プログラマブル回路ブロックの機能的態様と、様々なプログラマブル回路ブロックの間の物理的接続性とを指定する。 Generally, the functionality of a programmable circuit is not established until configuration data is loaded into the IC. A set of configuration bits may be used to program the programmable circuit of an IC, such as an FPGA. The configuration bit(s) are generally referred to as a "configuration bitstream." Generally, a programmable circuit is not operable or functional without first loading a configuration bitstream into the IC. The configuration bitstream effectively implements a particular circuit design within the programmable circuit. The circuit design specifies, for example, the functional aspects of the programmable circuit blocks and the physical connectivity between the various programmable circuit blocks.

「ハードワイヤード」または「ハード化（ｈａｒｄｅｎ）」される、すなわち、プログラマブルでない回路が、ＩＣの一部として製造される。プログラマブル回路とは異なり、ハードワイヤード回路または回路ブロックは、設定ビットストリームのローディングを通してＩＣの製造後に実装されない。ハードワイヤード回路は、概して、たとえば、設定ビットストリームを、ＩＣ、たとえば、ＰＲＯＣ７１０に最初にロードすることなしに機能可能である、専用回路ブロックおよび相互接続を有すると見なされる。 "Hardwired" or "hardened" circuits, i.e., non-programmable circuits, are fabricated as part of an IC. Unlike programmable circuits, hardwired circuits or circuit blocks are not implemented after the fabrication of an IC through the loading of a configuration bitstream. Hardwired circuits are generally considered to have dedicated circuit blocks and interconnects that are functional without first loading, for example, a configuration bitstream into an IC, e.g., PROC 710.

いくつかの事例では、ハードワイヤード回路は、ＩＣ内の１つまたは複数のメモリ要素に記憶されたレジスタセッティングまたは値に従ってセットまたは選択され得る１つまたは複数の動作モードを有し得る。動作モードは、たとえば、ＩＣへの設定ビットストリームのローディングを通してセットされ得る。この能力にもかかわらず、ハードワイヤード回路が、ＩＣの一部として製造されたとき、動作可能であり、特定の機能を有するので、ハードワイヤード回路はプログラマブル回路と見なされない。 In some cases, a hardwired circuit may have one or more operating modes that can be set or selected according to register settings or values stored in one or more memory elements within the IC. The operating mode may be set, for example, through the loading of a configuration bitstream into the IC. Despite this capability, a hardwired circuit is not considered a programmable circuit because it is operational and has a specific function when fabricated as part of an IC.

ＳｏＣの場合、設定ビットストリームは、プログラマブル回路内に実装されるべきである回路と、ＰＲＯＣ７１０またはソフトプロセッサによって実行されるべきであるプログラムコードとを指定し得る。いくつかの場合には、アーキテクチャ７００は、適切な設定メモリおよび／またはプロセッサメモリに設定ビットストリームをロードする専用設定プロセッサを含む。専用設定プロセッサは、ユーザ指定のプログラムコードを実行しない。他の場合には、アーキテクチャ７００は、設定ビットストリームを受信し、設定ビットストリームを適切な設定メモリにロードし、および／または実行のためのプログラムコードを抽出するために、ＰＲＯＣ７１０を利用し得る。 In the case of a SoC, the configuration bitstream may specify the circuitry to be implemented in the programmable circuitry and the program code to be executed by the PROC 710 or a soft processor. In some cases, the architecture 700 includes a dedicated configuration processor that loads the configuration bitstream into appropriate configuration memory and/or processor memory. The dedicated configuration processor does not execute user-specified program code. In other cases, the architecture 700 may utilize the PROC 710 to receive the configuration bitstream, load the configuration bitstream into the appropriate configuration memory, and/or extract the program code for execution.

図７は、プログラマブル回路、たとえば、プログラマブルファブリックを含むＩＣを実装するために使用され得る例示的なアーキテクチャを示すことを意図される。たとえば、１つの列中の論理ブロックの数、列の相対幅、列の数および順序、列中に含まれる論理ブロックのタイプ、論理ブロックの相対サイズ、および図７の上部に含まれる相互接続／論理実装形態は、例示にすぎない。実際のＩＣでは、たとえば、ＣＬＢの２つ以上の隣接する列は、一般に、ユーザ回路設計の効率的な実装を容易にするために、ＣＬＢが現れるところならどこでも含まれる。しかしながら、隣接するＣＬＢ列の数は、ＩＣの全体的サイズとともに変動し得る。さらに、ＩＣ内のＰＲＯＣ７１０などのブロックのサイズおよび／または位置決めは、例示のためのものにすぎず、限定として意図されていない。 7 is intended to illustrate an exemplary architecture that may be used to implement a programmable circuit, e.g., an IC including a programmable fabric. For example, the number of logic blocks in a column, the relative width of the columns, the number and order of columns, the types of logic blocks included in the columns, the relative sizes of the logic blocks, and the interconnect/logic implementations included in the upper portion of FIG. 7 are merely exemplary. In an actual IC, for example, two or more adjacent columns of CLBs are typically included wherever CLBs appear to facilitate efficient implementation of the user circuit design. However, the number of adjacent CLB columns may vary with the overall size of the IC. Additionally, the size and/or positioning of blocks such as PROC 710 within the IC are for illustrative purposes only and are not intended as limitations.

説明されたように、アーキテクチャ７００またはアーキテクチャ７００と同様のアーキテクチャを使用して実装されるＩＣは、本明細書で説明されるストリーミングアーキテクチャを実装するために使用され得る。１つまたは複数の実施形態では、エンドポイント１０８、ＤＭＡ１１０、ストリームトラフィックマネージャ２１２、衛星ストリームトラフィックマネージャ４１２、相互接続２１４および２１６、バッファ２１８～２３２、ならびにカーネル回路２３４は、プログラマブル回路を使用して実装され得る。１つまたは複数の他の実施形態では、エンドポイント１０８、ＤＭＡ１１０、および／または相互接続などの回路ブロックのうちの選択された回路ブロックが、ハード化されたまたはハードワイヤード回路ブロックとして実装され得る。１つまたは複数の実施形態では、入力バッファおよび／または出力バッファはＡＸＩ４－ＳｔｒｅａｍデータＦＩＦＯとして実装され得る。 As described, ICs implemented using architecture 700 or an architecture similar to architecture 700 may be used to implement the streaming architectures described herein. In one or more embodiments, the endpoints 108, DMA 110, stream traffic manager 212, satellite stream traffic manager 412, interconnects 214 and 216, buffers 218-232, and kernel circuit 234 may be implemented using programmable circuits. In one or more other embodiments, selected ones of the circuit blocks, such as the endpoints 108, DMA 110, and/or interconnects, may be implemented as hardened or hardwired circuit blocks. In one or more embodiments, the input buffers and/or output buffers may be implemented as AXI4-Stream data FIFOs.

特定の実施形態では、ＩＣ１０４に位置するものとして説明される任意のバッファまたはキューが、より遅いオフチップＲＡＭを使用することとは対照的に、利用可能なメモリリソース（たとえば、ＢＲＡＭ）、またはＩＣ１０４内で利用可能な他の同様の回路ブロックを使用して実装され得る。たとえば、バッファ２１８～２３２、トラフィックストリームマネージャ２１２中のキュー、および／またはＤＭＡ１１０中のキューが、ＩＣ上で利用可能なメモリリソースを使用して実装され得る。 In particular embodiments, any buffers or queues described as being located on the IC 104 may be implemented using available memory resources (e.g., BRAM) or other similar circuit blocks available within the IC 104, as opposed to using slower off-chip RAM. For example, buffers 218-232, queues in the traffic stream manager 212, and/or queues in the DMA 110 may be implemented using memory resources available on the IC.

本明細書で説明されるアーキテクチャは、限定ではなく、例示の目的で提供される。たとえば、ＩＣは、図に示されているカーネル回路よりも少ないまたは多いカーネル回路を含み得る。さらに、ＩＣのプログラマブル回路を使用して実装されるカーネル回路の数に基づいて、ドライバ中のキューの数およびＩＣ内に実装されるバッファ中のキューの数が変動することになる。 The architecture described herein is provided for purposes of illustration and not limitation. For example, an IC may include fewer or more kernel circuits than those shown in the figures. Furthermore, based on the number of kernel circuits implemented using the programmable circuitry of the IC, the number of queues in the driver and the number of queues in the buffer implemented within the IC will vary.

説明のために、特定の名称が、本明細書で開示される様々な発明概念の完全な理解を提供するために記載される。しかしながら、本明細書で使用される専門用語は、本発明の構成の特定の態様を説明するためのものにすぎず、限定するものではない。 For purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the various inventive concepts disclosed herein. However, the terminology used herein is intended to be merely for the purpose of describing particular aspects of the inventive configurations and is not intended to be limiting.

本明細書で定義される単数形「ａ」、「ａｎ」および「ｔｈｅ」は、文脈が別段に明確に示さない限り、複数形をも含むものとする。 As defined herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

本明細書で定義される「約」という用語は、正確ではないが、ほぼ正しいまたは厳密である、値または量が近い、を意味する。たとえば、「約」という用語は、具陳された特性、パラメータ、または値が、厳密な特性、パラメータ、または値の所定の量内にあることを意味し得る。 As defined herein, the term "about" means close to, but not exactly, the approximate or exact value or amount. For example, the term "about" can mean that a stated property, parameter, or value is within a given amount of the exact property, parameter, or value.

本明細書で定義される「少なくとも１つ」、「１つまたは複数」、および「および／または」という用語は、別段に明記されていない限り、運用において連言的と選言的の両方である、オープンエンド表現である。たとえば、「Ａ、Ｂ、およびＣのうちの少なくとも１つ」、「Ａ、Ｂ、またはＣのうちの少なくとも１つ」、「Ａ、Ｂ、およびＣのうちの１つまたは複数」、「Ａ、Ｂ、またはＣのうちの１つまたは複数」、および「Ａ、Ｂ、および／またはＣ」という表現の各々は、Ａのみ、Ｂのみ、Ｃのみ、ＡとＢを一緒に、ＡとＣを一緒に、ＢとＣを一緒に、またはＡとＢとＣを一緒に、を意味する。 The terms "at least one," "one or more," and "and/or," as defined herein, are open-ended expressions that are both conjunctive and disjunctive in operation, unless otherwise specified. For example, each of the expressions "at least one of A, B, and C," "at least one of A, B, or C," "one or more of A, B, and C," "one or more of A, B, or C," and "A, B, and/or C" means A only, B only, C only, A and B together, A and C together, B and C together, or A, B, and C together.

本明細書で定義される「自動的に」という用語は、ユーザ介入なしに、を意味する。本明細書で定義される「ユーザ」という用語は、人間を意味する。 As used herein, the term "automatically" means without user intervention. As used herein, the term "user" means a human being.

本明細書で定義される「コンピュータ可読記憶媒体」という用語は、命令実行システム、装置、またはデバイスが使用するための、あるいはそれとともに使用するためのプログラムコードを含んでいるかまたは記憶する記憶媒体を意味する。本明細書で定義される「コンピュータ可読記憶媒体」は、それ自体は、一時的な伝搬信号でない。コンピュータ可読記憶媒体は、限定はしないが、電子記憶デバイス、磁気記憶デバイス、光記憶デバイス、電磁記憶デバイス、半導体記憶デバイス、または上記の任意の好適な組合せであり得る。本明細書で説明される、様々な形態のメモリが、コンピュータ可読記憶媒体の例である。コンピュータ可読記憶媒体のより具体的な例の非網羅的なリストは、ポータブルコンピュータディスケット、ハードディスク、ＲＡＭ、読取り専用メモリ（ＲＯＭ）、消去可能プログラマブル読取り専用メモリ（ＥＰＲＯＭまたはフラッシュメモリ）、電子的消去可能プログラマブル読取り専用メモリ（ＥＥＰＲＯＭ）、スタティックランダムアクセスメモリ（ＳＲＡＭ）、ポータブルコンパクトディスク読取り専用メモリ（ＣＤ－ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、メモリスティック、フロッピーディスクなどを含み得る。 The term "computer-readable storage medium" as defined herein means a storage medium that contains or stores program code for use by or with an instruction execution system, apparatus, or device. A "computer-readable storage medium" as defined herein is not itself a transitory propagating signal. A computer-readable storage medium may be, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. The various forms of memory described herein are examples of computer-readable storage media. A non-exhaustive list of more specific examples of computer-readable storage media may include portable computer diskettes, hard disks, RAM, read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), electronically erasable programmable read-only memory (EEPROM), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, and the like.

本明細書で定義される「する場合（ｉｆ）」という用語は、文脈に応じて、「するとき（ｗｈｅｎ）」または「すると（ｕｐｏｎ）」または「に応答して（ｉｎｒｅｓｐｏｎｓｅｔｏ）」または「に反応して（ｒｅｓｐｏｎｓｉｖｅｔｏ）」を意味する。したがって、「それが決定された場合」または「［述べられた条件またはイベント］が検出された場合」という句は、文脈に応じて、「決定すると」または「決定したことに応答して」あるいは「［述べられた条件またはイベント］を検出すると」または「［述べられた条件またはイベント］を検出したことに応答して」または「［述べられた条件またはイベント］を検出したことに反応して」を意味すると解釈され得る。 The term "if" as defined herein means "when" or "upon" or "in response to" or "responsive to," depending on the context. Thus, the phrase "when it is determined" or "when the stated condition or event is detected" may be interpreted to mean "upon determining" or "in response to determining" or "upon detecting the stated condition or event" or "in response to detecting the stated condition or event" or "in response to detecting the stated condition or event," depending on the context.

本明細書で定義される「に反応して」という用語および上記で説明されたような同様の言い回し、たとえば、「する場合」、「するとき」、または「すると」は、アクションまたはイベントに容易に応答または反応することを意味する。応答または反応は、自動的に実施される。したがって、第２のアクションが第１のアクション「に反応して」実施される場合、第１のアクションの発生と第２のアクションの発生との間に因果関係がある。「に反応して」という用語は、因果関係を示す。 The term "in response to" as defined herein and similar phrases as explained above, e.g., "when," "when," or "upon," means to readily respond or react to an action or event. The response or reaction is performed automatically. Thus, when a second action is performed "in response to" a first action, there is a causal relationship between the occurrence of the first action and the occurrence of the second action. The term "in response to" indicates a causal relationship.

本明細書で定義される「一実施形態（ｏｎｅｅｍｂｏｄｉｍｅｎｔ）」、「一実施形態（ａｎｅｍｂｏｄｉｍｅｎｔ）」、「１つまたは複数の実施形態」、「特定の実施形態」という用語、または同様の言い回しは、実施形態に関して説明される特定の特徴、構造、または特性が、本開示内で説明される少なくとも１つの実施形態に含まれることを意味する。したがって、本開示全体にわたる、「一実施形態では（ｉｎｏｎｅｅｍｂｏｄｉｍｅｎｔ）」、「一実施形態では（ｉｎａｎｅｍｂｏｄｉｍｅｎｔ）」、「１つまたは複数の実施形態では」、「特定の実施形態では」という句、および同様の言い回しの出現は、必ずしもそうとは限らないが、すべて、同じ実施形態を指し得る。「実施形態」および「構成」という用語は、本開示内では互換的に使用される。 As defined herein, the terms "one embodiment," "an embodiment," "one or more embodiments," "particular embodiment," or similar phrases mean that a particular feature, structure, or characteristic described with respect to an embodiment is included in at least one embodiment described within the disclosure. Thus, appearances of the phrases "in one embodiment," "in an embodiment," "in one or more embodiments," "in a particular embodiment," and similar phrases throughout this disclosure may, but do not necessarily, all refer to the same embodiment. The terms "embodiment" and "configuration" are used interchangeably within this disclosure.

本明細書で定義される「プロセッサ」という用語は、プログラムコード中に含まれている命令を行うことが可能な少なくとも１つのハードウェア回路を意味する。ハードウェア回路は集積回路であり得る。プロセッサの例は、限定はしないが、中央処理ユニット（ＣＰＵ）、アレイプロセッサ、ベクトルプロセッサ、デジタル信号プロセッサ（ＤＳＰ）、およびコントローラを含む。 As defined herein, the term "processor" means at least one hardware circuit capable of executing instructions contained in program code. The hardware circuit may be an integrated circuit. Examples of processors include, but are not limited to, central processing units (CPUs), array processors, vector processors, digital signal processors (DSPs), and controllers.

本明細書で定義される「出力」という用語は、物理メモリ要素、たとえば、デバイスに記憶すること、ディスプレイまたは他の周辺出力デバイスに書き込むこと、別のシステムに送出することまたは送信すること、エクスポートすることなどを意味する。 As defined herein, the term "output" means storing in a physical memory element, e.g., a device, writing to a display or other peripheral output device, sending or transmitting to another system, exporting, etc.

本明細書で定義される「実質的に」という用語は、具陳された特性、パラメータ、または値が正確に達成される必要がないこと、ただし、たとえば、当業者に知られている許容差、測定誤差、測定精度限界、および他のファクタを含む、偏差または変動が、特性が提供することを意図された効果を妨げない量で生じ得ることを意味する。 As defined herein, the term "substantially" means that the recited characteristic, parameter, or value need not be achieved exactly, but that deviations or variations, including, for example, tolerances, measurement errors, measurement accuracy limits, and other factors known to those skilled in the art, may occur in amounts that do not interfere with the effect that the characteristic is intended to provide.

第１の、第２のなどの用語は、様々な要素を説明するために本明細書で使用され得る。これらの用語は、別段に述べられていない限り、または文脈が別段に明確に示さない限り、ある要素を別の要素と区別するために使用されるにすぎないので、これらの要素はこれらの用語によって限定されるべきでない。 Terms such as first, second, etc. may be used herein to describe various elements. These terms are only used to distinguish one element from another, unless otherwise stated or the context clearly indicates otherwise, and therefore these elements should not be limited by these terms.

コンピュータプログラム製品は、プロセッサに本明細書で説明される本発明の構成の態様を行わせるためのコンピュータ可読プログラム命令をその上に有する（１つまたは複数の）コンピュータ可読記憶媒体を含み得る。本開示内では、「プログラムコード」という用語は、「コンピュータ可読プログラム命令」という用語と互換的に使用される。本明細書で説明されるコンピュータ可読プログラム命令は、コンピュータ可読記憶媒体からそれぞれのコンピューティング／処理デバイスに、あるいはネットワーク、たとえば、インターネット、ＬＡＮ、ＷＡＮおよび／またはワイヤレスネットワークを介して外部コンピュータまたは外部記憶デバイスにダウンロードされ得る。ネットワークは、銅伝送ケーブル、光伝送ファイバー、ワイヤレス送信、ルータ、ファイアウォール、スイッチ、ゲートウェイコンピュータ、および／またはエッジサーバを含むエッジデバイスを含み得る。各コンピューティング／処理デバイス中のネットワークアダプタカードまたはネットワークインターフェースは、ネットワークからコンピュータ可読プログラム命令を受信し、そのコンピュータ可読プログラム命令を、それぞれのコンピューティング／処理デバイス内のコンピュータ可読記憶媒体に記憶するためにフォワーディングする。 A computer program product may include a computer-readable storage medium(s) having computer-readable program instructions thereon for causing a processor to perform aspects of the inventive arrangements described herein. Within this disclosure, the term "program code" is used interchangeably with the term "computer-readable program instructions." The computer-readable program instructions described herein may be downloaded from the computer-readable storage medium to the respective computing/processing device or to an external computer or external storage device via a network, e.g., the Internet, a LAN, a WAN, and/or a wireless network. The network may include copper transmission cables, optical transmission fibers, wireless transmissions, edge devices including routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium within the respective computing/processing device.

本明細書で説明される本発明の構成のための動作を行うためのコンピュータ可読プログラム命令は、アセンブラ命令、命令セットアーキテクチャ（ＩＳＡ）命令、機械命令、機械依存命令、マイクロコード、ファームウェア命令、あるいは、オブジェクト指向プログラミング言語および／または手続き型プログラミング言語を含む１つまたは複数のプログラミング言語の任意の組合せで書き込まれたソースコードまたはオブジェクトコードのいずれかであり得る。コンピュータ可読プログラム命令は、状態セッティングデータを含み得る。コンピュータ可読プログラム命令は、完全にユーザのコンピュータ上で、部分的にユーザのコンピュータ上で、スタンドアロンソフトウェアパッケージとして、部分的にユーザのコンピュータ上でおよび部分的にリモートコンピュータ上で、あるいは完全にリモートコンピュータまたはサーバ上で実行し得る。後者のシナリオでは、リモートコンピュータは、ＬＡＮまたはＷＡＮを含む任意のタイプのネットワークを通してユーザのコンピュータに接続され得るか、あるいは接続は、（たとえば、インターネットサービスプロバイダを使用してインターネットを通して）外部コンピュータに対して行われ得る。いくつかの場合には、たとえば、プログラマブル論理回路、ＦＰＧＡ、またはＰＬＡを含む電子回路が、本明細書で説明される本発明の構成の態様を実施するために、電子回路を個人化するためにコンピュータ可読プログラム命令の状態情報を利用することによって、コンピュータ可読プログラム命令を実行し得る。 The computer readable program instructions for performing the operations for the inventive arrangements described herein may be either assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, or source or object code written in any combination of one or more programming languages, including object-oriented and/or procedural programming languages. The computer readable program instructions may include state setting data. The computer readable program instructions may execute entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a LAN or WAN, or the connection may be to an external computer (e.g., through the Internet using an Internet Service Provider). In some cases, for example, an electronic circuit including a programmable logic circuit, an FPGA, or a PLA may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuit to implement aspects of the inventive arrangements described herein.

本発明の構成のいくつかの態様が、方法、装置（システム）、およびコンピュータプログラム製品のフローチャート例示図および／またはブロック図を参照しながら本明細書で説明された。フローチャート例示図および／またはブロック図の各ブロック、ならびにフローチャート例示図および／またはブロック図中のブロックの組合せが、コンピュータ可読プログラム命令、たとえば、プログラムコードによって実装され得ることを理解されよう。 Several aspects of the inventive arrangements have been described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions, e.g., program code.

これらのコンピュータ可読プログラム命令は、汎用コンピュータ、専用コンピュータ、または機械を製造するための他のプログラマブルデータ処理装置のプロセッサに提供され得、その結果、コンピュータまたは他のプログラマブルデータ処理装置のプロセッサを介して実行する命令は、フローチャートおよび／またはブロック図の１つまたは複数のブロックにおいて指定された機能／行為を実装するための手段を作成する。これらのコンピュータ可読プログラム命令はまた、コンピュータ、プログラマブルデータ処理装置、および／または他のデバイスに特定の様式で機能するように指示することができるコンピュータ可読記憶媒体に記憶され得、その結果、命令が記憶されたコンピュータ可読記憶媒体は、フローチャートおよび／またはブロック図の１つまたは複数のブロックにおいて指定された動作の態様を実装する命令を含む製造品を備える。 These computer-readable program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to manufacture a machine, such that the instructions executing through the processor of the computer or other programmable data processing apparatus create means for implementing the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams. These computer-readable program instructions may also be stored on a computer-readable storage medium capable of directing a computer, a programmable data processing apparatus, and/or other device to function in a particular manner, such that the computer-readable storage medium on which the instructions are stored comprises an article of manufacture that includes instructions that implement aspects of the operations specified in one or more blocks of the flowcharts and/or block diagrams.

コンピュータ可読プログラム命令はまた、コンピュータ実装プロセスを作り出すために、一連の動作をコンピュータ、他のプログラマブルデータ処理装置または他のデバイス上で実施させるように、コンピュータ、他のプログラマブル装置、または他のデバイスにロードされ得、その結果、コンピュータ、他のプログラマブル装置、または他のデバイス上で実行する命令は、フローチャートおよび／またはブロック図の１つまたは複数のブロックにおいて指定された機能／行為を実装する。 The computer-readable program instructions may also be loaded into a computer, other programmable apparatus, or other device to cause a sequence of operations to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, such that the instructions executing on the computer, other programmable apparatus, or other device implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

図中のフローチャートおよびブロック図は、本発明の構成の様々な態様によるシステム、方法、およびコンピュータプログラム製品の可能な実装形態のアーキテクチャ、機能性、および動作を示す。この点について、フローチャートまたはブロック図中の各ブロックは、指定された動作を実装するための１つまたは複数の実行可能な命令を備える、命令のモジュール、セグメント、または部分を表し得る。 The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various aspects of the present invention. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of instructions that comprises one or more executable instructions for implementing the specified operations.

いくつかの代替実装形態では、ブロック中で言及される動作は、図中で言及される順序から外れて行われ得る。たとえば、関与する機能性に応じて、連続して示されている２つのブロックが、実質的にコンカレントに実行され得るか、またはブロックが、時々、逆の順序で実行され得る。他の例では、ブロックは、概して小さい数字から順に実施され得、さらに他の例では、１つまたは複数のブロックは、変動順で実施され得、結果は、記憶され、後続の、または直後にこない他のブロックにおいて利用される。また、ブロック図および／またはフローチャート例示図の各ブロック、ならびにブロック図および／またはフローチャート例示図中のブロックの組合せが、指定された機能または行為を実施するかあるいは専用ハードウェアとコンピュータ命令との組合せを行う専用ハードウェアベースシステムによって実装され得ることに留意されたい。 In some alternative implementations, the actions noted in the blocks may be performed out of the order noted in the figures. For example, two blocks shown in succession may be performed substantially concurrently, or the blocks may sometimes be performed in the reverse order, depending on the functionality involved. In other examples, the blocks may be performed generally in ascending numerical order, and in still other examples, one or more blocks may be performed in a varying order, with the results stored and utilized in other blocks that follow or do not immediately follow. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, may be implemented by dedicated hardware-based systems that perform the specified functions or acts or a combination of dedicated hardware and computer instructions.

以下の特許請求の範囲において見られ得るすべての手段またはステップおよび機能要素の対応する構造、材料、行為、および等価物は、特に主張されるように、他の請求される要素と組み合わせて機能を実施するための任意の構造、材料、または行為を含むことを意図される。 The corresponding structures, materials, acts, and equivalents of all means or step and functional elements that may appear in the following claims are intended to include any structure, material, or act for performing a function in combination with other claimed elements as specifically asserted.

本明細書で提供される本発明の構成の説明は、例示のためであり、網羅的なものでも、開示される形式および例に限定されるものでもない。本明細書で使用される専門用語は、本発明の構成の原理、実際的適用例、または市場で見られる技術に対する技術的改善を説明するために、および／あるいは、他の当業者が本明細書で開示される本発明の構成を理解することを可能にするために選定された。説明される本発明の構成の範囲および趣旨から逸脱することなく、修正および変形が当業者に明らかになり得る。したがって、そのような特徴および実装形態の範囲を示すものとして、上記の開示に対してではなく、以下の特許請求の範囲に対して参照が行われるべきである。 The description of the inventive configurations provided herein is for illustrative purposes and is not exhaustive or limited to the forms and examples disclosed. The terminology used herein is chosen to explain the principles of the inventive configurations, practical applications, or technical improvements over the art found in the marketplace, and/or to enable others skilled in the art to understand the inventive configurations disclosed herein. Modifications and variations may become apparent to those skilled in the art without departing from the scope and spirit of the inventive configurations described. Accordingly, reference should be made to the following claims, rather than to the above disclosure, as indicating the scope of such features and implementations.

Claims

a communications interface configured to communicate with a host system;
a direct memory access circuit coupled to the communication interface;
A kernel circuit implemented using a programmable circuit;
a stream traffic manager circuit coupled to the direct memory access circuit and to the kernel circuit, the stream traffic manager circuit configured to control data streams exchanged between the host system and the kernel circuit;
The integrated circuit, wherein the stream traffic manager circuit is configured to set at least one of a size of a data set or an amount of prefetch data for the kernel circuits on a per kernel circuit basis.

a first interconnect configured to receive packetized data from the stream traffic manager circuitry and to deliver the packetized data to the kernel circuitry;
a second interconnect configured to receive data from the kernel circuit and provide the data to the stream traffic manager circuit.

an input buffer coupled to an output port of the first interconnect and to an input port of the kernel circuit, the input buffer configured to temporarily store the packetized data, convert the packetized data into a data stream, and provide the data stream to the kernel circuit;
3. The integrated circuit of claim 2, wherein the stream traffic manager circuit initiates a data transfer to the kernel circuit in response to determining that the input buffer has available space.

an output buffer coupled to an output port of the kernel circuit and to an input port of the stream traffic manager circuit, the output buffer configured to temporarily store a data stream output from the kernel circuit, convert the data stream into further packetized data, and provide the further packetized data to the second interconnect;
4. The integrated circuit of claim 3, wherein the stream traffic manager circuit initiates further data transfers from the kernel circuit to the host system in response to determining that a buffer in the host system corresponding to the output buffer has available space and that the output buffer contains at least one complete packet.

5. The integrated circuit of claim 4, wherein the output buffer is configured to add tagging information to the further packetized data to identify the kernel circuit as a source kernel circuit.

the kernel circuit is one of a plurality of kernel circuits implemented in the programmable circuit;
the stream traffic manager circuitry is configured to interleave data streams exchanged with the plurality of kernel circuits;
2. The integrated circuit of claim 1, wherein each kernel circuit of the plurality of kernel circuits is coupled to the stream traffic manager circuit through a buffer and an interconnect, the stream traffic manager circuit implementing a round robin arbitration scheme for streaming data to each kernel circuit of the plurality of kernel circuits based on space availability in the buffer corresponding to each kernel circuit of the plurality of kernel circuits.

a first kernel circuit implemented in a programmable circuit;
a second kernel circuit implemented in the programmable circuit;
a stream traffic manager circuit coupled to the first kernel circuit and the second kernel circuit, the stream traffic manager circuit configured to control data streams exchanged between the first kernel circuit and the second kernel circuit;
the stream traffic manager circuitry is configured to set at least one of a size of a data set or an amount of prefetch data for the first kernel circuit and the second kernel circuit for each of the first kernel circuit and the second kernel circuit .

The integrated circuit of claim 7, wherein the selected data stream sent from the first kernel circuit to the second kernel circuit includes in-band instructions for the second kernel circuit.

the first kernel circuit is coupled to a first interconnect through a first input buffer and a first output buffer;
the second kernel circuit is coupled to a second interconnect through a second input buffer and a second output buffer;
The integrated circuit of claim 7 , wherein the first interconnect and the second interconnect are coupled to the stream traffic manager circuit.

10. The integrated circuit of claim 9, wherein the first output buffer and the second output buffer are configured to temporarily store data streams output from the first kernel circuit and the second kernel circuit, respectively, convert the data streams into packetized data, and add tagging information to the packetized data to identify at least one of a source kernel circuit or a destination kernel circuit.

The integrated circuit of claim 10, wherein the selected data stream provided to the first kernel circuit or the second kernel circuit includes in-band instructions for the first kernel circuit or the second kernel circuit.

The integrated circuit of claim 9, wherein the first kernel circuit and the first interconnect are located on a first die of the integrated circuit, and the second kernel circuit and the second interconnect are located on a second die of the integrated circuit.

The integrated circuit of claim 12, wherein the stream traffic manager circuitry is located on the first die.

the second input buffer is coupled to an input port of the second kernel circuit in the second die and configured to temporarily store data to be streamed to the second kernel circuit;
the first output buffer is coupled to an output port of the first kernel circuit in the first die and configured to temporarily store data output from the first kernel circuit;
13. The integrated circuit of claim 12, wherein the stream traffic manager circuit is configured to initiate a data transfer from the first kernel circuit to the second kernel circuit in response to determining that the second input buffer has available space and the first output buffer is storing data.

a first transceiver coupled to an input/output port of the stream traffic manager circuit;
the first transceiver is configured to communicate with a second transceiver implemented in a different integrated circuit;
8. The integrated circuit of claim 7 , wherein the stream traffic manager circuit is configured to control data streams exchanged with a satellite stream traffic manager circuit coupled to a further kernel circuit implemented in the different integrated circuit via communication via the first transceiver and the second transceiver with the further kernel circuit, the satellite stream traffic manager circuit being implemented in the different integrated circuit .