CN111666106A - Data offload acceleration from multiple remote chips - Google Patents

Data offload acceleration from multiple remote chips Download PDF

Info

Publication number
CN111666106A
CN111666106A CN202010137127.3A CN202010137127A CN111666106A CN 111666106 A CN111666106 A CN 111666106A CN 202010137127 A CN202010137127 A CN 202010137127A CN 111666106 A CN111666106 A CN 111666106A
Authority
CN
China
Prior art keywords
data
processor
offload
accelerator
buffer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010137127.3A
Other languages
Chinese (zh)
Inventor
N·罗伯森
E·托马斯
D·马西奥洛斯基
E·安格拉达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hewlett Packard Development Co LP
Hewlett Packard Enterprise Development LP
Original Assignee
Hewlett Packard Development Co LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Development Co LP filed Critical Hewlett Packard Development Co LP
Publication of CN111666106A publication Critical patent/CN111666106A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping
    • G06F9/4403Processor initialisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/781On-chip cache; Off-chip memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/83Indexing scheme relating to error detection, to error correction, and to monitoring the solution involving signatures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/4401Bootstrapping

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Security & Cryptography (AREA)
  • Quality & Reliability (AREA)
  • Advance Control (AREA)

Abstract

Embodiments of the present disclosure relate to data offload acceleration from multiple remote chips. The data offload accelerator offloads data from the plurality of remote chips to the processor. A specification of a plurality of addresses for retrieving data from a plurality of remote chips is received into an address buffer bank of a data offload accelerator. A command to initiate capture of data from a plurality of remote chips is received into an offload control device of a data offload accelerator. Data from multiple remote chips is captured in parallel into a data buffer bank of a data offload accelerator, and the processor is interrupted via an offload control device to pass at least a portion of the data to the processor.

Description

Data offload acceleration from multiple remote chips
Background
Application Specific Integrated Circuit (ASIC) chips typically include sideband interface slave ports for device configuration, management, and runtime status and monitoring functions. The interface may be defined by a proprietary physical/logical protocol or industry guidelines such as peripheral component interconnect express (PCIe). A processor, such as a microprocessor or Baseboard Management Controller (BMC), may initiate transactions through the sideband ASIC interface to access and manipulate Control and Status Registers (CSRs) and memory mapped data structures in addressable devices, such as ASICs.
Drawings
The features of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
FIG. 1 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;
FIG. 2 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;
FIG. 3 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;
FIG. 4 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;
FIG. 5 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;
FIG. 6 depicts a system architecture within which an initialization accelerator for initiating multiple remote chips may be implemented in accordance with one or more examples of the present disclosure;
FIG. 7 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;
FIG. 8 depicts a flow diagram of a method for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure; and
fig. 9 depicts a flow diagram of another method for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure.
Detailed Description
A data collection solution, such as one for collecting telemetry, implements a loop in the BMC firmware that reads one CSR or memory mapped data element across one to several BMC-to-ASIC management communication interfaces at a time. In order for the BMC to perform the retrieval of a single CSR, the firmware is designed to set the request, start the transaction and retrieve the data from the hardware when notified. During setup, the firmware formats the transaction and performs multiple separate writes to the hardware to prepare it for execution. The firmware then initiates the request by another hardware write. Once the hardware completes the transaction, the firmware completion status will be notified via an asynchronous hardware interrupt and any response data is available. Finally, the firmware performs multiple separate reads of the hardware to retrieve the return data. To collect the entire data set, this process is performed in a loop that requests each CSR or memory-mapped data element separately. This serialization process is considered slow due to the overhead (software, operating system, firmware, drivers, storage resources, etc.) involved in sequentially initiating requests and servicing each response.
This configuration is sufficient if the BMC can access the ASIC locally and can maintain the expected performance (bandwidth, latency, persistence and burstiness, etc.) as dictated by the high level requirements of the system. However, this configuration does not address performance expectations for certain system topologies.
A system and method related to data offload acceleration, such as for a one-to-many BMC-ASIC interface topology, is disclosed. A hardware data offload accelerator, for example within an Integrated Circuit (IC) such as a Field Programmable Gate Array (FPGA) or ASIC, reads, stores, and transfers data, such as telemetry data, from multiple chips, for example multiple ASICs, each having multiple chiplets with similar addressable memory and CSR structures. The data offload accelerator offloads the data to a processor remote from the plurality of chips, such as a BMC. As used herein, "offloading" data means sending or transferring data between one or more processors and one or more chips. The data offload accelerator includes a register set that allows the processor to control data offload, one or more address buffers, and one or more data buffers to implement data offload. The processor may specify a remote memory address, such as a CSR address or other mapped memory address location, for offloading. The processor may specify these addresses within one or more address buffers. The processor may specify non-contiguous addresses at any independent location within the predefined address space of the ASIC to which the accelerator is attached.
In one example, the data offload accelerator includes "fast" and "slow" address buffers (and corresponding "fast" and "slow" data buffers). For example, "fast" means that data is offloaded to the processor at some relatively fast refresh rate (e.g., 100 Hz). While "slow" means that data, which may be a larger data set than the former, is offloaded to the processor at a relatively slow refresh rate (e.g., 1 Hz). Having initialized one or more address buffers, the processor may initiate data capture through registers within the write-once data offload accelerator. When the data offload accelerator observes that its "start" bit is valid, it will issue a chip read for each of the specified addresses. These reads may be performed in parallel across multiple chip interfaces. As each chip responds, the data offload accelerator fills the appropriate data buffer in parallel with the relevant response data. After storing some or all of the response data, the data offload accelerator will interrupt the processor. The processor then retrieves the stored data from the data offload accelerator.
Example benefits include improving the speed of collecting data from multiple chips, as reading and writing of data from multiple CSRs within a chip, and particularly, for example, tens or hundreds of chips, can be performed in parallel. Furthermore, speed is increased by significantly reducing transaction overhead. That is, a single write by the processor to the data offload accelerator may replace the overhead (software, OS, firmware, drivers, storage resources, etc.) involved with the processor initiating a request to each CSR and servicing each response. This triggers the data offload accelerator to issue multiple read requests in hardware in parallel and save the responses in parallel to the appropriate response data buffer.
In a particular example, a data offload accelerator offloads a large amount of telemetry data from multiple chips within multiple remote ICs. For example, each of the chips within each remote IC may contain hundreds or thousands of CSRs (or other memory mapped locations) from which telemetry data may be offloaded. In this example, when the processor issues a single write command, the data offload accelerator may issue multiple read requests to multiple chips in parallel to retrieve a first plurality of data responses from a first plurality of specified addresses in parallel and store the first plurality of data responses in one or more data buffers in parallel. The data offload accelerator may optimize the ordering request size of these read requests to increase data transfer efficiency over the IC interface. For example, if the data offload accelerator identifies a block of contiguous memory in a processor-defined address buffer, it may group discrete addresses into a single read request, where the size of the transactional read response is suitable to cover multiple discrete addresses contained in a contiguous range. Further, the data offload accelerator may order the address buffer contents to facilitate such spatial locality optimizations. The data offload accelerator may repeat the parallel capture of data (including issuing another plurality of read requests to the chip in parallel, receiving another plurality of data responses from another plurality of specified addresses in parallel, and storing the plurality of data responses to one or more data buffers in parallel) until data from all of the specified addresses is captured, without receiving additional commands from the processor or incurring additional overhead. Thus, in situations where the data offload accelerator offloads large amounts of remote data (e.g., telemetry), significant overhead may be saved.
Turning now to the drawings, FIG. 1 depicts a system architecture 100 in which a data offload accelerator 112 may be implemented for offloading data from a plurality of remote chips 1 through M in a plurality of ICs 106-1 through 106-N of the system architecture 100 to a processor 102, according to one or more examples of the present disclosure. In this case, M and N are integers, the values of which depend on the design of the system architecture 100. The integers M or N may be the same or different.
The system architecture 100 also includes a data offload accelerator device 104, the data offload accelerator device 104 including a data offload accelerator 112 and coupled to the processor 102 and the plurality of chips 1-M in the plurality of ICs 106-1 to 106-N. The chips 1 through M are "remote" from the processor 102 and the data offload accelerator device 104, meaning that the chips 1 through M are included on at least a different chipset or die (die) than the processor 102 and the data offload accelerator device 104. In a particular example, the ICs 106-1 to 106-N, each containing a "remote" chip 1 to M, are formed on a different semiconductor substrate than the one or more semiconductor substrates on which the processor 102 and the data offload accelerator device 104 are formed.
In one example, processor 102 is a Baseboard Management Controller (BMC). In another example, the processor 102 is a microprocessor. In another example, the system architecture 100 may include multiple processors. For example, the system architecture 100 may include a processor 130, and the data offload accelerator 112 may also offload data from multiple chips 1 through M within the multiple ICs 106-1 through 106-N to the processor 130. In yet another example, the system architecture 100 may include a plurality of data offload accelerator devices 104 mounted on one or more printed circuit components, which are coupled to one or more processors.
As shown, each of ICs 106-1 to 106-N is an ASIC; accordingly, ICs 106-1 through 106-N are also labeled in FIG. 1 (and FIGS. 2-6) and are referred to herein as ASICs 1 through N. The ASICs 1 to N may have Very Large Scale Integration (VLSI) designs. For example, ASICs 1 through N may each have multiple chiplets 1 through M (as shown) and may be fabricated using Silicon Stack Interconnect (SSI) or other three-dimensional IC design techniques, where multiple ASIC dies are embedded in a single substrate or IC package. As further shown, chiplets 1-M each include one or more addressable CSRs, each of which in one example is a 64-bit register. Although only two are shown, the N ASICs may include more than two ASICs. Although only two are shown, M chiplets can include more than two chiplets. In another example, the system architecture 100 may include a single ASIC having chiplets 1-M. In another example, the system architecture 100 may include ASICs 1-N each having a single chiplet.
In an example, the data offload accelerator 112 is used to offload telemetry data from the CSR to the processor 102. In a particular example, ASICs 1 through N form, at least in part, a switched fabric network, and processor 102 may monitor telemetry data from the CSRs for certain conditions. For example, the processor 102 may monitor telemetry data so that a response or corrective action may be taken to prevent unwanted problems from occurring within the switched fabric network.
The data offload accelerator device 104 is a hardware device having various hardware components, as described below. In an example, the data offload accelerator device 104 is an IC. For example, the data offload accelerator device 104 is an FPGA. Alternatively, the data offload accelerator device 104 is an ASIC. An example benefit of implementing the data offload accelerator device 104 as an FPGA is improved flexibility in design, as it can be suitably programmed, for example, in the field or in the factory. Another example benefit of implementing the data offload accelerator device 104 as an FPGA is: the data offload accelerator algorithms used therein may be optimized and/or changed based on user demand or hardware changes within the system architecture 100.
The data offload accelerator device 104 also includes an interface bridge 110 that couples the processor 102 to a data offload accelerator 112. The data offload accelerator device 104 also includes interfaces 122, 124-1 to 124-N, and 126-1 to 126-N that couple the data offload accelerator 112 to the chiplets 1 to M within the ASICs 1 to N. Where there are multiple processors within the system architecture 100, such as processors 102 and 130, the interface bridge 110 may connect all of the processors to the data offload accelerator 112.
The interface bridge 110 may include any suitable interface, such as an interface capable of enabling serial transfer of data between the data offload accelerator 112 and the processor 102. The interfaces within the interface bridge 110 may further allow for the transfer of commands (or requests) and interrupts or any other suitable messaging or information between the processor 102 and the data offload accelerator 112. One exemplary interface bridge 110 includes a PCIe industrial interface. Where there are multiple processors (e.g., processors 102 and 130) within the system architecture 100, the interface bridge 110 may include multiple PCIe interfaces, each of which is coupled to one of the processors and the data offload accelerator 112. Alternatively, the interface bridge 110 may include a PCIe switch coupled to each processor and the data offload accelerator 112.
In one example, the interfaces 122, 124-1 to 124-N, and 126-1 to 126-N allow transaction-based communication between the data offload accelerator 112 and the chiplets 1 to M of the ASICs 1 to N. Transaction-based communications or "transactions" may include, for example, requests for configuration (e.g., during initialization or at some subsequent time) and memory reads and writes, as well as responses to requests, e.g., data, such as telemetry data. As shown, interfaces 124-1 through 124-N are connected between interface 122 and chiplets 1 through M, respectively, of ASIC 1. Similarly, interfaces 126-1 through 126-N are connected between interface 122 and chiplets 1 through M, respectively, of ASIC N to access one or more CSRs within the chiplets.
As further shown, interfaces 124-1 through 124-N and 126-1 through 126-N each include physical, link, and protocol (higher) layers. Further, as shown, the interface 122 serves as a protocol layer interface between the data offload accelerator 112 and the interfaces 124-1 through 124-N and 126-1 through 126-N. In an example, the interface 122 includes a transaction buffer and a router. In a particular example, the interfaces 124-1 through 124-N and 126-1 through 126-N are each proprietary interfaces, and the interface 122 includes a plurality of transaction first-in-first-out (FIFO) routers. In this example, the data offload accelerator device 104 may serve as a protocol bridge between the processor 102 and the chiplets 1-M, for example if the interface bridge 110 includes PCIe. Alternatively, interfaces 124-1 through 124-N and 126-1 through 126-N are PCIe interfaces and interface 122 includes multiple PCIe transaction FIFO routers.
This example arrangement of interfaces 122, 124-1 through 124-N and 126-1 through 126-N enables parallel transfer of transactions (e.g., requests) from the data offload accelerator 112 to the chiplets 1 through M of ASICs 1 through N. Likewise, this example arrangement of interfaces 122, 124-1 through 124-N, and 126-1 through 126-N enables parallel transfer of transactions (e.g., responses to requests) from the chiplets 1 through M of ASICs 1 through N to the data offload accelerator 112. For example, after the processor 102 receives a single write command, the data offload accelerator 112 may send read requests in parallel to the chiplets 1-M in the ASICs 1-N, one read request being sent on each of the interfaces 124-1-124-N and 126-1-126-N. In addition, in response to a read request, the chiplets 1-M within the ASICs 1-N can send data responses in parallel to the data offload accelerator, one of which is sent on each of the interfaces 124-1-124-N and 126-1-126-N.
The data offload accelerator 112 includes an offload control device 114, an address buffer bank having a single address buffer 116, a response data buffer bank having a single response data buffer 120, and FSM Transaction (TXN) processing logic 118 (referred to herein as transaction processing logic). In the example shown, address buffer 116 is implemented as a Block Random Access Memory (BRAM), but may be implemented using any suitable memory technology to store multiple addresses. Address buffer 116 is coupled to processor 102, e.g., via a BRAM interface (I/F). The address buffer 116 is also coupled to transaction logic 118, for example using a hardware connection. More specifically, the processor 102 may specify an address, such as an address of a CSR, in the address buffer 116 to access one or more of the chips 106-1 through 106-N and retrieve data from one or more of the chips 106-1 through 106-N. In one example, the addresses are linear or sequential. In this example, the processor 102 may indicate a starting address and address range in the address buffer 116. Alternatively, the addresses may be non-linear or heterogeneous, non-discrete addresses, which may provide greater flexibility than using linear addresses. In this alternative example, the address buffer 116 may implement a list feature that allows the processor 102 to write a list of non-linear addresses to the address buffer 116.
In the example shown, response data buffer 120 is implemented as a BRAM, but may be implemented using any suitable memory technology to store multiple addresses. Response data buffer 120 is coupled to processor 102, e.g., via a BRAM I/F. Response data buffer 120 is also coupled to transaction logic 118. The transaction logic 118 may forward the data it retrieves from the chips 106-1 through 106-N to the response data buffer 120 in parallel. The parallel lines between the transaction processing logic (e.g., 118) and the one or more data buffers (e.g., 120) represent multiple hardware connections to enable parallel forwarding of information (including data responses) from chiplets 1-M, in some examples, transactions are sent to chiplets 1-M to initialize the CSR. As shown, data response buffer 120 is an on-chip buffer. Alternatively, and as described with respect to one or more example data offload accelerators disclosed herein, the response data buffer bank may include a plurality of data buffers, e.g., a plurality of sets of data buffers, with each set having a plurality of data buffers.
Offload control device 114 is coupled to both processor 102 and transaction logic 118. The offload control device 114 may include any suitable circuitry that enables: the processor 102 provides a status indication (e.g., a command or other input) to control the function of the data offload accelerator 112 to retrieve data from chiplets 1-M; transaction logic 118 provides one or more status indications as a result of requesting, receiving, storing, and/or processing data from chiplets 1-M; and the processor 102 reads one or more status indications from the data offload accelerator 112, such as one or more status indications provided by the transaction processing logic 118.
The offload control device 114 includes a plurality of status indicator circuits, such as one or more bit registers, that enable different status indications. As shown, the offload control device 114 includes accelerator (Accel) ready, start, in progress (inpreg), data available (DataAvail), error, address (Addr) control (Cnt), buffer offset/size, and performance counter (Cntrs) status indicator circuitry. The offload control device 114 may have more or less status indicator circuitry, for example, depending on the information indicated by the processor 102 and the data offload accelerator 112.
The accelerator ready state indicator circuit enables the data offload accelerator 112 to indicate its readiness to the processor 102. For example, an accelerator ready state indicator circuit enables the data offload accelerator 112 to indicate that it has exited reset and that all links to chiplets 1-M have been initialized. The start state indicator circuit enables the processor 102 to indicate to the data offload accelerator 112 to start processing the address specified in the address buffer 116. For example, using the start status indicator circuit, the processor 102 may issue a single write command to initiate data capture from the chiplets 1-M of one or more of the ASICs 1-N. Data capture or capturing data includes receiving data from a remote chip and storing the data. In an example, data capture includes sending a data request, receiving a data response including the data, and forwarding the data to one or more response data buffers. The in-flight status indicator circuitry enables the transaction logic 118 to indicate to the processor 102 that a data fetch from the address indicated in the address buffer 116 is in progress. The data available status indicator circuit enables the transmit processing logic 118 to indicate to the processor 102 that data is available for offloading. For example, the data available status indicator circuit enables the data offload accelerator 112 to interrupt the processor 102 to offload at least a portion of the retrieved data to the processor 102. The error status indicator circuitry enables the transaction logic 118 to indicate to the processor 102 various errors that may occur. Example errors include, for example, a read timeout or a connection with an ASIC being interrupted in a transaction. The address control state indicator circuitry enables the processor 102 to indicate to the transaction logic 118 how many addresses are indicated in the address buffer 116. The buffer offset/size status indicator circuitry enables the transaction logic 118 to indicate to the processor 102 where the retrieved data is located within the response data buffer 120. The buffer offset/size status indicator circuit also enables the data offload accelerator 112 to indicate the size of the address buffer 116 and/or the response data buffer 120 to the processor 102. The performance counter status indicator circuitry enables the transaction logic 118 to indicate to the processor 102 various parameter data indicative of performance, such as how long it takes to retrieve and store data.
The transaction processing logic 118 may be coupled to an interface 122 to process transactions between the data offload accelerator 112 and the chiplets 1-M of the ASICs 1-N. In a particular example, the transaction logic 118 is implemented in hardware as a Finite State Machine (FSM) to perform various functions. For example, transaction logic 118: read the address from the address buffer 116; constructing read requests in a format suitable for interfaces 122, 124-1 through 124-N and 126-1 through 126-N and chiplets 1 through M and sending read requests to those addresses; receiving data in response to a read request; the data is stored in an appropriate response data buffer, in this example response data buffer 120. Transaction logic 118 may also handle errors and provide a status indication, for example, as described above. In other examples, the transaction logic 118 may perform other functions, such as comparing the retrieved data to one or more criteria and performing an action based on the comparison. In another example, the transaction logic 118 may send the read request in parallel to at least some of the plurality of addresses specified in the address buffer 116. Transaction logic 118 may also receive data from multiple addresses in parallel.
FIG. 2 depicts a system architecture 200 in which a data offload accelerator 212 may be implemented for offloading data from a plurality of remote chiplets 1 through M of ASICs 1 through N of the system architecture 200 to the processor 102, according to one or more examples of the present disclosure. The system architecture 200 also includes a data offload accelerator device 204, the data offload accelerator device 204 including a data offload accelerator 212, and the data offload accelerator device 204 coupled to both the processor 102 and the plurality of chiplets 1-M of ASICs 1-N. The processor 102 and ASICs 1 through N may be implemented similarly as described above with reference to fig. 1. Further, the system architecture 200 may include multiple processors.
The data offload accelerator device 204 includes an interface bridge 110, which interface bridge 110 may be similarly implemented as described above with reference to fig. 1 and similarly coupled to the processor 102. As described above with reference to FIG. 1, the data offload accelerator device 204 also includes interfaces 122, 124-1 through 124-N and 126-1 through 126-N, which interfaces 122, 124-1 through 124-N and 126-1 through 126-N may be similarly implemented and similarly coupled to chiplets 1 through M of ASICs 1 through N.
The data offload accelerator 212 is shown coupled to both the interface bridge 110 and the interface 122. As shown, the data offload accelerator 212 includes: a plurality of accelerator registers 214; and an address buffer bank comprising a slow address buffer 208 and a fast address buffer 202; a data buffer bank comprising a plurality of slow response data buffers 210, a plurality of fast response data buffers 216, and a plurality of fast response data buffers 206; and transaction processing logic 218. The transaction logic 218 may be implemented in hardware as a finite state machine and is coupled to the interface 122, the accelerator registers 214, the slow address buffer 208, and the fast address buffer 202, as well as the plurality of slow response data buffers 210, the plurality of fast response data buffers 206, and the plurality of fast response data buffers 216.
Each of the slow address buffer 208 and the fast address buffer 202 and the plurality of slow response data buffers 210, the plurality of fast response data buffers 206, and the plurality of fast response data buffers 216 is shown as a BRAM with a BRAM I/F to couple the data buffers to the processor 102, but may be implemented as any suitable memory device. Transaction logic 218 may function similar to transaction logic 118 described with reference to fig. 1. However, rather than reading addresses from a single address buffer, the transaction logic 218 may read addresses from multiple, e.g., "slow" and "fast" address buffers 208 and 202, respectively. In addition, rather than writing to a single response data buffer, the transaction logic 218 may write multiple (e.g., "slow" and "fast") response data buffers 210 and 206, 216, respectively, in parallel.
As shown, the accelerator registers 214 include accelerator ready, slow/fast start, slow/fast in-progress, slow/fast data available, slow/fast error, slow/fast address control, buffer offset/size, and performance counter status registers, which may operate similar to the acceleration ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status indicator circuits of the offload control device 114 described above with reference to FIG. 1. However, the accelerator registers 214 are capable of enabling indications for multiple, e.g., "slow" and "fast" address buffers 208 and 202, respectively, rather than indications for a single address buffer. Furthermore, the accelerator registers 214 are capable of enabling indications for multiple, e.g., "slow" and "fast" response data buffers 210 and 206, 216, respectively, rather than indications for a single response data buffer.
Using the data offload accelerator 212, the processor 102 may indicate an address, such as an address of a CSR, in both the slow address buffer 208 and the fast address buffer 202 to access the chiplets 1-M of one or more of the ASICs 1-N and retrieve data from the chiplets 1-M of one or more of the ASICs 1-N. The transaction logic 218 reads addresses from the address buffers 202 and 208; constructing read requests and sending the read requests to the addresses; receiving data in response to a read request; and forwards the data it retrieves from chiplets 1-M of ASICs 1-N in parallel to response data buffers 206, 210 and 216.
As shown, the data offload accelerator 212 includes buffers of different performance classes or priorities, in which case "fast" buffers (e.g., 202, 206, and 216) and "slow" buffers (e.g., 208 and 210) enable data to be offloaded to the processor 102 at different rates. In one example, a "fast" buffer enables certain data to be retrieved and offloaded at a faster rate (e.g., 100 times) than a "slow" buffer. In this example, there are two performance classes or priorities of buffers. However, there may be additional performance classes or priorities of the buffers according to the data collection specification, e.g., based on expected quality of results or particular CSR segments. For example, the multiple performance classes or priorities may indicate different data refresh rates, different bandwidths of data captured by the data offload accelerator 212 and/or communicated to the processor 102, different rates at which data is captured by the data offload accelerator 212 and/or communicated to the processor 102, and so forth. In a particular example, each performance level or priority is associated with a separate address buffer and response data buffer along with it.
In addition, the slow address buffer 208, the fast address buffer 202, the plurality of slow response data buffers 210, the plurality of fast response data buffers 206, and the plurality of fast response data buffers 216 may have different sizes depending on the data collection specification. In an example, the amount of data captured at the faster rate is less than the amount of data captured at the slower rate. Thus, the fast address buffer 202 may be smaller than the slow address buffer 208 to store fewer addresses. Likewise, the plurality of fast response data buffers 206 and the plurality of fast response data buffers 216 may be smaller than the plurality of slow response data buffers 210 to store less data. In a particular example, the transaction logic 218 may execute an arbitration scheme to determine how often to fill the multiple fast response data buffers 206 and 216 relative to the multiple slow response data buffers 210.
Another feature of the data offload accelerator 212 is the use of multiple sets of fast response data buffers (e.g., 206 and 216). For example, the data offload accelerator 212 may send requests to the chips 106-1 through 106-N to retrieve data in parallel, and may fill data to a given response data buffer in parallel. However, the processor 102 may serially read data from the filled response data buffer. Thus, offloading data to the processor 102 may take several times longer than capturing and writing data. In this case, the use of multiple fast response data buffers may support higher speeds. For example, when the processor 102 reads data from the FastA response data buffer 216, the transaction logic 218 can collect the data and fill the FastB response data buffer 206. Similarly, when the processor 102 reads data from the FastB response data buffer 206, the transaction logic 218 can collect the data and fill the FastA response data buffer 216. In another example, the data offload accelerator 212 includes additional fast response data buffers, the number of which may depend at least in part on the relative speed of the processor 102 at which the transaction logic 218 captures data and fills in data to read data.
FIG. 3 depicts a system architecture 300 within which a data offload accelerator 312 may be implemented for offloading data from a plurality of remote chiplets 1-M of ASICs 1-N of the system architecture 300 to the processor 102, according to one or more examples of the present disclosure. The system architecture 300 also includes a data offload accelerator device 304, the data offload accelerator device 304 including a data offload accelerator 312, and the data offload accelerator device 304 coupled to both the processor 102 and the plurality of chiplets 1-M of ASICs 1-N. The processor 102 and ASICs 1 through N may be implemented similarly as described above with reference to fig. 1. Further, the system architecture 300 may include multiple processors.
As described above with reference to FIG. 1, the data offload accelerator device 304 includes the interface bridge 110, which interface bridge 110 may be similarly implemented and similarly coupled to the processor 102. As described above with reference to FIG. 1, the data offload accelerator device 304 also includes interfaces 122, 124-1 through 124-N and 126-1 through 126-N, which interfaces 122, 124-1 through 124-N and 126-1 through 126-N may be similarly implemented and similarly coupled to chiplets 1 through M of ASICs 1 through N.
The data offload accelerator 312 is shown coupled to both the interface bridge 110 and the interface 122. As shown, the data offload accelerator 312 includes: a plurality of accelerator registers 314; an address buffer bank including an address buffer 302; a data buffer bank including a plurality of response data buffers 310; transaction processing logic 318; a comparator circuit 306; and a marking circuit 308. The transaction logic 318 may be implemented in hardware as a finite state machine and is coupled to the interface 122, the accelerator registers 314, the address buffer 302, the plurality of response data buffers 310, the comparator circuit 306, and the action circuit 308.
The address buffer 302 and each of the plurality of response data buffers 310 are shown as a BRAM with a BRAM I/F for coupling the data buffers to the processor 102, but may be implemented using any suitable memory technology. Transaction logic 318 may function similarly to transaction logic 118 described with reference to fig. 1. However, rather than writing to a single response data buffer, transaction logic 318 may write to multiple response data buffers 310 in parallel. Further, in the example shown, transaction logic 318 also provides an input to comparator circuit 306 and an indication based on the result of action circuit 308.
As shown, the accelerator registers 314 include accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status registers that function similarly to the accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status indicator circuits of the offload control device 114 as described above with reference to fig. 1. The accelerator registers 314 also include address/data (a/D) first-in-first-out (FIFO) control (Cnt)/status registers that enable the transaction logic 318 to provide one or more status indications to the processor 102 based on the results of the action circuitry 308. Further, the accelerator register 314 can enable indications for multiple response data buffers 310 rather than a single response data buffer.
In an example case, data from the chiplets 1-M of the ASICs 1-N may not change frequently. Thus, to further improve efficiency, once transaction logic 318 first writes data to response data buffer 310 (e.g., data set n-1), when there is a change, transaction logic 318 updates the data in response data buffer 310 and the changed data may be marked and passed to processor 102 instead of passing the complete data set. Comparator circuit 306 and flag circuit 308 can implement this feature. In the example, the tag circuit 308 is implemented as a set of "tag address/data FIFO" registers.
As shown, comparator circuit 306 implements a comparison function, for example, using combinational logic circuits. In one example, when transaction logic 318 retrieves data set-n, it may provide each data point (value) in data set-n to comparator circuit 306. The comparison function may then compare each data point in data set n with a corresponding data point in data set n-1 for a given address location of the response data buffer 310. When the compare function outputs a difference to the tag circuit 308, the tag circuit 308 may store the difference and/or the actual data value as a "tag value" and also store the corresponding address location of the response data buffer 310 as a "tag address" in the tag address/data FIFO register. In an example, the contents of the tag address/data FIFO register of the tag circuit 308 are accessible to the processor 102. In a particular example, the data offload accelerator 312 uses the A/D FIFO control/status register to provide an indication to the processor 102 that the content is available for transfer from the tag address/data FIFO register.
Thus, only a portion of the data, such as the tag data value for a given address location and/or the tag difference between the data set n value and the data set n-1 value, is offloaded to the processor 102, rather than the entire data set-n and data set n-1. This speeds up the data read by the processor 102, which may be important, for example, where there are thousands of CSRs from which to read data. In addition, the data offload accelerator 312 may monitor its hardware for significant changes in data. Such monitoring may be performed more efficiently than if the data were first passed to the processor 102 to perform the comparison function.
FIG. 4 depicts a system architecture 400 in which a data offload accelerator 412 may be implemented to offload data from a plurality of remote chiplets 1-M of ASICs 1-N of the system architecture 400 to the processor 102 according to one or more examples of the present disclosure. The system architecture 400 also includes a data offload accelerator device 404, the data offload accelerator device 404 including a data offload accelerator 412, and the data offload accelerator device 404 coupled to the processor 102 and the plurality of chiplets 1-M of ASICs 1-N. The processor 102 and ASICs 1 through N may be implemented similarly as described above with reference to fig. 1. Further, the system architecture 400 may include multiple processors.
The data offload accelerator device 404 includes an interface bridge 110, which interface bridge 110 may be similarly implemented as described above with reference to fig. 1 and similarly coupled to the processor 102. As described above with reference to FIG. 1, the data offload accelerator device 404 also includes interfaces 122, 124-1 through 124-N, and 126-1 through 126-N, which may be similarly implemented and similarly coupled to the chiplets 1 through M of ASICs 1 through N.
Data offload accelerator 412 is shown coupled to both interface bridge 110 and interface 122. As shown, the data offload accelerator 412 includes: a plurality of accelerator registers 414; a plurality of address buffer banks including an address buffer 402; a data buffer bank including a plurality of response data buffers 410; transaction processing logic 418; a comparator circuit 406; and a marking circuit 408. Transaction logic 418 may be implemented in hardware as a finite state machine and is coupled to interface 122, accelerator registers 414, address buffer 402, plurality of response data buffers 410, comparator circuit 406, and tag circuit 408.
The address buffer 402 and each data buffer within the plurality of response data buffers 410 are shown as a BRAM with a BRAM I/F for coupling the data buffer to the processor 102, but may be implemented as any suitable memory technology. The tag circuit 408 may function similarly to the tag circuit 308 described with reference to fig. 3. Transaction logic 418 may function similarly to transaction logic 118 described with reference to fig. 1. However, transaction logic 418 may write multiple response data buffers 410 in parallel rather than a single response data buffer. Further, in the illustrated example, the transaction logic 418 also provides an indication based on the results of the action circuitry 408.
As shown, the accelerator registers 414 include accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status registers that function similar to the accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status indicator circuits of the offload control device 114 as described above with reference to fig. 1. The accelerator registers 414 also include an A/D FIFO control/status register, which enables the transaction logic 418 to provide one or more status indications to the processor 102 based on the results of the marking circuit 408. Further, the accelerator registers 414 can enable indications for multiple response data buffers 410 rather than a single response data buffer.
In another example scenario, the comparison and monitoring functions of the example of fig. 4 may be more complex than the functions described with reference to fig. 3. For example, the comparator circuit 406 may include a plurality of data set criteria and pattern matching registers that allow data retrieved from the chiplets 1-M of one or more of the ASICs 1-N to be compared to one or more criteria. In one example, processor 102 programs one or more criteria into a data set criteria and pattern matching register of comparator circuit 406 against which the retrieved data is compared. As shown, the processor 102 may be aligned and then programmed, including but not limited to a bit range field, a pattern match type, and a pattern match value. The bit range field may indicate a bit range of interest, for example, from the CSR, to compare. The pattern match type may indicate the type of comparison to be performed, e.g., equal to, greater than, less than. The pattern match value may indicate one or more values to which the retrieved data is compared using the pattern match type. In one example, the comparator circuit 406 may monitor certain conditions of interest based on a defined telemetry reaction target.
If one or more (pattern matching) criteria are met, the tag circuit 408 may store a "tag value," such as the value of the data or some other value related to the data, and also store the corresponding address location of the response data buffer 410 as a "tag address" in a tag address/data FIFO register. In an example, the contents of the tag address/data FIFO register of the tag circuit 408 are accessible to the processor 102. In a particular example, the data offload accelerator 412 uses an A/D FIFO control/status register to provide an indication to the processor 102 that content is available for offload from the tagged address/data FIFO register. Thus, only a portion of the data, such as the marked data value and/or other values associated with the data value for a given address location, is offloaded to the processor 102, rather than the entire data set-n and data set n-1.
FIG. 5 depicts a system architecture 500 in which a data offload accelerator 512 may be implemented in the system architecture 500 for offloading data from a plurality of remote chiplets 1-M of ASICs 1-N of the system architecture 500 to the processor 102, according to one or more examples of the present disclosure. The system architecture 500 also includes a data offload accelerator device 504, the data offload accelerator device 504 including a data offload accelerator 512, and the data offload accelerator device 504 coupled to the processor 102 and the plurality of chiplets 1-M of ASICs 1-N. The processor 102 and ASICs 1 through N may be implemented similarly as described above with reference to fig. 1. Further, the system architecture 500 may include multiple processors.
The data offload accelerator device 504 includes an interface bridge 110, which interface bridge 110 may be similarly implemented as described above with reference to fig. 1 and similarly coupled to the processor 102. The data offload accelerator device 404 also includes interfaces 122, 124-1 through 124-N and 126-1 through 126-N, which may be similarly implemented as described above with reference to FIG. 1 and similarly coupled to the chiplets 1 through M of the ASICs 1 through N.
The data offload accelerator 512 is shown coupled to both the interface bridge 110 and the interface 122. As shown, the data offload accelerator 512 includes: a plurality of accelerator registers 514; and an address buffer bank comprising an address buffer 502; a data buffer bank including a plurality of response data buffers 510; transaction processing logic 518; a comparator circuit 506; and an action circuit 508. The transaction logic 518 may be implemented in hardware as a finite state machine and is coupled to the interface 122, the accelerator registers 514, the address buffer 502, the plurality of response data buffers 510, the comparator circuit 506, and the action circuit 508.
Each data buffer within address buffer 402 and plurality of response data buffers 410 is shown as a BRAM with a BRAM I/F for coupling the data buffer to processor 102, but may be implemented as any suitable memory technology. Transaction logic 518 may function similar to transaction logic 118 described with reference to fig. 1. However, the transaction logic 518 may write multiple response data buffers 510 in parallel rather than a single response data buffer. Further, in the illustrated example, the transaction logic 518 also provides an indication based on the results of the action circuitry 508.
As shown, the accelerator registers 514 include accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status registers that function similar to the accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status indicator circuits of the offload control device 114 as described above with reference to fig. 1. The accelerator registers 514 also include a criteria/action CSR status register that enables the transaction processing logic 518 to provide one or more status indications to the processor 102 based on the results of the marking circuitry 508. Further, the accelerator registers 514 can enable indications for multiple response data buffers 510 rather than a single response data buffer.
In another example scenario, the comparison and monitoring functions of the example of fig. 5 may be more complex than the functions described with reference to fig. 3. For example, the comparator circuit 406 may include a plurality of data set criteria and pattern matching registers that allow data retrieved from the chiplets 1-M of one or more of the ASICs 1-N to be compared to one or more criteria. In one example, the processor 102 programs one or more criteria into the data set criteria and pattern matching registers of the comparator circuit 506 against which the retrieved data is compared. As shown, the processor 102 may be programmed with an alignment, including but not limited to a bit range field, a pattern match type, a pattern match value, and a criteria category. The bit range field, pattern match type, and pattern match value registers may function similar to those described with reference to fig. 4. However, the criteria category register includes additional criteria against which to compare the retrieved data. In one example, the criteria category is used to determine whether the retrieved data value is from a CSR within a specified CSR category.
For example where the ASICs 1 to N are formed or included in a switched network, the collected telemetry data may be processed to find or monitor a particular performance signature. When an exception signature is found, the data offload accelerator 512 may take action to recover from the condition while the switching network continues to operate. In this context, as a non-limiting example, the data offload accelerator 512 may: finding a condition indicating that the fabric path is no longer operable; characterizing new silicon wafers and new exchange paradigms; search for "black holes," i.e., ports that are receiving data but never passing it to a destination; identify "brick walls," i.e., ports that never accept data; etc. of
If one or more criteria are met or a signature is found, in this example scenario, action circuitry 508 may take some action, such as some corrective action. For example, action circuitry 508 may take one or more actions within response data buffer 510 for a chiplet, particularly for a CSR or the like within a chiplet. As shown, action circuit 508 includes: a plurality of accelerated action response registers, including action category, action type, action scope, action state and action performance (Perf.) counter registers, provide flexibility in the actions that the action circuitry may take in response to one or more criteria being met. These actions may be delineated by categories, types, and ranges in corresponding registers. An indication of the action taken and the results or other status related to the action taken may be recorded in an action status register. Also, any performance metrics related to the action may be stored in the action performance counter register. Such actions may include, but are not limited to, writing a particular value to the CSR or to an address in response data buffer 510, changing the routing table for the CSR to avoid black hole or brick wall network routing conditions, and the like.
In an example, the processor 120 may access the contents of an accelerated motion response register of the motion circuitry 508. In a particular example, the data offload accelerator 512 uses the criteria/action CRS register to provide an indication to the processor 102 that content may be viewed and passed from one or more action response registers of the action circuit 508. In yet another example, data in response data buffer 510 is also transferred to processor 102. If all data from the chiplets 1-M of ASICs 1-N must first be serially offloaded to the processor 102 for analysis, the action circuit 508 can implement the action to be taken faster than the action taken to resolve or mitigate the condition in the network.
FIG. 6 depicts a system architecture 600 in which an initialization accelerator 612 may be implemented for initializing a plurality of remote chiplets 1-M of ASICs 1-N of the system architecture 600, according to one or more examples of the present disclosure. The system architecture 600 also includes an initialization accelerator device 604, the initialization accelerator device 604 including an initialization accelerator 612, and the initialization accelerator device 604 coupled to the processor 102 and the plurality of chiplets 1-M of ASICs 1-N. Processor 102 and ASICs 1 through N may be similarly implemented as described above with reference to FIG. 1. Further, the system architecture 600 may include multiple processors.
The initialization accelerator device 604 includes an interface bridge 110, which interface bridge 110 may be similarly implemented and similarly coupled to the processor 102 as described above with reference to FIG. 1. The initialization accelerator device 604 also includes interfaces 122, 124-1 through 124-N and 126-1 through 126-N, which may be similarly implemented and similarly coupled to the chiplets 1 through M of ASICs 1 through N as described above with reference to FIG. 1.
Initialization accelerator 612 is illustrated as being coupled to both interface bridge 110 and interface 122. As shown, the initialization accelerator 612 includes: a plurality of accelerator registers 614; an address buffer bank comprising an address buffer 602; a data buffer bank comprising a plurality of output data buffers 606; and transaction processing logic 618. The transaction logic 618 may be implemented in hardware as a finite state machine and is coupled to the interface 122, the accelerator registers 614, the address buffer 602, and the plurality of output data buffers 606.
As shown, the accelerator registers 614 include accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status registers that function similar to the accelerator ready, start, in-progress, data available, error, address control, buffer offset/size, and performance counter status indicator circuits of the offload control device 114 as described above with reference to fig. 1. Further, the accelerator register 614 can enable indications of multiple output data buffers 606. In another example, the accelerator register 614 can implement an indication of a single output data buffer.
Each of the address buffer 602 and the plurality of output data buffers 606 is illustrated as a BRAM with a BRAM I/F for coupling the data buffers to the processor 102, but may be implemented as any suitable memory technology. The transaction logic 618 may function similarly to the transaction logic 118 described with reference to FIG. 1 to retrieve and store data from the chips 106-1 through 106-N. In this case, outbound data buffer 606 will serve as a response data buffer to store the retrieved data.
However, in some examples, transaction logic 618 writes initialization data from output data buffer 606 to chiplets 1-M of one or more of ASICs 1-N, e.g., to CSRs within chiplets 1-M. Initialization data may be written in parallel on the interfaces 122, 124-1 through 124-N, and 126-1 through 126-N. In an example, as part of initializing a chiplet, the processor 102 can specify multiple addresses in the address buffer 602. The processor 102 may correspondingly fill the outbound data buffer 106 with initialization data to initialize the chiplets at those addresses. Then, upon receiving an indication from processor 102, for example using a start register within accelerator register 614, transmit processing logic 618 writes the initialization data from outbound data buffer 606 to the chiplet at the stored address. Thus, the transaction logic 618 can perform simultaneous initialization of the chiplets by using hardware to send write requests to the chiplets in parallel. The initialization time may be greatly reduced as compared to when the processor 102 initializes each chiplet one at a time.
In another example, the transaction logic 618 can write other types of data to the chiplets 1-M of one or more of the ASICs 1-N. For example, the transaction logic can write configuration data to the chiplet at a subsequent time other than initialization to change the values within the CSR.
FIG. 7 depicts a system architecture 700 in which a data offload accelerator 712 may be implemented within the system architecture 700 for offloading data from a plurality of remote chips 1-M of a plurality of ICs 730-1 to 730-N of the system architecture 700 to the processor 102, according to one or more examples of the present disclosure. ICs 730-1 to 730-N are also referred to herein as ICs 1 to N. The system architecture 700 also includes a data offload accelerator device 704 that includes a data offload accelerator 712 and is coupled to both the processor 102 and the plurality of chips 106-1 to 106-N. Processor 102 and chips 1 through M may be any type of chip that includes memory mapped address locations containing data for offloading to processor 102. Further, the system architecture 700 may include multiple processors.
The data offload accelerator device 704 includes an interface bridge 110, which interface bridge 110 may be similarly implemented as described above with reference to fig. 1 and similarly coupled to the processor 102. The data offload accelerator device 704 also includes interfaces 122, 124-1 through 124-N and 126-1 through 126-N, which may be similarly implemented as described above with reference to FIG. 1 and similarly coupled to chips 1 through M of the ICs 1 through N.
The data offload accelerator 712 is illustrated as being coupled to both the interface bridge 110 and the interface 122. As shown, the data offload accelerator 712 includes transaction processing logic 718 coupled to the remaining components of the data offload accelerator 712. The transaction logic 718 may include at least some of the functionality described above with reference to fig. 1, 2, 3, 4, and 6. The data offload accelerator 712 also includes a plurality of accelerator registers 714 that may function similarly to the accelerator registers 114 described with reference to FIG. 1. The data offload accelerator 712 also includes an address buffer bank having a slow address buffer 708 and a fast address buffer 702, which may function similarly to the slow address buffer 208 and the fast address buffer 202 described with reference to FIG. 2. The data offload accelerator 712 also includes a data buffer bank having a plurality of slow response data buffers 710, a plurality of fast response data buffers 722, and a plurality of fast response data buffers 706 that function similarly to the slow response data buffers 210, the fast response data buffers 216, and the fast response data buffers 206 described with reference to FIG. 2. The data offload accelerator 712 further includes a comparator circuit 716 and an action circuit 720, which may function similarly to the comparator circuit 406 and the action circuit 408 described with reference to fig. 4. In a particular example, the processor 102 may program the data offload accelerator to function in different modes to trigger various functions therein.
Fig. 8 depicts a flow diagram of a method 800 for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure. The example method 800, or portions thereof, may be performed by the example data offload accelerators 112, 212, 312, 412, 512, 612, and 712 shown in fig. 1-7, respectively. However, the method 800 is illustratively described with reference to the data offload accelerator 112 of FIG. 1. According to the example method 800, an indication of a plurality of addresses for retrieving data from a plurality of remote chiplets 1-M of one or more of the ASICs 1-N is received (802) into an address buffer bank, which in this case is the address buffer 116. A command to initiate the offloading of data is received (804) into the offload control device 114. Transaction logic 118 captures 806 the data in parallel to data buffer bank 120, in this case response data buffer 120. Once the response data buffer 120 is at least partially or completely filled with data, the transaction logic 118 interrupts (808) the processor 102 for at least a portion of the transferred data.
In one example, the transactional logic 118 retrieves the data based on one or more policies programmed within the transactional logic 118. In one example, transaction logic 118 retrieves data from some, but not all, ACIS 1 through ACIS N. In another example, transaction logic 118 retrieves data from some, but not all, of one or more of ASICs 1-N, 1-M. In another example, transaction logic 118 retrieves data from some, but not all, of the CSRs within one or more of chips 1 through M in one or more of ASICs 1 through N. In another example, the transaction logic 118 retrieves different sizes of data, such as different sizes of data bits or bytes. In another example, each transaction may retrieve data of a different size.
Fig. 9 depicts a flow diagram of a method 900 for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure. The example method 900, or portions thereof, may be performed by the example data offload accelerators 112, 212, 312, 412, 512, 612, and 712 and the example ASIC initialization accelerator 612, as shown in fig. 1-7. Illustratively, however, the method 900 is described with reference to the data offload accelerator 712 of FIG. 7.
According to the example method 900, an indication of a plurality of addresses for providing or retrieving data from a plurality of remote chips 1 through M in one or more of ICs 1 through N is received into a plurality of address buffers 702, 708 of an address buffer bank. In particular, a first portion of a memory address is received (902) in a first address buffer, such as fast address buffer 702, and a second portion of the memory address is received (904) in a second address buffer, such as slow address buffer 710. Depending on the configuration of the data offload accelerator (e.g., where the data offload accelerator 612 is configured similar to fig. 6), the transaction logic 712 may initialize (906) multiple remote chips 1 through M in one or more of ICs 1 through N in parallel at the indicated addresses. In one example, in response to the data buffer bank containing initialization data, the transaction logic 718 may write these initialization data to remote chips 1 through M of one or more of the ICs 1 through N to initialize or configure a chip, such as a CSR within chips 1 through M of one or more of the ICs 1 through N. In another example, the processor uses an accelerator register to trigger the transaction logic 718 to initialize the CSR.
A command to initiate offloading of data is received (908) into an offload control device, in this case an accelerator register 714. The transaction logic 718 captures (910) a first portion of data into a first set of data buffers of the data buffer bank and captures (912) a second portion of data into a second portion of the data buffer bank. For example, the transaction logic 718 requests and receives data from the remote chips 106-1 through 106-N in parallel. The transaction logic 718 then forwards or writes a first portion of the data in parallel to the fast response data buffers 706 and 722 of the data buffer bank and writes a second portion of the data in parallel to the slow response data buffers 710 of the data buffer bank. In one example, once one or more of the response data buffers 706, 722, or 710 are at least partially or fully populated with data, the transaction logic 718 interrupts (918) the processor 102 to offload at least a portion of the data. For example, once one of the fast response data buffers, e.g., 722, is full, the transaction logic 718 interrupts the processor 102 to unload data from one of the fast response data buffers 722 while the transaction logic 718 fills the other of the response data buffers 706.
Where as shown in fig. 7, the data offload accelerator includes comparator circuitry (e.g., 716) and circuitry (e.g., 720), the comparator circuitry 716 may compare the captured data to subsequently retrieved data or one or more pattern matching criteria (914). At block 916, depending on how it is configured, the circuit 720 may mark subsequent data that is different from the captured data, or may take an action based on whether the data satisfies one or more criteria. In another example, a portion of the data that meets the one or more criteria or is marked is passed 918 to the processor 102, rather than the entirety of the data. In another example, when at least some of the data satisfies one or more criteria, circuitry 720 (e.g., if configured similarly to action circuitry 508, for example) takes corrective action 916.
For simplicity and illustrative purposes, the present disclosure is described primarily by reference to examples thereof. In the above description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, that the present disclosure may be practiced without limitation to these specific details. For example, examples illustrate the practice of the present disclosure using different hardware configurations and combinations of hardware for data offload accelerators. However, the present disclosure may be practiced with other combinations and configurations of data offload accelerators or different configurations of ASIC initialization accelerators not described herein. Additionally, some elements depicted may be removed and/or modified without departing from the scope of the described examples. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. As used herein, the term "including" means including but not limited to, and the term "comprising" means including but not limited to. In addition, the term "having" means having, but not limited to, and the term "containing" means containing, but not limited to. As used herein, the term "about" when applied to a value generally means within the tolerance of the device used to generate the value, or in some examples, plus or minus 10%, or plus or minus 5%, or plus or minus 1%, unless expressly specified otherwise.

Claims (20)

1. A method for offloading data from a plurality of remote chips to a processor, the method comprising:
receiving into an address buffer bank of a data offload accelerator a specification of a plurality of addresses for retrieving the data from the plurality of remote chips;
receiving a command to initiate capture of the data from the plurality of remote chips into an offload control device of the data offload accelerator;
capturing the data from the plurality of remote chips in parallel into a data buffer bank of the data offload accelerator; and
interrupting, via the offload control device, the processor to transfer at least a portion of the data to the processor.
2. The method of claim 1, wherein capturing the data from the plurality of remote chips comprises: telemetry data from a plurality of control and status registers within each of the remote chips in the plurality of remote chips is captured.
3. The method of claim 1, comprising:
receiving into the address buffer bank a specification of a first portion of the plurality of addresses for retrieving the first portion of the data from the plurality of remote chips;
receiving into the address buffer bank a specification of a second portion of the plurality of addresses for retrieving the second portion of the data from the plurality of remote chips;
capturing the first portion of the data into the data buffer bank;
capturing the second portion of the data into the data buffer bank; and
passing the first portion of the data to the processor based on at least one different criterion than passing the second portion of the data to the processor.
4. The method of claim 3, comprising:
receiving the specification of the first portion of the plurality of addresses into a first address buffer of the address buffer bank;
receiving the specification of the second portion of the plurality of addresses into a second address buffer of the address buffer bank;
capturing the first portion of the data into a first data buffer of the data buffer bank; and
capturing the second portion of the data into a second data buffer of the data buffer bank, wherein the at least one different criterion for passing the first portion of the data and the second portion of the data to the processor includes one or both of a different rate or a different bandwidth.
5. The method of claim 1, wherein capturing the data from the plurality of remote chips in parallel into the data buffer bank comprises:
sending, from transaction processing logic of the data offload accelerator, a plurality of requests for the data in parallel to a plurality of remote chips;
receiving, by the transaction logic, a plurality of responses including the data in parallel from the plurality of remote chips; and
forwarding the data from the transaction logic into the data buffer bank in parallel.
6. The method of claim 1, comprising writing data to the plurality of remote chips in parallel.
7. The method of claim 6, wherein writing the data to the plurality of remote chips in parallel comprises: initializing the plurality of remote chips prior to capturing the data.
8. The method of claim 1, comprising:
subsequent data is received from the plurality of remote chips,
comparing the captured data with subsequent data; and
marking the subsequent data that is different from the captured data.
9. The method of claim 8, comprising communicating the subsequent data different from the captured data to the processor.
10. The method of claim 1, comprising:
comparing the data to one or more criteria;
performing an action based on a determination that at least some of the data satisfies the one or more criteria.
11. A data offload accelerator to offload data from a plurality of remote chips to a processor, the data offload accelerator comprising:
an address buffer bank to receive a specification of a plurality of addresses for retrieving the data from the plurality of remote chips;
an offload control device coupled to the processor, the offload control device to receive a command to initiate capture of the data from the plurality of remote chips and interrupt the processor to pass at least a portion of the data to the processor.
Transaction logic coupled to the address buffer bank and the offload control device, the transaction logic to retrieve the data from the plurality of remote chips; and
a data buffer bank coupled to the transactional logic via a plurality of physical couplings, the data buffer bank to receive the data from the transactional logic in parallel.
12. A data offload accelerator as defined in claim 11, wherein the data buffer bank comprises:
a first data buffer to receive a first portion of the data for delivery to the processor; and
a second data buffer to receive a second portion of the data for delivery to the processor based on at least one different criterion than delivering the first portion of the data to the processor.
13. A data offload accelerator as defined in claim 12, wherein the address buffer bank comprises:
a first address buffer to receive a specification of a first portion of the plurality of addresses for retrieving the first portion of the data; and
a second address buffer to receive a specification of a second portion of the plurality of addresses for use in retrieving the second portion of the data.
14. A data offload accelerator according to claim 11, wherein the offload control device comprises: a first register to receive the command from the processor to initiate the capture of the data from the plurality of remote chips.
15. A data offload accelerator according to claim 14, wherein the offload control device further comprises: a second register to interrupt the processor.
16. The data offload accelerator of claim 11, comprising: a comparator circuit coupled to the data buffer bank, the comparator circuit to compare and provide an output based on the comparison.
17. A data offload accelerator as defined in claim 16, wherein the comparator circuit comprises: logic to:
comparing the data received into the data buffer bank with subsequent data captured from the plurality of remote chips; and
providing an output indicative of a difference between the data received into the data buffer bank and the subsequent data captured from the plurality of remote chips.
18. A data offload accelerator as defined in claim 17, comprising: at least one register coupled to the comparator circuit and the transaction logic, the at least one register to mark the difference between the data received into the data buffer bank and the subsequent data captured from the plurality of remote chips.
19. A data offload accelerator as defined in claim 16, wherein the comparator circuit comprises: at least one register to:
comparing the data to one or more criteria; and
providing an output indicating whether the data satisfies the one or more criteria.
20. A data offload accelerator as defined in claim 19, comprising: an action circuit coupled to the comparator circuit and the transaction logic, the action circuit to perform an action when the data satisfies the one or more criteria.
CN202010137127.3A 2019-03-07 2020-03-02 Data offload acceleration from multiple remote chips Pending CN111666106A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201916295823A 2019-03-07 2019-03-07
US16/295,823 2019-03-07

Publications (1)

Publication Number Publication Date
CN111666106A true CN111666106A (en) 2020-09-15

Family

ID=72146714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010137127.3A Pending CN111666106A (en) 2019-03-07 2020-03-02 Data offload acceleration from multiple remote chips

Country Status (2)

Country Link
CN (1) CN111666106A (en)
DE (1) DE102020105896A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080222383A1 (en) * 2007-03-09 2008-09-11 Spracklen Lawrence A Efficient On-Chip Accelerator Interfaces to Reduce Software Overhead
CN101373449A (en) * 2007-08-21 2009-02-25 三星电子株式会社 ECC control circuits, multi-channel memory systems and operation methods thereof
CN102934102A (en) * 2010-05-26 2013-02-13 日本电气株式会社 Multiprocessor system, execution control method and execution control program
CN103970027A (en) * 2014-04-02 2014-08-06 北京控制工程研究所 Telemetry processing unit simulation method in integrated electronic simulation software environment
CN104866452A (en) * 2015-05-19 2015-08-26 哈尔滨工业大学(鞍山)工业技术研究院 Multi-serial port extension method based on FPGA and TL16C554A
CN105045763A (en) * 2015-07-14 2015-11-11 北京航空航天大学 FPGA (Field Programmable Gata Array) and multi-core DSP (Digital Signal Processor) based PD (Pulse Doppler) radar signal processing system and parallel realization method therefor
CN105229481A (en) * 2013-02-21 2016-01-06 爱德万测试公司 There is the acceleration on storer and the tester for the acceleration of automatic mode generation in FPGA block
CN107430628A (en) * 2015-04-03 2017-12-01 华为技术有限公司 Acceleration framework with immediate data transmission mechanism
US20180095750A1 (en) * 2016-09-30 2018-04-05 Intel Corporation Hardware accelerators and methods for offload operations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080222383A1 (en) * 2007-03-09 2008-09-11 Spracklen Lawrence A Efficient On-Chip Accelerator Interfaces to Reduce Software Overhead
CN101373449A (en) * 2007-08-21 2009-02-25 三星电子株式会社 ECC control circuits, multi-channel memory systems and operation methods thereof
CN102934102A (en) * 2010-05-26 2013-02-13 日本电气株式会社 Multiprocessor system, execution control method and execution control program
CN105229481A (en) * 2013-02-21 2016-01-06 爱德万测试公司 There is the acceleration on storer and the tester for the acceleration of automatic mode generation in FPGA block
CN103970027A (en) * 2014-04-02 2014-08-06 北京控制工程研究所 Telemetry processing unit simulation method in integrated electronic simulation software environment
CN107430628A (en) * 2015-04-03 2017-12-01 华为技术有限公司 Acceleration framework with immediate data transmission mechanism
CN104866452A (en) * 2015-05-19 2015-08-26 哈尔滨工业大学(鞍山)工业技术研究院 Multi-serial port extension method based on FPGA and TL16C554A
CN105045763A (en) * 2015-07-14 2015-11-11 北京航空航天大学 FPGA (Field Programmable Gata Array) and multi-core DSP (Digital Signal Processor) based PD (Pulse Doppler) radar signal processing system and parallel realization method therefor
US20180095750A1 (en) * 2016-09-30 2018-04-05 Intel Corporation Hardware accelerators and methods for offload operations

Also Published As

Publication number Publication date
DE102020105896A1 (en) 2020-09-10

Similar Documents

Publication Publication Date Title
US8489792B2 (en) Transaction performance monitoring in a processor bus bridge
US7822908B2 (en) Discovery of a bridge device in a SAS communication system
US7552241B2 (en) Method and system for managing a plurality of I/O interfaces with an array of multicore processor resources in a semiconductor chip
US8588228B1 (en) Nonvolatile memory controller with host controller interface for retrieving and dispatching nonvolatile memory commands in a distributed manner
US9806904B2 (en) Ring controller for PCIe message handling
US8606976B2 (en) Data stream flow controller and computing system architecture comprising such a flow controller
US8312187B2 (en) Input/output device including a mechanism for transaction layer packet processing in multiple processor systems
US10802995B2 (en) Unified address space for multiple hardware accelerators using dedicated low latency links
EP1750202A1 (en) Combining packets for a packetized bus
CN101430652A (en) On-chip network and on-chip network software pipelining method
US9596186B2 (en) Multiple processes sharing a single infiniband connection
CN104094222A (en) External auxiliary execution unit interface to off-chip auxiliary execution unit
US11935600B2 (en) Programmable atomic operator resource locking
US11989556B2 (en) Detecting infinite loops in a programmable atomic transaction
US7962676B2 (en) Debugging multi-port bridge system conforming to serial advanced technology attachment (SATA) or serial attached small computer system interface (SCSI) (SAS) standards using idle/scrambled dwords
CN100401279C (en) Configurable multi-port multi-protocol network interface to support packet processing
US20060259648A1 (en) Concurrent read response acknowledge enhanced direct memory access unit
US7805551B2 (en) Multi-function queue to support data offload, protocol translation and pass-through FIFO
US8402320B2 (en) Input/output device including a mechanism for error handling in multiple processor and multi-function systems
CN107291641B (en) Direct memory access control device for a computing unit and method for operating the same
US11847464B2 (en) Variable pipeline length in a barrel-multithreaded processor
CN111666106A (en) Data offload acceleration from multiple remote chips
US20140207881A1 (en) Circuit arrangement for connection interface
Inoue et al. Low-latency and high bandwidth TCP/IP protocol processing through an integrated HW/SW approach
US20170357594A1 (en) Transactional memory that is programmable to output an alert if a predetermined memory write occurs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination