WO2024102916A1

WO2024102916A1 - Root complex switching across inter-die data interface to multiple endpoints

Info

Publication number: WO2024102916A1
Application number: PCT/US2023/079244
Authority: WO
Inventors: Alexander Koch; Peter Korger
Original assignee: Kandou Labs SA; Kandou Us, Inc.
Priority date: 2022-11-09
Filing date: 2023-11-09
Publication date: 2024-05-16

Abstract

A plurality of upstream pseudo-ports (PPs) of a first circuit die, each upstream PP having a connection to a respective one of at least two root complex devices, a plurality of downstream PPs of a second circuit die, each set of downstream PPs having connections to a respective one of at least two endpoints, an inter-die data interface between the first and second circuit dies having adaptation layer ports on each circuit die according to an adaptation layer protocol, lane routing logic in the first and second circuit dies configured to map at least one of the sets of upstream PPs and a corresponding set of downstream PPs to respective adaptation layer ports on the first and second circuit dies, and a processor on one of the first and second circuit dies for configuring the lane routing logic in both circuit dies.

Description

ROOT COMPLEX SWITCHING ACROSS INTER-DIE DATA INTERFACE TO MULTIPLE ENDPOINTS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. Provisional Application No. 63/382,901, entitled ‘ ROOT COMPLEX SWITCHING ACROSS INTER-DIE DATA INTERFACE TO MULTIPLE ENDPOINTS”, filed November 9, 2022, which is hereby incorporated by reference in its entirety for all purposes.

BACKGROUND

[0002] With increased data rate in PCIe 5.0 (32 Gbps) compared to previous generations (e.g., PCIe 4.0 MAX 16 Gbps), the channel reach becomes even shorter than before, and the need for retimers becomes more evident. Typical channels comprise system boards, backplanes, cables, riser-cards and add-in cards. Connections across these kinds of channels - often combinations of these channels and their sockets - usually have losses that exceed the specified target loss of -36 dB at 16 GHz. Retimers extend the channel reach to get across the border to what is possible without a retimer.

[0003] Retimers break a link between a host (root complex, abbreviated RC) and a device (end point) into two separate segments. Thus, a retimer re-establishes a new PCIe link going forward, which includes re-training and proper equalization implementing the physical and link layer.

[0004] While redrivers are pure analog amplifiers that boost the signal to compensate for attenuation, they also boost noise and usually contribute to jitter. Retimers instead comprise analog and digital logic. Retimers equalize the signal, retrieve their clocking, and output a signal with high amplitude and low noise and jitter. Furthermore, retimers maintain power states to keep system power low.

[0005] Retimers were first specified in PCIe 4.0. For PCIe 5.0, the usage of retimers is expected. FIGs. 1 and 2 show typical applications for retimers, in accordance with some embodiments. In FIG. 1, one retimer is employed. The retimer is located on the motherboard, and logically the retimer is between the PCIe root complex (RC) and the PCIe endpoint.

[0006] FIG. 2 shows the usage of two retimers. The first retimer is similarly located on the motherboard, while the second retimer is on a riser card which makes the connection between the motherboard and the add-in card containing the PCIe endpoint. [0007] In complex PCIe systems, the number of PCIe endpoints can be significantly higher than the number of free PCIe ports. In such scenarios, switch devices may be used to extend the number of PCIe ports. Switches allow for connecting several endpoints to one root complex, and for routing data packets to the specified destinations rather than simply mirroring data to all ports. One important characteristic of switches is the sharing of bandwidth, as all endpoints share the bandwidth of the root point.

BRIEF DESCRIPTION

[0008] Methods and systems are described herein which include an apparatus having a plurality of sets of upstream pseudo-ports (PPs) of a first circuit die, each upstream PP having a connection to a respective one of at least two root complex devices, a plurality of sets of downstream PPs of a second circuit die, each set of downstream PPs having connections to a respective one of at least two endpoints, an inter-die data interface between the first and second circuit dies, the inter-die data interface configured to establish retimer physical coding sublayer (RPCS) data flows between the upstream PPs and downstream PPs of the first and second circuit dies via adaptation layer ports on each circuit die according to an adaptation layer protocol, lane routing logic in the first and second circuit dies configured to map at least one of the sets of upstream PPs and a corresponding set of downstream PPs to respective adaptation layer ports on the first and second circuit dies according to the adaptation layer protocol, and a processor on one of the first and second circuit dies for configuring the lane routing logic in both the first and second circuit dies.

[0009] This Brief Summary is provided to introduce a selection of concepts in a simplified form that are further described below' in the Detailed Description. This Brief Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. Other objects and/or advantages of the present invention will be apparent to one of ordinary' skill in the art upon review' of the Detailed Description and the included drawings.

BRIEF DESCRIPTION OF FIGURES

[0010] FIGs. I and 2 illustrate two usages of retimers, in accordance with some embodiments. [0011] FIG. 3 is a block diagram of a chip configuration of a multi-die integrated chip module (ICM) for providing multiple endpoint switching between multiple root complexes using a highspeed die-to-die (D2D) interconnect, in accordance with some embodiments. [0012] FIG. 4 is a data flow diagram of a multi-die ICM operating in a retimer mode where data lanes are routed within the same die, in accordance with some embodiments.

[0013] FIG. 5 is a data flow diagram of a multi-die ICM operating in a retimer mode where data lanes are routed between circuit dies using a D2D interconnect, in accordance with some embodiments.

[0014] FIG. 6 is a block diagram of a crossbar multiplexing switch for performing data lane routing, in accordance with some embodiments.

[0015] FIG. 7 is a diagram of a D2D interconnect, in accordance with some embodiments.

[0016] FIG. 8 is a block diagram of an adaptation layer for a D2D interconnect, in accordance with some embodiments.

[0017] FIG. 9 is a block diagram illustrating the configuration of the tile-to-tile (T2T) Serial Peripheral Interface (SPI) bus in a four-tile embodiment.

[0018] FIG. 10 is a block diagram illustrating a complete signal path between central processing unit (CPU) core 900 and each PHY on the various tiles in the multi-chip module.

[0019] FIG. 11 is a flowchart of a method, in accordance with some embodiments.

DETAILED DESCRIPTION

[0020] Despite the increasing technological ability to integrate entire systems into a single integrated circuit, multiple chip systems and subsystems retain significant advantages. For purposes of description and without limitation, example embodiments of at least some aspects of the invention herein described assume a systems environment of at least one point-to-point communications interface connecting two integrated circuit chips representing a root complex (i.e., a host) and an endpoint, (2) wherein the communications interface is supported by several data lanes, each composed of four high-speed transmission line signal wires.

[0021] Retimers ty pically include PHYs and retimer core logic. PHYs include a receiver portion and a transmitter portion. A PHY receiver recovers and deserializes data and recovers the clock, while a PHY transmitter serializes data and provides amplification for output transmission. The retimer core logic performs deskewing (in multi-lane links) and rate adaptation to accommodate for frequency differences between the ports on each side.

[0022] Since the retimer is located on the path between a root complex (e.g., a CPU) and an end point (e.g., a cache block) the retimer adds additional value. An integrated processing unit, e.g., an accelerator, may be integrated into the retimer performing data processing on the path from the root complex to the end point. [0023] The PCIe retimer circuit is a chiplet, a die, with a four-lane retimer and the capability to connect to a DPU chiplet or another retimer chiplet via the high-speed die-to-die interconnect. One, two or four lanes can be bundled into a multi-lane link where data is spread across all of the links. It is also possible to configure each lane individually to form a single-lane link. In the PCIe retimer, each lane employs two PHYs, one on each end (up- and downstream ports). Considering four lanes, eight PHYs are used in one PCIe retimer die. The PCIe retimer die also contains communication lines which allow for exchanging control information between two or more PCIe retimer dies.

[0024] The following can be built using one (or more) PCIe retimer chiplet(s). These are discussed in more detail below:

4-lane retimer

Single die, with full flexible 4x4 static lane routing

4-lane retimer with accelerator (DPU)

Two dies in one package, a retimer die and a DPU die

8-lane retimer

Two dies in one package, limited static lane routing - flexible 4x4 routing on same die but no data crossing die boundaries

8-lane retimer with full flexible lane routing

Two dies in one package, data crossing chiplets are routed through high-speed die-to-die interconnect at the cost of additional delay.

8-lane retimer with accelerator (DPU)

Three dies in package, two retimer dies and a DPU die

16-lane retimer

Four dies in one package, limited static lane routing - flexible 4x4 routing on same die but no data crossing die boundaries

Multi-Die ICM with Multiple Endpoint Switching using D2D Interface

[0025] FIG. 3 is a block diagram of a multi-die ICM 300, in accordance with embodiments. As shown, the ICM 300 includes a set of serial data transceivers (SerDes, PHYs) for a plurality of upstream pseudo-ports (PPs) of a first circuit die 305, each upstream PP having a connection to a respective one of at least two root complex devices 302 and 304. The apparatus further includes a second circuit die 310 having respective sets of PHYs of respective downstream PPs, each downstream PP having a connection to a respective one of at least two endpoints 315 and 320. [0026] FIG. 3 also includes an inter-die data interface (D2D) betw een the first circuit die 305 and the second circuit die 310. The D2D interface is configured to establish retimer data flows between the upstream PPs and downstream PPs of the first and second circuit dies via adaptation layer ports on each circuit die according to an adaptation layer protocol. The adaptation layer protocol may be configured to format raw data received on the PHYs of a pseudo-port of one type (upstream/downstream) for transmission over the D2D interface to PHYs of the pseudo-ports of the opposite type (downstream/ upstream). A D2D interface is described in more detail below with respect to FIGs. 7 and 8 that utilizes multiple flows of an orthogonal differential vector signaling code (ODVS). It should be noted that other D2D interfaces, such as the Universal Chiplet Interconnect Express (UCIe) interface, may be utilized as well. Each of the first and second circuit dies further include lane routing logic 600 configured to map at least one of the sets of upstream PPs and a corresponding set of downstream PPs to respective adaptation layer ports on the first and second circuit dies according to the adaptation layer protocol. The apparatus further includes a processor, e.g., a CPU core, on one of the first and second circuit dies for configuring the lane routing logic in both the first and second circuit dies. In the multi-die ICM 300, one of the first and second circuit dies is a leader circuit die, and while both circuit dies may include a processor on the circuit die, only the processor on the leader circuit die is active. The processor on the leader circuit die may configure the lane routing logic in the follower circuit die via a tile-to-tile serial peripheral interface (SPI) described in more detail below in the descriptions of FIGs. 9 and 10. [0027] FIG. 3 includes a Board Management Controller (BMC) 325. BMCs may be included on e.g., motherboards to monitor the state of components and hardware devices on the motherboard utilizing sensors, and communicating the status of such devices e.g., to the root complex. BMCs may be employed in e.g., server room/data center applications and may be remotely managed by administrators to access information about the overall system. Some monitoring functions of a BMC include temperature, humidity, power-supply voltage, fan speeds, communications parameters, and operating system functions. The BMC may notify the administrator if any of the parameters exceed a threshold and the administrator may take action. In some embodiments, the BMC may be preconfigured to take certain actions in the event that a parameter exceeds a threshold, such as (but not limited to) executing a sequence to switch to a redundant endpoint in the event of a failure in the primary endpoint. In some embodiments, the BMC 325 monitors the status of the PCIe links between the root complexes 302/304 and the endpoints 315/320. In such embodiments, monitoring the status of the PCIe link includes bit error rate measurements for the upstream and downstream data paths. Such measurements may be useful to monitor the overall status of the PCIe links and initiate link retraining sequences.

[0028] In addition to such monitoring, BMC 325 may be configured to manage the multiple root complexes in FIG. 3. Specifically, endpoints 315 and 320 may correspond to shareable resources for expensive functions such as artificial intelligence (Al), shareable computer-readable mediums such as hard-disk drives (HDDs) or solid state drives (SSDs), network interface cards (NICs) amongst other endpoint devices. In such embodiments, the BMC 325 may coordinate usage of the endpoints by the root complex devices, i.e., so that both root complex devices do not establish connections with the same endpoint device at the same time. In some embodiments, the BMC may utilize credit-based techniques to share the multiple endpoints between the multiple root complex devices.

[0029] The BMC may be configured to provide instructions to the CPU core in the leader tile of the ICM 300. Such instructions may be provided e.g., over a SMBus connection, or various other point-to-point connections. The instructions may be associated with a root complex-to-endpoint mapping, and the CPU of the leader tile may configure the lane routing logic on the leader tile as well as the follower tile to map the upstream pseudo-ports to the downstream pseudo-ports associate with the mapping instruction issued by the BMC. In some embodiments, configuring the lane routing logic comprises modifying configuration register space in both circuit dies, where the configuration register space includes control signal values provided as selection signals to the multiplexing devices in the lane routing logic. In some embodiments, as described below, the logic lanes of the upstream pseudo-ports have static mapping configurations to the adaptation layer ports. For example, the upstream pseudo-port PHY1 in FIG. 6 may be statically mapped to adaptation layer port 1 in the Tx-portion of the adaptation layer on the leader tile. The downstream pseudo-ports on the follower tile may be selectively connected to any one of the adaptation layer ports 0-7, depending on the root complex-to-endpoint mapping. In some embodiments, the downstream pseudo-ports may be statically mapped to the adaptation layer ports while the upstream pseudo-ports are configurable to be connected to any adaptation layer port, and vice versa.

[0030] FIG. 4 is a data flow diagram that may be used for the PCIe data link to the first endpoint 315, in accordance with some embodiments. As shown, serial data is received at the PHY in the upstream pseudo-port, which includes a deserializer configured to convert the serial data stream into e.g., 32-bit deserialized lane-specific data words. The data words are routed via the lane routing logic to retimer core logic. In some embodiments, the core logic includes a PCS decoding block configured to perform e.g., 8blOb or 128bl30b decoding prior to being stored in a retimer FIFO. The retimer FIFO includes lane deskewing and rate adaptation functionalities across multiple lanes within a given circuit die as well as between lanes across multiple different circuit dies. The lane-specific data words are read from the retimer FIFO and transmitted on the downstream serial data transceivers via the transmitter in the PHY of the downstream pseudoport.

[0031] FIG. 5 is a data flow diagram that may be used for the PCIe data link to the second endpoint 320 using the D2D interface. Specifically, FIG. 5 illustrates data received over a set of serial data transceivers and provided to a second circuit die via the inter-die data interface (e.g., the adaptation layer and inter-die transmitter). In FIG. 5, data is received at a PHY of a first circuit die. The data is deserialized and routed using lane routing logic to the adaptation layer on the first circuit die, which formats the raw data for transmission using the D2D interface. The data is received at the adaptation layer on the second circuit die, which performs the reciprocal formatting to provide the data to the destination lanes on the second circuit die. The data is provided to the RPCS logic to perform rate adaptation and lane-to-lane deskew before being output on the serial data transceiver PHYs of the second circuit die to the endpoint. A similar data path exists in the reverse direction from the endpoint to the root complex.

[0032] In FIGs 4 and 5, RPCS logic is shown which may include e.g., the 8blOb encoding/decoding functions of PCIe generations 1 and 2 and the 128b/130b encoding/decoding functions of PCIe generations 3-5. Embodiments described herein further contemplate PCIe generation 6, which utilizes a flow control unit (FLIT) scheme, and thus no 8blOb or 128bl30b is implemented. In such embodiments, the functionalities for encoding/decoding may be omitted, while additional functionalities specific to PCIe 6, such as FEC decoding (either partial or full) are included as logic in the data path. Some functionalities of retimer core logic are shared, such as lane-to-lane deskewing and rate adaptation in the FIFO.

[0033] FIG. 6 is a block diagram of lane routing logic 600 in a retimer circuit die of an ICM, in accordance with some embodiments. FIG. 6 includes a block diagram on the left and various lane routing configurations on the right. In the top lane routing configuration 605, data is fed in through a deserializer, passes into the PHY and through the core routing logic and through the same PHY and output via the serializer down to the bottom. In the middle diagram 610, the data is fed into one port, processed in the core routing logic and fed out at the opposite PHY on the bottom. Finally, in the bottom drawing 615, all data is fed into the PHYs at the top side of one PCIe retimer circuit and directly forwarded to the high-speed die-to-die interconnect. The data is fed through the core lane routing logic to the PHYs on the other PCIe retimer die. In all such scenarios, there are data paths in the opposite direction as well.

[0034] On the left side of FIG. 6, a sketch of the lane routing logic is shown. The serial data transceiver PHYs are numbered from 0 to 7 and include receiver deserializers (DES) and transmitter serializers (SER). The top lane (PHY #0 and #4) illustrates the three different data paths matching the data paths shown on the right. Data path 605 on the right corresponds to data coming in on PHY 0 of the PCIe retimer circuit leaving on the same PHY #0 on the left-hand side of FIG. 6. Path 610 shows a feed-through path where data received on PHY 0 passes through to PHY #4 as show n on the left-hand side of FIG. 6. Finally, path 615 indicates that all received data is directly forwarded to adaptation layer to be transmitted over the inter-die data interface. On the second PCIe retimer, data from the inter-die data interface is forwarded to the core routing logic, where it is processed and output on the attached PHY.

[0035] The second lane (PHY #1 and #5) illustrates the multiplexing capabilities. Each core- logic/transmitter path can receive data from any one of the eight lanes. Additionally, data can be obtained from the adaptation layer ports of the D2D data interface. The other lanes (PHY #0 with #4. PHY #2 with #6 and PHY #3 with #7) have the same switching capabilities. On the bottom, the multiplexing for the adaptation layer ports is shown. As shown, any input PHY may be selected as the input for a given adaptation layer port. The adaptation layer ports may have fixed mappings to the D2D flows as described below with respect to FIG. 7. Thus, some embodiments may mirror data by selecting the same received PHY data for multiple adaptation layer physical ports.

[0036] Switching a data path in the routing logic includes the 32-bit received data bus carrying the deserialized lane-specific data words, accompanying data enabled lines, the recovered clock, and the corresponding reset. When a lane is routed to the adaptation layer, a clock domain crossing occurs by reading the deserialized data into FIFOs using the recovered clock signal. The adaptation layer handles clocking e.g., for outputing 150-bit D2D words to the transmiters in each D2D data flow as well as the conversion to the 25Gbd clock domain for the 5b6w interface. It is important to note that only raw data is multiplexed, the received data is not processed in any way. The Raw MUX logic is statically configured via configuration bits, the switching itself happens asynchronously. In case the Raw MUX setings are changed during mission mode, invalid data and glitches on the clock lines are likely. Thus, the multiplexing logic is setup during reset.

[0037] In some embodiments, each circuit die includes lane routing logic such as the Raw- MUX for lane routing betw een and within circuit dies. In such an embodiment, a primary' circuit die, also referred to as a "leader” may perform the configuration of the Raw MUX in each circuit die, e.g., by writing to the configuration registers associated with the Raw MUX. FIGs. 9 and 10 illustrate such tile-to-tile communications. FIG. 9 provides a schematic of the configuration of the T2T SPI bus in the four-tile case. This specific number of tiles is not limiting as the principles described herein can be extended to a N tile retimer having one leader tile and N-l follower tiles, N > 2. [0038] The T2T SPI leader 985 includes a serial clock line SCK that carries a serial clock signal generated by T2T SPI leader 985. The SCK signal is received by all T2T SPI followers and is used to co-ordinate reading and writing of data over the T2T SPI bus.

[0039] T2T SPI leader 985 also includes a MOSI line (Leader Out Follower In) and MISO line (Leader In Follower Out). The MOSI line is used to transmit data from the leader to the follower, i.e. as part of a write operation. The MISO line is used to transmit data from the follower to the leader, i.e. as part of a read operation.

[0040] T2T SPI leader 985 further includes a FS Iine (Follower Select). This is used to signal which follower is to participate in the current operation of the bus - that is, which follower data or a command on the bus is intended for. For convenience a single wire is shown for the follower select line in FIG. 9 but in practice one wire can be present for each line, i.e. three separate follower select wires in the case of FIG. 9.

[0041] T2T SPI followers 975a, 975b and 975c are each also coupled to all of the lines discussed above to enable two-way communication between the T2T leader and follower. In this manner, communication between tiles is achieved.

[0042] FIG. 10 shows the complete signal path between CPU core 900 and each PHY on the various tiles in the multi-chip module.

[0043] CPU core 900 is connected to PHYs 970 on the leader tile via leader tile APB interconnect 925 and can thus communicate with PHYs 970 via APB interconnect 925. CPU core 900 is also connected to T2T SPI leader 985 via leader tile APB interconnect 925. T2T SPI leader 985 is part of the T2T SPI bus that enables CPU core 900 to communicate with other tiles.

[0044] As shown in FIG. 10, each follower tile includes a respective T2T SPI follower 975a, 975b, 975c. Each of these SPI followers is coupled to T2T SPI leader 985 to enable signaling between tiles.

[0045] Each SPI follower 975a, 975b, 975c is coupled to respective PHYs 970a, 970b, 970c via respective follower tile APB interconnects 926, 927, 928. Each SPI follower 975a, 975b, 975c is leader on the respective APB interconnect 926. 927, 928. This enables each SPI follow er to access all registers that are located on the tile that the SPI follower is also located on.

[0046] Communication between tiles thus makes use of tw o distinct busses and protocols. SPI protocol does not support addressing, but the APB protocol does. Part of the data put onto the T2T SPI bus by CPU core 900 is APB address information, to enable the local APB interconnect on each follower tile to route messages to the intended recipient PHY.

[0047] Each PHY is assigned a unique APB address or APB address range so that it is possible for CPU core 900 to write to and/or read from one specific PHY on any tile. From the perspective of the CPU core 900. the entire multi-tile module has a single address space that includes separate regions for each PHY.

[0048] Assuming for the sake of illustration 24-bit APB addresses and a 32-bit data word size, control information put onto the SPI bus can be of the following format. This is referred to herein as a ‘control packet’.

[0049] Bits 0-23 are address bits (‘a’), bits 24, 25 and 26 are follower select bits and bits 27-31 are reserved bits (‘r’). In this particular case there are three follower select bits because there are three followers tiles (and hence three T2T SPI followers) in this example. The reserved bits provide space for additional follower select bits - in this case, up to eight follower select bits can be provided, supporting up to eight follower tiles. The principles established here can be extended to any number of follower tiles by increasing the word size.

[0050] The address bits form an APB address. The T2T-SPI followers are each configured as bus leader on their respective local APB interconnects, enabling each T2T-SPI follower to instruct its respective APB interconnect to perform a write or read operation to one of the respective PHYs the APB bus is coupled to. In some cases the address data can be omitted because the T2T-SPI bus can auto-increment addresses such that it already knows which address to write data to or read data from. The address data can be provided to the local APB interconnect after receipt of the control packet by the respective T2T SPI follower, enabling the local APB interconnect to route commands and data to the correct local PHY.

[0051] The follower select bits enable the control packet to specify which follower select line should be activated, i.e. which tile data is to be written to or read from. The T2T SPI bus uses the follower select bits to control the follower select lines FS_±, FS₂, FS₃, where e.g. a 0 indicates the corresponding follower select line should be low and a 1 indicates a corresponding follower select line should be high.

[0052] Follower select control information can alternatively be sent separately from the APB address data. The follower select information could be sent in-band as illustrated above, or another channel could be used such as a System Management bus (SMBus). The address data can be sent separately and before the data package is transmitted. In some cases the address data can be omitted because the T2T SPI bus can auto-increment addresses such that it already knows which address to write data to. [0053] In either case, once the follower select and address information (if required) has been provided, data can be transmitted. The T2T SPI leader 985 can keep the follower select line(s) asserted until it receives new instructions regarding follower select line configuration. Similarly, the relevant APB interconnect(s) can continue writing to the address(es) specified (possibly by auto-incrementing) until new addressing information is provided. In this way, data and commands can be transmitted to, and received from, any PHY on any tile.

[0054] The APB address space is a global address space across all tiles. This means it is possible to address any register on any tile via this global address space. One particular configuration provides a base address for each tile that is given by a tile identifier multiplied by a constant. The tile identifier can be a tile number and the constant can be a base address for the leader tile. Other memory space constructions are possible. Each register on each tile has a unique address or address range assigned to it within this global address space. Each PHY of PHYs 970, 970a, 970b, 970c thus has a unique address or address range assigned to it.

[0055] The CPU core on the leader tile may coordinate the lane switching circuits in both tiles. The CPU core on the follower tile may be in a low power state. As shown, a SPI communications bus between the two tiles may be used to configure the switching circuit in the follower tile to select between the first and second sets of downstream serial data transceiver ports. In some embodiments, a die-to-die (D2D) interface may be present and configured to configure lane routing between the leader and follower tiles. I.e.. serial data streams received on upstream ports of the leader tile may be routed to downstream ports of the follower tile and vice versa. Such a D2D interface may also be configured to carry configuration information as sideband information from the leader tile to the follower tile, e.g., to configure the configuration registers of the follower tile. In another embodiment, the configuration of the raw crossbar MUX may be performed via a system management bus. which may be further connected to the root complex. In some embodiments, a virtual channel between the root complex and retimer chip may be used for configuration purposes. In such embodiments, vendor-defined messages (VDMs) may be present in particular vendor-defined packet fields of a PCIe data transmission. Such VDMs may be detected, extracted, and provided to the CPU of the leader circuit die using e.g., an interrupt protocol. While FIG. 7 includes a single follower tile, it should be noted that additional follower tiles may be included, in some embodiments up to three follower tiles. In such a scenario, each follower tile may have a specific tile ID, and configuration register write commands can be assigned to certain tile IDs.

[0056] In some embodiments, the leader tile may initialize the configuration registers of the Raw MUX of the follower tile such that the RX adaptation layer ports are statically mapped to downstream ports to the redundant endpoint. In such an embodiment, the leader tile can switch the routing of the deserialized lane-specific data words between (i) downstream ports on the same die to the primary endpoint and (ii) the adaptation layer to be routed via the D2D interface.

[0057] FIG. 7 is a block diagram of an inter-die data interface (also referred to herein as a highspeed die-to-die (D2D) interconnect, “D2D link” and the like), in accordance with some embodiments. As shown, the D2D link utilizes eight high-speed die-to-die data flows, four in each direction, each data flow operating at a rate of 25GBd, transmitting 5 bits over 6 wires for a total throughput of 125Gbps. Furthermore, the interface includes two differential clock lanes operating at 6.25 GHz. It should be noted that interconnects having alternative sizes, throughputs, and/or encoding methods may be utilized as well.

[0058] Each data flow has a raw bandwidth of up to 125 Gbps without using forward errorcorrection (FEC). With the FEC enabled, the bandwidth is 125 Gbps * 150/160 = 117,1875 Gbps. In some embodiments, the PCIe retimer operates the high-speed die-to-die interconnect using low' latency FEC and scrambling. In this configuration, 150 bits of data are transmitted each clock period for each data flow. The clock frequency may depend on the link speed. At 125 Gbps, the core clock is 125 Gbps/(5*32) = 781.25 MHz. The 150 bits of data send at one end of the link are aligned at the receiving end, i.e., TX bitO is received as RX bitO. The 150 bits of data in a clock cycle is referred to as a ‘word’.

[0059] The inter-die data interface is operated using the same 100 MHz reference clock as the PHYs. In some embodiments, the interface is configured through the APB interface with an 8-bit wide data bus. In some embodiments, the interface may be configured to operate at a lower speed to reduce power. Furthermore, the number of enabled TX/RX data flows may be adjusted depending on the amount of bandwidth required for the communication.

[0060] FIG. 8 is a block diagram of an adaptation layer (AL) for an inter-die data interface, in accordance w ith some embodiments. The Adaptation Layer formats the payload sent and received over the high-speed die-to-die interconnect. As shown in FIG. 8, the Adaptation Layer supports the following types of payload:

1) Raw SERDES RX data (up to eight SERDES).

2) Frames/packets from link controllers (up to eight active interfaces) with support for flow control.

3) Indirect register-write and -read commands performed through the APB bus.

[0061] In the embodiment of FIG. 3, the retimers 305 and 310 may utilize the retimer data path shown in FIG. 5. In FIG. 5, data is routed over the D2D interface using an adaptation layer. In such an embodiment, raw encoded data is sent over the D2D interface using the raw' interface to minimize latency. The frame mode may be used when the inbound traffic is terminated using a link controller, described in more detail below.

Raw Data Format

[0062] The eight raw SERDES RX data interfaces are served in parallel. The eight frame interfaces may be served Round-Robin or in parallel depending on the protocol. The high-speed link is statically setup to either transmit raw SERDES RX data or frames of data. The indirect register accesses may be interleaved in both above traffic types.

[0063] As show n in FIG. 8, the raw SERDES RX data flow collects tw o 32-bit words of data from a SERDES over two consecutive receive clock cycles and writes the combined 64 bits of data into an asynchronous FIFO. It is in the asynchronous FIFOs that the clock domain changes from the recovered clock provided with the data by the lane routing logic to the adaptation layer clock domain. The read data from the asynchronous FIFO is sent on a specific data flow of the highspeed die-to-die link. The raw data from two RX SERDES asynchronous FIFOs are combined and sent on the same specific data flow of the high-speed link. The adaptation layer provides a clock to read from the asynchronous FIFOs as well as a clock to funnel the 150 bit (or 160 bit in the case of FEC) D2D words to the 25Gbd clock used for the transmitters in the D2D data flows. [0064] The raw' data format (i.e., non-frame based protocol) is a format used to transfer raw' 32- bit sets of SERDES data within each data flow clock cycle. The non-frame based protocol may be utilized to transport raw data over the D2D interface while the multi-chip module is operating in a retimer mode of operation. The non-frame based protocol w ord is as follow s:

[0065] where the protocol bit is asserted 1 ’bl for non-frame based protocol. As shown below in Table 1, Bits 148:0 of the payload field have a format of:

TABLE 1

[0066] SERDES payload is a high priority payload type, register commands are medium priority, and future messages are low priority. The SERDES payload is always filled in a user data cycle starting with PAYLOADO. followed by PAYLOAD 1. etc. A register command is only inserted in the case that there is less than four SERDES payload data ready in the data flow cycle. A register command is only inserted in the PAYLOAD3 field. The register write address command is followed by a register write data command before a new register write address command is sent. A register read address command or register read data command may be inserted in between the register write address command and register write data command.

[0067] While the above description details a particular D2D interface as shown in FIGs. 7 and 8, it should be noted that other D2D interfaces may be utilized to convey PCIe traffic between circuit dies. In some embodiments, the D2D interface may refer to the Universal Chiplet Interconnect Express (UCIe) interface. UCIe includes several modes of operations including a FLIT-aware mode of operation that includes a die-to-die adapter to implement e.g., CXL/PCIe protocols. Further, UCIe includes a streaming protocol that offers generic modes of a user defined protocol to transmit raw data. In the multiple-endpoint switching embodiment described with respect to FIG. 3, such a streaming protocol may be utilized to convey data between circuit dies in the retimer mode of operation. In some embodiments, a similar adaptation layer may be utilized that partitions traffic for two separate PCIe links over the UCIe connection, similar to the adaptation layer described above which statically map PHYs of each upstream pseudo-port to PHYs of a corresponding one of the downstream pseudo-ports using the particular D2D flows shown in FIG. 7.

Load Distribution: Non-Load Balancing Mode

[0068] Transmitting payload over the D2D link in load balancing mode or non-load balancing mode is configurable and depends on the protocol. All data flows operate in one or the other mode. Non-load balancing mode is used when the D2D link transmits PCS payload data (raw SERDES data) or non-PCS payload data in custom frame-based mode. Load balancing mode is used when transmitting non-PCS payload data in frame-based mode. Load balancing mode is described in more detail below.

[0069] In raw SERDES mode, the payload data from a fixed set of lanes is statically setup for transmission over a specific D2D data flow. The ‘logic lanes’ in this context correspond to the adaptation layer physical ports, i.e., the ports to which the PHYs are mapped to via the raw MUX crossbar switch. To minimize further multiplexing logic, fixed mapping of logic lanes to data flows may be used. In one example, the mapping for eight lanes of traffic from the adaptation layer physical ports to the four die-to-die data flows is given below: Logic lanes 0-1 map to data flow 0 Logic lanes 2-3 map to data flow 1 Logic lanes 4-5 map to data flow 2 Logic lanes 6-7 map to data flow 3

[0070] Such a mapping may also apply to non-SERDES payload data. The register commands and message payload are statically setup to use a specific data flow to minimize logic by only handling one command in one cycle. The messages payload may be configured to use a different data flow than for the register commands.

[0071] In custom frame-based mode, similar to raw SERDES mode, the lanes may be configured statically to the same specific D2D data flows given above. D2D link words are load distributed round-robin from the two frame interfaces per D2D data flow. Some embodiments may implement a minimum spacing between D2D link words for the same frame interface/port on the same data flow. In some embodiments, the minimum spacing may be four cycles.

[0072] Some embodiments may have programmability to run fixed TDM slots. In fixed TDM mode the transmitter constantly sends words for the four supported ports, e.g., Port#0, Port#l, Port#2, Port#3, Port#0, Port#l, etc. If a port does not have payload to send in a slot it sends an IDLE cycle. Some embodiments may also implement programmability for the number of ports in the TDM calendar. The register commands and message payload may also be statically setup to use a specific D2D data flow to minimize logic by only handling one command in one cycle, similar to the raw SERDES mode. The messages payload may be configured to use a different data flow than the register commands.

APB Leader/Follower Interface

[0073] The D2D interface includes an APB follower interface and an APB leader interface. The APB follower interface is the interface to all the configuration registers of the adaptation layer including configuration registers to set up the tile-to-tile (T2T) read/write transactions. The T2T transactions are indirect register read and write commands sent over the D2D link. The source of the T2T transactions is the adaptation layer on the leader tile. The destination of the T2T transactions is the adaptation layer on the follower tile which translates the received T2T read/write commands to an APB read/write transaction. Both the APB follower and leader interface have command FIFOs whereas only the APB leader interface has a read return FIFO. The number of entries in the two types of FIFOs can be independent, however, at least one embodiment configures them to be equal size.

[0074] The APB leader interface executes the receive T2T read/write commands on the APB in the follower tile. For read commands, the corresponding read return data is transmitted back to the leader tile on the D2D link. The command FIFO in the APB leader interface allows for a number of outstanding writes that may take some time to execute on the follower tile. Firmware guarantees that the command FIFO does not overrun. The fill level of the FIFO may be read in a register, however firmware can guarantee no overrun occurs by adding delay between T2T write transactions, or by performing a read and waiting for the read data after having sent a maximum number of back-to-back T2T write transactions, where the maximum number is defined by the number of command FIFO entries minus one. The T2T read transaction is used to flush the command FIFO since commands do not overtake each other. The APB leader interface is idle on the leader tile, i.e., it never receives T2T transactions from the follower tile. The APB follower interface on the follower tile is used to access the adaptation registers, yet no T2T transactions are initiated from the follower tile. [0075] FIG. 11 is a flowchart of a method 1 100, in accordance with some embodiments. As shown, method 1100 includes receiver 1105, at a plurality of upstream serial data transceivers of a first circuit die of a multi-die integrated circuit module (ICM), a plurality of serial data lanes associated with a PCIe data link, and responsively generating respective deserialized lane-specific data words. The method further includes providing 11 10 the deserialized lane-specific data words for transmission via a group of downstream serial data transceivers on the first circuit die of the multi-die ICM, the group of downstream serial data transceivers having a PCIe data link to a first endpoint. The method further includes rerouting 1115. responsive to a failure in the PCIe data link to the first endpoint, the deserialized lane-specific data words over an inter-die data interface using an inter-die adaptation layer protocol to a second circuit die of the multi-die ICM. The method further includes recovering 1120, the deserialized lane-specific data words at the second circuit die from the inter-die data interface. The method further includes transmitting 1125 the deserialized lane-specific data words via a second group of downstream serial data transceivers to a second endpoint via a second PCIe data link.

Claims

CLAIMS We Claim:

1. An apparatus comprising: a plurality of upstream pseudo-ports (PPs) of a first circuit die, each upstream PP having a connection to a respective one of at least two root complex devices; a pl urality of dow nstream PPs of a second circuit die, each dow nstream PP having a connection to a respective one of at least two endpoints: an inter-die data interface between the first and second circuit dies, the inter-die data interface configured to exchange retimer data flows between the upstream PPs and downstream PPs of the first and second circuit dies via adaptation layer ports on each circuit die according to an adaptation layer protocol; lane routing logic in the first and second circuit dies configured to map the retimer data flow between at least one of the sets of upstream PPs and a corresponding set of downstream PPs to respective adaptation layer ports on the first and second circuit dies according to the adaptation layer protocol; and a processor on one of the first and second circuit dies for configuring the lane routing logic in both the first and second circuit dies.

2. The apparatus of claim 1, wherein the inter-die data interface is a universal chiplet interconnect express (UCIe) interface.

3. The apparatus of claim 1, wherein each retimer data flow comprises one or more lanes, and wherein for each retimer data flow, the lane routing logic is configured to route, for each of the one or more lanes, (i) deserialized data, (ii) a receive clock signal, (iii) a reset signal, and (iv) a data enable signal to the respective adaptation layer ports.

4. The apparatus of claim 1, wherein the inter-die data interface comprises a plurality of die-to-die (D2D) data flows, and wherein the retimer data flows are provided to the plurality⁷ of D2D data flows based on the respective adaptation layer ports.

5. The apparatus of claim 4, wherein each D2D data flow comprises a respective clock domain crossing (CDC) buffer interfaced to up to one or more of the adaptation layer ports.

6. The apparatus of claim 4, wherein each D2D data flow exchanges the retimer data flows using an orthogonal differential vector signaling (ODVS) code.

7. The apparatus of claim 1, further comprising a board management controller configured to provide a control signal to the processor via a system management bus to configure the lane routing logic.

8. The apparatus of claim 1, wherein the processor configures the lane routing logic on both the first and second circuit dies via configuration registers.

9. The apparatus of claim 8, wherein the processor configures the lane routing logic in both the first and second circuit dies during a reset period.

10. The apparatus of claim 1, wherein the processor configures the lane routing logic on both the first and second circuit dies to map the retimer data flow between a first upstream PP and a first downstream PP and to map the retimer data flow between a second upstream PP and a second downstream PP, and wherein each retimer data flow is exchanged concurrently over the inter-die data interface.

11. A method comprising: receiving data at a set of physical layer transceivers (PHYs) of a first upstream pseudoport (PP) on a first circuit die comprising a plurality of upstream PPs each having a respective connection to a respective root complex device of a plurality' of root complex devices; routing the received data from the set of PHYs of the first upstream PP over an inter-die data interface to a corresponding set of PHYs in a first downstream PP on a second circuit die comprising a plurality of downstream PPs each having a respective connection to a respective endpoint of a plurality of endpoints; and configuring, using a processor in one of the first and second circuit dies, lane routing logic in both the first and the second circuit dies to map the set of PHYs of the first upstream PP and the set of PHYs of the first downstream PP to respective adaptation layer ports as part of an adaptation layer protocol.

12. The method of claim 11, wherein the received data is routed over the inter-die data interface using a universal chiplet interconnect express (UCIe) interface.

13. The method of claim 11, wherein the received data comprises (i) deserialized data words, (ii) a receive clock signal, (iii) a reset signal, and (iv) a data enable signal.

14. The method of claim 11, wherein the adaptation layer ports are statically mapped to a plurality of die-to-die transceivers.

15. The method of claim 14, wherein the die-to-die transceivers route the received data over the die-to-die interface using an orthogonal differential vector signaling (ODV S) code.

16. The method of claim 11. further comprising: receiving data at a set of PHYs of a second upstream PP on the first circuit die; routing the received data from the set of PHYs of the first upstream PP over an inter-die data interface to a corresponding set of PHYs in a second downstream PP on the second circuit die; and configuring the lane routing logic in both the first and the second circuit dies to map the set of PHYs of the second upstream PP and the set of PHYs of the second downstream PP to the adaptation layer ports.

17. The method of claim 1 1, wherein configuring the lane routing logic on both the first and second circuit dies comprises writing to configuration registers on both the first and second circuit dies.

18. The method of claim 17, w herein writing to the configuration registers on one of the first and second circuit dies is performed via a tile-to-tile interface.

19. The method of claim 11, further comprising receiving a control signal at the processor, the control signal associated with a lane routing logic configuration, the control signal issued by a board management controller.

20. The method of claim 11, wherein the lane routing logic in both the first and second circuit dies is configured during a reset period.