US20120079155A1

US20120079155A1 - Interleaved Memory Access from Multiple Requesters

Info

Publication number: US20120079155A1
Application number: US13/239,065
Authority: US
Inventors: Raguram Damodaran; Naveen Bhoria
Original assignee: Texas Instruments Inc
Current assignee: Texas Instruments Inc
Priority date: 2010-09-28
Filing date: 2011-09-21
Publication date: 2012-03-29
Also published as: US8532247B2; US8732416B2; US8607000B2; US9195610B2; US10713180B2; US20120198166A1; US8732398B2; US20150269090A1; US20120198163A1; US20120290756A1; US20120191915A1; US20120079204A1; US20120198192A1; US8598932B2; US9189331B2; US9075744B2; US8904115B2; US8707127B2; US20120079203A1; US9268708B2

Abstract

A shared memory system having multiple banks is coupled to a set of requesters. Separate arbitration and control logic is provided for each bank, such that each bank can be accessed individually. The separate arbitration logics individually arbitrate transaction requests targeted to each bank of the memory. Access is granted to each bank on each access cycle to a highest priority request for each bank, such that more than one transaction request may be granted access to the memory on a same access cycle. A wide transaction request that has a transaction width that is wider than a width of one bank is divided into a plurality of divided requests.

Description

CLAIM OF PRIORITY UNDER 35 U.S.C. 119(e)

The present application claims priority to and incorporates by reference U.S. Provisional Application No. 61/387,283, (attorney docket TI-69952PS) filed Sep. 28, 2010, entitled “Cache Controller Architecture.”

FIELD OF THE INVENTION

This invention generally relates to management of memory access by multiple requesters, and in particular to access to a shared memory resource in a system on a chip with multiple cores.

BACKGROUND OF THE INVENTION

System on Chip (SoC) is a concept that strives to integrate more and more functionality into a given device. This integration can take the form of both hardware and solution software. Performance gains are traditionally achieved by increased clock rates and more advanced processor nodes. Many SoC designs pair a digital signal processor (DSP) with a reduced instruction set computing (RISC) processor to target specific applications. A more recent approach to increasing performance has been to create multi-core devices. In this scenario, management of competition for processing resources is typically resolved using a priority scheme. Each pending request to a shared resource must wait until a prior access is completed.

BRIEF DESCRIPTION OF THE DRAWINGS

Particular embodiments in accordance with the invention will now be described, by way of example only, and with reference to the accompanying drawings:

FIG. 1 is a functional block diagram of a system on chip (SoC) that includes an embodiment of the invention;

FIGS. 2 is a more detailed block diagram of one processing module used in the SoC of FIG. 1;

FIGS. 3 and 4 illustrate configuration of the L1 and L2 caches;

FIG. 5 is a more detailed block diagram of one processing module used in the SoC of FIG. 1;

FIG. 6 is a block diagram illustrating a portion of the processing module of FIG. 5 in more detail;

FIG. 7 illustrates a priority value register;

FIG. 8 is a block diagram illustrating concurrent bank arbitration for the L1 data cache in the processor module of FIG. 5;

FIG. 9 is a timing diagram illustrating operation of concurrent bank arbitration;

FIG. 10 is a flow diagram illustrating operation of concurrent bank arbitration; and

FIG. 11 is a block diagram of a system that includes the SoC of FIG. 1.

Other features of the present embodiments will be apparent from the accompanying drawings and from the detailed description that follows.

DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Specific embodiments of the invention will now be described in detail with reference to the accompanying figures. Like elements in the various figures are denoted by like reference numerals for consistency. In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to one of ordinary skill in the art that the invention may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
High performance computing has taken on even greater importance with the advent of the Internet and cloud computing. To ensure the responsiveness of networks, online processing nodes and storage systems must have extremely robust processing capabilities and exceedingly fast data-throughput rates. Robotics, medical imaging systems, visual inspection systems, electronic test equipment, and high-performance wireless and communication systems, for example, must be able to process an extremely large volume of data with a high degree of precision.
A multi-core architecture that embodies an aspect of the present invention will be described herein. In a typically embodiment, a multi-core system is implemented as a single system on chip (SoC). As used herein, the term “core” refers to a processing module that may contain an instruction processor, such as a digital signal processor (DSP) or other type of microprocessor, along with one or more levels of cache that are tightly coupled to the processor.
A multi-level cache controller within a core module may process different types of transfer requests from multiple requestors that may be targeted to different resources. In a multi-core scenario, these transfers may be divided into two categories: 1) local core central processing unit (CPU) generated, and 2) external master generated. In an embodiment that will be described in more detail below, external master generated transactions that target a local static random access memory (SRAM) within a core module may be generated by a direct memory access (DMA) module. The DMA transactions may come from an internal DMA (IDMA) engine, or from a slave DMA (SDMA) interface that is servicing requests from another core CPU module within the SoC. CPU transactions and DMA transactions may both be targeted for a same resource, such as SRAM that may also configured as a level 1 (L1) cache.
In order to improve access and to avoid deadlock situations, embodiments of the present invention may provide distributed arbitration to control access to shared resources by separate pipeline stages for CPU, DMA, and other transactions. These parallel pipelines interact only at the point where they require access to the same shared resource.
In an embodiment of the invention, a Level 1 data cache (L1D) SRAM memory is shared by multiple initiators such as: CPU1, CPU2, direct memory access (DMA), cache coherence operations (SNOOP), read allocate writes (L2W), and cache eviction (VICTIM) writeback logic. The width of the L1D cache DRAM is 256 bits, which is organized as eight banks providing a maximum possible bandwidth of 256 bits/cycle. The transaction size for each of the initiators can vary from 8 bits to 256 bits. All of these initiators may have an independent data stream to access L1D memory. Each of these initiators can have bank-stalls which need to be resolved based on an L1D arbitration scheme that may include factors such as: priorities, maxwait counters, and transaction type, for example.
An efficient scheme is provided to insure maximum utilization of the 256 bits/cycle bandwidth of the L1D SRAM, while at the same time ensuring minimum latency for completion of each transaction and honoring arbitration schemes. To efficiently achieve maximum throughput with minimum latency, each of the transaction types is split into a set of transactions that each have a size of 32 bits, which is equal to the size of each bank of the L1D DRAM. Access to each bank is controlled by an independent arbitration finite state machine (FSM) that considers the following inputs, where bank enable signals with each request define the width of the request and the targeted bank(s) within the SRAM:

- L2 Write bank enable signals
- DMA access bank enable signals
- CPU1 bank enable signals
- CPU2 bank enable signals
- priority between CPU0 and CPU1
- priority between CPU and DMA

Each FSM may output the following information: address, data and byte enable data-path control signals, and Information related to which transactions are currently stalled or granted access. Completion of any particular transaction is achieved when access is granted to each of the requested banks.
Since any given requestor could potentially block a resource for extended periods of time, a bandwidth management scheme may be implemented in some embodiments to provide fairness for all requestors. The bandwidth management scheme may be summarized as weighted-priority-driven bandwidth allocation. Each requestor (SDMA, IDMA, CPU, etc) may be assigned a priority level on a per-transfer basis. In this embodiment, the programmable priority level has a single meaning throughout the system; there is total of eight priority levels, where priority 0 is highest and priority 7 is lowest priority. When requests for a single resource contend, access is granted to the highest priority requestor. When the contention occurs for multiple successive cycles, a contention counter may guarantee that the lower priority requestor gets access to the resource every 1 out of n cycles, where n is programmable as a maximum wait (maxwait) value. In this embodiment, a priority level of ‘−1’ may used to represent a transfer whose priority has been increased due to expiration of the contention counter.
In some embodiments, a provision is made to allow an application program that is being executed within the SoC to dynamically control bandwidth allocation to the shared resource. This may be done to optimize different tasks at different times, for example. The priority of requestors may be changed on the fly, and bandwidth problems may be fine tuned using the contention counters. The arbitration may be distributed across the resource controllers, which provides flexibility.
Other resources, such as buffers, configuration registers or register files which hold parameters that are required for processing these transactions may be either duplicated or made concurrently readable from multiple sources. Examples of duplicated or concurrently accessible resources may include, but are not limited to, the following: a memory protection attributes table, snoop tag status bits register file. This avoids any contention between CPU and DMA.
FIG. 1 is a functional block diagram of a system on chip (SoC) 100 that includes an embodiment of the invention. System 100 is a multi-core SoC that includes a set of processor modules 110 that each includes a processor core, level one (L1) data and instruction caches, and a level two (L2) cache. In this embodiment, there are eight processor modules 110; however other embodiments may have fewer or greater number of processor modules. In this embodiment, each processor core is a digital signal processor (DSP); however, in other embodiments other types of processor cores may be used. A packet-based fabric 120 provides high-speed non-blocking channels that deliver as much as 2 terabits per second of on-chip throughput. Fabric 120 interconnects with memory subsystem 130 to provide an extensive two-layer memory structure in which data flows freely and effectively between processor modules 110, as will be described in more detail below. An example of SoC 100 is embodied in an SoC from Texas Instruments, and is described in more detail in “TMS320C6678—Multi-core Fixed and Floating-Point Signal Processor Data Manual”, SPRS691, November 2010, which is incorporated by reference herein.
External link 122 provides direct chip-to-chip connectivity for local devices, and is also integral to the internal processing architecture of SoC 100. External link 122 is a fast and efficient interface with low protocol overhead and high throughput, running at an aggregate speed of 50 Gbps (four lanes at 12.5 Gbps each). Working in conjunction with a routing manager 140, link 122 transparently dispatches tasks to other local devices where they are executed as if they were being processed on local resources.
There are three levels of memory in the SoC 100. Each processor module 110 has its own level-1 program (L1P) and level-1 data (L1D) memory. Additionally, each module 110 has a local level-2 unified memory (L2). Each of the local memories can be independently configured as memory-mapped SRAM (static random access memory), cache or a combination of the two.
In addition, SoC 100 includes shared memory 130, comprising internal memory 133 and optional external memory 135 connected through the multi-core shared memory controller (MSMC) 132. MSMC 132 allows processor modules 110 to dynamically share the internal and external memories for both program and data. The MSMC internal RAM offers flexibility to programmers by allowing portions to be configured as shared level-2 RAM (SL2) or shared level-3 RAM (SL3). SL2 RAM is cacheable only within the local L1P and L1D caches, while SL3 is additionally cacheable in the local L2 caches.
External memory may be connected through the same memory controller 132 as the internal shared memory via external memory interface 134, rather than to chip system interconnect as has traditionally been done on embedded processor architectures, providing a fast path for software execution. In this embodiment, external memory may be treated as SL3 memory and therefore cacheable in L1 and L2.
SoC 100 may also include several co-processing accelerators that offload processing tasks from the processor cores in processor modules 110, thereby enabling sustained high application processing rates. SoC 100 may also contain an Ethernet media access controller (EMAC) network coprocessor block 150 that may include a packet accelerator 152 and a security accelerator 154 that work in tandem. The packet accelerator speeds the data flow throughout the core by transferring data to peripheral interfaces such as the Ethernet ports or Serial RapidIO (SRIO) without the involvement of any module 110's DSP processor. The security accelerator provides security processing for a number of popular encryption modes and algorithms, including IPSec, SCTP, and SRTP, 3GPP, SSL/TLS and several others.
Multi-core manager 140 provides single-core simplicity to multi-core device SoC 100. Multi-core manager 140 provides hardware-assisted functional acceleration that utilizes a packet-based hardware subsystem. With an extensive series of more than 8,000 queues managed by queue manager 144 and a packet-aware DMA controller 142, it optimizes the packet-based communications of the on-chip cores by practically eliminating all copy operations.
The low latencies and zero interrupts ensured by multi-core manager 140, as well as its transparent operations, enable new and more effective programming models such as task dispatchers. Moreover, software development cycles may be shortened significantly by several features included in multi-core manager 140, such as dynamic software partitioning. Multi-core manager 140 provides “fire and forget” software tasking that may allow repetitive tasks to be defined only once, and thereafter be accessed automatically without additional coding efforts.
Two types of buses exist in SoC 100 as part of packet based switch fabric 120: data buses and configuration buses. Some peripherals have both a data bus and a configuration bus interface, while others only have one type of interface. Furthermore, the bus interface width and speed varies from peripheral to peripheral. Configuration buses are mainly used to access the register space of a peripheral and the data buses are used mainly for data transfers. However, in some cases, the configuration bus is also used to transfer data. Similarly, the data bus can also be used to access the register space of a peripheral. For example, DDR3 memory controller 134 registers are accessed through their data bus interface.
Processor modules 110, the enhanced direct memory access (EDMA) traffic controllers, and the various system peripherals can be classified into two categories: masters and slaves. Masters are capable of initiating read and write transfers in the system and do not rely on the EDMA for their data transfers. Slaves on the other hand rely on the EDMA to perform transfers to and from them. Examples of masters include the EDMA traffic controllers, serial rapid I/O (SRIO), and Ethernet media access controller 150. Examples of slaves include the serial peripheral interface (SPI), universal asynchronous receiver/transmitter (UART), and inter-integrated circuit (I2C) interface.
FIG. 2 is a more detailed block diagram of one processing module 110 used in the SoC of FIG. 1. As mentioned above, SoC 100 contains two switch fabrics that form the packet based fabric 120 through which masters and slaves communicate. A data switch fabric 224, known as the data switched central resource (SCR), is a high-throughput interconnect mainly used to move data across the system. The data SCR is further divided into two smaller SCRs. One connects very high speed masters to slaves via 256-bit data buses running at a DSP/2 frequency. The other connects masters to slaves via 128-bit data buses running at a DSP/3 frequency. Peripherals that match the native bus width of the SCR it is coupled to can connect directly to the data SCR; other peripherals require a bridge.
A configuration switch fabric 225, also known as the configuration switch central resource (SCR), is mainly used to access peripheral registers. The configuration SCR connects the each processor module 110 and masters on the data switch fabric to slaves via 32-bit configuration buses running at a DSP/3 frequency. As with the data SCR, some peripherals require the use of a bridge to interface to the configuration SCR.
Bridges perform a variety of functions:

- Conversion between configuration bus and data bus.
- Width conversion between peripheral bus width and SCR bus width.
- Frequency conversion between peripheral bus frequency and SCR bus frequency.

The priority level of all master peripheral traffic is defined at the boundary of switch fabric 120. User programmable priority registers are present to allow software configuration of the data traffic through the switch fabric. In this embodiment, a lower number means higher priority. For example: PRI=000b=urgent, PRI=111b =low.
All other masters provide their priority directly and do not need a default priority setting. Examples include the processor module 110, whose priorities are set through software in a unified memory controller (UMC) 216 control registers. All the Packet DMA based peripherals also have internal registers to define the priority level of their initiated transactions.
DSP processor core 112 includes eight functional units 214, two register files 215, and two data paths. The two general-purpose register files 215 (A and B) each contain 32 32-bit registers for a total of 64 registers. The general-purpose registers can be used for data or can be data address pointers. The data types supported include packed 8-bit data, packed 16-bit data, 32-bit data, 40-bit data, and 64-bit data. Multiplies also support 128-bit data. 40-bit-long or 64-bit-long values are stored in register pairs, with the 32 LSBs of data placed in an even register and the remaining 8 or 32 MSBs in the next upper register (which is always an odd-numbered register). 128-bit data values are stored in register quadruplets, with the 32 LSBs of data placed in a register that is a multiple of 4 and the remaining 96 MSBs in the next 3 upper registers.
The eight functional units 214 (.M1, .L1, .D1, .S1, .M2, .L2, .D2, and .S2) are each capable of executing one instruction every clock cycle. The .M functional units perform all multiply operations. The .S and .L units perform a general set of arithmetic, logical, and branch functions. The .D units primarily load data from memory to the register file and store results from the register file into memory. Note that two CPU transaction requests to data cache controller DMC 218 may be active in parallel on parallel request buses 270, 271. Each .M unit can perform one of the following fixed-point operations each clock cycle: four 32×32 bit multiplies, sixteen 16×16 bit multiplies, four 16×32 bit multiplies, four 8×8 bit multiplies, four 8×8 bit multiplies with add operations, and four 16×16 multiplies with add/subtract capabilities. There is also support for Galois field multiplication for 8-bit and 32-bit data. Many communications algorithms such as FFTs and modems require complex multiplication. Each .M unit can perform one 16×16 bit complex multiply with or without rounding capabilities, two 16×16 bit complex multiplies with rounding capability, and a 32×32 bit complex multiply with rounding capability. The .M unit can also perform two 16×16 bit and one 32×32 bit complex multiply instructions that multiply a complex number with a complex conjugate of another number with rounding capability.
Communication signal processing also requires an extensive use of matrix operations. Each .M unit is capable of multiplying a [1×2] complex vector by a [2×2] complex matrix per cycle with or without rounding capability. A version may be embodied allowing multiplication of the conjugate of a [1×2] vector with a [2×2] complex matrix. Each .M unit also includes IEEE floating-point multiplication operations, which includes one single-precision multiply each cycle and one double-precision multiply every 4 cycles. There is also a mixed-precision multiply that allows multiplication of a single-precision value by a double-precision value and an operation allowing multiplication of two single-precision numbers resulting in a double-precision number. Each .M unit can also perform one the following floating-point operations each clock cycle: one, two, or four single-precision multiplies or a complex single-precision multiply.
The .L and .S units support up to 64-bit operands. This allows for arithmetic, logical, and data packing instructions to allow parallel operations per cycle.
An MFENCE instruction is provided that will create a processor stall until the completion of all the processor-triggered memory transactions, including:

- Cache line fills
- Writes from L1D to L2 or from the processor module to MSMC and/or other system endpoints
- Victim write backs
- Block or global coherence operation
- Cache mode changes
- Outstanding XMC prefetch requests.

The MFENCE instruction is useful as a simple mechanism for programs to wait for these requests to reach their endpoint. It also provides ordering guarantees for writes arriving at a single endpoint via multiple paths, multiprocessor algorithms that depend on ordering, and manual coherence operations.
Each processor module 110 in this embodiment contains a 1024 KB level-2 memory (L2) 216, a 32 KB level-1 program memory (L1P) 217, and a 32 KB level-1 data memory (L1D) 218. The device also contains a 4096 KB multi-core shared memory (MSM) 132. All memory in SoC 100 has a unique location in the memory map
The L1P and L1D cache can be reconfigured via software through the L1PMODE field of the L1P Configuration Register (L1PCFG) and the L1DMODE field of the L1D Configuration Register (L1DCFG) of each processor module 110 to be all SRAM, all cache memory, or various combinations as illustrated in FIG. 3, which illustrates an L1D configuration; L1P configuration is similar. L1D is a two-way set-associative cache, while L1P is a direct-mapped cache.
L2 memory can be configured as all SRAM, all 4-way set-associative cache, or a mix of the two, as illustrated in FIG. 4. The amount of L2 memory that is configured as cache is controlled through the L2MODE field of the L2 Configuration Register (L2CFG) of each processor module 110.
Global addresses are accessible to all masters in the system. In addition, local memory can be accessed directly by the associated processor through aliased addresses, where the eight MSBs are masked to zero. The aliasing is handled within each processor module 110 and allows for common code to be run unmodified on multiple cores. For example, address location 0x10800000 is the global base address for processor module 0's L2 memory. DSP Core 0 can access this location by either using 0x10800000 or 0x00800000. Any other master in SoC 100 must use 0x10800000 only. Conversely, 0x00800000 can by used by any of the cores as their own L2 base addresses.
Level 1 program (L1P) memory controller (PMC) 217 controls program cache memory 267 and includes memory protection and bandwidth management. Level 1 data (L1 D) memory controller (DMC) 218 controls data cache memory 268 and includes memory protection and bandwidth management. Level 2 (L2) memory controller, unified memory controller (UMC) 216 controls L2 cache memory 266 and includes memory protection and bandwidth management. External memory controller (EMC) 219 includes Internal DMA (IDMA) and a slave DMA (SDMA) interface that is coupled to data switch fabric 224. The EMC is coupled to configuration switch fabric 225. Extended memory controller (XMC) is coupled to MSMC 132 and to dual data rate 3 (DDR3) external memory controller 134. The XMC provides a lookahead prefetch engine for L2 cache 216/266.
FIG. 5 is a more detailed block diagram of one processing module 110 used in the SoC of FIG. 1 that illustrates distributed bandwidth management. When multiple requestors contend for a single processor module 110 resource, the conflict is resolved by granting access to the highest priority requestor. The following four resources are managed by the bandwidth management control hardware 516-519:

- Level 1 Program (L1P) SRAM/Cache 217
- Level 1 Data (L1D) SRAM/Cache 218
- Level 2 (L2) SRAM/Cache 216
- EMC 219

The priority level for operations initiated within the processor module 110 are declared through registers within each processor module 110. These operations are:

- DSP-initiated transfers
- User-programmed cache coherency operations
- IDMA-initiated transfers

The priority level for operations initiated outside the processor modules 110 by system peripherals is declared through the Priority Allocation Register (PRI_ALLOC). System peripherals that are not associated with a field in PRI_ALLOC may have their own registers to program their priorities.

Distributed Arbitration

As described above, each core module 110 must control the dataflow between its internal resources, including L2 SRAM/Cache, L1P SRAM/Cache, L1D SRAM/Cache; MMR (memory mapped register) Configuration Bus, and each of the potential requestors, which include external DMA initiated transfers received at the slave DMA (SDMA) interface, internal DMA (IDMA) initiated transfers, internal cache coherency operations, and CPU direct initiated transfers, which include: L1D initiated transfers such as load/store, and L1P initiated transfers such as program fetch.
FIG. 6 is a block diagram illustrating a portion of a processing module 110 in more detail. As illustrated in FIG. 5, there are various buses that interconnect UMC 216, PMC 217, DMC 218 and EMC 219. Each of these buses includes signal lines for a command portion and data portion of each transaction packet. Most of the buses also include signal lines to carry the priority value associated with each transaction command, such as: EMC to PMC priority signal 602; EMC to DMC priority signal 603, UMC to DMC priority signal 604; UMC to EMC priority signal 605; and EMC to UMC priority signal 606.
FIG. 7 illustrates an example of a set of programmable priority value registers 700 used in SoC 100. Most requesters in SoC 100 have a copy of memory mapped programmable priority register similar to register 700 associated with them. Priority field 702 is a three-bit field that is used to specify a priority value of 0-7, where a value of 0 indicates highest priority and a value of 7 indicates lowest priority. Maxwait field 704 defines a maximum number of arbitration contests that requester may lose before its priority value is elevated for one arbitration contest.
Referring again to FIG. 6, priority for PMC-UMC commands 610 and DMC to UMC commands 620 are each specified by priority registers 611, 621 that are similar to register 700, therefore a priority signal is not needed in the bus for those commands. Requests initiated by CPU 212 to program cache 217 and data cache 218 will cause transaction request commands 610, 620 when a respective cache miss occurs. UMC 216 will arbitrate between competing requests based on the priority value stored in the associated priority register 610, 620 using arbitration logic within bandwidth management logic 516. The winning request is then granted access to L2 cache RAM 266 if the requested data is present in L2 cache 266, as indicated by tags in UMC 216. The general operation of caches is known and does not need to be explained in further detail here.
If the requested data is not present in L2 cache 266, then another access request is generated and sent to shared L3 memory coupled to MSMC 132 via bus link 630(1). This request goes through XMC 570, as illustrated in FIG. 5. Each of the other core modules 110 also send request commands to MSMC 132 via individual bus links 630(N). Arbitration logic within bandwidth management logic 632 uses a priority value for each request command sent on a priority signal with the request command, such as priority signal 631 that is part of link 630(1). However, the priority value that is provided on signal 631 may indicate an elevated priority if the winner of the arbitration contest in UMC 216 had to have its priority elevated in order to win the arbitration contest. In this manner, a requester that contends for access and has to wait until its assigned priority value is elevated in order to win an arbitration contest maintains its elevated priority when a cache miss, for example, forces it to contend in another arbitration contest.

Concurrent Bank Arbitration

FIG. 8 is a block diagram illustrating concurrent bank arbitration for the L1 data cache in data cache controller 218 of processor module 110. In this embodiment of the invention, a Level 1 data cache (L1D) SRAM memory 266 is shared by multiple initiators such as: CPU1 (.D1), CPU2 (.D2), direct memory access (DMA), cache coherence operations (SNOOP), read allocate writes (L2W), and cache eviction (VICTIM) writeback logic. The width of the L1D cache DRAM 266 is 256 bits, which is organized as eight banks 820-827 providing a maximum possible bandwidth of 256 bits/cycle. The transaction size for each of the initiators can vary from 8 bits to 256 bits. All of these initiators may have an independent data stream to access L1D memory. Each of these initiators can have bank-stalls which need to be resolved based on an L1D arbitration scheme that may include factors such as: priorities, maxwait counters, and transaction type, for example.
In order to improve access to the L1 data cache/SRAM and to avoid deadlock situations, separate pipelines 270, 271, 641 may be provided for CPU and DMA transactions. For the return data from L2 cache controller 216 and other acknowledgments back to the DMC requestor, separate return paths 640 are provided. Thus, each requestor essentially has a separate interface to the shared target resource 266. These parallel pipelines interact only at the point where they require access to SRAM 266. An arbitration scheme is provided that tries to maintain a fair bandwidth distribution between the various requesters trying to access SRAM 266.
Referring again to FIG. 5, slave DMA interface 560 receives transaction requests from external masters via the data switch fabric 224. Referring back to FIG. 1, these requests may be originated by another processor module 110, by packet DMA 142 or from a master that is external to the SoC via external link 122, for example. As explained above, L1P memory 267, L1D memory 268 and L2 memory 266 may each be configured as a cache, a shared memory or a combination. The address space of each memory is also mapped into the SoC global address space, therefore, transaction requests from masters external to processor module 110 may access any of these memory resources within processor module 110.
Referring again to FIG. 8, cache control logic 802 receives transaction requests from data units .D1 and .D2 via parallel buses 270, 271 respectively. Each request is checked against tags 805 to determine if the requested data item is available in L1 memory 266. If not, eviction/allocation logic 804 may initiate a request 872 to evict a victim location from cache memory 266 and send requests 875 to the L2 cache via link 640 to perform write back of dirty data and to request the missing data. Control logic within DMC 216 then initiates a request to L1 memory 266 when the requested missing data is returned from the L2 cache via link 640. Control logic within DMC 218 also initiates requests to L1 memory 266 in response to snoop requests or DMA requests received from the EMC via link 641.
An efficient scheme is provided to insure maximum utilization of the 256 bits/cycle bandwidth of the L1D SRAM, while at the same time ensuring minimum latency for completion of each transaction and honoring arbitration schemes. To efficiently achieve maximum throughput with minimum latency, slicing and routing logic 806 splits each of the transaction types into a set of transactions that each have a size of 32 bits, which is equal to the size of each bank of the L1D DRAM. Access to each bank is controlled by arbitration logic 807 that includes independent finite state machines (FSM) 810-817. Each FSM considers the following inputs, where bank enable signals with each request define the width of the request and the target bank(s):

- L2 Write bank enable signals
- DMA access bank enable signals
- CPU0 bank enable signals
- CPU1 bank enable signals
- priority between CPU1 and CPU2
- priority between CPU and DMA

Each FSM may output the following information to request routing and control logic 808: address, data and byte enable data-path control signals, and Information related to which transactions are currently stalled or granted access. Interconnect fabric 806 is implemented as an switch fabric with eight sets of multiplexers that allow data being routed to each 32 bit bank 820-827 to be selected individually in response to outputs from FSM 810-817. In this manner, several different accesses may proceed in parallel, depending on the bank enable signals for each request. Completion of any particular transaction is achieved when access is granted to each of the requested banks.
For example, assume the DMA requested a 256 bit aligned transfer to L1D while the CPU is also accessing L1D with 32 bit transfers. A portion of the 256 bit DMA transaction may be initiated to one or more banks on a first cycle while a CPU transaction is initiated on different banks, and then the rest of the DMA transaction may be initiated on a later cycle when the remaining banks are available to the DMA request. The DMA's 256 bit transfer is deemed complete when all eight banks have been accessed. This may require only one access cycle, or it may require two or more access cycles, depending on what other requests are pending.
FIG. 9 is a timing diagram illustrating operation of concurrent bank arbitration in L1D cache controller 218. In this example, the CPU and DMA priority registers have been programmed to give the CPU a higher priority than the DMA. The DMA maximum wait value has been programmed to be four cycles. As discussed above, these values may be changed dynamically under control of software being executed by SoC 100.
At a random point in time referred to as cycle 0, a 256 bit DMA transfer is pending that is divided into eight 32 bit requests (DMA 7-0) by bandwidth management logic 518. CPU data unit .D1 has also requested two double word (d-word) accesses (CPU1 DW 3-2, CPU1 DW 1-0) that are likewise divided into four 32 bit requests. CPU .D1 follows this with two more d-word accesses (CPU1 DW 7-6, CPU1 DW 5-4) on cycle 1. Since CPU priority is higher than DMA priority, .D1 wins the arbitration contest in FSMs 0-3 for banks 820-823 and CPU1 DW 3-2, CPU1 DW 1-0 are given access on cycle 0. However, banks 824-827 are not being requested by the CPU, so FSMs 814-817 award these banks to the DMA request; DMA requests 7-4 may therefore be awarded access on cycle 0.
For cycle 1, CPU1 DW 7-6 and CPU1 DW 5-4 are pending for banks 824-827 and are awarded access by FSMs 814-817. Meanwhile, banks 820-823 are now available and FSMs 810-813 award access to DMA request DMA 3-0. In this manner, the original 256 bit DMA access is completed in two cycles using banks that are not needed by the CPU.
Another 256 bit DMA access may be pending in cycle 2 along with a sequence of five single word requests to bank 820, word address 0 from CPU data unit .D1. Therefore, during cycle 2, CPU request CPU1 W 8 wins the arbitration contest with the DMA in lane 0. However, lanes 1-7 are not needed by CPU1 W 8, so FSM 811-817 grants access to DMA 15-9.
On cycle 3, no other requests are pending; CPU W 9 is awarded access to bank 820 by FSM 810.
On cycle 4, CPU data unit .D2 requests access to four double words, beginning at word address 4. Therefore, FSM 810 awards bank 820 to CPU1 W 10, and FSMs 814-817 award banks 824-827 to CPU2 DW 3-2 and CPU2 DW 1-0.
On cycle 5, since data unit .D1 and .D2 both have the same CPU priority, the arbitration logic FSMs are designed to award access alternately. Therefore, FSM 810 awards bank 820 to CPU2 DW 4. FSMs 811-813 also award banks 821-823 to CPU2 DW 7-6 and 5.
On cycle 6, data unit .D2 initiates another pair of double word accesses. Since data unit .D2 won in cycle 5, FSM awards bank 820 to CPU W 11. Since lane 1 is free, FSM 811 awards bank 821 to CPU2 DW 9, which is only half of the double word transaction. FSM 812-813 also award banks 821-823 to CPU2 DW 11-10.
On cycle 7, the DMA maxwait counter in FSM 810 reaches the maximum wait value and therefore awards bank 820 to DMA 8. At the completion of this transaction, control logic 808 then indicates that 256 bit transaction DMA 15-8 is complete.
On cycle 8, another 256 bit DMA transaction request arrives. Since CPU priority is currently higher than DMA priority, FSM 810 awards bank 820 to CPU2 DW 8 which is the other half of the double word request, since CPU1 won the last time. At the completion of this transaction, control logic 808 will indicate that data unit .D2's request CPU2DW 9-8 is complete.
On cycle 9, another d-word request is received from data unit .D2 for an access beginning at word address 2. FSM 810 awards bank 820 to the CPU1 W 12, which is the last of the five word sequence from data unit .D1. Lanes 2-3 are available, so FSMs 812-813 award banks 822-823 to CPU2 DW 13-12.
On cycle 10, another double d-word request is received from data unit .D2 for an access beginning at word address 2. FSMs 812-815 award banks 822-825 to CPU2 DW 17-16 and 15-14. Bank 820 is free, so FSM 810 awards bank 820 to the last word of the DMA transfer, DMA 16. At the completion of this transaction, control logic 808 then indicates that 256 bit transaction DMA 23-16 is complete.
In this manner, access to multibank shared resource SRAM 266 continues by slicing each request into bank size pieces and awarding access to each available bank based on priority and bandwidth control mechanisms for each bank.
FIG. 10 is a flow diagram illustrating operation of concurrent bank arbitration for access to a shared resource with multiple banks in a system that has multiple requesters. As described in more detail above, a shared memory resource such as L1D cache/SRAM 266 may be organized 1002 as an N bit wide memory. In the case of SRAM 266, N is 256. Of course, in other embodiments a different value of N may be embodied.
The memory is configured 1004 as a set of banks, where each bank is organized as a portion of the N bits. In some embodiments, the each bank may be the same width, while in other embodiments some of the banks may be of different widths. In the embodiment of FIG. 9, there are eight banks that each has a width of 32 bits.
A separate arbitration mechanism and control logic is provided 1006 for each bank, such that each bank can be accessed individually. Referring back to FIG. 9, FSMs 810-817 operate independently to provide an arbitration contest for each requester that has a pending access request on a particular bank of shared resource 266.
As each transaction request is received from the various requesters, each request that is for an access wider than one bank is divided 1008 into a plurality of transaction requests for a respective portion of the plurality of banks. If the banks have different widths, then the transaction is divided according to widths of the banks it is targeting. In this embodiment of SoC 100, DMA requests may have a width of 256 bits, while CPU requests may have a width of 8 to 128 bits.
Each of the plurality of transaction requests for the wide transaction is arbitrated 1010 separately against competing requests for each corresponding bank. Access to each bank is granted 1012 individually, such that two or more of the plurality of transaction requests of the wide transaction may be permitted to access a respective bank at different cycles when there is conflict with another request, and such that all of the transaction requests may occur in parallel on a same cycle when there is not a conflict with another request.
Each of the plurality of transaction requests in a set from a divided wide transaction is monitored 1014 during arbitration and access to the plurality of banks. The wide transaction request is indicated 1014 as complete only when all of the all of transaction requests in the set are complete.
As described in more detail above, one scheme for managing bandwidth 1020 is to provide a contention counter at each arbitration FSM for at least one of the requesters having a lower priority. A sequence of arbitration contests is performed at a given arbitration point FSM for requests from the plurality of requesters for access to the associated memory bank. Access is granted 1012 to the memory bank for the winning requestor of each arbitration contest. The contention counter is incremented (or decremented) each time the lower priority requester loses an arbitration contest in a sequence of arbitration contests. The priority of the lower priority requester is elevated when the contention counter reaches a value N, such that the lower priority requester will win the next arbitration contest.

System Example

FIG. 11 is a block diagram of a base station for use in a radio network, such as a cell phone network. SoC 1102 is similar to the SoC of FIG. 1 and is coupled to external memory 1104 that may be used, in addition to the internal memory within SoC 1102, to store application programs and data being processed by SoC 1102. Transmitter logic 1110 performs digital to analog conversion of digital data streams transferred by the external DMA (EDMA3) controller and then performs modulation of a carrier signal from a phase locked loop generator (PLL). The modulated carrier is then coupled to multiple output antenna array 1120. Receiver logic 1112 receives radio signals from multiple input antenna array 1121, amplifies them in a low noise amplifier and then converts them to digital a stream of data that is transferred to SoC 1102 under control of external DMA EDMA3. There may be multiple copies of transmitter logic 1110 and receiver logic 1112 to support multiple antennas.
The Ethernet media access controller (EMAC) module in SoC 1102 is coupled to a local area network port 1106 which supplies data for transmission and transports received data to other systems that may be coupled to the internet.
An application program executed on one or more of the processor modules within SoC 1102 encodes data received from the internet, interleaves it, modulates it and then filters and pre-distorts it to match the characteristics of the transmitter logic 1110. Another application program executed on one or more of the processor modules within SoC 1102 demodulates the digitized radio signal received from receiver logic 1112, deciphers burst formats, and decodes the resulting digital data stream and then directs the recovered digital data stream to the internet via the EMAC internet interface. The details of digital transmission and reception are well known.
By making use of an individual bank arbitration system to control accesses to shared resources by multiple requesters within processor modules of SoC 1102, data drops are avoided while transferring the time critical transmission data to and from the transmitter and receiver logic.
Input/output logic 1130 may be coupled to SoC 1102 via the inter-integrated circuit (I2C) interface to provide control, status, and display outputs to a user interface and to receive control inputs from the user interface. The user interface may include a human readable media such as a display screen, indicator lights, etc. It may include input devices such as a keyboard, pointing device, etc.

Other Embodiments

Although the invention finds particular application to Digital Signal Processors (DSPs), implemented, for example, in a System on a Chip (SoC), it also finds application to other forms of processors. A SoC may contain one or more megacells or modules which each include custom designed functional circuits combined with pre-designed functional circuits provided by a design library.
While the invention has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various other embodiments of the invention will be apparent to persons skilled in the art upon reference to this description. For example, in another embodiment, more or fewer number of banks and individual bank arbitration may be implemented. Some embodiments may include bandwidth management using maximum wait counters, while other embodiments may be implemented without such bandwidth management.
While the shared memory was described herein as part of a processor module, in other embodiments the shared resource may be internal to a processor module, external to a processor module, included within an SoC or not part of an SoC, etc.
While a three bit priority value was described herein, in another embodiment more or fewer priority levels may be implemented. In another embodiment, higher priority values may indicate higher priority, for example.
In another embodiment, the shared resource may be just a memory that is not part of a cache. In various embodiments, the banks of a shared memory may be divided into banks that each have the same width or the banks may have different widths, for example. When the bank width is different, then a wide transaction must be aligned to the correct portion of banks before being divided.
Certain terms are used throughout the description and the claims to refer to particular system components. As one skilled in the art will appreciate, components in digital systems may be referred to by different names and/or may be combined in ways not shown herein without departing from the described functionality. This document does not intend to distinguish between components that differ in name but not function. In the following discussion and in the claims, the terms “including” and “comprising” are used in an open-ended fashion, and thus should be interpreted to mean “including, but not limited to . . . . ” Also, the term “couple” and derivatives thereof are intended to mean an indirect, direct, optical, and/or wireless electrical connection. Thus, if a first device couples to a second device, that connection may be through a direct electrical connection, through an indirect electrical connection via other devices and connections, through an optical electrical connection, and/or through a wireless electrical connection.
Although method steps may be presented and described herein in a sequential fashion, one or more of the steps shown and described may be omitted, repeated, performed concurrently, and/or performed in a different order than the order shown in the figures and/or described herein. Accordingly, embodiments of the invention should not be considered limited to the specific ordering of steps shown in the figures and/or described herein.
It is therefore contemplated that the appended claims will cover any such modifications of the embodiments as fall within the true scope and spirit of the invention.

Claims

1. A method for operating a memory subsystem, the method comprising:

organizing the memory as N bits wide;

configuring the memory as a plurality of banks, wherein each bank is organized to have a width that is a portion of the N bits;

providing a separate arbitration and control logic for each bank, such that each bank can be accessed individually;

receiving a plurality of transaction requests to access the memory;

arbitrating transaction requests targeted to each bank of the memory individually; and

granting access to each bank on each access cycle to a highest priority request for each bank, such that more than one transaction request may be granted access to the memory on a same access cycle.

2. The method of claim 1, further comprising:

dividing a wide transaction request that has a transaction width that is wider than a width of one bank into a plurality of divided requests each having a width less than or equal to a respective target bank width; and

arbitrating individually each of the divided requests such that two or more of the divided requests are permitted to access a respective target bank on different access cycles when there is conflict with another request, and such that all of the divided requests occur in a same access cycle when there is not a conflict with another request.

3. The method of claim 3, further comprising:

monitoring each of the divided requests during arbitration and access to the plurality of banks; and

indicating that the wide transaction request is complete only when all of the divided requests are complete.

4. The method of claim 2, wherein N is 256 bits, the wide request has a width in the range of 8 to 256 bits, and there are eight banks each having a width of 32 bits.

5. The method of claim 1, wherein all of the banks have a same width.

6. The method of claim 1, further comprising elevating a priority value of a transaction request that loses more than a specified number of arbitration cycles.

7. A digital system comprising:

a shared resource having an N bit wide access interface, wherein the shared resource is configured to have M banks each having a width that is a portion of the N bits;

a plurality of requesters coupled to request access the shared resource via an interconnect fabric, wherein the interconnect fabric is configured to selectively route each requester to any one of the M banks; and

M separate arbitration points coupled respectively to the M banks; the M separate arbitration points configured to grant access to each bank on each access cycle to a highest priority request for each bank, such that more than one access request may be granted access to the memory on a same access cycle.

8. The digital system of claim 7, further comprising slicing logic coupled to the interconnect fabric, wherein the slicing logic is operable to divide a wide access request that has a transaction width that is wider than a width of one bank into a plurality of divided requests for a respective portion of the M banks; and

wherein the M separate arbitration points are configured such that two or more of the divided requests are permitted to access a respective bank on different access cycles when there is conflict with another request, and such that all of the divided requests occur in a same access cycle when there is not a conflict with another request.

9. The digital system of claim 8, further comprising control logic couple to the M banks, the control logic operable to monitor each of the divided requests during arbitration and access to the plurality of banks and to indicate that the wide access request is complete only when all of the divided requests are complete.

10. The system of claim 9, further comprising weighting logic coupled to each of the M arbitration points, wherein the arbitration point is configured to grant access to the first shared resource in response to the weighting logic.

11. The system of claim 10, wherein a weighting value of the weighting logic is operable to be dynamically updated while the system is in operation.

12. The system of claim 10, wherein the weighting logic comprises a maximum wait counter.

13. The system of claim 7 being a system on a chip (SoC), wherein the shared resource and the plurality of requesters are comprised within a core module within the SoC.

14. The system of claim 13, further comprising a plurality of the core modules within the SoC.

15. A system comprising:

a target resource having M banks and a plurality of requesters coupled for access to the target resource;

means for associating a separate arbitration point with each of the M banks of the target resource;

means for receiving a plurality of transaction requests from the plurality of requesters to access the target resource;

means for dividing a wide transaction request that has a transaction width that is wider than a width of one bank into a plurality of divided requests for a respective portion of the M banks; and

means for granting access to each bank on each access cycle to a highest priority request for each bank, such that more than one of the plurality of requesters may be granted access to the target resource on a same access cycle.