WO2003088047A1

WO2003088047A1 - System and method for memory management within a network processor architecture

Info

Publication number: WO2003088047A1
Application number: PCT/US2002/011523
Authority: WO
Inventors: Ryszard Bleszynski; Man Dieu Trinh
Original assignee: Bay Microsystems, Inc.
Priority date: 2002-04-12
Filing date: 2002-04-12
Publication date: 2003-10-23
Also published as: AU2002307270A1

Abstract

The present invention discloses a system and method for balancing memory accesses to a low cost memory unit (shared central buffer) in order to sustain and guarantee a desired line rate regardless of the incoming traffic pattern. The memory unit may include, for example, a group of dynamic random access memory units. The memory unit is divided into memory channels and each of the memory channels is further divided into memory lines, each of the memory lines includes one or more buffers that correspond to the memory channels. The determination as to which of one or more buffers within a memory line an incoming information element is stored is based on factors such as the number of buffers pending to be read within each of the memory channels, the number of buffers pending to be written within each of the memory channels, and the number of buffers within each of the memory channels that has data written to it and is waiting to be read.

Description

System and Method for Memory Management Within a Network Processor Architecture

Field of the Invention

The present invention relates in general to the maximization of bottlenecks in a communication network and more specifically, to the maximization of memory bottlenecks in a processor.

Background Information

Almost all communications equipment uses one or more network processors. Communications equipment include, but is not limited to, high-speed routers, switches, intelligent optical devices, digital subscriber line access multiplexers, broadband access devices, and voice gateways. The equipment may deploy the network processor in a centralized or distributed manner. The distributed network processor is popular for high speed and intelligent communications equipment. For lower and mid-range equipment, the centralized network processor is attractive because this leads to a lower cost. In complex high-speed intelligent broadband equipment, the network processors (such as those manufactured by Intel Corporation or Agere Corporation) are distributed and each line card may include one or more network processors.

Figure 1 illustrates a prior art line card 110 and its components. In the line card 110, the fiber-optic line 109 is coupled to the optical module 100. The other end of the fiber-optic line 109 typically connects to an external router or another communications device. Among other functions, the optical module 100 converts the optical signal into an electrical signal. The optical module 100 presents the electrical signal to the framer 101. The framer 101 performs functions such as: framing, error checking and statistical gathering. The framer 101 provides the framed information to the classifier 102. The classifier 102 performs a flow classification function. The classifier 102 is an optional component that may or may not exist in the line card 110. Most equipment do not use a classification beyond layer three or four of the seven-layer Open Systems Interconnection ("OSI") Reference Model. Most network processors perform at least up to layer three or four. The network processor processes the information and forwards it into the appropriate line card 110 within the system's backplane 111 using the switch fabric 104. Logically, the optical module 100 and the framer 101 perform layer one of the OSI stack, whereas the network processor and the classifier 102 handle layers 2 through 7. Processing intelligence, power and bandwidth capacity are the biggest differentiation factors between network processors.

Among the single biggest limiting factor for network processors to scale and meet increasing Internet bandwidth demand is Moore's Law. Figure 2 illustrates Moore's Law versus the Internet bandwidth demand curve. This law predicts that chip capacity would double every 18 months. Moore's Law limits the advancement in semiconductor process technology to 18 months in order to achieve a 100% performance improvement. Doubling every 18 months is far below the Internet bandwidth demand, which doubles every four to six months. As of today, early generation network processors cannot scale by 4 or 16 within a two to three year time window. Overcoming Moore's Law is a non-trivial process.

The current techniques used in network processor architectures are bounded by Moore's Law. In general there are three approaches to network processor architectures: (1) using multiple reduced instruction set computing ("RISC") processors, (2) using configurable hardware and (3) using a mix of RISC processors and configurable hardware. The RISC processor and its instruction set were created decades ago for devices geared toward human-to- machine interaction. Network equipment, however, are machine-to-machine devices that demand a much larger bandwidth than that demanded by the human- to-machine interaction. Multiple RISC processors within the data path of networking equipment will not satisfy the large bandwidth demand by the equipment. Moore's Law is one limiting factor. Another severe limiting factor is the complexity of the software compiler, scheduler and kernel used to efficiently control and maximize the RISC processor's operation. Creating a mini operating system is not the solution to the explosive demand in bandwidth, especially when Moore's Law (i.e., the hardware) cannot even meet this demand.

Using configurable hardware results in the highest-performance processors. The simple software interface used with configurable hardware avoids any performance degradation. Eliminating any software within the information path and replacing it with configurable gates and transistors significantly boosts the performance of the network processor. However, at the gate level, without any creativity within the architecture itself, Moore's Law still bounds the performance advancement of the network processor architecture.

Using a mixture of multiple RISC processors and configurable hardware machines has two different flavors. The first flavor uses the RISC processor in the data path and the other one is to have the RISC processor in the control path. Traditionally, RISC processors in the control path have been limited to those external to the network processor. Even with configurable hardware, RISC processors should not be used in the data path because they cannot satisfy the large bandwidth demanded by network equipment.

In addition to the processing capability of the network processor, a more critical bottleneck in the network processor architecture is the memory throughput for the payload buffer. Memory technology advancement is also bounded by Moore's Law. Today's generation of store and forward network processors use a single hierarchy memory organization. Bandwidth may be increased by increasing the width of the memory bus. Increasing the information width of the packet memory bus, however, only decreases the actual memory throughput for packet sizes smaller than the bus width because of the additional processing overhead.

Figure 3 illustrates a prior art multilevel memory hierarchy within a processor. Due to the principle of locality, the linear multilevel memory hierarchical scheme of Figure 3 works well in a system utilizing RISC or complex instruction set computing ("CISC") processors. The RISC or CISC processor includes very high speed registers 300 that can be immediately accessed by those processors. These registers 300 are high-speed memory and are internal to the processor providing the processor with very high-speed single cycle access to the stored data. A cache 301 is a relatively small piece of memory that has a slightly slower access time compared to the registers 300. A main memory 302 is the main general-purpose storage region to which the processor has direct access. The main memory 302 has a slower access time than the cache 301. A hard disk 303 magnetically stores data on one or more platters. The hard disk 303 is slower than the main memory 302. As the memory hierarchy moves away from the processor, the storage capacity increases and the access time decreases.

The multilevel memory hierarchy and caching theory works well in the RISC and CISC architecture but due to the non-deterministic nature of network traffic, does not work well in network processors. The principle of locality on which caching theory relies does not apply in a networking environment.

Therefore, it is desirable to have a system and method to efficiently access a memory unit while processing network traffic.

Summary of the Invention

According to an embodiment of the present invention, a first method is described to optimally access a memory unit of a processor where the memory unit is logically partitioned to form multiple memory channels, the multiple memory channels are further logically partitioned to form multiple memory lines, each of the multiple memory lines includes multiple buffers and each of the multiple buffers corresponds to a separate one of the multiple memory channels. This method includes determining at least one load parameter of each of the multiple memory channels, and based on the determined at least one load parameter, selecting a particular one of the multiple memory channels.

According to the embodiment of the present invention, a second method is described to optimally access the memory unit of a processor where the memory unit is logically partitioned to form multiple memory channels, the multiple memory channels are further logically partitioned to form multiple memory lines, each of the multiple memory lines includes multiple buffers and each of the multiple buffers corresponds to a separate one of the multiple memory channels. This second method includes determining at least one load parameter of each of the multiple memory channels, and selecting a particular one of the multiple memory channels that has a particular one of the at least one load parameter that is the lowest.

According to the embodiment of the present invention, a third method is described to optimally access a single hierarchical level memory unit of a processor where the memory unit is logically partitioned to form multiple memory channels, the multiple memory channels are further logically partitioned to form multiple memory lines, each of the multiple memory lines includes multiple buffers and each of the multiple buffers corresponds to a separate one of the multiple memory channels. This third method includes determining, for each of the plurality of memory channels, the number of pending read requests, the number of pending write requests, and the number of active buffers which is the number of a particular one of the multiple buffers that is unavailable and corresponds to the particular one of the multiple memory channels in each of the multiple memory lines, and selecting a particular one of the multiple memory channels that has a lowest number of pending read requests, a lowest number of pending write requests, a lowest number of active buffers, or a corresponding channel identification number that is next in a round robin scheme.

According to the embodiment of the present invention, a first system is described to optimally access a memory unit, the first system includes the memory unit that is logically partitioned to form multiple memory channels, a traffic analyzer to determine one or more loads of each of the multiple memory channels, and a bandwidth balancer to select a particular one of the multiple memory channels based on the determined one or more loads.

According to the embodiment of the present invention, a second system is described to optimally access a memory unit, the second system includes the memory unit that is logically partitioned to form a plurality of memory channels, a bandwidth management unit that includes a traffic analyzer to determine one or more loads of each of the multiple memory channels, and a bandwidth balancer to select a particular one of the multiple memory channels based on the determined one or more loads, and a policy control unit to provide an information element or a particular one of a plurality of information element segments for writing to the selected one of the multiple memory channels.

Brief Description of the Drawings

Figure 1 illustrates a prior art line card and its components.

Figure 2 illustrates Moore's Law versus the Internet bandwidth demand curve.

Figure 3 illustrates a prior art multilevel memory hierarchy within a processor.

Figure 4 illustrates an embodiment of a memory management subsystem according to the present invention.

Figure 5 illustrates an embodiment of a network processor according to the present invention.

Figure 6 illustrates an embodiment of a payload channel sequence table according to the present invention.

Figure 7 illustrates an embodiment of a bandwidth balancing flowchart according to the present invention.

Description of the Specific Embodiments

The present invention provides novel systems and techniques for balancing memory accesses to sustain and guarantee a desired Internet bandwidth (e.g., line rate) demand under any traffic pattern using low cost memory such as dynamic random access memory ("DRAM"). Among other reasons, network processor advancements to meet the explosively-increasing demand requirements for Internet bandwidth cannot rely on traditional memory locality principles. In particular, the technique of the present invention provides a novel traffic analyzer and memory bandwidth balancer that will maximize aggregate memory bandwidth using low cost memory, such as commercially available DRAM memory, and enable true scalability for network processors and advancement independent of improvements in memory capabilities and Moore's Law.

According to the invention, systems and methods are provided for maximizing the memory throughput by dividing the memory into channels. The memory hierarchy is single level as opposed to the linear multilevel approach used in prior art computer systems as shown earlier in Figure 3. Each memory channel may include single or multiple banks of DRAM and have a 64-bit wide information path. The bandwidth balancing can be applied to channel granularity other than 64-bits. Due to the long latency of low cost DRAM, it can be mathematically proven that four individual 64-bit wide memory channels provide significantly better performance than a single 256-bit wide memory, especially for smaller packet sizes.

Figure 4 illustrates an embodiment of a memory management subsystem according to the present invention. A bandwidth management unit 410 in Figure 4 resides within the network processor. The framer 101 or the classifier 102 not shown in Figure 4 is located between the fiber-optic line 109 and the network processor. Each fiber-optic line 109 connects to one or more external routers or another communications device. As each packet or cell arrives, it is temporarily stored within the ingress FIFOs 415 of input/output unit ("IOU") 400. The bandwidth management unit 410 includes a traffic analyzer 401, a bandwidth balancer 402, a resource manager 403, and payload channel FIFOs 404 (to clarify Figure 4, note that packets do not actually pass through the traffic analyzer 401, the bandwidth balancer 402, or the resource manager 403). The traffic analyzer 401 analyzes the traffic by using counters to measure the depth level of the payload channel FIFOs 404. The count values are used by the bandwidth balancer to apply the balancing algorithm. The bandwidth balancer 402 balances the traffic load across the multiple channels (i.e., channel 1 to channel n). The resource manager 403 interfaces with a buffer management unit 406 for pointer allocation and recycling. The payload channel FIFOs 404 on the memory side provide additional temporary storage to compensate for latencies inherent within a memory unit 405. The memory unit 405 may include groups of DRAM units. As illustrated in Figure 4, the memory channels may include two or more channels. Each channel bus width used in this example is 64-bits wide. The bandwidth balancing may also, for example, be applied to channel bus widths of 2 to n where n is a positive integer.

In a store and forward architecture, network traffic arrives from the line side and the network processor temporarily buffers the information in the memory side. This buffering provides tolerance and this prevents network congestion. The buffering also allows the network processor to perform traffic engineering and forwarding functions to determine the next hop destination of the packet data. After the network processor determines the destination, the traffic leaves the processor from the memory into the line side. In this example the buffer granularity is 64-bytes. The present invention can be applied to buffer size other than 64-bytes.

The present invention guarantees and sustains a line rate of, for example, 10 Gbps using four memory channels at a memory bus frequency of 166 MHz. Increasing the frequency to 200 MHz, the algorithm guarantees and sustains up to 20 Gbps of line rate. With 6 channels at 266 MHz, 40 Gbps of usable memory bandwidth is achievable. These numbers apply to packet sizes of 40-bytes or greater.

For the example here, four channels and a 64-byte buffer size are used. In an ideal scenario and the simplest case, when a packet or cell arrives, each 64-byte chunk is stored in one memory channel in a sequential manner. In particular, the first 64-byte goes to channel one, and the next one goes to channel two and so forth. This simplest case works fine if the outgoing (egress) traffic pattern is deterministic. Due to the non-deterministic outgoing traffic pattern experienced in real-world networks, however, the memory channels may not be balanced and thus the aggregate memory bandwidth will fall below the line rate. With unbalanced memory accesses, one or two channel swap may occur and the line rate cannot be sustained or guaranteed. This is true since each channel provides significantly less bandwidth than the line rate requirement. In other words, because different packet streams are read out at different rates (e.g., DSL, TI) due to the demands of different-quality services, these varying demands will affect each channel differently and generally cause the rate of reading an information element from one network memory channel to be different from the reading rates of the other memory channels. The non-deterministic nature of the reading rate makes it particularly important to the determination of traffic flow.

According to an aspect of the invention, a method is provided to analyze the incoming and outgoing traffic patterns. The method uses the traffic analyzer 401 to analyze incoming and outgoing traffic by monitoring the depth level of the FIFOs. In one embodiment, the traffic analyzer 401 uses counters to measure the FIFO depth level.

In another aspect of the invention, the bandwidth balancer 402 intelligently determines the channel selection for storing the incoming traffic. The bandwidth balancer 402 balances the channels appropriately depending on the incoming and outgoing traffic patterns. The bandwidth balancer 402 allocates a memory line by fetching the corresponding pointers for the line (which, for example, may consist of four 64-byte buffers) from the buffer management unit 406. The memory line may include two to N buffers, each of which may be assigned to a channel. In the example, the bandwidth balancer 402 fetches four pointers, one for each buffer. Under severe traffic patterns, the bandwidth balancer 402 includes intelligence to sacrifice one or more 64-byte buffers without even using the buffer space in exchange for a new line. The new line has four 64-byte buffers and thus increases the channel selection choice.

In this embodiment, the memory unit 405 is divided into two or more memory channels. The memory unit 405 is addressed line by line. Each memory line is divided into buffers, each of which may be assigned to a corresponding channel (in some instances, a buffer is not assigned to a channel if it is sacrificed when channel skipping is enabled as described below). A buffer is considered to be assigned to a channel when it is permitted to store data in that memory channel. Each memory line is pointed to by a line pointer, and each buffer is pointed to by a channel pointer. The bandwidth balancer 402 selects into which memory channel an information element is to be stored. The information element may be, for example, a complete packet if the size of the buffer can accommodate the entire packet, or a portion of a packet, if the buffer is only large enough to accommodate a portion of the packet. The information element may also be, for example, an ATM cell. The memory unit 405 may be DRAM for storing data in a network processor. Each memory channel may comprise one or more banks of DRAM. While the data is stored in the memory unit 405, the network processor or other communications device determines, among other things, to which destination (e.g., a next hop router) the data should be sent.

The bandwidth balancer 402 determines channel selection based upon load parameters, such as, for example, parameters relating to incoming and outgoing traffic for a channel, or, more particularly, the number of currently pending read and write requests and the active buffer count for each of the channels. In one embodiment (with no buffer sacrifice option), the bandwidth balancer 402 select the channel for the incoming cell, packet or packet segment using the following order of criteria:

1. Of the unoccupied channels in a line of memory, the channel with the lowest number of read requests is selected because a read request has the highest priority.

2. If there is more than one available channel within the line with the lowest number of read requests, then, of those channels, the channel with the lowest number of write requests is selected.

3. If the number of write requests pending for those channels is the same, then of those channels, the channel with smallest number of active buffers is selected (a buffer is active if it stores data to be read out). This criterion is based on a statistical prediction related to the fact that the channel with the higher number of active buffers will eventually generate more reads from that channel. 4. If all the channels determined by 1, 2 and 3 above have the same number of active buffers, then a round robin scheme is used to select from among the available channels based upon an ascending or descending order of channel identification numbers.

In a store and forward architecture, such as that of the present invention, the information stored in the memory will eventually be read out for forwarding or discarding. Based on this fact, the method in the present invention uses counters for each channel to keep track of the number of buffers pending to be read, pending to be written, and active. Each channel uses three counters: pending read requests, pending write requests, and active buffers.

Figure 5 illustrates an embodiment of a network processor according to the present invention. In Figure 5, a Policy Control Unit ("PCU") 512 resides within the network processor. Among other functions, the PCU 512 performs functions such as usage parameter control ("UPC") on information elements (e.g., packets or packet segments) arriving on the ingress FIFOs 415. WT en the PCU 512 completes its operations on an incoming cell or packet segment, the PCU 512 initiates a write request to a Data Buffer Unit ("DBU") 514 to write data into the memory unit 405. As explained in detail below, the bandwidth balancer 402 selects the memory channel into which the information element is to be stored. Each channel in the memory unit 405 corresponds to write request (control) FIFOs 519a-d and associated incoming write payload (data) channel FIFOs 511a- d. The PCU 512 transmits the write request to the DBU 514 through the write request FIFOs 519a-d corresponding to the selected memory channel, and temporarily stores the information element in the incoming write payload channel FIFOs 51 la-d corresponding to the selected memory channel before the information element is written into the selected memory channel. An interface 503 interfaces with the payload channel FIFOs and the channels that transfer information between the DBU 514 and the memory unit 405. The reason request and payload channel FIFOs are used to store requests and data, respectively, is that accesses to the memory unit 405 (e.g., DRAM) are subject to nondeterministic latencies.

Among other actions, a Forwarding Processing Unit ("FPU") 513 performs a forwarding function, including calculation of the next destination (e.g., next hop). When the FPU 513 completes its operations for an outgoing cell or packet segment, the FPU 513 initiates a read request to the DBU 514 to read data from the memory unit 405. Each channel in the memory unit 405 corresponds to read request (control) FIFOs 520a-d and associated outgoing read payload (data) channel FIFOs 510a-d. The FPU 513 transmits the read request to the DBU 514 through a particular one of the read request FIFOs 520a-d corresponding to the channel storing the requested information element. After the DBU 514 reads the information element from a buffer in the memory unit 405, the information element is stored in a particular one of the outgoing read payload channel FIFOs 510a-d corresponding to the channel from which the information element came. The information element is then transferred to an egress line FIFO after which the information element is forwarded to its next destination.

When the PCU 512 generates a write request, a particular one of pending write request counters 500a-d corresponding to the channel that the write request is directed is incremented by one. When the DBU 514 services the write request (moves data from the incoming write payload channel FIFO to a buffer at a channel in the memory unit 405), a particular one of the pending write request counters 500a-d corresponding to the channel is decremented by one. When the FPU 513 generates a read request a particular one of the pending read request counters 504a-d corresponding to the channel at which the read request is directed is incremented by one. When the DBU 514 services the read request (moves data from a buffer at a channel in the memory unit 405 to an outgoing read payload channel FIFO), the a particular one of the pending read request counters 504a-d corresponding to the channel is decremented by one. When the PCU 512 generates a write request, a particular one of the active buffer counters 515a-d corresponding to the channel that the write request is directed is incremented by one. When the FPU 513 generates a read request, a particular one of the active buffer counters 515a-d corresponding to the channel to which the read request is directed is decremented by one.

One of ordinary skill in the art will recognize that the present invention is not limited to using the exemplary counters described herein, but can use any technique to measure the load on the memory channels.

For four channels, this embodiment uses 12 counters. One memory line includes four 64-byte buffers. When the resource manager 403 fetches a buffer pointer, the buffer management unit 406 provides a line pointer. The resource manager 403 fetches a new line pointer when the bandwidth balancer 402 requests that a new line be fetched.

The line pointer points to a memory line which includes four buffers, each of which may be assigned to a channel. Each time a 64-byte quantity of information is ready to be stored, the bandwidth balancer 402 selects from among the four channels. The PCU 512 maintains state information using a field called the payload channel occupancy ("PCO") to identify which of the four channels are occupied. For example, if the buffers in a line corresponding to channels 1 and 3 are occupied, the PCO vector for that line would be (1,0,1,0) where the element vectors correspond to channels (3,2,1,0) in that order. A channel is defined as "occupied" or "unavailable" with respect to a particular memory line if, within that line, the buffer that corresponds to the channel stores data. A channel is defined as "written" with respect to a particular memory line when data is written into a buffer corresponding to the channel within that line. The relationship between the buffers and the channels is maintained in a channel sequence table ("CST"), as explained below. The PCO is a four-bit field for each memory line that is maintained in a separate structure called a policy control state ("PCS") within the PCU 512.

Initially, when the resource manager 403 fetches a new line from the buffer management unit 406 (see Figure 4), the bandwidth balancer 402 can select any one of the four channels. The corresponding bit in the PCO field is set to logic one to indicate when a particular channel is already occupied. When the next cell or packet segment arrives after the first channel is selected to be written by the bandwidth balancer 402, the bandwidth balancer 402 can select any one of the remaining unoccupied three channels. When only one or two channels are left, the selection is constrained to those one or two channels.

When the PCO state indicates that there are only one or two channels left and the channel selection does not meet any of the above four balancing criteria, the bandwidth balancer 402 includes an option to sacrifice one or more buffers (e.g., 64-bytes of the 256 Mbytes of memory) for performance trade-off. Under appropriate load conditions, when the channel sacrifice (i.e., channel skipping) option is enabled the resource manager 403 fetches a new memory line pointer (i.e., allocates a new line) and this provides the bandwidth balancer 402 with four new buffers, one buffer per channel, to choose from instead of one or two. The channel skipping can be applied when one or two channels are available for selection and the available channels do not meet the channel selection criteria. One or two buffers can be sacrificed for performance. Instead of storing the data in one of the two remaining buffers if storing the data in either of those buffers would load the corresponding channel beyond a limit deemed acceptable according to the four balancing criteria, the bandwidth balancer 402 may fetch a new line and store the data in any of the buffers of that new line. In this embodiment, channel skipping is not limited to skipping one or two buffers, but rather, any number of buffers may be skipped in order to prevent the overloading of a channel.

Buffer Link List

In the architecture of the present invention, the next buffer pointer is stored in the header of the current buffer. The next buffer pointer is written in the header section of the buffer at the same time as the payload using a burst- write transaction. In order for the present invention to enable dynamic channel selection, the payload channel occupancy state information within one memory line cannot reside within the payload buffer header. The sequence of every four cells or packet segments is dynamic and it is determined by the bandwidth balancer 402. The payload channel occupancy state information is kept in a separate data structure.

This embodiment uses a separate data structure to maintain the sequence of channel usage. The data structure that maintains the sequence is called the channel sequence table ("CST") 600. Figure 6 illustrates an embodiment of the CST 600 according to the present invention. The CST 600 may be stored in SRAM or embedded DRAM.

The CST 600 includes information about the sequence of the channel occupancy within the memory line (e.g., line pointer 601, line pointer 602, and line pointer 603 represent a memory line). In this example, a memory line includes four 64-byte buffers, one buffer from each channel (e.g., the channels in Figure 6 are channel 0 which is referenced by the number "604", channel 1 which is referenced by the number "605", channel 2 which is referenced by the number "606", and channel 3 which is referenced by the number "607"). Since one packet may occupy one or more buffers, the buffer sequence within a packet has to be maintained. Initially in this example, buffer one contains the first segment of the packet (packet 1, cell 1), buffer two contains the second segment (packet 1, cell 2) and so forth.

The first buffer location field within the CST 600 in Figure 6 contains the channel number (represented in binary) to which the first buffer is assigned. The second buffer location contains the channel number where the second buffer resides. The third buffer location contains the channel number where the third buffer resides. Since there will be some occasions that one or two of the buffers within a line are not used because they fail to meet the balancing criteria (and are thus sacrificed), this embodiment uses a valid bit within the CST 600 data structure to indicate whether the buffer is occupied. As illustrated in the second line of the CST 600, a valid bit of zero for the fourth buffer location indicates that that buffer is being sacrificed (i.e., skipped).

According to the present invention, the CST 600 serves two purposes. It provides real-time dynamic channel assignment for the bandwidth balancer 402. In addition, the CST 600 enables a pre-fetch method for the FPU 513 in the unassigned bit rate ("UBR") or packet mode of operation. In UBR and packet mode, the FPU 513 forwards cells and packets one packet at a time. In conjunction with the first segment of the packet, the FPU 513 can fetch one entry from the channel sequence table ("CST") 600 and know exactly the sequence of channels in which to send the read requests to the DBU 514 to fetch the information from memory in advance. The pre-fetch method in the FPU 513 provides a tremendous increase in throughput especially for large packets spanning more than one memory line or four buffers in this particular example.

This embodiment also uses an end-of-packet ("EOP") field within the CST 600. The PCU 512 sets the value of the EOP bit to one to mark the end of packet. This information allows the FPU 513 to pre-fetch the sequence information until it encounters the buffer with the EOP field set to one.

In the present implementation of this invention, the CST 600 structure resides in a separate memory region. The memory used in this region may be a static random access memory ("SRAM") which provides data every cycle. The bandwidth occurring on this interface is 2 reads and 1 write. The PCU 512 does a read-modify- write to update the CST 600. The FPU 513 only reads and uses the information contained within the table.

Figure 7 illustrates an embodiment of a bandwidth balancing flowchart according to the present invention. After the information element (e.g., cell or packet segment) arrives (block 700), the PCU 512 provides the PCO state information to the bandwidth balancer 402 (block 701) and the bandwidth balancer 402 reads the values of all the current counters (block 702). The first test the bandwidth balancer 402 performs is whether the channel sacrifice (i.e., skipping) option is enabled (block 703). This option may be enabled by the user. Channel Skipping Not Enabled

If this option is not enabled, the bandwidth balancer 402 will determine from among the available (unoccupied) channels the channel with the lowest number of pending read requests (block 705).

If only one channel has a lowest read count (block 706), then the bandwidth balancer 402 will select this channel for storage of the information element (e.g., packet segment), and indicate in the PCO that the channel is occupied by setting to a logic one an indicator in the appropriate field of the PCO corresponding to the occupied buffer in the memory line (block 719). If the selected channel is not the last channel within the line (decision block 721), then the flowchart for this information element completes (block 727), and the bandwidth balancer 402 will wait for the next information element to arrive. If the selected channel is the last channel (decision block 721), then the bandwidth balancer 402 will fetch a new line pointer (thereby allocating a new line in memory) and initialize the PCO to zero for that line before exiting the algorithm (block 726).

If more than one channel has a lowest read count (block 706), then from among those channels, the bandwidth balancer 402 determines which channel has the lowest number of pending write requests (block 707). If only one channel has both the lowest read and write counts (block 708), then the bandwidth balancer 402 selects this channel for storage of the information element, and proceeds to marking the PCO (block 719) followed by the other actions performed above in the case of only one lowest-read count channel (i.e., the last channel test).

If, however, more than one channel has both the lowest read and write counts (block 708), then from among those channels, the bandwidth balancer 402 determines the channel that has the lowest number of active buffers. If more than one channel matches all three criteria, then the bandwidth balancer 402 uses a round robin selection from among those channels based upon an ascending or descending order of the channel identification numbers, which are arbitrarily assigned as is well known in the art (block 709). The bandwidth balancer 402 selects the channel that survives these tests, and marks the PCO accordingly (block 719).

Channel Skipping Enabled

If the channel skipping option is enabled, the bandwidth balancer 402 will determine the channel with the lowest number of pending read requests (block 704). If only one channel has a lowest read count (block 711), then the bandwidth balancer 402 checks using the state information in the PCO whether the channel is available (i.e., unoccupied) (block 714). If the channel is available, the bandwidth balancer 402 selects the channel for storage of the information element, marks the PCO (block 719) and performs the last channel test (block 721 and block 726 if appropriate).

If, however, the lowest-read-count channel is not available (block 714), then the bandwidth balancer 402 performs a last channel test and fetches a new line if the channel is the last channel (blocks 713 and 720). Then the bandwidth balancer 402 starts again at the first determination of the channel with the lowest read count (block 704) to ultimately determine in which channel of the new line the data should be stored.

If the lowest-read-count channel is not available (block 714) and not the last channel (block 713), the bandwidth balancer 402 finds the channel with the next lowest read count value (block 712). If (1) there is only one channel with this next lowest-read-count value (block 711), and (2) it is available (block 714), then the bandwidth balancer 402 selects this channel for storage, marks the PCO accordingly (block 719), and performs the last channel test (block 721 and block 726 if appropriate).

If more than one channel has the lowest read count (block 71 1), then from among those channels, the bandwidth balancer 402 determines which channel has the lowest number of pending write requests (block 710). If only one channel has both the lowest read and write counts (block 715), then the bandwidth balancer 402 determines whether this channel is available (block 718). If it is, then the bandwidth balancer 402 selects this channel for storage of the information element, marks the PCO (block 719) and performs the last channel test (block 721 and block 726 if appropriate).

If the channel is not available (block 718), then the bandwidth balancer 402 determines whether that channel is the last channel capable of being assigned in the line (block 722). If it is, then the bandwidth balancer 402 fetches a new buffer line (block 725). Then the bandwidth balancer 402 starts again at the first step of determining the channel with the lowest read count (block 704) to ultimately determine in which channel of the new line the data should be stored.

If the channel is not the last channel (block 722), then the bandwidth balancer 402 finds the channel having both the lowest read count and the next lowest write count (block 717). The bandwidth balancer 402 then again makes the determination whether there is more than one channel meeting these criteria (block 715), going through the loop again.

If, however, more than one channel has both the lowest read and write counts (block 715), then from among those channels, the bandwidth balancer 402 determines the channel that has the lowest number of active buffers, or, if more than one channel matches all three criteria, then the bandwidth balancer 402 uses a round robin selection from among those channels based upon an ascending or descending order of the channel identification numbers (block 716).

The bandwidth balancer 402 then determines whether the channel that survives all these tests is available (block 723). If it is, then the bandwidth balancer 402 selects it for storage, marks the PCO (block 719) and performs the last channel test (block 721 and block 726 if appropriate). If, however, the channel is not available (block 723), then the bandwidth balancer 402 determines whether the channel is the last channel in the line (block 724). If it is not the last channel, then the bandwidth balancer 402 determines the channel having both the lowest read and write counts as well as the next lowest active buffer count, or, if more than one channel matches all three criteria, then the bandwidth balancer 402 uses a round robin selection from among those channels based upon an ascending or descending order of the channel identification numbers (block 728). The bandwidth balancer 402 then performs the channel available test again (block 723).

If the channel is the last channel (block 724), then the bandwidth balancer 402 fetches a new buffer line (block 725). Then the bandwidth balancer 402 starts again at the first step of determining the channel with the lowest read count (block 704) to ultimately determine in which channel of the new line the data should be stored.

The selection algorithm represented by the above flowchart is only one example of the implementation of the bandwidth balancer 402, and should not be viewed as limiting the scope of the invention. The invention can, for example, employ other algorithms using other count mechanisms with a similar or different sequence of tests in order to allocate incoming information elements among memory channels.

In the preceding specification, the invention has been described with reference to specific embodiments. It will, however, be evident that various modifications and changes may be made thereunto without departing from the broader spirit and scope of the invention as set forth in the claims that follow. The specification and drawings are accordingly to be regarded in an illustrative rather than restrictive sense.

Claims

ClaimsWhat Is Claimed:

1. A method to optimally access a memory unit where the memory unit is logically partitioned to form a plurality of memory channels, the plurality of memory channels are further logically partitioned to form a plurality of memory lines, each of the plurality of memory lines includes a plurality of buffers and each of the plurality of buffers corresponds to a separate one of the plurality of memory channels, comprising: determining at least one load value of each of the plurality of memory channels; and based on the determined at least one load value, selecting a particular one of the plurality of memory channels.

2. The method of claim 1 wherein the step of determining the at least one load value of each of the plurality of memory channels includes determining, for each of the plurality of memory channels, the number of pending read requests.

3. The method of claim 1 wherein the step of selecting the particular one of the plurality of memory channels includes selecting the particular one of the plurality of memory channels that has a lowest number of pending read requests.

4. The method of claim 1 wherein the step of determining the at least one load value of each of the plurality of memory channels includes determining, for each of the plurality of memory channels, at least one of the number of pending write requests, and the number of active buffers which is the number of a particular one of the plurality of buffers that is unavailable and corresponds to the particular one of the plurality of memory channels in each of the plurality of memory lines.

5. The method of claim 1 wherein the step of selecting the particular one of the plurality of memory channels includes selecting the particular one of the plurality of memory channels that has at least one of a lowest number of pending write requests, a lowest number of active buffers, and a corresponding channel identification number that is next in a round robin scheme.

6. The method of claim 1 wherein the memory unit is a plurality of dynamic random access memory units.

7. The method of claim 1 wherein each of the plurality of buffers has a fixed- size.

8. The method of claim 7 further comprising receiving an incoming information element; if the size of the information element is greater than the fixed-size of each of the plurality of buffers, dividing the information element into a plurality of information element segments, each of the plurality of information element segments having a size less than or equal to the fixed-size of each of the at least one buffer; and storing at least one of the information element and a particular one of the plurality of information element segments within a particular one of the plurality of buffers corresponding to the selected one of the plurality of memory channels at a particular one of the plurality of memory lines.

9. The method of claim 1 wherein each of the plurality of memory channels has a width equal to a width of the memory unit divided by the number of the plurality of memory channels.

10. A method to optimally access a memory unit where the memory unit is logically partitioned to form a plurality of memory channels, the plurality of memory channels are further logically partitioned to form a plurality of memory lines, each of the plurality of memory lines includes a plurality of buffers and each of the plurality of buffers corresponds to a separate one of the plurality of memory channels, comprising: determining at least one load value of each of the plurality of memory channels; and selecting a particular one of the plurality of memory channels that has a particular one of the at least one load value that is the lowest.

11. The method of claim 10 wherein the step of determining the at least one load value of each of the plurality of memory channels includes determining, for each of the plurality of memory channels, at least one of the number of pending read requests, the number of pending write requests, and the number of active buffers which is the number of a particular one of the plurality of buffers that is unavailable and corresponds to the particular one of the plurality of memory channels in each of the plurality of memory lines.

12. The method of claim 10 wherein the step of selecting the particular one of the plurality of memory channels that has the lowest determined load includes selecting the particular one of the plurality of memory channels that has at least one of a lowest number of pending read requests, a lowest number of pending write requests, a lowest number of active buffers, and a corresponding channel identification number that is next in a round robin scheme.

13. The method of claim 10 wherein the memory unit is a plurality of dynamic random access memory units.

14. The method of claim 10 wherein each of the plurality of buffers has a fixed-size.

15. The method of claim 14 further comprising receiving an incoming information element; if the size of the information element is greater than the fixed-size of each of the plurality of buffers, dividing the information element into a plurality of information element segments, each of the plurality of information element segments having a size less than or equal to the fixed-size of each of the at least one buffer; and storing at least one of the information element and a particular one of the plurality of information element segments within a particular one of the plurality of buffers corresponding to the particular one of the plurality of buffers to the selected one of the plurality of memory channels at a particular one of the plurality of memory lines.

16. The method of claim 14 wherein each of the plurality of memory channels has a width equal to a width of the memory unit divided by the number of the plurality of memory channels.

17. A method to optimally access a single hierarchical level memory unit, where the memory unit is logically partitioned to form a plurality of memory channels, the plurality of memory channels are further logically partitioned to form a plurality of memory lines, each of the plurality of memory lines includes a plurality of buffers and each of the plurality of buffers corresponds to a separate one of the plurality of memory channels, comprising: determining, for each of the plurality of memory channels, at least one of the number of pending read requests, the number of pending write requests, and the number of active buffers which is the number of a particular one of the plurality of buffers that is unavailable and corresponds to the particular one of the plurality of memory channels in each of the plurality of memory lines; and selecting a particular one of the plurality of memory channels that has at least one of a lowest number of pending read requests, a lowest number of pending write requests, a lowest number of active buffers, and a corresponding channel identification number that is next in a round robin scheme.

18. The method of claim 17 wherein each of the plurality of buffers has a fixed-size.

19. The method of claim 18 further comprising receiving an incoming information element; if the size of the information element is greater than the fixed-size of each of the plurality of buffers, dividing the information element into a plurality of information element segments, each of the plurality of information element segments having a size less than or equal to the fixed-size of each of the at least one buffer; and storing at least one of the information element and a particular one of the plurality of information element segments within a particular one of the plurality of buffers corresponding to the selected one of the plurality of memory channels at a particular one of the plurality of memory lines.

20. The method of claim 17 wherein the single hierarchical level memory unit is a plurality of dynamic random access memory units.

21. The method of claim 19 wherein the step of selecting the particular one of the plurality of memory channels includes finding a first subset of the plurality of memory channels that is available at the particular one of the plurality of memory lines and has a lowest number of the pending read requests; if the number of memory channels within the first subset of the plurality of memory channels equals one, setting the selected one of the plurality of memory channels to the first subset of the plurality of memory channels; if the number of memory channels within the first subset of the plurality of memory channels is greater than one, then finding a second subset of the plurality of memory channels within the first subset of the plurality of memory channels that has the lowest number of the pending write requests; if the number of memory channels within the second subset of the plurality of memory channels equals one, setting the selected one of the plurality of memory channels to the second subset of the plurality of memory channels; if the number of memory channels within the second subset of the plurality of memory channels is greater than one, then finding a third subset of the plurality of memory channels within the second subset of the plurality of memory channels that has the lowest number of active buffers; if the number of memory channels within the third subset of the plurality of memory channels equals one, setting the selected one of the plurality of memory channels to the third subset of the plurality of memory channels; and if the number of memory channels within the third subset of the plurality of memory channels is greater than one, setting the selected one of the plurality of memory channels to a particular one of the third subset of the plurality of memory channels that has a corresponding channel identification number that is next in a round robin scheme.

22. The method of claim 19 wherein the step of selecting the particular one of the plurality of memory channels includes finding a first subset of the plurality of memory channels that has a lowest number of the pending read requests; if the number of memory channels within the first subset of the plurality of memory channels equals one, determining if the first subset of the plurality of memory channels at the particular one of the plurality of memory lines is available; if the first subset of the plurality of memory channels at the particular one of the plurality of memory lines is available, setting the selected one of the plurality of memory channels to the first subset of the plurality of memory channels; and if the first subset of the plurality of memory channels at the particular one of the plurality of memory lines is not available, determining if at least one of the information element and the particular one of the plurality of information element segments can be stored within any remaining one of the plurality of memory channels at the particular one of the plurality of memory lines without overloading that memory channel; if at least one of the information element and the particular one of the plurality of information element segments can be stored within any remaining one of the plurality of memory channels, finding a second subset of the plurality of memory channels that has a next lowest number of the pending read requests; and if at least one of the information element and the particular one of the plurality of information element segments cannot be stored within any remaining one of the plurality of memory channels, fetching a new one of the plurality of memory lines; and if the number of memory channels within the first subset of the plurality of memory channels is greater than one, setting the selected one of the plurality of memory channels to a particular one of the first subset of the plurality of memory channels that has at least one of a lowest number of pending write requests, a lowest number of active buffers, and a corresponding channel identification number that is next in a round robin scheme.

23. The method of claim 19 further comprising, upon storing at least one of the information element and the particular one of the plurality of information element segments within the particular one of the plurality of buffers corresponding to the selected one of the plurality of memory channels, setting a particular one of a plurality of payload channel occupancy bits that corresponds to the selected one of the plurality of memory channels.

24. The method of claim 19 further comprising, reading the plurality of payload channel occupancy bits to determine if a corresponding one of the plurality of memory channels is available.

25. The method of claim 19 further comprising, upon storing at least one of the information element and the particular one of the plurality of information element segments within the particular one of the plurality of buffers corresponding to the selected one of the plurality of memory channels at the particular one of the plurality of memory lines, writing a channel identification number corresponding to the selected one of the plurality of memory channels to a buffer location field within a payload channel sequence table that corresponds to the particular one of the plurality of buffers.

26. The method of claim 25 further comprising, upon storing at least one of the information element and the particular one of the plurality of information element segments within the particular one of the plurality of buffers corresponding to the selected one of the plurality of memory channels at the particular one of the plurality of memory lines, setting a value field within the payload channel sequence table that corresponds to the particular one of the plurality of buffers.

27. The method of claim 26 further comprising, upon storing at least one of the information element and the particular one of the plurality of information element segments within the particular one of the plurality of buffers corresponding to the selected one of the plurality of memory channels at the particular one of the plurality of memory lines, if the data within the particular one of the plurality of buffers signals an end-of-packet, setting an end-of-packet field corresponding to the particular one of the plurality of buffers within the payload channel sequence table.

28. The method of claim 27 further comprising fetching at least one of the information element and a portion of the information element by determining at least one memory channel that stores at least one of the information element and the portion of the information element by reading the buffer location field corresponding to each of the plurality of buffers at a particular one of the plurality of memory lines until an end-of-packet field corresponding to that buffer signals the end-of-packet; and reading the contents of each of an at least one buffer of the plurality of buffers at a particular one of the plurality of memory lines corresponding to each of the at least one memory channel.

29. A system to optimally access a memory unit, comprising: the memory unit that is logically partitioned to form a plurality of memory channels; a traffic analyzer to determine at least one load of each of the plurality of memory channels; and a bandwidth balancer to select a particular one of the plurality of memory channels based on the determined at least one load.

30. The system of claim 29 wherein the plurality of memory channels of the memory unit are further logically partitioned to form a plurality of memory lines, each of the plurality of memory lines includes a plurality of buffers and each of the plurality of buffers corresponds to a separate one of the plurality of memory channels.

31. The system of claim 29 further comprising a plurality of write payload channel queues, each of the plurality of write payload channel queues corresponds to a separate one of the plurality of memory channels, each of the plurality of write payload channel queues stores at least one of an information element and a particular one of the information element segments to be written to a corresponding one of the plurality of memory channels; a plurality of write request queues, each of the plurality of write request queues corresponds to a separate one of the plurality of write payload channel queues, a particular one of the plurality of write request queues stores a request to write the data within a corresponding one of the plurality of write payload channel queues to a corresponding one of the plurality of memory channels of the memory unit; a plurality of read payload channel queues, each of the plurality of read payload channel queues corresponds to a separate one of the plurality of memory channels, each of the plurality of read payload channel queues stores at least one of an information element and a particular one of the information element segments that is retrieved from the memory unit; and a plurality of read request queues, each of the plurality of read request queues corresponds to a separate one of the plurality of read payload channel queues, a particular one of the plurality of read request queues stores a request to retrieve from a corresponding one of the plurality of memory channels of the memory unit at least one of the information element and the particular one of the information element segments and store it in a corresponding one of the plurality of read payload channel queues.

32. The system of claim 31 wherein the traffic analyzer includes a plurality of pending write request counters to measure write request loads on the plurality of channels, each of the plurality of pending write request counters corresponds to a separate one of the plurality of write request queues; a plurality of pending read request counters to measure read request loads on the plurality of channels, each of the plurality of pending read request counters corresponds to a separate one of the plurality of read request queues; and a plurality of active buffer counters to measure stored data loads on the plurality of channels, each of the plurality of active buffer counters corresponds to a separate one of the plurality of write request queues that in turn corresponds to a particular one of the plurality of memory channels and each of the plurality of active buffer counters also corresponds to a separate one of the plurality of write request queues that in turn corresponds to the particular one of the plurality of memory channels.

33. The system of claim 32 wherein a particular one of the plurality of pending write request counters is incremented upon a corresponding one of the plurality of write request queues receiving a write request and decremented upon extracting the write request from the corresponding one of the plurality of write request queues; a particular one of the plurality of pending read request counters is incremented upon a corresponding one of the plurality of read request queues receiving a read request and decremented upon extracting the read request from the corresponding one of the plurality of read request queues; and a particular one of the plurality of active buffer counters is incremented upon a corresponding one of the plurality of write request queues receiving the write request and decremented upon a corresponding one of the plurality of read request queues receiving the read request.

34. The system of claim 29 further comprising a payload channel occupancy vector, each element of the payload channel occupancy vector corresponds to a separate one of the plurality of buffers at a particular one of the plurality of memory lines and each element of the payload channel occupancy vector indicates if a corresponding one of the plurality of memory buffers is available.

35. The system of claim 29 further comprising a payload channel sequence table to specify an at least one memory channel of the plurality of memory channels at which at least one of the information element and a portion of the information element is stored.

36. The system of claim 35 wherein the payload channel sequence table is partitioned to form a plurality of columns, each of the plurality of columns corresponds to a separate one of the plurality of memory channels, the plurality of columns are further partitioned to form a plurality of rows, each of the plurality of rows includes a plurality of buffer information units and each of the plurality of buffer information units includes a buffer location field that specifies a particular one of the plurality of memory channels at which a particular one of the plurality of buffers at a particular one of the plurality of memory lines stores at least one of the information element and a particular one of the plurality of information element segments; a value field that indicates whether the particular one of the plurality of buffers corresponding to the particular one of the plurality of memory channels at the particular one of the plurality of memory lines stores any data within that buffer; and an end-of-packet field that indicates whether the particular one of the plurality of buffers corresponding to the particular one of the plurality of memory channels at the particular one of the plurality of memory lines stores data that signals an end-of-packet.

37. The system of claim 29 wherein the memory unit is a plurality of dynamic random access memory units.

38. The system of claim 30 further comprising a buffer management unit to provide a pointer to a new one of the plurality of memory lines.

39. The system of claim 29 wherein each of the plurality of buffers has a length that is a fixed-size.

40. The system of claim 39 wherein each of the plurality of memory channels has a width that is the fixed-size.

41. A system to optimally access a memory unit, comprising: the memory unit that is logically partitioned to form a plurality of memory channels; a bandwidth management unit that includes a traffic analyzer to determine at least one load of each of the plurality of memory channels; and a bandwidth balancer to select a particular one of the plurality of memory channels based on the determined at least one load; and a policy control unit to provide at least one of an information element and a particular one of a plurality of information element segments for writing to the selected one of the plurality of memory channels.

42. The system of claim 41 further comprising a data buffer unit to temporarily store at least one of the information element and the particular one of the plurality of information element segments within a particular one of a plurality of write payload channel queues that corresponds to the selected one of the plurality of memory channels and writes the temporarily stored data to the selected one of the plurality of memory channels within the memory unit; and a forward processing unit that fetches at least one buffer of the plurality of buffers within the memory unit.

43. The system of claim 42 wherein the forward processing unit includes a plurality of read payload channel queues, each of the plurality of read payload channel queues corresponds to a separate one of the plurality of memory channels, each of the plurality of read payload channel queues stores at least one of an information element and a particular one of the information element segments that is retrieved from the memory unit; and a plurality of read request queues, each of the plurality of read request queues corresponds to a separate one of the plurality of read payload channel queues, a particular one of the plurality of read request queues stores a request to retrieve from a corresponding one of the plurality of memory channels of the memory unit at least one of the information element and the particular one of the information element segments and store it in a corresponding one of the plurality of read payload channel queues.

44. The system of claim 43 further comprising a payload channel sequence table to specify an at least one memory channel of the plurality of memory channels at which at least one of the information element and a portion of the information element is stored.

45. The system of claim 43 wherein the forward processing unit fetches at least one of the information element and the portion of the information element by accessing the payload channel sequence table to determine at least one memory channel within which at least one of the information element and the portion of the information element is stored, and for each of the at least one memory channel, sending a read request to a particular one of the plurality of read request queues that corresponds to that memory channel.

46. The system of claim 44 wherein the payload channel sequence table is partitioned to form a plurality of columns, each of the plurality of columns corresponds to a separate one of the plurality of memory channels, the plurality of columns are further partitioned to form a plurality of rows, each of the plurality of rows includes a plurality of buffer information units and each of the plurality of buffer information units includes a buffer location field that specifies a particular one of the plurality of memory channels at which a particular one of the plurality of buffers at a particular one of the plurality of memory lines stores at least one of the information element and a particular one of the plurality of information element segments; a value field that indicates whether the particular one of the plurality of buffers corresponding to the particular one of the plurality of memory channels at the particular one of the plurality of memory lines stores any data within that buffer; and an end-of-packet field that indicates whether the particular one of the plurality of buffers corresponding to the particular one of the plurality of memory channels at the particular one of the plurality of memory lines stores data that signals an end-of-packet.

47. The system of claim 46 wherein the forward processing unit fetches at least one of the information element and the portion of the information element by determining at least one memory channel that stores the at least one of the information element and the portion of the information element by traversing each of the plurality of buffer information units within a particular one of the plurality of rows of the payload channel sequence table and retrieving the particular one of the plurality of memory channels specified within the buffer location field until the end-of-packet field of that buffer information unit signals the end-of-packet; and for each of the at least one memory channel, sending a read request to a particular one of the plurality of read request queues that corresponds to that memory channel.

48. A program storage device readable by a computer system, storing a plurality of instructions to optimally access a memory unit where the memory unit is logically partitioned to form a plurality of memory channels, the plurality of memory channels are further logically partitioned to form a plurality of memory lines, each of the plurality of memory lines includes a plurality of buffers and each of the plurality of buffers corresponds to a separate one of the plurality of memory channels, comprising: instructions for determining at least one load value of each of the plurality of memory channels; and instructions for selecting a particular one of the plurality of memory channels based on the determined at least one load value.

49. The device of claim 48 wherein the instructions for determining the at least one load value of each of the plurality of memory channels includes instructions for determining, for each of the plurality of memory channels, at least one of the number of pending read requests, the number of pending write requests, and the number of active buffers which is the number of a particular one of the plurality of buffers that is unavailable and corresponds to the particular one of the plurality of memory channels in each of the plurality of memory lines.

50. The device of claim 48 wherein the instructions for selecting the particular one of the plurality of memory channels includes instructions for selecting the particular one of the plurality of memory channels that has at least one of a lowest number of pending read requests, a lowest number of pending write requests, a lowest number of active buffers, and a corresponding channel identification number that is next in a round robin scheme.