WO2002011368A2 - Pre-fetching and caching data in a communication processor's register set - Google Patents

Pre-fetching and caching data in a communication processor's register set Download PDF

Info

Publication number
WO2002011368A2
WO2002011368A2 PCT/US2001/041485 US0141485W WO0211368A2 WO 2002011368 A2 WO2002011368 A2 WO 2002011368A2 US 0141485 W US0141485 W US 0141485W WO 0211368 A2 WO0211368 A2 WO 0211368A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
event
context
buffers
buffer
Prior art date
Application number
PCT/US2001/041485
Other languages
French (fr)
Other versions
WO2002011368A3 (en
Inventor
Duane E. Galbi
Wilson P. Ii Snyder
Daniel J. Lussier
Joseph B. Tompkins
Bruce G. Burns
Original Assignee
Conexant Systems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US09/640,231 external-priority patent/US6804239B1/en
Application filed by Conexant Systems, Inc. filed Critical Conexant Systems, Inc.
Priority to AU2001285384A priority Critical patent/AU2001285384A1/en
Publication of WO2002011368A2 publication Critical patent/WO2002011368A2/en
Publication of WO2002011368A3 publication Critical patent/WO2002011368A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30123Organisation of register space, e.g. banked or distributed register file according to context, e.g. thread buffers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching

Definitions

  • the present invention is related to the field of communications, and more particularly to integrated circuits that process communication packets.
  • each packet contains a header and a payload.
  • the header contains control information, such as addressing or channel information, that indicates how the packet should be handled.
  • the payload contains the information that is being transferred.
  • Some examples of the types of packets used in communication systems include, Asynchronous Transfer Mode (ATM) cells, Internet Protocol (IP) packets, frame relay packets, Ethernet packets, or some other packet-like information block.
  • ATM Asynchronous Transfer Mode
  • IP Internet Protocol
  • packets Internet Protocol
  • frame relay packets Internet Protocol
  • Ethernet packets or some other packet-like information block.
  • packet is intended to include packet segments.
  • Integrated circuits termed "traffic stream processors" have been designed to apply robust functionality to high-speed packet streams. Robust functionality is critical with today's diverse but converging communication systems. Stream processors must handle multiple protocols and inter-work between streams of different protocols. Stream processors must also ensure that quality-of service constraints, priority, and bandwidth requirements are met. This functionality must be applied differently to different streams, and there may be thousands
  • the integrated circuit includes a core processor.
  • the processor handles a series of tasks, termed "events". These events consist of tasks such as CPU processing steps as well as the scheduling of subsequent events. These subsequently scheduled events may consist of CAM lookups, DMA data transfers, or other generic events based on conditions in the current event. All events have an associated service address, "context information" and "data”.
  • Information about the event such as the resource that requested the event, how much data is associated with the event, and other key information from the event requestor is stored in "special state" information associated with the event. Most events have an associated service address, "context information" and "data”.
  • the external resource supplies the core processor with a memory pointer to "context” information and also supplies the data to be associated with the event.
  • the context pointer is used to fetch the context from external memory and to store this "context" information in memory located on the chip. If the required context data has already been fetched onto the chip, the hardware recognizes this fact and sets the on chip context pointer to point to this already pre-fetched context data.
  • the core processor In order to process an event, the core processor needs the service address of the event as well as the "context" and "data” associated with the event.
  • the service address is the starting address for the instructions used to service the event.
  • the core processor branches to the service address in order to start servicing the event.
  • the core processor needs to access a portion of the "context" associated with the event so the appropriate part of the "context” is read into the core processor's local registers. When this is done, the core processor can read, and if appropriate modify, the "context" values. However, when the core processor modifies a "context” value, the "context" values stored outside of the core processor register must be updated to reflect this change. This can happen under direct programmer control or using the method described in the above referenced patent (U.S. Patent 5,748,630). The "data" associated with an event is handled in a manner similar to that described for the "context".
  • the processing core performed a register read which returned a pointer to the context, data, and service address associated with the next event.
  • the processing core then needed to explicitly read the context and data into its internal register set.
  • data and context information for a number of events are stored in buffers in a coprocessor.
  • the core processor needs the service address of the event as well as the "context" and "data” associated with the event.
  • the service address is the starting address for the instructions used to service the event.
  • the core processor branches to the service address in order to start servicing the event.
  • the present invention frees the core processor from performing the explicit read operation required to read data into the internal register set.
  • the present invention expands the processor's register set and provides a "shadow register” set. While the core processor is processing one event, the "context" and “data” and some other associated information for the next event is loaded into the shadow register set. When the core processor finishes processing an event, the core processor switches to the shadow register set and it can begin processing the next event immediately. With short service routines, there might not be time to fully pre- fetch the "context" and "data” associated with the next event before the current event ends. In this case, the core processor still starts processing the next event and the pre-fetch continues during the event processing.
  • the core processor accesses a register which is associated with part of the context for which the prefetch is still in progress, the core processor will automatically stall or delay until the pre-fetch has completed reading the appropriate data.
  • Logic has been provided to handle several special situations, which are created by the use of the shadow registers, and to provide the programmer with control over the pre-fetching and service address selection progress.
  • special state information is effectively stored together with associated data in data buffers.
  • the data buffers do not have associated in-use counters.
  • separate logical buffers are provided for special state information and for the associated data buffer.
  • each data buffer and each special state information buffer (hereinafter termed resources) has an associated in-use counter. Multiple events can share the same resource.
  • the counter associated with a resource is incremented when a resource becomes associated with a particular event.
  • the counter associated with a resource is decremented when an event completes the use of that particular resource.
  • the in-use count for a resource becomes zero, the in-use count indicates that the resource is unassigned and that the resource can be assigned to a new event.
  • two events can point to (i.e. utilize) the same data buffer and/or the same special state information buffer.
  • content of a data buffer or a special state information buffer can be passed directly from one event to another event without reading the data into and out of memory.
  • the in-use counter is particularly useful to facilitate the timing of DMA requests without need for explicit control by an external program.
  • two events can use the same data buffer. This is possible since the special state information is stored in a separate buffer. Furthermore, one can have one data buffer associated with multiple context buffers since the special state information is stored separately from the associated data.
  • Some embodiments also add a communication mechanism which allows an event to pass a multi-bit message to subsequent events. This message passing mechanism does not require that the two events share any of the same context, data, or special state resources.
  • Figure 1 is an overall block diagram of a packet processing integrated circuit in an example of the invention.
  • Figure 2 is a block diagram that illustrates packet processing stages and the pipe-lining used by the circuit in an example of the invention.
  • Figure 3 is a diagram illustrating circuitry in the co-processing relating to context and data buffer processing in an example of the invention.
  • Figure 4 is a block program flow diagram illustrating buffer correlation and in- use counts in an example of the invention.
  • Figure 5 is a block diagram of the buffer management circuitry in an example of the invention.
  • Figure 6 is a block diagram showing the transfer queue and registers in the core processor in an example of the invention.
  • Figure 7 is a block program flow diagram illustrating an operation in an example of the invention.
  • Figure 8 is a block diagram showing the details of the data and special state information buffers in an example of the invention.
  • Figure 9 is a block program flow diagram illustrating how data buffers are passed between events in an example of the invention.
  • Figure 10 is a block program flow diagram illustrating how state information buffers are passed between events in an example of the invention.
  • Figure 11A and 11B are block program flow diagram illustrating examples of how DMA commands are handled in an example of the invention.
  • One embodiment of the present invention described herein is applied as an improvement to the type of integrated circuit described in co-pending patent applications 60/211 ,863 filed on June 14, 2000, 09/640,260 filed on August 16, 2000, 09/639,915 filed on August 16, 2000, 09/639,966 filed on August 16, 2000, 09/640,258 filed on August 16, 2000 and 09/640,231 filed on August 17, 2000, the content of which is hereby incorporated herein by reference in order to shorten and simplify the description of the present application.
  • FIG 1 is a block diagram that illustrates a packet processing integrated circuit 100 in an example of the invention. It should be understood that the present invention can also be applied to other types of processors. The operation of the circuit 100 will first be described with reference to Figures 1 to 4 and then the operation of the present invention will be described with reference to Figures 5 to 11 A.
  • Integrated circuit 100 includes a core processor 104, a scheduler 105, receive interface 106, co-processor circuitry 107, transmit interface 108, and memory interface 109. These components may be interconnected through a memory crossbar or some other type of internal interface. Receive interface 106 is coupled to communication system 101. Transmit interface 108 is coupled to communication system 102. Memory interface is coupled to memory 103. Communication system 101 could be any device that supplies communication packets with one example being the switching fabric in an
  • Asynchronous Transfer Mode (ATM) switch Communication system 101 could be any device that receives communication packets with one example being the physical line interface in the ATM switch.
  • Memory 103 could be any memory device with one example being Random Access Memory (RAM) integrated circuits.
  • Receive interface 106 could be any circuitry configured to receive packets with some examples including UTOPIA interfaces or Peripheral Component Interconnect (PCI) interfaces.
  • Transmit interface 108 could be any circuitry configured to transfer packets with some examples including UTOPIA interfaces or PCI interfaces.
  • Core processor 104 is a micro-processor that executes networking application software. Core-processor 104 supports an instruction set that has been tuned for networking operations especially context switching. As described herein, core processor 104 has the following characteristics: 166 MHz, pipelined single-cycle operation, RISC-based design, 32-bit instruction and register set, K instruction cache, 8 KB zero-latency scratchpad memory, interrupt/trap/halt support, and C compiler readiness.
  • Scheduler 105 comprises circuitry configured to schedule and initiate packet processing that typically results in packet transmissions from integrated circuit 100, although scheduler 105 may also schedule and initiate other activities. Scheduler 105 schedules upcoming events, and as time passes, selects scheduled events for processing and re-schedules unprocessed events.
  • Scheduler 105 transfers processing requests for selected events to co-processor circuitry 107.
  • Scheduler 105 can handle multiple independent schedules to provide prioritized scheduling across multiple traffic streams.
  • scheduler 105 may execute a guaranteed cell rate algorithm to implement a leaky bucket or a token bucket scheduling system.
  • the guaranteed cell rate algorithm is implemented through a cache that holds algorithm parameters.
  • Scheduler 105 is described in detail in the above referenced co- pending patent applications.
  • Co-processor circuitry 107 receives communication packets from receive interface 106 and memory interface 109 and stores the packets in internal data buffers. Co-processor circuitry 107 correlates each packet to context information describing how the packet should be handled . Co-processor circuitry 107 stores the correlated context information in internal context buffers and associates individual data buffers with individual context buffers to maintain the correlation between individual packets and context information. Importantly, co-processor circuitry 107 ensures that only one copy of the correlated context information is present the context buffers to maintain coherency. Multiple data buffers are associated with a single context buffer to maintain the correlation between the multiple packets and the single copy the context information. Co-processor circuitry 107 also determines a prioritized processing order for core processor 104.
  • the prioritized processing order controls the sequence in which core processor 104 handles the communication packets.
  • the prioritized processing order is typically based on the availability of all of the resources and information that are required by core processor 104 to process a given communication packet. Resource state bits are set when resources become available, so co-processor circuitry 107 may determine when all of these resources are available by processing the resource state bits. If desired, the prioritized processing order may be based on information in packet handling requests. Co-processor circuitry 107 selects scheduling algorithms based on an internal scheduling state bits and uses the selected scheduling algorithms to determine the prioritized processing order.
  • co-processor circuitry 107 is externally controllable. Co-processor circuitry 107 is described in more detail with respect to FIGS. 2-4.
  • Memory interface 109 comprises circuitry configured to exchange packets with external buffers in memory 103.
  • Memory interface 109 maintains a pointer cache that holds pointers to the external buffers.
  • Memory interface 109 allocates the external buffers when entities, such as core processor 104 or co-processor circuitry 107, read pointers from the pointer cache.
  • Memory interface 109 deallocates the external buffers when the entities write the pointers to the pointer cache.
  • external buffer allocation and de-allocation is available through an on-chip cache read/write.
  • Memory interface 109 also manages various external buffer classes, and handles conditions such as external buffer exhaustion. Memory interface 109 is described in detail in the above referenced patent applications.
  • receive interface 106 receives new packets from communication system 101 , and scheduler 105 initiates transmissions of previously received packets that are typically stored in memory 103.
  • receive interface 106 and scheduler 105 transfer requests to co-processor circuitry 107.
  • core processor 104 may also request packet handling from co-processor circuitry 107.
  • Co-processor circuitry 107 fields the requests, correlates the packets with their respective context information, and creates a prioritized work queue for core processor 104.
  • Core processor 104 processes the packets and context information in order from the prioritized work queue.
  • co-processor circuitry 107 operates in parallel with core processor 104 to offload the context correlation and prioritization tasks to conserve important core processing capacity.
  • core processor 104 In response to packet handling, core processor 104 typically initiates packet transfers to either memory 103 or communication system 102. If the packet is transferred to memory 103, then core processor 104 instructs scheduler 105 to schedule and initiate future packet transmission or processing.
  • scheduler 105 operates in parallel with core processor 104 to offload scheduling tasks and conserve important core processing capacity.
  • core processor 104 In response to packet handling, core processor 104 typically initiates packet transfers to either memory 103 or communication system 102. If the packet is transferred to memory 103, then core processor 104 instructs scheduler 105 to schedule and initiate future packet transmission or processing.
  • scheduler 105 operates in parallel with core processor 104 to offload scheduling tasks and conserve important core processing capacity.
  • Various data paths are used in response to core processor 104 packet transfer instructions.
  • Co-processor circuitry 107 transfers packets directly to communication system 102 through transmit interface 108.
  • Co-processor circuitry 107 transfers packets to memory 103 through memory interface 109 with an on-chip pointer cache.
  • Memory interface 109 transfers packets from memory 103 to communication system 102 through transmit interface 108.
  • Co-processor circuitry 107 transfers context information from a context buffer through memory interface 109 to memory 103 if there are no packets in the data buffers that are correlated with the context information in the context buffer.
  • memory interface 109 operates in parallel with core processor 104 to offload external memory management tasks and conserve important core processing capacity.
  • FIGS. 2-4 depict a specific example of co-processor circuitry. Those skilled in the art will understand that Figures 2-4 have been simplified for clarity.
  • FIG. 2 illustrates how co-processor circuitry 107 provides pipe-lined operation.
  • FIG. 2 is vertically separated by dashed lines that indicate five packet processing stages: 1 ) context resolution, 2) context fetching, 3) priority queuing, 4) software application, and 5) context flushing.
  • Co-processor circuitry 107 handles stages 1-3 to provide hardware acceleration.
  • Core processor 104 handles stage 4 to provide software control with optimized efficiency due to stages 1-3.
  • Co-processor circuitry 107 also handles stage 5.
  • Co-processor circuitry 107 has eight pipelines through stages 1-3 and 5 to concurrently process multiple packet streams.
  • requests to handle packets are resolved to a context for each packet in the internal data buffers.
  • the requests are generated by receive interface 106, scheduler 105, and core processor 104 in response to incoming packets, scheduled transmissions, and application software instructions.
  • the context information includes a channel descriptor that has information regarding how packets in one of 64,000 different channels are to be handled.
  • a channel descriptor may indicate service address information, traffic management parameters, channel status, stream queue information, and thread status.
  • 64,000 channels with different characteristics are available to support a wide array of service differentiation.
  • Channel descriptors are identified by channel identifiers.
  • Channel identifiers may be indicated by the request.
  • a map may be used to translate selected bits from the packet header to a channel identifier.
  • a hardware engine may also perform a sophisticated search for the channel identifier based on various information. Different algorithms that calculate the channel identifier from the various information may be selected by setting correlation state bits in co-processor circuitry 107. Thus, the technique used for context resolution is externally controllable.
  • context information is fetched, if necessary, by using the channel identifiers to transfer the channel descriptors to internal context buffers. Prior to the transfer, the context buffers are first checked for a matching channel identifier and validity bit. If a match is found, then the context buffer with the existing channel descriptor is associated with the corresponding internal data buffer holding the packet.
  • requests with available context are prioritized and arbitrated for core processor 104 handling.
  • the priority may be indicated by the request - and it may be the source of the request.
  • the priority queues 1-12 are 8 entries deep. Priority queues 1-12 are also ranked in a priority order by queue number.
  • the priority for each request is determined, and when the context and data buffers for the request are valid, an entry for the request is placed in one of the priority queues that corresponds to the determined priority.
  • the entries in the priority queues point to a pending request state RAM that contains state information for each data buffer.
  • the state information includes a data buffer pointer, a context pointer, context validity bit, requester indicator, port status, a channel descriptor loaded indicator. This state information was referred to earlier in this document as the special state information associated with an event. These two terms may be used interchangeably.
  • the work queue indicates the selected priority queue entry that core processor 104 should handle next.
  • the requests in priority queues are arbitrated using one of various algorithms such as round robin, service-to-completion, weighted fair queuing, simple fairness, first-come first-serve, allocation through priority promotion, and software override.
  • the algorithms may be selected through scheduling state bits in co-processor circuitry 107.
  • Co-processor circuitry 107 loads core processor 104 registers with the channel descriptor information for the next entry in the work queue.
  • core processor 104 executes the software application to process the next entry in the work queue which points to a portion of the pending state request RAM that identifies the data buffer and context buffer.
  • the context buffer indicates one or more service addresses that direct the core processor 104 to the proper functions within the software application.
  • One such function of the software application is traffic shaping to conform to service level agreements.
  • Other functions include header manipulation and translation, queuing algorithms, statistical accounting, buffer management, inter-working, header encapsulation or stripping, cyclic redundancy checking, segmentation and reassembly, frame relay formatting, multicasting, and routing. Any context information changes made by the core processor are linked back to the context buffer in real time. In stage 5, context is flushed.
  • core processor 104 instructs coprocessor circuitry 107 to transfer packets to off-chip memory 103 or transmit interface 108. If no other data buffers are currently associated with the pertinent context information, then co-processor circuitry 107 transfers the context information to off-chip memory 103.
  • FIG. 3 is a block diagram that illustrates co-processor circuitry 107 in an example of the invention.
  • Co-processor circuitry 107 comprises a hardware engine that is firmware-programmable in that it operates in response to state bits and register content.
  • core processor 104 is a micro-processor that executes application software.
  • Co-processor circuitry 107 operates in parallel with core processor 104 to conserve core processor 104 capacity by off-loading numerous tasks from the core processor 104.
  • Co-processor circuitry 107 comprises context resolution 310, control 311 , arbiter 312, priority queues 313, data buffers 314, context buffers 315, context DMA 316, and data DMA 317.
  • Data buffers 314 hold packets and context buffers 315 hold context information, such as a channel descriptor.
  • Data buffers 314 hold packets and context buffers 315 hold context information, such as a channel descriptor. Data buffers
  • each data buffer 314 are relatively small and of a fixed size, such as 64 bytes, so if the packets are ATM cells, each data buffer holds only a single ATM cell and ATM cells do not cross data buffer boundaries.
  • Individual data buffers 314 are associated with individual context buffers
  • Priority queues 313 hold entries that represent individual data buffers 314 as indicated by the upward arrows.
  • a packet in one of the data buffers is associated with its context information in an associated one of the context buffers 315 and with an entry in priority queues 313.
  • Arbiter 312 presents a next entry from priority queues 313 to core processor 104 which handles the associated packet in the order determined by arbiter 312.
  • Context DMA 316 exchanges context information between memory 103 and context buffers 315 through memory interface 109.
  • Context DMA automatically updates queue pointers in the context information.
  • Data DMA 317 exchanges packets between data buffers 314 and memory 103 through memory interface 109.
  • Data DMA 317 also transfers packets from memory 103 to transmit interface 108 through memory interface 109.
  • Data DMA 317 signals context DMA 316 when transferring packets off-chip, and context DMA 316 determines if the associated context should be transferred to off-chip memory 103.
  • Both DMAs 316-317 may be configured to perform CRC calculations.
  • control 311 receives the new packet and a request to handle the new packet from receive interface 106.
  • Control 311 receives and places the packet in one of the data buffers 314 and transfers the packet header to context resolution 310. Based on gap state bits, a gap in the packet may be created between the header and the payload in the data buffer, so core processor 104 can subsequently write encapsulation information to the gap without having to create the gap.
  • Context resolution 310 processes the packet header to correlate the packet with a channel descriptor, although in some cases, receive interface 106 may have already performed this context resolution.
  • the channel descriptor comprises information regarding packet transfer over a channel.
  • Control 311 determines if the channel descriptor that has been correlated with the packet is already in one of the context buffers 315 and is valid. If so, control 311 does not request the channel descriptor from off-chip memory 103. Instead, control 311 associates the particular data buffer 314 holding the new packet with the particular context buffer 315 that already holds the correlated channel descriptor. This prevents multiple copies of the channel descriptor from existing in context buffers 314. Control 311 then increments an in-use count for the channel descriptor to track the number of data buffers 314 that are associated with the same channel descriptor.
  • control 311 requests the channel descriptor from context DMA 316.
  • Context DMA 316 transfers the requested channel descriptor from off-chip memory 103 to one of the context buffers 315 using the channel descriptor identifier, which may be an address, that was determined during context resolution.
  • Control 311 associates the context buffer 315 holding the transferred channel descriptor with the data buffer 314 holding the new packet to maintain the correlation between the new packet and the channel descriptor.
  • Control 311 also sets the in-use count for the transferred channel descriptor to one and sets the validity bit to indicate context information validity.
  • Control 311 also determines a priority for the new packet.
  • the priority may be determined by the source of the new packet, header information, or channel descriptor.
  • Control 311 places an entry in one of priority queues 313 based on the priority. The entry indicates the data buffer 314 that has the new packet.
  • Arbiter 312 implements an arbitration scheme to select the next entry for core processor 104. Core processor 104 reads the next entry and processes the associated packet and channel descriptor in the particular data buffer 314 and context buffer 315 indicated in the next entry.
  • Each priority queue has a service-to-completion bit and a sleep bit.
  • the service-to-completion bit When the service-to-completion bit is set, the priority queue has a higher priority that any priority queues without the service-to-completion bit set.
  • the sleep bit When the sleep bit is set, the priority queues is not processed until the sleep bit is cleared.
  • the ranking of the priority queue number breaks priority ties.
  • Each priority queue has a weight from 0-15 to ensure a certain percentage of core processor handling. After an entry from a priority queue is handled, its weight is decremented by one if the service-to-completion bit is not set.
  • the weights are re-initialized to a default value after 128 requests have been handled or if all weights are zero.
  • Each priority queue has a high and low watermark. When outstanding requests that are entered in a priority queue exceed its high watermark, the service-to-completion bit is set. When the outstanding requests fall to the low watermark, the service-to-completion bit is cleared.
  • the high watermark is typically set at the number of data buffers allocated to the priority queue.
  • the context buffers 315 each have an associated in-use counter.
  • the in-use counters associated with the context buffers is not shown in Figure 3, but it is shown in Figure 8.
  • Core processor 104 may instruct control 311 to transfer the packet to off- chip memory 103 through data DMA 317.
  • Control 311 decrements the context buffer in-use count, and if the in-use count is zero (no data buffers 314 are associated with the context buffer 315 holding the channel descriptor), then control 311 instructs context DMA 316 to transfer the channel descriptor to off- chip memory 103.
  • Control 311 also clears the validity bit. This same general procedure is followed when scheduler 105 requests packet transmission, except that in response to the request from scheduler 105, control 311 instructs data DMA 317 to transfer the packet from memory 103 to one of data buffers 314.
  • FIG. 4 is a flow diagram that illustrates the operation of co-processor circuitry 107 when correlating buffers in an example of the invention. Coprocessor circuitry 107 has eight pipelines to concurrently process multiple packet streams in accord with FIG. 3.
  • a packet is stored in a data buffer, and the packet is correlated to a channel descriptor as identified by a channel identifier.
  • the channel descriptor comprises the context information regarding how packets in one of 64,000 different channels are to be handled.
  • context buffers 314 are checked for a valid version of the correlated channel descriptor. This entails matching the correlated channel identifier with a channel identifier in a context buffer that is valid. If the correlated channel descriptor is not in a context buffer that is valid, then the channel descriptor is retrieved from memory 103 and stored in a context buffer using the channel identifier. The data buffer holding the packet is associated with the context buffer holding the transferred channel descriptor. An in-use count for the context buffer holding the channel descriptor is set to one. A validity bit for the context buffer is set to indicate that the channel descriptor in the context buffer is valid. If the correlated channel descriptor is already in a context buffer that is valid, then the data buffer holding the packet is associated with the context buffer already holding the channel descriptor. The in-use count for the context buffer holding the channel descriptor is incremented.
  • core processor 104 instructs co-processor circuitry 107 to transfer packets to off-chip memory 103 or transmit interface 108.
  • Data DMA 317 transfers the packet and signals context DMA 316 when finished.
  • Context DMA 316 decrements the in-use count for the context buffer holding the channel descriptor, and if the decremented in-use count equals zero, then context DMA 316 transfers the channel descriptor to memory 103 and clears the validity bit for the context buffer.
  • the effect of DMA operations on the in-use counts of the special state buffers and the data buffers will be explained later. Figures 11 A and 11 B will be used to illustrate these operations.
  • FIGS. 5-6 depict a specific example of memory interface circuitry in accord with the present invention. Those skilled in the art will appreciate numerous variations from the circuitry shown in this example may be made. Furthermore, those skilled in the art will appreciate that some conventional aspects of FIGS. 5-6 have been simplified or omitted for clarity.
  • FIG. 5 is a block diagram that illustrates memory interface 109.
  • Memory interface 109 comprises a hardware circuitry engine that is firmware- programmable in that operates in response to state bits and register content.
  • core processor 104 is a micro-processor that executes application software. Memory interface 109 operates in parallel with core processor 104 to conserve core processor 104 capacity by off-loading numerous tasks from the core processor 104.
  • FIG. 1 and FIG. 5 show memory 103, core processor 104, coprocessor circuitry 107, transmit interface 108, and memory interface 109.
  • Memory 103 comprises Static RAM (SRAM) 525 and Synchronous Dynamic RAM (SDRAM) 526, although other memory systems could also be used.
  • SDRAM 526 comprises pointer stack 527 and external buffers 528.
  • Memory interface 109 comprises buffer management engine 520, SRAM interface 521 , and SDRAM interface 522.
  • Buffer management engine 520 comprises pointer cache 523 and control logic 524. Conventional components could be used for SRAM interface 521 , SDRAM interface 522, SRAM 525, and SDRAM 526.
  • SRAM interface 521 exchanges context information between SRAM 525 and co-processor circuitry 107.
  • External buffers 528 use a linked list mechanism to store communication packets externally to integrated circuit 100.
  • Pointer stack 527 is a cache of pointers to free external buffers 528 that is initially built by core processor 104.
  • Pointer cache 523 stores pointers that were transferred from pointer stack 527 and correspond to external buffers 528. Sets of pointers may be periodically exchanged between pointer stack 527 and pointer cache 523. Typically, the exchange from stack 527 to cache 523 operates on a first-in/first-out basis.
  • core processor 104 writes pointers to free external buffers 528 to pointer stack 527 in SDRAM 526.
  • control logic 524 transfers a subset of these pointers to pointer cache 523.
  • an entity such as core processor 104, co-processor circuitry 107, or an external system, needs to store a packet in memory 103
  • the entity reads a pointer from pointer cache 523 and uses the pointer to transfer the packet to external buffers 528 through SDRAM interface 522.
  • Control logic 524 allocates the external buffer as the corresponding pointer is read from pointer cache 523.
  • SDRAM stores the packet in the external buffer indicated by the pointer. Allocation means to reserve the buffer, so other entities do not improperly write to it while it is allocated.
  • the packet is transferred from memory 103 through SDRAM interface 522 to coprocessor circuitry 107 or transmit interface 108, then the entity writes the pointer to pointer cache 523.
  • Control logic 524 de-allocates the external buffer as the corresponding pointer is written to pointer cache 523. De-allocation means to release the buffer, so other entities may reserve it. The allocation and de- allocation process is repeated for other external buffers 528.
  • Control logic 524 tracks the number of the pointers in pointer cache 523 that point to de-allocated external buffers 528. If the number reaches a minimum threshold, then control logic 524 transfers additional pointers from pointer stack 527 to pointer cache 523. Control logic 524 may also transfer an exhaustion signal to core processor 104 in this situation. If the number reaches a maximum threshold, then control logic 524 transfers an excess portion of the pointers from pointer cache 523 to pointer stack 527.
  • Figure 6 illustrates in more detail the registers 603A, 603B and 603C in core processor 104 and the interface transfer queue 602 between core processor 104 and co-processor 107.
  • registers 0 to 63 there are sixty four registers 0 to 63 available to a user of the system. Registers 0 to 29 are used to store general state information and registers 30 to 63 are used to store "context information", "data information", and "event specific state information”. There is also a shadow set of registers that corresponds to registers 30 to 63.
  • the core processor 104 when the core processor 104 is processing a series of events, the first event uses registers A & B, the next event uses registers A & C, the next event uses registers A & B, the next event uses registers A and C, etc.
  • one set of registers (either B or C) is the active set of registers and at the same time the other set of registers (either B or C) is a shadow set of registers that is being loaded for the next event, which will be processed.
  • register sets B and C alternate as the active and shadow register sets.
  • the registers 603A, 603B and 603C are low latency memory.
  • the data buffers in co-processor 107 are medium latency memory.
  • the off chip memory 103 is a high latency memory.
  • some embodiments of the invention make possible the increased use of the low latency memory available to the core processor 104.
  • the data buffers 314 and the context buffers 315 are part of the control of the co-processor 107.
  • the co-processor 107 can read data and context from the cache memory via memory interface 109 and provide the data and context to the core processor 104 over the data bus indicated by the arrow 601 A. While an event is being processed using registers A and B, registers C are loaded with data and context information needed to process the next event.
  • the registers shown in Figure 6 are not a cache memory.
  • the registers shown in Figure 6 are the on chip registers, which are part of the core processor 104.
  • the pre-fetch block 601 shown is responsible for controlling the coprocessor pre-fetch processing.
  • this unit Based on signals from the core processor 104 and the state of the current pre-fetch, this unit indicates to the work queue selection logic (312) when to select the top element from the work queue and to return the identifying parameters back to the pre-fetch logic block. Based on these parameters, the pre-fetch block controls the reading of the appropriate
  • the pre-fetch logic 601 also indicates to the core processor 104 whether to swap to the shadow register set when the core processor 104 begins processing a new event. Typically, the core processor 104 swaps to the shadow register set; however, there are special conditions, as described later in this document, under which the pre-fetch logic 601 can determine that the core processor 104 should not swap to the shadow register set.
  • the program running on the core processor 104 can, in certain case, determine in advance that it should always or never swap to the shadow "context" or "data” register set.
  • the core processor 104 can indicate this by setting the configuration bits in the prefetch logic 601 which force the logic to always, never, or when appropriate indicate to the core processor 104 that it should swap to the shadow register set. For instance, in the case where the pre-fetched "data" registers are never being used, the core processor 104 could configure the pre-fetch logic 601 to indicate that the core processor 104 should never swap to the "data" shadow register set. In this case, the core processor 104 would then be free to use the "data" registers for other purposes.
  • the configuration bits for this option are associated with each priority queue, and hence, the configuration bits used are determined by the priority queue which is selected.
  • Another function associated with the pre-fetch logic 601 is to determine the service address associated with the pre-fetched event.
  • the pre-fetch logic 601 can pick the service address from the a set of fixed addresses or from the "context" data which is being fetched.
  • the location the pre-fetch logic 601 uses to pick the service address, the service address selection field, is configured on a per priority queue basis, and hence this field is determined by the priority queue selected.
  • the resource which initiates an event can also pass a field which is used to modify the service address selection field just for the selection of this particular event's service address.
  • Various functions could be used to combine the field the resource supplied with the field stored in the configuration registers. The function which has been implemented was exclusion-or. Other possible choices could have been addition, and or replacement.
  • the overall operation of the pre-fetch system is illustrated in Figure 7.
  • the process begins at some point with the state indicted by block 701.
  • the context and data are stored in buffers 314 and 315 using the methods previously described and the core processor 104 is using an active register set.
  • the core processor 104 needs to pre-fetch the initial events data into its shadow register set. This initial pre-fetch is performed using what is termed the BRSLPRE instruction. This instruction indicates to the co-processor 107 to pre-fetch data for the next event into the shadow register file, and to send the corresponding service address.
  • This core processor 104 instruction does not change the program flow of the core processor 104, but rather is serves as a way to initialize or reinitialize the event information stored in the shadow register file.
  • the core processor 104 is now ready to begin event processing.
  • the core processor 104 sends a command to the coprocessor 107 to fetch the top entry on the work queue 313 into the shadow register and to send the next service address.
  • the core processor 104 prepares to branch to the previously pre-fetched service address. This is termed a BRSL instruction.
  • the core processor 104 determines if the Service Address for the Shadow register has been fetched. If not, the core processor 104 stalls until the Service Address for the Shadow register has been fetched in step 702B. It should be noted that the service address in question is not the service address determined by the proceeding BRSL instruction (701 C), but rather by the earlier BRSL instruction (701 B), which initiated the pre-fetch of data into the shadow register set. As indicated by step 703, when the service address for the shadow register has been fetched, the core processor 104 switches to the shadow register set and branches to the appropriate service address.
  • the core processor 104 then performs event processing using the then active register set as indicated by step 704. It is noted that all the requested data does not need to be pre-fetched into the core processor shadow register set before the core processor 104 can switch to this register set.
  • the pre-fetching of data into a register set can happen concurrently with the progressing of an event using this register set. If the data required by the event progressing has not yet been pre-fetched, the core processor 104 operation is automatically stalled or suspended until the data becomes available.
  • the core processor 104 next sends a command to co-processor 107 to fetch the top entry in the work queue 313 into the shadow registers and to set the next service address.
  • the core processor 104 begins to branch to the previously pre-fetched service address. As indicated previously, this can be described as performing a BRSL instruction.
  • the core processor 104 can not branch to a new service address until the active register pre-fetching operation has been finished.
  • the core processor 104 operation is stalled until this pre- fetching has been finished. Finishing the pre-fetch may consist of terminating the pre-fetch or allowing the pre-fetch to complete. The process then repeats using the steps described above using steps 702 through 706.
  • the pre-fetch logic 601 handles two special situations. One situation is when back to back events are taken which use the same "context" and/or "data” information. Since the core processor 104 can be updating the "context" and "data” information while the next event "data” and "context” is being pre-fetched, if the next event context is the same as the current context, the pre-fetched context is not assured to reflect all the changes the processing core has made to the context, (i.e. the pre-fetched data can be stale).
  • the current registers do reflect all the changes the core processor 104 has made to the context, there is no need to swap to the shadow register set, and the BRSL instruction (blocks 701 B and 705 in Figure 7) does not switch to the shadow register set in this situation. Determining the appropriate service address in this situation also requires some special handling. If the work-queue is set to extract the service address from the "context" and the processing core changes this service address, then the service address determined by the pre-fetch logic 601 might be stale. In order to avoid this problem, a mode has been added to the work queue selection hardware 312, which does not allow back to back event from the same work queue in a first embodiment. This allows the programmer to avoid the case described above.
  • back to back events are allowed, but a write to a BRSL interlock address is issued, after the service address has been changed. Writing to this address stalls the next BRSL instruction until the BRSL interlock address write has left the queue shown in figure 602. Since the BRSL interlock address write happened after the service address update, the service address update must have cleared this queue as well.
  • There is pre-fetch logic 601 which snoops the output of the queue 602 and checks for writes which will effect the service address of the currently pre-fetched event (indicated as 602A on Figure 6). If such a write is detected, the logic updates the next service address appropriately.
  • ABA case Another special situation to the pre-fetch mechanism (hereinafter referred to as the ABA case) occurs when the pre-fetch is for a context that was used in not the previous event but instead one event before the previous event.
  • the case is further complicated by the fact that writes from core processor 104 to the on chip "context" storage go through the queue 602.
  • the ABA case when pre-fetching for the second "A" event, there could be writes in the queue which affect context "A", which could cause the pre-fetch logic to pre-fetch stale values of context "A".
  • the start of a pre-fetch is delayed until all the writes associated with the event one before the current event have cleared the queue 602.
  • the selection of the event to pre-fetch for is also delayed in the same manner. This allows the writes associated with the first event "A”, in the "ABA” case, to affect the selection of the second ⁇ " event.
  • Figure 8 shows the detailed logic added to the data buffer 314 shown in Figure 3 in an example of the invention.
  • the data buffer 314 includes two sections designated data only buffers 814 and special state information buffers 820.
  • the data buffers are assigned an index number from zero to the maximum number of data buffers in the co-processor 107.
  • the special state information buffers are also assigned an index from zero to the maximum number of special state information buffers in the co-processor 107.
  • the context buffers are also assigned an index from zero to the maximum number of context buffers in the co-processor 107.
  • indexes are used by the logic in the co-processor 107 and the core processor 104 to identify an individual context buffer, data buffer, or special state information buffer. In one embodiment, there are sixteen of each of these type of buffers in the co-processor 107. The exact number of each of these buffers is not significant to the general operation of the logic.
  • Each buffer has an associated in-use counter 814-0 to 814-5 and 820-0 to
  • the in-use counters keep track of the number of events, which are using the data in the particular buffers. Each in-use counter is incremented by one for each event, which is using the data or state information in a particular buffer. When an event finishes with a particular buffer, the in-use counter is decremented by one. When the count in an in-use counter reaches zero, no events are using the particular buffer and it can be reallocated.
  • Data buffer resolution logic 822 and PRSR special data resolution logic 821 operates similar to the operation of context buffer resolution 310, which was previously described.
  • Data buffer resolution logic 822 keeps track of which data buffers 814 are in use and which are available to the assigned to new events. Data buffer resolution logic 822 also contains the logic for incrementing and decrementing the in use counters associated with the data buffers 814. PRSR special data resolution logic 821 keeps track of which special state information buffers are in use and which are available to be assigned to new events. PRSR special data resolution logic 821 also contains the logic for incrementing and decrementing the in use counters associated with the special state information buffers.
  • PRSR special data resolution logic 821 and data buffer resolution logic 822 select a buffer to be assigned to a new event by scanning the in use counts of all their associated buffers and picking the buffer with the lowest index which has an in-use count of zero. In other embodiments, there are numerous variations in selecting a buffer to be assigned to a new event and which has an in-use count of zero. Some examples of selecting a buffer are first-in-first-out selection and last-in- first-out selection.
  • Context resolution 310 contains the logic used to select the context buffer to be assigned to a new event.
  • a global configuration bit is used to pick which of two mechanisms is used to select the next context buffer to be assigned to a new event.
  • One mechanism picks the context buffer in the same manner as the next data buffer is picked.
  • this method returns the context buffer with a zero in-use count which has the lowest index.
  • the problem with this selection mechanism for context buffers is that the selection mechanism tends to select the context buffer that have been most recently freed. For instance, when context buffer with index zero is freed, it is always the next new index to be selected. Because context information, which is not already stored in a context buffer, needs to be read in from off-chip memory, under certain conditions is better to not reuse a context buffer as soon as its in-use count goes to zero.
  • This problem is addressed by the second context selection mechanism.
  • This mechanism uses a moving "finger" which determines at what index the logic will start searching for an in-use count of zero. The value of the finger is incremented after each new context selection. Hence, for the first context new selection the logic will start search forward from index zero. For the second new context select, the logic will start searching forward from index 1 , etc.
  • the special state information data buffer 820 contains a pointer to an associated data buffer 614 as well as an associated context buffer 315 (hereinafter these will also be referred to as resources). Because of these links, a special state data buffer can be used to identify the resources associated with an event. As shown by the arrows from the special state data buffers to the priority queues 313, a special state data buffer pointer is stored in the appropriate priority queue. This logic was described in more detail above in stage 3 of Figure 3. When the arbiter 312 picks the next entry to service from the priority queue, the arbiter 312 returns a special state data buffer pointer. This pointer is then used by logic associated with the core processor 104 and the co-processor circuitry 107 to identify the context and data buffer resources the event will be using.
  • the size of a data buffer 614 is 64-bytes
  • the size of a context buffer 315 is 64-bytes
  • the size of a special state data buffer 620 is 44 bits. As recognized by those skilled in the art, the size of these buffers could be changed without affecting the operation of the logic in Figure 8.
  • FIG. 9 is a block flow diagram showing how a data buffer 614 can be passed from one event to another event in an example of the invention.
  • a new event begins as indicated by steps 901 and 902
  • a check is made to determine if the particular event is using a passed data buffer. If the particular event would like to use a "passed" data buffer, the particular data buffer 814 is associated with the event and the in-use counter for the particular data.
  • step 921 the event processing takes place and at the end of the event, the in-use counter of the data buffer is decremented by one in step 922.
  • a check is made to determine if the in-use counter is zero. If the count is zero, the buffer is freed and can be assigned to a new event as indicated by step 925. If the count is not zero, as indicated by step 924, the buffer is not freed since the buffer is still in use by some other event.
  • FIG. 10 is a block flow diagram showing how state information is passed between events in an example of the invention.
  • a determination is made is as to whether or not an event is passing "state" information. If state information is not being passed, the operation proceeds as indicated by steps 1010 to 1015.
  • a new state information buffer is selected from the unused pool of buffers as indicated by step 1010.
  • the event is performed.
  • the in-use counter is decremented by one (step 1012) and a check is made to determine if the count is zero at step 1013. If the count is zero, the buffer is free to be assigned as indicated by step 1015. Otherwise, the buffer is not freed as indicated by block 1014.
  • steps 1004 to 1008 The operations that occur when "state" information is passed from one event to another event are indicated by steps 1004 to 1008.
  • the information in the data only buffer 814 is also passed between the events. This is indicated by steps 1004 and 1005.
  • the event proceeds as indicated by step 1006, and at the end of the event, as indicated by steps 1007 and 1012, the in-use counter of the data only buffer 814 and the state information buffer 820 is decreased by one.
  • steps 1008-a and 1008-b and 1013 to 1015 the check is then made to determine if the in-use counter has reached zero to determine if the buffers can be re-assigned.
  • An event can pass data or special state information associated with one event to a new event, which does not share the same context information. Such transfers are possible because the state information is stored in a buffer that is separate from the data buffer.
  • An event can also pass a multi-bit message from a current event to a subsequent event that is generated by the current event. This message is stored in the special state buffer of the subsequent event.
  • Figure 11 A and 11 B illustrate examples of how one embodiment of the invention operates.
  • the horizontal dimension in Figures 11A and 11B represents time.
  • Figure 11 A illustrates how the in-use counts for a data buffer change for an event which submits a DMA command in an example of the invention.
  • the process begins at step 1101. It is assumed that at this point the in-use count of the data buffer is one. While the event posted as indicated by step 1101 is progressing, steps 1102 and 1103 indicate that two DMA transfers are submitted.
  • the data buffer count is incremented to two by the first DMA command and to three by the second DMA command. As indicated by step 1104, when the first DMA transfer finishes, the in-use count is reduced to two.
  • Figure 11 B indicates how the in-use count of a data buffer changes for an event, which creates a shared data buffer in an example of the invention.
  • the horizontal dimension indicates time.
  • the illustrated process begins as indicated by step 1111 with an event being posted. In one embodiment, this event requested a new data buffer. This data buffer would have an initial in- use count of zero and when the event is posted, as indicated by step 1111 , the in- use count is increased to one.
  • Step 1121 represents another event request, which is posted as indicated by step 1122. For the event request shown in 1121 , the first event passes its data buffer to the second event so the second event starts with a data buffer in-use count of two. This initial in- ⁇ se count of two is arrived at using multiple steps.
  • the core processor 104 initiates a request for another event, the data buffer in-use count is immediately incremented by one in order to reserve this data buffer for the next event.
  • the event request is for another core processor event
  • the co-processor circuitry 107 receives this event request and passes this request to the section of the co-processor logic which handles core processor event requests. This is the same logic, which handled the initial event generation indicated in 1101 or 1111.
  • the in-use count of the data buffer is again incremented as this data buffer is assigned to the new event.
  • the section of the co-processor circuitry 107 that handles event requests signals back to the section of the co-processor circuitry 107, which received this event request from the core processor 104.
  • This section of the coprocessor logic now requests the in-use count of the data buffer be decremented by one. Hence, there is a total of two increments and one decrement and the new event is posted with an effective initial data buffer in-use count two.
  • the system is setup so that if step 1122 is delayed by stalls in the system such that this event request is really processed after 1112 happens, the data buffer is reserved using in-use counts by the 1121 operation until the 1122 operation can take place.
  • Step 1112 indicates that when the first event is finished, the data buffer count is reduced to one.
  • Steps 1131 and 1132 indicate a DMA request that is submitted and posted using the same data buffer. As indicated by steps 1132 and 1131 the count is increased to two and then reduced to one when the DMA request is finished.
  • the event posed at block 1122 is finished, the in-use count is reduced to zero and the data buffer can be reassigned to a new event.

Abstract

Circuitry (100) to free the core processor (104) from performing the explicit read operation required to read data into the internal register set. The processor's register set (603B) is expanded and a 'shadow register' set (603C) is provided. While the core processor (104) is processing one event the 'context' and 'data' and other associated information for the next event is loaded into the shadow register set (603C). When the core processor (104) finishes processing an event, the core processor (104) switches to the shadow register set (603C) and it can begin processing the next event immediately. With short service routines, there might not be time to fully pre-fetch the 'context' and 'data' associated with the next event before the current event ends. In this case, the core processor (104) still starts processing the next event and the pre-fetch continues during the event processing.

Description

Enhancing Performance by Pre-Fetching and Caching Data Directly in a Communication Processor's Register Set
Related Applications: Priority is claimed for the following co-pending applications:
1) Application serial number 60/221 ,821 entitled "Traffic Stream Processor" filed on July 31 , 2000.
2) Application serial number 09/639,915 entitled "Integrated Circuit that Processes Communication Packets with Scheduler Circuitry that Executes Scheduling Algorithms based on Cached Scheduling Parameters" filed on August 16, 2000.
3) Application serial number 09/640,258 entitled "Integrated Circuit that Processes Communication Packets with Co-Processor Circuitry to Determine a Prioritized Processing Order for a Core Processor" filed on August 16, 2000.
4) Application serial number 09/640,231 entitled "Integrated Circuit that Processes Communication Packets with Co-Processor Circuitry to Correlate a Packet Stream with Context Information" filed on August 16, 2000.
The content of the above applications is hereby incorporated herein by reference.
Field of the Invention: The present invention is related to the field of communications, and more particularly to integrated circuits that process communication packets.
Background of the Invention:
Many communication systems transfer information in streams of packets. In general, each packet contains a header and a payload. The header contains control information, such as addressing or channel information, that indicates how the packet should be handled. The payload contains the information that is being transferred. Some examples of the types of packets used in communication systems include, Asynchronous Transfer Mode (ATM) cells, Internet Protocol (IP) packets, frame relay packets, Ethernet packets, or some other packet-like information block. As used herein, the term "packet" is intended to include packet segments. Integrated circuits termed "traffic stream processors" have been designed to apply robust functionality to high-speed packet streams. Robust functionality is critical with today's diverse but converging communication systems. Stream processors must handle multiple protocols and inter-work between streams of different protocols. Stream processors must also ensure that quality-of service constraints, priority, and bandwidth requirements are met. This functionality must be applied differently to different streams, and there may be thousands of different streams.
Co-pending applications 09/639,966, 09/640,231 and 09/640,258 , the content of which is hereby incorporated herein by reference, describe a integrated circuit for processing communication packets. As described in the above applications, the integrated circuit includes a core processor. The processor handles a series of tasks, termed "events". These events consist of tasks such as CPU processing steps as well as the scheduling of subsequent events. These subsequently scheduled events may consist of CAM lookups, DMA data transfers, or other generic events based on conditions in the current event. All events have an associated service address, "context information" and "data". Information about the event such as the resource that requested the event, how much data is associated with the event, and other key information from the event requestor is stored in "special state" information associated with the event. Most events have an associated service address, "context information" and "data". When an external resource initiates an event, the external resource supplies the core processor with a memory pointer to "context" information and also supplies the data to be associated with the event. The context pointer is used to fetch the context from external memory and to store this "context" information in memory located on the chip. If the required context data has already been fetched onto the chip, the hardware recognizes this fact and sets the on chip context pointer to point to this already pre-fetched context data. Only a small number of the system "contexts" are cached on the chip at any one time, and their allocation needs to managed and sometimes shared among multiple processing events. Each cached "context" has an in-use counter so that one context can be associated with multiple sets of data. The rest of the system "contexts" are stored in external memory. This context fetch mechanism is described in the above referenced co-pending applications.
In order to process an event, the core processor needs the service address of the event as well as the "context" and "data" associated with the event. The service address is the starting address for the instructions used to service the event. The core processor branches to the service address in order to start servicing the event.
Typically, the core processor needs to access a portion of the "context" associated with the event so the appropriate part of the "context" is read into the core processor's local registers. When this is done, the core processor can read, and if appropriate modify, the "context" values. However, when the core processor modifies a "context" value, the "context" values stored outside of the core processor register must be updated to reflect this change. This can happen under direct programmer control or using the method described in the above referenced patent (U.S. Patent 5,748,630). The "data" associated with an event is handled in a manner similar to that described for the "context".
In the circuit described in the above references co-pending applications, the processing core performed a register read which returned a pointer to the context, data, and service address associated with the next event. The processing core then needed to explicitly read the context and data into its internal register set. In the circuit described in the above references co-pending applications, data and context information for a number of events are stored in buffers in a coprocessor. In order to process an event, the core processor needs the service address of the event as well as the "context" and "data" associated with the event. The service address is the starting address for the instructions used to service the event. The core processor branches to the service address in order to start servicing the event.
Summary of the Invention: The present invention frees the core processor from performing the explicit read operation required to read data into the internal register set. The present invention expands the processor's register set and provides a "shadow register" set. While the core processor is processing one event, the "context" and "data" and some other associated information for the next event is loaded into the shadow register set. When the core processor finishes processing an event, the core processor switches to the shadow register set and it can begin processing the next event immediately. With short service routines, there might not be time to fully pre- fetch the "context" and "data" associated with the next event before the current event ends. In this case, the core processor still starts processing the next event and the pre-fetch continues during the event processing. If the core processor accesses a register which is associated with part of the context for which the prefetch is still in progress, the core processor will automatically stall or delay until the pre-fetch has completed reading the appropriate data. Logic has been provided to handle several special situations, which are created by the use of the shadow registers, and to provide the programmer with control over the pre-fetching and service address selection progress.
In the integrated circuit shown in the referenced co-pending applications, special state information is effectively stored together with associated data in data buffers. Furthermore, the data buffers do not have associated in-use counters. In some embodiments, separate logical buffers are provided for special state information and for the associated data buffer. Furthermore, each data buffer and each special state information buffer (hereinafter termed resources) has an associated in-use counter. Multiple events can share the same resource. The counter associated with a resource is incremented when a resource becomes associated with a particular event. The counter associated with a resource is decremented when an event completes the use of that particular resource. When the in-use count for a resource becomes zero, the in-use count indicates that the resource is unassigned and that the resource can be assigned to a new event. In some embodiments, two events can point to (i.e. utilize) the same data buffer and/or the same special state information buffer. Furthermore the content of a data buffer or a special state information buffer can be passed directly from one event to another event without reading the data into and out of memory. The in-use counter is particularly useful to facilitate the timing of DMA requests without need for explicit control by an external program. With the present invention two events can use the same data buffer. This is possible since the special state information is stored in a separate buffer. Furthermore, one can have one data buffer associated with multiple context buffers since the special state information is stored separately from the associated data. Some embodiments also add a communication mechanism which allows an event to pass a multi-bit message to subsequent events. This message passing mechanism does not require that the two events share any of the same context, data, or special state resources.
Brief Description of the Figures:
Figure 1 is an overall block diagram of a packet processing integrated circuit in an example of the invention. Figure 2 is a block diagram that illustrates packet processing stages and the pipe-lining used by the circuit in an example of the invention.
Figure 3 is a diagram illustrating circuitry in the co-processing relating to context and data buffer processing in an example of the invention.
Figure 4 is a block program flow diagram illustrating buffer correlation and in- use counts in an example of the invention.
Figure 5 is a block diagram of the buffer management circuitry in an example of the invention.
Figure 6 is a block diagram showing the transfer queue and registers in the core processor in an example of the invention. Figure 7 is a block program flow diagram illustrating an operation in an example of the invention.
Figure 8 is a block diagram showing the details of the data and special state information buffers in an example of the invention.
Figure 9 is a block program flow diagram illustrating how data buffers are passed between events in an example of the invention.
Figure 10 is a block program flow diagram illustrating how state information buffers are passed between events in an example of the invention.
Figure 11A and 11B are block program flow diagram illustrating examples of how DMA commands are handled in an example of the invention.
Detailed Description of the Invention:
Various aspects of packet processing integrated circuits are discussed in United States patent 5,748,630, entitled "ASYNCHRONOUS TRANSFER MODE CELL PROCESSING WITH LOAD MULTIPLE INSTRUCTION AND MEMORY WRITE -BACK", filed on May 9, 1996. The content of the above referenced patent is hereby incorporated by reference into this application in order to shorten and simplify the description in this application. One embodiment of the present invention described herein is applied as an improvement to the type of integrated circuit described in co-pending patent applications 60/211 ,863 filed on June 14, 2000, 09/640,260 filed on August 16, 2000, 09/639,915 filed on August 16, 2000, 09/639,966 filed on August 16, 2000, 09/640,258 filed on August 16, 2000 and 09/640,231 filed on August 17, 2000, the content of which is hereby incorporated herein by reference in order to shorten and simplify the description of the present application.
Figure 1 is a block diagram that illustrates a packet processing integrated circuit 100 in an example of the invention. It should be understood that the present invention can also be applied to other types of processors. The operation of the circuit 100 will first be described with reference to Figures 1 to 4 and then the operation of the present invention will be described with reference to Figures 5 to 11 A.
Integrated circuit 100 includes a core processor 104, a scheduler 105, receive interface 106, co-processor circuitry 107, transmit interface 108, and memory interface 109. These components may be interconnected through a memory crossbar or some other type of internal interface. Receive interface 106 is coupled to communication system 101. Transmit interface 108 is coupled to communication system 102. Memory interface is coupled to memory 103. Communication system 101 could be any device that supplies communication packets with one example being the switching fabric in an
Asynchronous Transfer Mode (ATM) switch. Communication system 101 could be any device that receives communication packets with one example being the physical line interface in the ATM switch. Memory 103 could be any memory device with one example being Random Access Memory (RAM) integrated circuits. Receive interface 106 could be any circuitry configured to receive packets with some examples including UTOPIA interfaces or Peripheral Component Interconnect (PCI) interfaces. Transmit interface 108 could be any circuitry configured to transfer packets with some examples including UTOPIA interfaces or PCI interfaces.
Core processor 104 is a micro-processor that executes networking application software. Core-processor 104 supports an instruction set that has been tuned for networking operations especially context switching. As described herein, core processor 104 has the following characteristics: 166 MHz, pipelined single-cycle operation, RISC-based design, 32-bit instruction and register set, K instruction cache, 8 KB zero-latency scratchpad memory, interrupt/trap/halt support, and C compiler readiness. Scheduler 105 comprises circuitry configured to schedule and initiate packet processing that typically results in packet transmissions from integrated circuit 100, although scheduler 105 may also schedule and initiate other activities. Scheduler 105 schedules upcoming events, and as time passes, selects scheduled events for processing and re-schedules unprocessed events. Scheduler 105 transfers processing requests for selected events to co-processor circuitry 107. Scheduler 105 can handle multiple independent schedules to provide prioritized scheduling across multiple traffic streams. To provide scheduling, scheduler 105 may execute a guaranteed cell rate algorithm to implement a leaky bucket or a token bucket scheduling system. The guaranteed cell rate algorithm is implemented through a cache that holds algorithm parameters. Scheduler 105 is described in detail in the above referenced co- pending patent applications.
Co-processor circuitry 107 receives communication packets from receive interface 106 and memory interface 109 and stores the packets in internal data buffers. Co-processor circuitry 107 correlates each packet to context information describing how the packet should be handled . Co-processor circuitry 107 stores the correlated context information in internal context buffers and associates individual data buffers with individual context buffers to maintain the correlation between individual packets and context information. Importantly, co-processor circuitry 107 ensures that only one copy of the correlated context information is present the context buffers to maintain coherency. Multiple data buffers are associated with a single context buffer to maintain the correlation between the multiple packets and the single copy the context information. Co-processor circuitry 107 also determines a prioritized processing order for core processor 104. The prioritized processing order controls the sequence in which core processor 104 handles the communication packets. The prioritized processing order is typically based on the availability of all of the resources and information that are required by core processor 104 to process a given communication packet. Resource state bits are set when resources become available, so co-processor circuitry 107 may determine when all of these resources are available by processing the resource state bits. If desired, the prioritized processing order may be based on information in packet handling requests. Co-processor circuitry 107 selects scheduling algorithms based on an internal scheduling state bits and uses the selected scheduling algorithms to determine the prioritized processing order. The algorithms could be round robin, service-to-completion, weighted fair queuing, simple fairness, first-come first- serve, allocation through priority promotion, software override, or some other arbitration scheme. Thus, the prioritization technique used by co-processor circuitry 107 is externally controllable. Co-processor circuitry 107 is described in more detail with respect to FIGS. 2-4.
Memory interface 109 comprises circuitry configured to exchange packets with external buffers in memory 103. Memory interface 109 maintains a pointer cache that holds pointers to the external buffers. Memory interface 109 allocates the external buffers when entities, such as core processor 104 or co-processor circuitry 107, read pointers from the pointer cache. Memory interface 109 deallocates the external buffers when the entities write the pointers to the pointer cache. Advantageously, external buffer allocation and de-allocation is available through an on-chip cache read/write. Memory interface 109 also manages various external buffer classes, and handles conditions such as external buffer exhaustion. Memory interface 109 is described in detail in the above referenced patent applications.
In operation, receive interface 106 receives new packets from communication system 101 , and scheduler 105 initiates transmissions of previously received packets that are typically stored in memory 103. To initiate packet handling, receive interface 106 and scheduler 105 transfer requests to co-processor circuitry 107. Under software control, core processor 104 may also request packet handling from co-processor circuitry 107. Co-processor circuitry 107 fields the requests, correlates the packets with their respective context information, and creates a prioritized work queue for core processor 104. Core processor 104 processes the packets and context information in order from the prioritized work queue. Advantageously, co-processor circuitry 107 operates in parallel with core processor 104 to offload the context correlation and prioritization tasks to conserve important core processing capacity. In response to packet handling, core processor 104 typically initiates packet transfers to either memory 103 or communication system 102. If the packet is transferred to memory 103, then core processor 104 instructs scheduler 105 to schedule and initiate future packet transmission or processing. Advantageously, scheduler 105 operates in parallel with core processor 104 to offload scheduling tasks and conserve important core processing capacity.
In response to packet handling, core processor 104 typically initiates packet transfers to either memory 103 or communication system 102. If the packet is transferred to memory 103, then core processor 104 instructs scheduler 105 to schedule and initiate future packet transmission or processing. Advantageously, scheduler 105 operates in parallel with core processor 104 to offload scheduling tasks and conserve important core processing capacity. Various data paths are used in response to core processor 104 packet transfer instructions. Co-processor circuitry 107 transfers packets directly to communication system 102 through transmit interface 108. Co-processor circuitry 107 transfers packets to memory 103 through memory interface 109 with an on-chip pointer cache. Memory interface 109 transfers packets from memory 103 to communication system 102 through transmit interface 108. Co-processor circuitry 107 transfers context information from a context buffer through memory interface 109 to memory 103 if there are no packets in the data buffers that are correlated with the context information in the context buffer. Advantageously, memory interface 109 operates in parallel with core processor 104 to offload external memory management tasks and conserve important core processing capacity. Co-processor Circuitry -- FIGS. 2-4:
FIGS. 2-4 depict a specific example of co-processor circuitry. Those skilled in the art will understand that Figures 2-4 have been simplified for clarity. FIG. 2 illustrates how co-processor circuitry 107 provides pipe-lined operation. FIG. 2 is vertically separated by dashed lines that indicate five packet processing stages: 1 ) context resolution, 2) context fetching, 3) priority queuing, 4) software application, and 5) context flushing. Co-processor circuitry 107 handles stages 1-3 to provide hardware acceleration. Core processor 104 handles stage 4 to provide software control with optimized efficiency due to stages 1-3. Co-processor circuitry 107 also handles stage 5. Co-processor circuitry 107 has eight pipelines through stages 1-3 and 5 to concurrently process multiple packet streams.
In stage 1 , requests to handle packets are resolved to a context for each packet in the internal data buffers. The requests are generated by receive interface 106, scheduler 105, and core processor 104 in response to incoming packets, scheduled transmissions, and application software instructions. The context information includes a channel descriptor that has information regarding how packets in one of 64,000 different channels are to be handled. For example, a channel descriptor may indicate service address information, traffic management parameters, channel status, stream queue information, and thread status. Thus, 64,000 channels with different characteristics are available to support a wide array of service differentiation. Channel descriptors are identified by channel identifiers. Channel identifiers may be indicated by the request. A map may be used to translate selected bits from the packet header to a channel identifier. A hardware engine may also perform a sophisticated search for the channel identifier based on various information. Different algorithms that calculate the channel identifier from the various information may be selected by setting correlation state bits in co-processor circuitry 107. Thus, the technique used for context resolution is externally controllable. In stage 2, context information is fetched, if necessary, by using the channel identifiers to transfer the channel descriptors to internal context buffers. Prior to the transfer, the context buffers are first checked for a matching channel identifier and validity bit. If a match is found, then the context buffer with the existing channel descriptor is associated with the corresponding internal data buffer holding the packet.
In stage 3, requests with available context are prioritized and arbitrated for core processor 104 handling. The priority may be indicated by the request - and it may be the source of the request. The priority queues 1-12 are 8 entries deep. Priority queues 1-12 are also ranked in a priority order by queue number. The priority for each request is determined, and when the context and data buffers for the request are valid, an entry for the request is placed in one of the priority queues that corresponds to the determined priority. The entries in the priority queues point to a pending request state RAM that contains state information for each data buffer. The state information includes a data buffer pointer, a context pointer, context validity bit, requester indicator, port status, a channel descriptor loaded indicator. This state information was referred to earlier in this document as the special state information associated with an event. These two terms may be used interchangeably.
The work queue indicates the selected priority queue entry that core processor 104 should handle next. To get to the work queue, the requests in priority queues are arbitrated using one of various algorithms such as round robin, service-to-completion, weighted fair queuing, simple fairness, first-come first-serve, allocation through priority promotion, and software override. The algorithms may be selected through scheduling state bits in co-processor circuitry 107. Thus, the technique used for prioritization is externally controllable. Co-processor circuitry 107 loads core processor 104 registers with the channel descriptor information for the next entry in the work queue. In stage 4, core processor 104 executes the software application to process the next entry in the work queue which points to a portion of the pending state request RAM that identifies the data buffer and context buffer. The context buffer indicates one or more service addresses that direct the core processor 104 to the proper functions within the software application. One such function of the software application is traffic shaping to conform to service level agreements. Other functions include header manipulation and translation, queuing algorithms, statistical accounting, buffer management, inter-working, header encapsulation or stripping, cyclic redundancy checking, segmentation and reassembly, frame relay formatting, multicasting, and routing. Any context information changes made by the core processor are linked back to the context buffer in real time. In stage 5, context is flushed. Typically, core processor 104 instructs coprocessor circuitry 107 to transfer packets to off-chip memory 103 or transmit interface 108. If no other data buffers are currently associated with the pertinent context information, then co-processor circuitry 107 transfers the context information to off-chip memory 103.
FIG. 3 is a block diagram that illustrates co-processor circuitry 107 in an example of the invention. Co-processor circuitry 107 comprises a hardware engine that is firmware-programmable in that it operates in response to state bits and register content. In contrast, core processor 104 is a micro-processor that executes application software. Co-processor circuitry 107 operates in parallel with core processor 104 to conserve core processor 104 capacity by off-loading numerous tasks from the core processor 104. Co-processor circuitry 107 comprises context resolution 310, control 311 , arbiter 312, priority queues 313, data buffers 314, context buffers 315, context DMA 316, and data DMA 317. Data buffers 314 hold packets and context buffers 315 hold context information, such as a channel descriptor. Data buffers
314 are relatively small and of a fixed size, such as 64 bytes, so if the packets are ATM cells, each data buffer holds only a single ATM cell and ATM cells do not cross data buffer boundaries.
Individual data buffers 314 are associated with individual context buffers
315 as indicated by the downward arrows. Priority queues 313 hold entries that represent individual data buffers 314 as indicated by the upward arrows. Thus, a packet in one of the data buffers is associated with its context information in an associated one of the context buffers 315 and with an entry in priority queues 313. Arbiter 312 presents a next entry from priority queues 313 to core processor 104 which handles the associated packet in the order determined by arbiter 312. Context DMA 316 exchanges context information between memory 103 and context buffers 315 through memory interface 109. Context DMA automatically updates queue pointers in the context information. Data DMA 317 exchanges packets between data buffers 314 and memory 103 through memory interface 109. Data DMA 317 also transfers packets from memory 103 to transmit interface 108 through memory interface 109. Data DMA 317 signals context DMA 316 when transferring packets off-chip, and context DMA 316 determines if the associated context should be transferred to off-chip memory 103. Both DMAs 316-317 may be configured to perform CRC calculations.
For a new packet from communication system 101 , control 311 receives the new packet and a request to handle the new packet from receive interface 106. Control 311 receives and places the packet in one of the data buffers 314 and transfers the packet header to context resolution 310. Based on gap state bits, a gap in the packet may be created between the header and the payload in the data buffer, so core processor 104 can subsequently write encapsulation information to the gap without having to create the gap. Context resolution 310 processes the packet header to correlate the packet with a channel descriptor, although in some cases, receive interface 106 may have already performed this context resolution. The channel descriptor comprises information regarding packet transfer over a channel.
Control 311 determines if the channel descriptor that has been correlated with the packet is already in one of the context buffers 315 and is valid. If so, control 311 does not request the channel descriptor from off-chip memory 103. Instead, control 311 associates the particular data buffer 314 holding the new packet with the particular context buffer 315 that already holds the correlated channel descriptor. This prevents multiple copies of the channel descriptor from existing in context buffers 314. Control 311 then increments an in-use count for the channel descriptor to track the number of data buffers 314 that are associated with the same channel descriptor.
If the correlated channel descriptor is not in context buffers 315, then control 311 requests the channel descriptor from context DMA 316. Context DMA 316 transfers the requested channel descriptor from off-chip memory 103 to one of the context buffers 315 using the channel descriptor identifier, which may be an address, that was determined during context resolution. Control 311 associates the context buffer 315 holding the transferred channel descriptor with the data buffer 314 holding the new packet to maintain the correlation between the new packet and the channel descriptor. Control 311 also sets the in-use count for the transferred channel descriptor to one and sets the validity bit to indicate context information validity.
Control 311 also determines a priority for the new packet. The priority may be determined by the source of the new packet, header information, or channel descriptor. Control 311 places an entry in one of priority queues 313 based on the priority. The entry indicates the data buffer 314 that has the new packet. Arbiter 312 implements an arbitration scheme to select the next entry for core processor 104. Core processor 104 reads the next entry and processes the associated packet and channel descriptor in the particular data buffer 314 and context buffer 315 indicated in the next entry.
Each priority queue has a service-to-completion bit and a sleep bit. When the service-to-completion bit is set, the priority queue has a higher priority that any priority queues without the service-to-completion bit set. When the sleep bit is set, the priority queues is not processed until the sleep bit is cleared. The ranking of the priority queue number breaks priority ties. Each priority queue has a weight from 0-15 to ensure a certain percentage of core processor handling. After an entry from a priority queue is handled, its weight is decremented by one if the service-to-completion bit is not set.
The weights are re-initialized to a default value after 128 requests have been handled or if all weights are zero. Each priority queue has a high and low watermark. When outstanding requests that are entered in a priority queue exceed its high watermark, the service-to-completion bit is set. When the outstanding requests fall to the low watermark, the service-to-completion bit is cleared. The high watermark is typically set at the number of data buffers allocated to the priority queue.
The context buffers 315 each have an associated in-use counter. The in- use counters associated with the context buffers is not shown in Figure 3, but it is shown in Figure 8.
Core processor 104 may instruct control 311 to transfer the packet to off- chip memory 103 through data DMA 317. Control 311 decrements the context buffer in-use count, and if the in-use count is zero (no data buffers 314 are associated with the context buffer 315 holding the channel descriptor), then control 311 instructs context DMA 316 to transfer the channel descriptor to off- chip memory 103. Control 311 also clears the validity bit. This same general procedure is followed when scheduler 105 requests packet transmission, except that in response to the request from scheduler 105, control 311 instructs data DMA 317 to transfer the packet from memory 103 to one of data buffers 314. FIG. 4 is a flow diagram that illustrates the operation of co-processor circuitry 107 when correlating buffers in an example of the invention. Coprocessor circuitry 107 has eight pipelines to concurrently process multiple packet streams in accord with FIG. 3.
First, a packet is stored in a data buffer, and the packet is correlated to a channel descriptor as identified by a channel identifier. The channel descriptor comprises the context information regarding how packets in one of 64,000 different channels are to be handled.
Next, context buffers 314 are checked for a valid version of the correlated channel descriptor. This entails matching the correlated channel identifier with a channel identifier in a context buffer that is valid. If the correlated channel descriptor is not in a context buffer that is valid, then the channel descriptor is retrieved from memory 103 and stored in a context buffer using the channel identifier. The data buffer holding the packet is associated with the context buffer holding the transferred channel descriptor. An in-use count for the context buffer holding the channel descriptor is set to one. A validity bit for the context buffer is set to indicate that the channel descriptor in the context buffer is valid. If the correlated channel descriptor is already in a context buffer that is valid, then the data buffer holding the packet is associated with the context buffer already holding the channel descriptor. The in-use count for the context buffer holding the channel descriptor is incremented.
Typically, core processor 104 instructs co-processor circuitry 107 to transfer packets to off-chip memory 103 or transmit interface 108. Data DMA 317 transfers the packet and signals context DMA 316 when finished. Context DMA 316 decrements the in-use count for the context buffer holding the channel descriptor, and if the decremented in-use count equals zero, then context DMA 316 transfers the channel descriptor to memory 103 and clears the validity bit for the context buffer. For some embodiments, the effect of DMA operations on the in-use counts of the special state buffers and the data buffers will be explained later. Figures 11 A and 11 B will be used to illustrate these operations.
Memory Interface 109 -- FIGS. 5-6 FIGS. 5-6 depict a specific example of memory interface circuitry in accord with the present invention. Those skilled in the art will appreciate numerous variations from the circuitry shown in this example may be made. Furthermore, those skilled in the art will appreciate that some conventional aspects of FIGS. 5-6 have been simplified or omitted for clarity. FIG. 5 is a block diagram that illustrates memory interface 109. Memory interface 109 comprises a hardware circuitry engine that is firmware- programmable in that operates in response to state bits and register content. In contrast, core processor 104 is a micro-processor that executes application software. Memory interface 109 operates in parallel with core processor 104 to conserve core processor 104 capacity by off-loading numerous tasks from the core processor 104.
Both FIG. 1 and FIG. 5 show memory 103, core processor 104, coprocessor circuitry 107, transmit interface 108, and memory interface 109. Memory 103 comprises Static RAM (SRAM) 525 and Synchronous Dynamic RAM (SDRAM) 526, although other memory systems could also be used. SDRAM 526 comprises pointer stack 527 and external buffers 528. Memory interface 109 comprises buffer management engine 520, SRAM interface 521 , and SDRAM interface 522. Buffer management engine 520 comprises pointer cache 523 and control logic 524. Conventional components could be used for SRAM interface 521 , SDRAM interface 522, SRAM 525, and SDRAM 526. SRAM interface 521 exchanges context information between SRAM 525 and co-processor circuitry 107. External buffers 528 use a linked list mechanism to store communication packets externally to integrated circuit 100. Pointer stack 527 is a cache of pointers to free external buffers 528 that is initially built by core processor 104. Pointer cache 523 stores pointers that were transferred from pointer stack 527 and correspond to external buffers 528. Sets of pointers may be periodically exchanged between pointer stack 527 and pointer cache 523. Typically, the exchange from stack 527 to cache 523 operates on a first-in/first-out basis.
In operation, core processor 104 writes pointers to free external buffers 528 to pointer stack 527 in SDRAM 526. Through SDRAM interface 522, control logic 524 transfers a subset of these pointers to pointer cache 523. When an entity, such as core processor 104, co-processor circuitry 107, or an external system, needs to store a packet in memory 103, the entity reads a pointer from pointer cache 523 and uses the pointer to transfer the packet to external buffers 528 through SDRAM interface 522. Control logic 524 allocates the external buffer as the corresponding pointer is read from pointer cache 523. SDRAM stores the packet in the external buffer indicated by the pointer. Allocation means to reserve the buffer, so other entities do not improperly write to it while it is allocated.
When the entity no longer needs the external buffer - for example, the packet is transferred from memory 103 through SDRAM interface 522 to coprocessor circuitry 107 or transmit interface 108, then the entity writes the pointer to pointer cache 523. Control logic 524 de-allocates the external buffer as the corresponding pointer is written to pointer cache 523. De-allocation means to release the buffer, so other entities may reserve it. The allocation and de- allocation process is repeated for other external buffers 528.
Control logic 524 tracks the number of the pointers in pointer cache 523 that point to de-allocated external buffers 528. If the number reaches a minimum threshold, then control logic 524 transfers additional pointers from pointer stack 527 to pointer cache 523. Control logic 524 may also transfer an exhaustion signal to core processor 104 in this situation. If the number reaches a maximum threshold, then control logic 524 transfers an excess portion of the pointers from pointer cache 523 to pointer stack 527.
Figure 6 illustrates in more detail the registers 603A, 603B and 603C in core processor 104 and the interface transfer queue 602 between core processor 104 and co-processor 107. In the embodiment of the invention described herein, there are sixty four registers 0 to 63 available to a user of the system. Registers 0 to 29 are used to store general state information and registers 30 to 63 are used to store "context information", "data information", and "event specific state information". There is also a shadow set of registers that corresponds to registers 30 to 63. Thus, with reference to Figure 6, in general, when the core processor 104 is processing a series of events, the first event uses registers A & B, the next event uses registers A & C, the next event uses registers A & B, the next event uses registers A and C, etc. Thus, at any one particular time, one set of registers (either B or C) is the active set of registers and at the same time the other set of registers (either B or C) is a shadow set of registers that is being loaded for the next event, which will be processed. In general, register sets B and C alternate as the active and shadow register sets. In some embodiments, the registers 603A, 603B and 603C are low latency memory. In some embodiments, the data buffers in co-processor 107 are medium latency memory. In some embodiments, the off chip memory 103 is a high latency memory. Thus, some embodiments of the invention make possible the increased use of the low latency memory available to the core processor 104.
The data buffers 314 and the context buffers 315, are part of the control of the co-processor 107. The co-processor 107 can read data and context from the cache memory via memory interface 109 and provide the data and context to the core processor 104 over the data bus indicated by the arrow 601 A. While an event is being processed using registers A and B, registers C are loaded with data and context information needed to process the next event. In some embodiments, the registers shown in Figure 6 are not a cache memory. The registers shown in Figure 6 are the on chip registers, which are part of the core processor 104. The pre-fetch block 601 shown is responsible for controlling the coprocessor pre-fetch processing. Based on signals from the core processor 104 and the state of the current pre-fetch, this unit indicates to the work queue selection logic (312) when to select the top element from the work queue and to return the identifying parameters back to the pre-fetch logic block. Based on these parameters, the pre-fetch block controls the reading of the appropriate
"context" and "data" buffer and the sending of the data to the core processor 104. Event processing does not always require that the full "context" and "data" buffer are pre-fetched to the core processor 104, so the pre-fetch unit allows the core processor 104 to configure the amount of the "context" and "data" data buffer which is sent by the pre-fetch logic to the core processor 104. In the current implementation, a different configuration can be attached to each of the priority queues (313), and the priority queue picked by the selection logic determines which configuration is used. However, it will be appreciated by those skilled in the art that this configuration information could be supplied in a different manner, such as having a global register or allowing each service address to indicated to the pre-fetch unit the maximum amount of pre-fetched information it could need. The pre-fetch logic 601 also indicates to the core processor 104 whether to swap to the shadow register set when the core processor 104 begins processing a new event. Typically, the core processor 104 swaps to the shadow register set; however, there are special conditions, as described later in this document, under which the pre-fetch logic 601 can determine that the core processor 104 should not swap to the shadow register set. The program running on the core processor 104 can, in certain case, determine in advance that it should always or never swap to the shadow "context" or "data" register set. The core processor 104 can indicate this by setting the configuration bits in the prefetch logic 601 which force the logic to always, never, or when appropriate indicate to the core processor 104 that it should swap to the shadow register set. For instance, in the case where the pre-fetched "data" registers are never being used, the core processor 104 could configure the pre-fetch logic 601 to indicate that the core processor 104 should never swap to the "data" shadow register set. In this case, the core processor 104 would then be free to use the "data" registers for other purposes. As with the above described case, the configuration bits for this option are associated with each priority queue, and hence, the configuration bits used are determined by the priority queue which is selected.
Another function associated with the pre-fetch logic 601 is to determine the service address associated with the pre-fetched event. In the current implementation, the pre-fetch logic 601 can pick the service address from the a set of fixed addresses or from the "context" data which is being fetched. The location the pre-fetch logic 601 uses to pick the service address, the service address selection field, is configured on a per priority queue basis, and hence this field is determined by the priority queue selected. In addition, the resource which initiates an event can also pass a field which is used to modify the service address selection field just for the selection of this particular event's service address. Various functions could be used to combine the field the resource supplied with the field stored in the configuration registers. The function which has been implemented was exclusion-or. Other possible choices could have been addition, and or replacement.
The overall operation of the pre-fetch system is illustrated in Figure 7. The process begins at some point with the state indicted by block 701. As indicated by block 701 A, the context and data are stored in buffers 314 and 315 using the methods previously described and the core processor 104 is using an active register set. As indicated by block 701 B, the core processor 104 needs to pre-fetch the initial events data into its shadow register set. This initial pre-fetch is performed using what is termed the BRSLPRE instruction. This instruction indicates to the co-processor 107 to pre-fetch data for the next event into the shadow register file, and to send the corresponding service address. This core processor 104 instruction does not change the program flow of the core processor 104, but rather is serves as a way to initialize or reinitialize the event information stored in the shadow register file. As indicated by block 701 C, after issuing the BRSLPRE instruction, the core processor 104 is now ready to begin event processing. The core processor 104 sends a command to the coprocessor 107 to fetch the top entry on the work queue 313 into the shadow register and to send the next service address. In addition, the core processor 104 prepares to branch to the previously pre-fetched service address. This is termed a BRSL instruction.
As indicated by steps 702A, the core processor 104 determines if the Service Address for the Shadow register has been fetched. If not, the core processor 104 stalls until the Service Address for the Shadow register has been fetched in step 702B. It should be noted that the service address in question is not the service address determined by the proceeding BRSL instruction (701 C), but rather by the earlier BRSL instruction (701 B), which initiated the pre-fetch of data into the shadow register set. As indicated by step 703, when the service address for the shadow register has been fetched, the core processor 104 switches to the shadow register set and branches to the appropriate service address.
The core processor 104 then performs event processing using the then active register set as indicated by step 704. It is noted that all the requested data does not need to be pre-fetched into the core processor shadow register set before the core processor 104 can switch to this register set. The pre-fetching of data into a register set can happen concurrently with the progressing of an event using this register set. If the data required by the event progressing has not yet been pre-fetched, the core processor 104 operation is automatically stalled or suspended until the data becomes available.
As indicated by step 705, after performing the processing required by an event, the core processor 104 next sends a command to co-processor 107 to fetch the top entry in the work queue 313 into the shadow registers and to set the next service address. In addition, the core processor 104 begins to branch to the previously pre-fetched service address. As indicated previously, this can be described as performing a BRSL instruction.
As indicated by steps 706 and 706A, the core processor 104 can not branch to a new service address until the active register pre-fetching operation has been finished. The core processor 104 operation is stalled until this pre- fetching has been finished. Finishing the pre-fetch may consist of terminating the pre-fetch or allowing the pre-fetch to complete. The process then repeats using the steps described above using steps 702 through 706.
In some embodiments, the pre-fetch logic 601 handles two special situations. One situation is when back to back events are taken which use the same "context" and/or "data" information. Since the core processor 104 can be updating the "context" and "data" information while the next event "data" and "context" is being pre-fetched, if the next event context is the same as the current context, the pre-fetched context is not assured to reflect all the changes the processing core has made to the context, (i.e. the pre-fetched data can be stale). Since in this situation, the current registers do reflect all the changes the core processor 104 has made to the context, there is no need to swap to the shadow register set, and the BRSL instruction (blocks 701 B and 705 in Figure 7) does not switch to the shadow register set in this situation. Determining the appropriate service address in this situation also requires some special handling. If the work-queue is set to extract the service address from the "context" and the processing core changes this service address, then the service address determined by the pre-fetch logic 601 might be stale. In order to avoid this problem, a mode has been added to the work queue selection hardware 312, which does not allow back to back event from the same work queue in a first embodiment. This allows the programmer to avoid the case described above. In a second embodiment, back to back events are allowed, but a write to a BRSL interlock address is issued, after the service address has been changed. Writing to this address stalls the next BRSL instruction until the BRSL interlock address write has left the queue shown in figure 602. Since the BRSL interlock address write happened after the service address update, the service address update must have cleared this queue as well. There is pre-fetch logic 601 which snoops the output of the queue 602 and checks for writes which will effect the service address of the currently pre-fetched event (indicated as 602A on Figure 6). If such a write is detected, the logic updates the next service address appropriately. Writing to the BRSL interlock address after the service address has changed assures that this snooping logic will be able to update the service address before the BRSL instruction uses this service address. Another special situation to the pre-fetch mechanism (hereinafter referred to as the ABA case) occurs when the pre-fetch is for a context that was used in not the previous event but instead one event before the previous event. The case is further complicated by the fact that writes from core processor 104 to the on chip "context" storage go through the queue 602. Hence for the ABA case, when pre-fetching for the second "A" event, there could be writes in the queue which affect context "A", which could cause the pre-fetch logic to pre-fetch stale values of context "A". In order to avoid this case, the start of a pre-fetch is delayed until all the writes associated with the event one before the current event have cleared the queue 602. The selection of the event to pre-fetch for is also delayed in the same manner. This allows the writes associated with the first event "A", in the "ABA" case, to affect the selection of the second Α" event.
Figure 8 shows the detailed logic added to the data buffer 314 shown in Figure 3 in an example of the invention. The data buffer 314 includes two sections designated data only buffers 814 and special state information buffers 820. For this embodiment, there are six buffers for data only and six buffers for special state information, shown in the diagram. For other embodiments, there are numerous data buffers and special state information buffers. The data buffers are assigned an index number from zero to the maximum number of data buffers in the co-processor 107. The special state information buffers are also assigned an index from zero to the maximum number of special state information buffers in the co-processor 107. Furthermore, the context buffers are also assigned an index from zero to the maximum number of context buffers in the co-processor 107. These indexes are used by the logic in the co-processor 107 and the core processor 104 to identify an individual context buffer, data buffer, or special state information buffer. In one embodiment, there are sixteen of each of these type of buffers in the co-processor 107. The exact number of each of these buffers is not significant to the general operation of the logic. Each buffer has an associated in-use counter 814-0 to 814-5 and 820-0 to
820-5. The in-use counters keep track of the number of events, which are using the data in the particular buffers. Each in-use counter is incremented by one for each event, which is using the data or state information in a particular buffer. When an event finishes with a particular buffer, the in-use counter is decremented by one. When the count in an in-use counter reaches zero, no events are using the particular buffer and it can be reallocated. Data buffer resolution logic 822 and PRSR special data resolution logic 821 operates similar to the operation of context buffer resolution 310, which was previously described.
Data buffer resolution logic 822 keeps track of which data buffers 814 are in use and which are available to the assigned to new events. Data buffer resolution logic 822 also contains the logic for incrementing and decrementing the in use counters associated with the data buffers 814. PRSR special data resolution logic 821 keeps track of which special state information buffers are in use and which are available to be assigned to new events. PRSR special data resolution logic 821 also contains the logic for incrementing and decrementing the in use counters associated with the special state information buffers.
PRSR special data resolution logic 821 and data buffer resolution logic 822 select a buffer to be assigned to a new event by scanning the in use counts of all their associated buffers and picking the buffer with the lowest index which has an in-use count of zero. In other embodiments, there are numerous variations in selecting a buffer to be assigned to a new event and which has an in-use count of zero. Some examples of selecting a buffer are first-in-first-out selection and last-in- first-out selection.
Context resolution 310 contains the logic used to select the context buffer to be assigned to a new event. A global configuration bit is used to pick which of two mechanisms is used to select the next context buffer to be assigned to a new event. One mechanism picks the context buffer in the same manner as the next data buffer is picked. As previous described, this method returns the context buffer with a zero in-use count which has the lowest index. The problem with this selection mechanism for context buffers is that the selection mechanism tends to select the context buffer that have been most recently freed. For instance, when context buffer with index zero is freed, it is always the next new index to be selected. Because context information, which is not already stored in a context buffer, needs to be read in from off-chip memory, under certain conditions is better to not reuse a context buffer as soon as its in-use count goes to zero.
This problem is addressed by the second context selection mechanism. This mechanism uses a moving "finger" which determines at what index the logic will start searching for an in-use count of zero. The value of the finger is incremented after each new context selection. Hence, for the first context new selection the logic will start search forward from index zero. For the second new context select, the logic will start searching forward from index 1 , etc.
As is shown by the arrows in Figure 8, the special state information data buffer 820 contains a pointer to an associated data buffer 614 as well as an associated context buffer 315 (hereinafter these will also be referred to as resources). Because of these links, a special state data buffer can be used to identify the resources associated with an event. As shown by the arrows from the special state data buffers to the priority queues 313, a special state data buffer pointer is stored in the appropriate priority queue. This logic was described in more detail above in stage 3 of Figure 3. When the arbiter 312 picks the next entry to service from the priority queue, the arbiter 312 returns a special state data buffer pointer. This pointer is then used by logic associated with the core processor 104 and the co-processor circuitry 107 to identify the context and data buffer resources the event will be using.
In one embodiment, the size of a data buffer 614 is 64-bytes, the size of a context buffer 315 is 64-bytes, and the size of a special state data buffer 620 is 44 bits. As recognized by those skilled in the art, the size of these buffers could be changed without affecting the operation of the logic in Figure 8.
Figure 9 is a block flow diagram showing how a data buffer 614 can be passed from one event to another event in an example of the invention. When a new event begins as indicated by steps 901 and 902, a check is made to determine if the particular event is using a passed data buffer. If the particular event would like to use a "passed" data buffer, the particular data buffer 814 is associated with the event and the in-use counter for the particular data. Next as indicated by step 921 , the event processing takes place and at the end of the event, the in-use counter of the data buffer is decremented by one in step 922. Next as indicated by step 923, a check is made to determine if the in-use counter is zero. If the count is zero, the buffer is freed and can be assigned to a new event as indicated by step 925. If the count is not zero, as indicated by step 924, the buffer is not freed since the buffer is still in use by some other event.
Figure 10 is a block flow diagram showing how state information is passed between events in an example of the invention. As indicated by step 1002, a determination is made is as to whether or not an event is passing "state" information. If state information is not being passed, the operation proceeds as indicated by steps 1010 to 1015. A new state information buffer is selected from the unused pool of buffers as indicated by step 1010. Next as indicated by step 1011 the event is performed. At the end of the event, the in-use counter is decremented by one (step 1012) and a check is made to determine if the count is zero at step 1013. If the count is zero, the buffer is free to be assigned as indicated by step 1015. Otherwise, the buffer is not freed as indicated by block 1014. The operations that occur when "state" information is passed from one event to another event are indicated by steps 1004 to 1008. When "state" information is passed from one event to another event, the information in the data only buffer 814 is also passed between the events. This is indicated by steps 1004 and 1005. The event proceeds as indicated by step 1006, and at the end of the event, as indicated by steps 1007 and 1012, the in-use counter of the data only buffer 814 and the state information buffer 820 is decreased by one. As indicated by steps 1008, 1008-a and 1008-b and 1013 to 1015 the check is then made to determine if the in-use counter has reached zero to determine if the buffers can be re-assigned.
An event can pass data or special state information associated with one event to a new event, which does not share the same context information. Such transfers are possible because the state information is stored in a buffer that is separate from the data buffer. An event can also pass a multi-bit message from a current event to a subsequent event that is generated by the current event. This message is stored in the special state buffer of the subsequent event.
Figure 11 A and 11 B illustrate examples of how one embodiment of the invention operates. The horizontal dimension in Figures 11A and 11B represents time. Figure 11 A illustrates how the in-use counts for a data buffer change for an event which submits a DMA command in an example of the invention. The process begins at step 1101. It is assumed that at this point the in-use count of the data buffer is one. While the event posted as indicated by step 1101 is progressing, steps 1102 and 1103 indicate that two DMA transfers are submitted. The data buffer count is incremented to two by the first DMA command and to three by the second DMA command. As indicated by step 1104, when the first DMA transfer finishes, the in-use count is reduced to two. When the event posted as indicated by block 1101 is complete, the in-use count is reduced to one as indicated by block 1105. Finally, when the second DMA transfer is complete, the in-use count is reduced to zero as indicated by step 1106. Conventional logic is provided in co-processor circuitry 107 to handle the changes to the in-use counts as described.
Figure 11 B indicates how the in-use count of a data buffer changes for an event, which creates a shared data buffer in an example of the invention. As in Figure 11 A, the horizontal dimension indicates time. The illustrated process begins as indicated by step 1111 with an event being posted. In one embodiment, this event requested a new data buffer. This data buffer would have an initial in- use count of zero and when the event is posted, as indicated by step 1111 , the in- use count is increased to one. Step 1121 represents another event request, which is posted as indicated by step 1122. For the event request shown in 1121 , the first event passes its data buffer to the second event so the second event starts with a data buffer in-use count of two. This initial in-ϋse count of two is arrived at using multiple steps. When the core processor 104 initiates a request for another event, the data buffer in-use count is immediately incremented by one in order to reserve this data buffer for the next event. In step 1122, the event request is for another core processor event, the co-processor circuitry 107 receives this event request and passes this request to the section of the co-processor logic which handles core processor event requests. This is the same logic, which handled the initial event generation indicated in 1101 or 1111. When the event is processed by this section of the co-process logic, the in-use count of the data buffer is again incremented as this data buffer is assigned to the new event. When this new event is created, the section of the co-processor circuitry 107 that handles event requests, signals back to the section of the co-processor circuitry 107, which received this event request from the core processor 104. This section of the coprocessor logic, now requests the in-use count of the data buffer be decremented by one. Hence, there is a total of two increments and one decrement and the new event is posted with an effective initial data buffer in-use count two. The system is setup so that if step 1122 is delayed by stalls in the system such that this event request is really processed after 1112 happens, the data buffer is reserved using in-use counts by the 1121 operation until the 1122 operation can take place. This assures that independent of the relative timing of 1122 and 1112 this is not time between 1112 and 1122 that the value of the data buffer's in-use count allows this passed data buffer to be viewed as an unassigned data buffer. The effective reservation of this data buffer by incrementing the is-use count when the event request 1121 is posted, assures that no intervening event request can mistakenly view this data buffer as unassigned and reallocate this data buffer Step 1112 indicates that when the first event is finished, the data buffer count is reduced to one. Steps 1131 and 1132 indicate a DMA request that is submitted and posted using the same data buffer. As indicated by steps 1132 and 1131 the count is increased to two and then reduced to one when the DMA request is finished. Finally, as indicated by block 1123, the event posed at block 1122 is finished, the in-use count is reduced to zero and the data buffer can be reassigned to a new event.
It should be noted that the descriptions for the examples give in Figure 11 A and 11 B explain only the change in the data buffer in-use count. The in-use counts of the context and special state information buffers change in a similar manner.
It should also be noted that the examples given in Figures 11 A and 11 B are meant to be illustrative examples only. Many other sequences can occur. The point of Figures 11 A and 11 B is to illustrate that with the present invention, there can be a composition of multiple processing tasks in situations where the subsequent tasks have no idea that any of their resources (data buffer/context buffer/special state buffer) had been processed by a previous service task. The in-use counters keep track of this automatically. While the invention has been shown and described with respect to preferred embodiments thereof, it will be appreciated by those skilled in the art that various changes in forma and detail can be made without departing from the sprit and scope of the invention. Applicant's invention is limited only by the scope of the appended claims.

Claims

We claim:
1. An integrated circuit (100) for processing communication packets, said integrated circuit (100) comprising a core processor (104) and the integrated circuit (100) characterized by: the core processor (104) configured to execute software to process a series of communication packets, the processing of each packet being an event and having associated data and context information, said core processor (104) having two sets of data registers (603B, 603C), each set of data registers being capable of storing the context and data information required to process one event, said core processor (104) using said sets of registers alternatively; and a co-processor (107) comprising a plurality of data buffers (314, 315) configured to store data and context information associated with a plurality of packets, data and context from one event being transferred to one of said sets of data registers (603B, 603C) in said core processor (104) while said core processor (104) is utilizing data and context information stored in a different set of data registers (603B, 603C) in said core processor (104), whereby said processor (104) need not wait between event processing in order to load data in said registers (603B, 603C).
2. The integrated circuit (100) of claim 1 wherein said data registers (603B, 603C) comprise a low latency memory.
3. The integrated circuit (100) of claim 1 wherein said data registers (603B, 603C) comprise a low latency memory and said data buffers (314, 315) in said co-processor comprises a medium latency memory.
4. The integrated circuit (100) of claim 1 wherein said data registers (603B, 603C) comprise a low latency memory, said data buffers (314, 315) in said coprocessor (107) comprises a medium latency memory, and said integrated circuit further comprises off chip high latency memory.
5. The integrated circuit (100) of claim 1 wherein said core processor (104) is configured to pre-fetch data into one set of registers while said core processor processes an event using data in another set of registers.
6. The integrated circuit (100) of claim 1 further comprising means for preventing the occurrence of two back to back events which use the same context data.
7. The integrated circuit (100) of claim 1 comprising a queue of events.
8. The integrated circuit (100) of claim 7 comprising logic configured to detect when an event being processes uses the same context as that used by an event immediately preceding the preceding event (the ABA case), and which in the ABA case, delay the selection of the data for the second event until data which affects the selection mechanism has been emptied from the queue.
9. The integrated circuit (100) of claim 1 further comprising a work queue configured to prioritize communication packets for processing, said work queue comprising a detector configured to determine if sequential communications packets queued for transmission to said core processor utilize the same context data, and delay transmission of the second such communication packet until processing of the first such communication packet is complete.
10. The integrated circuit (100) of claim 1 wherein the co-processor (107) further comprises a plurality of state information buffers (820) for storing state information associated with events wherein each of said state information buffers (820) having an in-use counter (820-0) indicating the number of events associated with the contents of said buffer (820).
11. The integrated circuit (100) of claim 10 wherein said co-processor (107) comprises a plurality of context buffers (315) for storing context information associated with a plurality of events.
12. The integrated circuit (100) of claim 11 wherein said co-processor (107) comprises an in-use counter associated with each of said context buffers.
13. The integrated circuit (100) of claim 10 wherein said co-processor (107) comprises a plurality of data buffers (814) for storing data.
14. The integrated circuit (100) of claim 13 wherein said co-processor (107) comprises an in-use counter (814-0) associated with each of said data buffers (814).
15. The integrated circuit (100) of claim 10 wherein said integrated circuit (100) comprises a plurality of data buffers (814) each having an in-use counter (814-0) whereby data can be transferred from one event to another event by changing information in a data buffer (814).
16. The integrated circuit (100) of claim 10 wherein said integrated circuit (100) comprises a plurality of buffers (814) for data associated with events and a plurality of buffers (315) for context associated with events.
17. The integrated circuit (100) of claim 16 wherein said integrated circuit (100) comprises an in-use counter associated with each of said buffers.
18. The integrated circuit (100) of claim 10 wherein said co-processor (107) comprises a plurality of data only information buffers (814), a plurality of context information buffers (315), an in-use counter (814-0) for each of said data only buffers and an in-use counter for each of said context buffers.
19. The integrated circuit (100) of claim 18 where data can be passed from one event to another event by changing the data in one of said state information buffers.
20. A method of processing communication packets in a system which comprises a core processor (104), the method characterized by the core processor (104) comprising a first set of registers (603B) and a shadow set of registers (603C) and the method comprising: in said first set of registers (603B) and said shadow set of registers
(603C), storing the context and data necessary to process one communication packet and a co-processor (107) with a plurality of buffers (814, 820, 315) configured to store data and context information necessary to process a plurality of packets; and transferring data and context information associated with a second communication packet from said co-processor (107) to said shadow set of registers (603C) while said core processor (104) is using the data and context information said first set of registers (603B) to process a first communication packet.
21. The method of claim 20 wherein said registers (603B, 603C) are a low latency memory.
22. The method of claim 20 wherein said data registers (603B, 603C) are a low latency memory and said data buffers (814, 820, 315) in said co-processor (107) are a medium latency memory.
23. The method of claim 20 further comprising: in said core processor (104), pre-fetching data into one set of registers (603B) while said core processor (104) processes an event using data in another set of registers (603C).
24. The method of claim 20 further comprising: preventing the occurrence of two back to back events which use the same context data.
25. The method recited in claim 20 wherein said co-processor (107) includes a queue of events.
26. The method of claim 25 wherein when an event being processes uses the same context as that used by an event immediately preceding the preceding event (the ABA case), the selection of the data for the second A event is delayed until data which affects the selection mechanism has been emptied from the queue.
27. The method of claim 20 wherein the co-processor (107) further comprises a state information buffer (820) for storing state information for an event separate from the data associated with said event, said state information buffer (820) having an associated in use counter (820-0) and the method further comprises: incrementing the in-use counter (820-0) associated with said state information buffer (820) when an event is associated with said state information buffer (820); and decrementing the in-use counter (820-0) of said state information buffer
(820) when said event associated with said buffer (820) is finished.
28. The method of claim 27 wherein said integrated circuit (100) comprises a plurality of state information buffers (820).
29. The method of claim 27 wherein said integrated circuit (100) comprises a context buffer (315) and an in-use counter for said context information buffer (315) and the method further comprises: incrementing the in-use counter associated with said context buffer (315) when an event is associated with said context buffer (315); and decrementing the in-use counter of said context buffer (315) when said events associated with said context buffer (315) is finished.
30. The method of claim 27 wherein said integrated circuit (100) comprises a data only buffer (814) to store data associated with an event.
31. The method of claim 27 wherein said integrated circuit (100) comprises a data only buffer (814) to store data associated with an event and an in-use counter (814-0) associated with said data only buffer (814) and the method further comprises: incrementing the in-use counter (814-0) associated with said data buffer (814) when an event is associated with said data buffer (814); and decrementing the in-use counter (814-0) of said data buffer (814) when said event associated with said data buffer (814) is finished.
32. A system for processing communication packets comprising a core processor (104), the system characterized by a co-processor (107), said core processor (104) including two sets of registers (603B, 603C), means in said core processor (104) for initiating pre-fetch of data into one set of said registers (603B, 603C) from said co-processor (107) and for at the same time initiating processing of an event using data in the other set of registers, whereby transfer of data from the co-processor (107) to the core processor (104) can take place at the same time as said core processor (104) is processing data for a different communication packet.
33. The system of claim 32 wherein said registers (603B, 603C) are a low latency memory.
34. The system of claim 32 wherein said data registers (603B, 603C) are a low latency memory and said co-processor (107) includes medium latency memory.
35. The system of claim 32 including means for preventing the occurrence of two back to back events which use the same context data.
36. The system of claim 32 including a queue of events.
37. The system of claim 36 including logic configured to detect when an event being processes uses the same context as that used by an event immediately preceding the preceding event (the ABA case), and which in the ABA case, delay the selection of the data for the second A event until data which affects the selection mechanism has been emptied from the queue.
38. The system of claim 32 wherein the co-processor (104) further comprises separate buffers (814, 820) for data and state information and in-use counters for all of said buffers (814, 820), whereby the contents of a data can be passed from one event to another event, each of said events having state information in a separate state information buffer (820).
39. The system of claim 38 which includes context information buffers (315).
40. The system of claim 39 which includes in-use counters for said context information buffers (315).
41. The system of claim 37 including a plurality of data buffers (814) and a plurality of state information buffers (820).
42. The system of claim 37 which includes a plurality of data buffers (814), a plurality of state information buffers (820) and a plurality of context information buffers (315), each of said buffers having an in-use counter which is increments when an event is associated with the buffer and decremented when an event is finished utilizing the buffer.
43. An integrated circuit (100) for processing communication packets, said integrated circuit (100) including a core processor (104), the integrated circuit characterized by: the core processor (104) configured to execute software to process a series of communication packets, the processing of each packet requiring data and context information, said core processor (104) comprising two sets of low latency data registers (603B, 603C), each set of data registers (603B, 603C) configured to store the context and data information required to process one communication packet, said core processor (104) using said sets of registers (603B, 603C) alternatively; a co-processor (107) comprising a plurality of medium latency data buffers (814) configured to store data and context information associated with a plurality of communication packets; and pre-fetch circuitry (601) configured to transfer data and context information required for processing one communication packet from said coprocessor (107) to one of said sets of data registers (603B, 603C) in said core processor (104) while said core processor (104) is utilizing data and context information stored in a different set of data registers (603B, 603C) to processes a different communication packet.
44. The integrated circuit (100) of claim 43 wherein the co-processor (107) further comprises a plurality buffers (814,820) which separately store data, state and context information associated with events wherein each of said data, state and context buffers (814, 820, 315) having an in-use counter indicating the number of events associated with said buffer.
PCT/US2001/041485 2000-07-31 2001-07-31 Pre-fetching and caching data in a communication processor's register set WO2002011368A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2001285384A AU2001285384A1 (en) 2000-07-31 2001-07-31 Enhancing performance by pre-fetching and caching data directly in a communication processor's register set

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US22182100P 2000-07-31 2000-07-31
US60/221,821 2000-07-31
US09/640,231 2000-08-16
US09/640,231 US6804239B1 (en) 1999-08-17 2000-08-16 Integrated circuit that processes communication packets with co-processor circuitry to correlate a packet stream with context information
US09/639,915 US6888830B1 (en) 1999-08-17 2000-08-16 Integrated circuit that processes communication packets with scheduler circuitry that executes scheduling algorithms based on cached scheduling parameters
US09/640,258 2000-08-16
US09/640,258 US6754223B1 (en) 1999-08-17 2000-08-16 Integrated circuit that processes communication packets with co-processor circuitry to determine a prioritized processing order for a core processor
US09/639,915 2000-08-16

Publications (2)

Publication Number Publication Date
WO2002011368A2 true WO2002011368A2 (en) 2002-02-07
WO2002011368A3 WO2002011368A3 (en) 2002-06-06

Family

ID=27499249

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2001/041485 WO2002011368A2 (en) 2000-07-31 2001-07-31 Pre-fetching and caching data in a communication processor's register set

Country Status (2)

Country Link
AU (1) AU2001285384A1 (en)
WO (1) WO2002011368A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100348003C (en) * 2002-06-26 2007-11-07 诺基亚公司 Programmable scheduling for IP routers
GB2466651A (en) * 2008-12-31 2010-07-07 St Microelectronics Security co-processor architecture for decrypting packet streams
US9026790B2 (en) 2008-12-31 2015-05-05 Stmicroelectronics (Research & Development) Limited Processing packet streams
CN109300217A (en) * 2018-09-03 2019-02-01 深圳怡化电脑股份有限公司 Queuing management method, computer storage medium, queuing server and system
CN114185513A (en) * 2022-02-17 2022-03-15 沐曦集成电路(上海)有限公司 Data caching device and chip

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5805927A (en) * 1994-01-28 1998-09-08 Apple Computer, Inc. Direct memory access channel architecture and method for reception of network information
US5920561A (en) * 1996-03-07 1999-07-06 Lsi Logic Corporation ATM communication system interconnect/termination unit

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5805927A (en) * 1994-01-28 1998-09-08 Apple Computer, Inc. Direct memory access channel architecture and method for reception of network information
US5920561A (en) * 1996-03-07 1999-07-06 Lsi Logic Corporation ATM communication system interconnect/termination unit

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEE T A ET AL: "Low power data management architecture for wireless communications signal processing" VEHICULAR TECHNOLOGY CONFERENCE, 1998. VTC 98. 48TH IEEE OTTAWA, ONT., CANADA 18-21 MAY 1998, NEW YORK, NY, USA,IEEE, US, 18 May 1998 (1998-05-18), pages 625-629, XP010287765 ISBN: 0-7803-4320-4 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100348003C (en) * 2002-06-26 2007-11-07 诺基亚公司 Programmable scheduling for IP routers
GB2466651A (en) * 2008-12-31 2010-07-07 St Microelectronics Security co-processor architecture for decrypting packet streams
US9026790B2 (en) 2008-12-31 2015-05-05 Stmicroelectronics (Research & Development) Limited Processing packet streams
CN109300217A (en) * 2018-09-03 2019-02-01 深圳怡化电脑股份有限公司 Queuing management method, computer storage medium, queuing server and system
CN114185513A (en) * 2022-02-17 2022-03-15 沐曦集成电路(上海)有限公司 Data caching device and chip

Also Published As

Publication number Publication date
WO2002011368A3 (en) 2002-06-06
AU2001285384A1 (en) 2002-02-13

Similar Documents

Publication Publication Date Title
US7099328B2 (en) Method for automatic resource reservation and communication that facilitates using multiple processing events for a single processing task
US6822959B2 (en) Enhancing performance by pre-fetching and caching data directly in a communication processor's register set
JP3801919B2 (en) A queuing system for processors in packet routing operations.
US8935483B2 (en) Concurrent, coherent cache access for multiple threads in a multi-core, multi-thread network processor
US6996639B2 (en) Configurably prefetching head-of-queue from ring buffers
US7676588B2 (en) Programmable network protocol handler architecture
US8505013B2 (en) Reducing data read latency in a network communications processor architecture
US8537832B2 (en) Exception detection and thread rescheduling in a multi-core, multi-thread network processor
US6952824B1 (en) Multi-threaded sequenced receive for fast network port stream of packets
US7269179B2 (en) Control mechanisms for enqueue and dequeue operations in a pipelined network processor
US8514874B2 (en) Thread synchronization in a multi-thread network communications processor architecture
US7853951B2 (en) Lock sequencing to reorder and grant lock requests from multiple program threads
US7113985B2 (en) Allocating singles and bursts from a freelist
JP2002505535A (en) Data flow processor with two or more dimensional programmable cell structure and method for configuring components without deadlock
US8087024B2 (en) Multiple multi-threaded processors having an L1 instruction cache and a shared L2 instruction cache
US8910171B2 (en) Thread synchronization in a multi-thread network communications processor architecture
GB2395308A (en) Allocation of network interface memory to a user process
EP1604493A2 (en) Free list and ring data structure management
WO2000013091A1 (en) Intelligent network interface device and system for accelerating communication
US8868889B2 (en) Instruction breakpoints in a multi-core, multi-thread network communications processor architecture
US20120131283A1 (en) Memory manager for a network communications processor architecture
US7039054B2 (en) Method and apparatus for header splitting/splicing and automating recovery of transmit resources on a per-transmit granularity
EP1680743B1 (en) Dynamically caching engine instructions for on demand program execution
WO2002011368A2 (en) Pre-fetching and caching data in a communication processor's register set
US9804959B2 (en) In-flight packet processing

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application
AK Designated states

Kind code of ref document: A3

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A3

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase in:

Ref country code: JP