CN100533410C - Software controlled content addressable memory in a general purpose execution datapath - Google Patents

Software controlled content addressable memory in a general purpose execution datapath Download PDF

Info

Publication number
CN100533410C
CN100533410C CNB028167740A CN02816774A CN100533410C CN 100533410 C CN100533410 C CN 100533410C CN B028167740 A CNB028167740 A CN B028167740A CN 02816774 A CN02816774 A CN 02816774A CN 100533410 C CN100533410 C CN 100533410C
Authority
CN
China
Prior art keywords
addressable memory
content addressable
subclauses
clauses
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB028167740A
Other languages
Chinese (zh)
Other versions
CN101137966A (en
Inventor
M·罗森布鲁斯
G·沃尔里奇
D·伯恩斯特因
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN101137966A publication Critical patent/CN101137966A/en
Application granted granted Critical
Publication of CN100533410C publication Critical patent/CN100533410C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Abstract

A mechanism in a multithreaded processor to allocate resources based on configuration information indicating how many threads are in use.

Description

The Content Addressable Memory of software control in the general execution data path
The related application reference
The application requires the right of priority of the U.S. Provisional Patent Application No.60/315144 (lawyer writes down No.10559-579P01) of submission on August 27 calendar year 2001.
Background of invention
In order to make maximizing efficiency, network processing unit should be able to be handled grouping with line speed at least.Packet transaction generally includes and writes or reading external memory.Because lacking necessary speed, slower memory access rate, current network processing unit come to handle grouping with line speed.
Summary of drawings
Fig. 1 adopts to have the block diagram of multithreading micro engine with the communication system of the processor of support multithreading execution.
Fig. 2 is the block diagram that comprises (micro engine of Fig. 1) programmable processor data path of CAM.
Fig. 3 is the synoptic diagram that micro engine is described as multistage (multi-stage), packet transaction pipeline.
Fig. 4 is the block diagram of the CAM of Fig. 2.
Fig. 5 A is the description of formation and queue descriptor in the SRAM storer.
Fig. 5 B describes the high-speed cache of queue descriptor and the corresponding marker stores that use (Fig. 4's) CAM realizes.
Fig. 6 is a process flow diagram, is illustrated in by during being programmed the queue operation that one of micro engine to carry out queue management carries out, and the example of CAM is used.
Fig. 7 is a process flow diagram, and illustration is by being programmed to carry out the pipeline micro engine that CRC handles, and the example of CAM is used to support Cyclic Redundancy Check to handle.
Embodiment
With reference to figure 1, communication system 10 comprises the processor 12 that is coupled to such as one or more I/O devices of network equipment 14 and 16, and storage system 18.Processor 12 is multiline procedure processors, and particularly useful to the task of being divided into parallel subtask or function.In one embodiment, as shown in the figure, processor 12 comprises a plurality of micro engines 20, each all have can be on task simultaneously effectively and the program threads of a plurality of hardware controls that work alone.Shown in example in, 16 micro engines 20 are arranged, micro engine 20a-20p (corresponding to micro engine 0 to 15), and each micro engine 20 can both handle the multiprogram thread, as will be described in detail.The maximum quantity of the context thread of supporting in the illustrated embodiment (context thread) is 8, but other maximum can be set.As shown in the figure, each micro engine 20 is connected to contiguous micro engine by next abutment line 21 and can communicates by letter with it.In an illustrated embodiment, micro engine 0-7 is organized into first bunch of (ME bunch 0) 22a and micro engine 8-15 is organized into second bunch of (ME bunch 1) 22b.
Processor 12 also comprises processor 24, it helps to load the microcode control of other resource that is used for processor 12 and carry out other general purpose computer type function, such as processing protocol and unusual, and the support that the higher layer network Processing tasks that can not handle micro engine is provided.In one embodiment, processor 24 is the frameworks based on StrongARM (ARM is the trade mark of Britain Advanced Risc Machines Ltd.) core.Processor (or core) 24 has operating system, and processor 24 can be by its call function so that operation on micro engine 20.Processor 24 can use the operating system of any support, is preferably real time operating system.Can use other processor architecture.
Micro engine 20 is worked with the shared resource that comprises storage system 18, pci bus interface 26, I/O interface 28, hashing unit 30 and working storage 32 separately.Pci bus interface 26 provides the interface to the pci bus (not shown).I/O interface 28 is responsible for control and joining process device 12 arrives network equipment 14,16.Storage system 18 comprises with dram controller 36 carries out the dynamic RAM (DRAM) 34 of access and the static RAM (SRAM) 38 that carries out access with SRAM controller 40.Though not shown, processor 12 comprises that also nonvolatile memory supports pilot operationp.DRAM34 and dram controller 36 are generally used for handling Large Volume Data, for example from the processing of the service load of network packet.SRAM38 and SRAM controller 40 are used for low time delay, fast access task in the network implementation, and for example the access look-up table is used for storer of processor 24 or the like.SRAM controller 40 comprises that data structure (queue descriptor's cache memory) and relevant controlling logic support effective queue operation, as will be described in more detail.Micro engine 20a-20p can carry out the memory reference instruction of DRMM controller 36 or SRAM controller 40.
Device 14 and 16 can be any network equipment that can send and/or receive network traffic data, such as framing/mac device, for example, be used to be connected to the network of 10/100BaseT Ethernet, gigabit Ethernet, ATM or other type, or be used to be connected to the device that switches tissue.For example, in a kind of arrangement, network equipment 14 can be integrated data is sent to the ethernet mac device (being connected to ethernet network, not shown) of processor 12 and install 16 can be receive from processor 12 be used for switching the switching tissue device that tissue is uploaded defeated integrated data.In this enforcement, promptly when handling the traffic that is sent to the switching tissue, processor 12 will be as the inlet network processing unit.Alternatively, processor can be with network processing unit for export, handles (by installing 16) from switching tissue and receive and be assigned to such as another network equipment of network equipment 14 or being coupled to the traffic of the network of this device.Though processor 12 can be appreciated that in order to reach higher performance with stand-alone mode work to support dual communication amount direction, may need to use the processor of two special uses, one as gateway and another is as the outlet processor.These two application specific processors all are coupled to device 14 and 16 separately.In addition, each network equipment 14,16 can comprise a plurality of ports by processor 12 services.Therefore, I/O interface 28 is supported the interface of one or more types, such as (for example being used for PHY device and higher protocol level, link layer) interface between between the interface of grouping and cell transmission or traffic management device and the switching tissue, described switching tissue is used for ATM(Asynchronous Transfer Mode), Internet Protocol (IP), Ethernet and similar data communications applications.I/O interface 28 comprises reception and sending module separately, and each all can be the special interface separate configuration of being supported by processor 12.
Such as host computer and/or can be coupled to other device, also can serve by processor 12 by the PCI peripherals (not shown) of the pci bus of PC interface 26 control.
Usually, as network processing unit, processor 12 can join the communicator or the interface of any kind of reception/transmission mass data to.Can handle the unit of these integrated datas from the unit of the network equipment receiving block data of similar network device 14 with parallel mode as the processor 12 of network processing unit, as will be described.The unit of integrated data can comprise the part of whole network packet (for example, the Ethernet grouping) or these groupings, for example cell or fragments for packet.
Each functional unit of processor 12 is coupled to inside bus structure 42. Memory bus 44a, 44b are coupled to Memory Controller 36 and 40 each storage unit DRAM34 and SRAM38 of storage system 18 respectively.I/O interface 28 is coupled to device 14 and 16 by I/O bus line 46a and the 46b that separates respectively.
With reference to figure 2, show an exemplary micro engine 20a.Micro engine (ME) 20a comprises the control store 50 that is used to store microprogram.Microprogram can be loaded by processor 24.
Micro engine 20a also comprises execution data path 54 and at least one general-purpose register (GPR) file 56, and it is coupled to control store 50.Data path 54 comprises several data path elements, comprises ALU58, multiplier 59 and Content Addressable Memory (CAM) 60.Gpr file 56 provides operand to the various data path treatment elements that comprise CAM60.Which data path element is opcode bits in the instruction select carry out operation by instruction definition.
Micro engine 20a also comprises and writes transmission register file 62 and read transmission register file 64.Write the data that 62 storages of transmission machine file will be written into micro engine resource (for example DRAM storer or SRAM storer) in addition.Write transmission register file 64 and be used for storing return data from the resource beyond the micro engine 20a.After data arrive or the while, can provide from each shared resource, for example the event signal of memory controller 36,40 or core 24 notify thread-data be can get or data send.Both are connected to data path 54 transmission register file 62,64, and control store 50.
Local storage 66 also is contained among the micro engine 20a.Local storage 66 is by register 68a, 68b addressing, and it offers data path 54 with operand.Local storage 66 from data path 54 reception results as the destination.Micro engine 20a also comprises local control and status register (CSR) 70, it is coupled to transmission register, be used for the local interior thread of storage and clobal signaling information and out of Memory, and CRC unit 72, it is coupled to transmission register, calculates with execution data path 54 concurrent workings and the CRC that is used for ATM cell.Micro engine 20a also comprises next contiguous register 74, it is coupled to control store 50 and execution data path 54, be used for storing from information that receives at ME or the information that receives from same ME, control as the information among the Local C SR70 to the previous vicinity of line treatment of the contiguous input signal 21a of the next one.
Except providing output to writing transmission unit 62, data path can also provide output to gpr file 56 on circuit 80.Therefore, comprise each data path element of CAM60 can be from executed the return results value.Can under the control of Local C SR80, provide and handle in the pipeline to the contiguous output signal 21b of the next one of the contiguous ME of the next one.
In order to simplify other details of saving micro engine.But should be understood that micro engine comprises the required suitable control hardware of (and control store 50 will be coupled to) support multiprocessing thread, such as programmable counter, instruction decode logic and front and back determinant/affair logic.
With reference to figure 3, the exemplary ME Task Distribution of the software pipeline model that is used for processor 12 has been shown in 90.Processor 12 is supported two pipelines: receiving pipeline and transmission pipeline.Receiving pipeline comprises with subordinate: ressemble pointer search (" RPTR ") 92, ressemble information updating (" RUPD ") 94, receive packet transaction (six levels) 96a-96f, measure level ME198 and ME2100, Congestion Avoidance (" CA ") 102, statistical treatment 104 and queue management device (" QM ") 106.It serves as that beginning is end with (being stored among the SRAM) transmit queue 107 that receiving pipeline arrives with the data in the receiver module of I/O interface 28.The transmission pipe stage comprises: TX dispatcher 108, QM106, transmission data level 110 and statistical treatment 104.
RPTR, RUPD work together with packet transaction pipe level and ressemble into complete grouping with the frame that will cut apart.RPTR level 92 finds to point to and ressembles the pointer of status information among the SRAM38 and this pointer is delivered to RUPD98.State is ressembled in the RUPD98 management, and it comprises distributes the DRAM buffering and calculate side-play amount, byte count and other variable, and provides the pointer that should assemble the position of network data among the sensing DRAM to packet transaction level 96.
The thread of packet transaction level 96 cushions by the DRAM that data (service load) is write distribution and finishes the process of ressembling and check that also L2 handles this grouping to the packet header of L7.These levels depend on application and therefore can change over another kind from a kind of application.For example, a kind of application can support the search of IP destination to search with identification stream and support access list to determine destination port and 7 tuples.
In order to support ATM to ressemble, except the pipe level of having described, the RX pipeline also needs Cyclic Redundancy Code (CRC) level.Can be by replacing first packet transaction level (level 96a as shown in the figure) and comprising that the out of Memory of ressembling in the state table provides CRC to support.CRC96a reads the state of ressembling and obtains AAL type and CRC remainder (rsidue), verifies as the virtual circuit (VC) that AAL5 is configured, and carries out the CRC calculating on the cell and upgrades the CRC remainder of ressembling in the state.
Measurement 98,100 is used to the bandwidth of monitoring flow.It checks that each grouping that enters is whether in profile (profile).When connecting, consult one group of parameter, for example the information rate of Cheng Nuoing (CommittedInformation Rate) (CIR) and the size of bursting (Committed burst size) of promising to undertake (CBS), they determine the employed bandwidth of stream.Can realize measurement function such as token bucket (tokenbucket) according in a large amount of known arrangement any one.
Congestion Avoidance 102 monitoring network communications amounts load is to make great efforts to estimate and avoid the congested of community network bottleneck.
QM 106 is responsible for all are grouped in and joins the team on the transmit queue 107 and go out team's operation, as will be described in detail.
The receiving pipeline thread is analyzed packet header and is searched according to this packet header information.In case handled grouping, then it or as sending so that further handle unusually by core 24, or be stored among the DRAM34 and by with the transmit queue relevant by the transmission (delivery port) of stem/search indication in line up in transmit queue for it is provided with the connection identifier (CID that divides into groups.Transmit queue is stored among the SRAM 38.The scheduling of transmission pipeline is used to send the grouping of data processing, sends data processing and during receiving processing grouping is passed out to by on the delivery port that stem/information of searching is indicated subsequently.
Generally speaking, level 92,94 and 96a-96f form the function pipeline.The function pipeline uses parallel 8 micro engines (ME), and in 8 threads among each ME (thread 0 to 7) each all designated the single grouping that is used to handle.As a result, the time in office, 64 groupings are arranged in pipeline.Each grade is to carry out with the multiple of the grouping arrival rate of performance period of 8 threads.
Level 98,100,102,104,106,108 and 110 is managed level before and after being, and same, and each is all handled by single (different) ME.In 8 threads in each grade each is all handled different groupings.
Some manages level, such as CRC 96a, RUPD 94, QM 106, goes up operation at " critical section " of code, and promptly for this code segment, the time in office has only a ME thread to have exclusive modification privilege to global resource.Reading-revise-write operation during these privilege protections coherences.Handle exclusive modification privilege between the ME by only allowing a ME (level) to revise this section.Therefore, architecture design is become to guarantee ME is not transited into the critical section level and has finished processing in its critical section up to previous ME.For example, RUPD98 is a critical section, and it need be to the mutual exclusiveness of sharing table in the external memory storage.Therefore, when when RPTR 92 carries out the transition to RUPD 94, all threads on ME0 have been finished previous RUPD pipe level, and the thread 0 of the ME1 of RUPD 94 just begins.In addition, at critical section code point place, in pipeline, adopt strict thread order execution technique to guarantee sequence management by the grouping of different threads processing.
Processor 12 also supports the use of cache mechanism to reduce processing time and improvement speed, and wherein processor 12 is operated the traffic that enters with this speed.For example, SRAM controller 40 (Fig. 1) keeps the cache of most recently used queue descriptor (being stored among the SRAM 38), as will be described further.Equally, local storage device 66 (Fig. 2) storage CRC information is such as the CRC remainder (also being stored among the SRAM) 38 that is used by CRC96a.Revise identical critical data if in pipe level, need to surpass a thread, also these data are write back external memory storage and will produce and postpone loss if then each thread is from external memory storage (being SRAM) reading of data, revise it such as QM 106.In order to reduce and to read and write the delay associated loss, the ME thread can use ME CAM60 (Fig. 2) that these operations are amounted to into and singlely read, a plurality of modification, and depend on cache expulsion strategy, amount to into one or more write operations, as will be described below.
Fig. 4 illustrates the exemplary embodiment of CAM 60.CAM 60 comprises a plurality of clauses and subclauses 120.In an illustrated embodiment, 16 clauses and subclauses are arranged.Each clauses and subclauses 120 has identifier value (or mark) 122, for example the number of queues or the storage address that can compare with the input value of searching.Each clauses and subclauses also comprises entry number 124 and the status information 126 relevant with the identifier 122 in the same item.Comparative result 128 is offered state and the LRU logical block 130 that produces lookup result 132.Lookup result 132 comprises and hits/miss (hit/miss) indicator 134, status information 136 and entry number 138.Generally, field 134 and 136 provides state 140.
The width of identifier 122 is identical with source-register, and described source-register is used to provide and loads cam entry or the value of searching is provided, for example the register of gpr file 56 (Fig. 3).In an illustrated embodiment, status information 126 realizes as mode bit.The width of status information and form, and the quantity of identifier depends on the design consideration.
During the CAM search operation, each identifier 122 of matched signal 142 compares the value that will present from the source such as gpr file 56 with having as a result concurrently, and wherein each identifier all has a matched signal 142 as a result.Load the value of each identifier earlier by the CAM load operation.During this load operation, specify the value that will load which identifier and identifier from the value of register file 56.During the CAM load operation, also status information is loaded into CAM.
By the identifier 122 and the value of searching in the source operand being compared to give an order, described instruction for example,
Lookup[dest_reg,src_reg]。
Keep to be applied to CAM60 by the source operand of parameter " src_reg " appointment and be used to the value of searching of searching.By parameter " dest_reg " designated destination operand is to receive the register that CAM searches 60 result.
Parallel relatively all clauses and subclauses 120.In one embodiment, lookup result 132 is 6 place values, with position 8:3 it is write in the register of designated destination, and wherein other position of register sets 0 for.This destination register can be the register in the gpr file 56.Alternatively, lookup result 132 can also write the LM_ADDR register 68a of ME22, any among the 68b (Fig. 2).
For hitting (that is, as result 132 hit/miss indicator 134 expression when hitting), entry number 138 is entry number of the clauses and subclauses of being mated.When generation was missed so hit/misses indicator 134 indication and miss, entry number 138 was entry number of least recently used in the cam array (LRU) clauses and subclauses.Status information 136 is only for hitting useful and comprising the value of the mode field 126 of the clauses and subclauses that are used for hitting.
The time sequencing tabulation that LRU logical one 30 keeps cam entry to use.When clauses and subclauses are loaded, or when mating on searching, it moves to the positions of using (MRU) recently at most, and the LRU tabulation is not revised in searching of missing.
All application can be used and hit/miss and indicate 134.Entry number 138 and status information 136 provide and can use the out of Memory that uses by some.When missing, for example, the LRU entry number can be as the prompting of cache expulsion (eviction).Do not need software to use this prompting.Status information 136 is the information that is only produced and used by software.It can distinguish the different implications of hitting, such as unmodified data to revising.In other used, software can enter tables of data as side-play amount with being used for branch's information judged.
Other instruction of using and managing CAM can comprise:
Write[entry,src_reg],opt_tok;
Write_state(state_value,entry);
Read_Tag(dest_reg,entry);
Read_State (dest_reg, entry); With
Clear.
The Write instruction writes the identifier value among the src_reg cam entry of appointment.The option that is adopted can be used for specifying status information.Read_Tag and Read_State instruction are used for diagnosis, but also can use in normal function.The mark value and the state that are used to specify clauses and subclauses are write destination register.So that be in the situation of new value vacating space, promptly searching of new value causes missing and having in the situation of the LRU entry number of returning as the result who misses in needs expulsion clauses and subclauses, and it is useful reading mark.Reading command can be used for finding the value that is stored in these clauses and subclauses subsequently.The Read_Tag instruction has been removed keep the needs corresponding to LRU entry number and identifier value in another register.Clear instruction is used for all information refreshed and CAM.
When CAM was relevant with the data block in the local storage device 66 as cache-tag storer and each clauses and subclauses, the result who searches can be used for hitting/missing indicator 134 upper bifurcations and use entry number 138 conducts to enter base pointer (base pointer) in the piece of local storage device 66.
In another embodiment, state 126 can be implemented as single locking bit and result 132 can be embodied as and comprises that state code (replace separately indicator and mode field) is together with entry number 138.For example, code definition can be become two codes, wherein possible result comprises " missing " (code ' 01 '), " hitting " (code ' 10 ') and " locking " (' 11 ').Miss returning the expression value of searching not in CAM of code, and the entry number of end value is least recently used (LRU) clauses and subclauses.As mentioned above, this value can come to replace with the value of searching as the clauses and subclauses of suggestion.Hit code with the expression value of searching in CAM and the locking bit zero clearing, wherein the entry number among the result is the entry number of the clauses and subclauses of matched and searched value.The locking code with the indication value of searching in CAM and locking bit 126 be set, wherein the entry number that provides among the result is again the entry number of the clauses and subclauses of matched and searched value.
Locking bit 126 is positions of data relevant with clauses and subclauses.Locking bit can be set or zero clearing by software, for example uses LOCK or UNLOCK instruction, when changing when load entries or in the clauses and subclauses that loaded.It is to be in operation (in flight) or to wait for situation about changing that locking bit 126 can be used for distinguishing the data relevant with cam entry, as will be further discussed.
As previously mentioned, use the front and back of critical data to manage the ME that level only is to use critical data.Therefore, be LRU when only replaced C AM misses to the replacement policy of cam entry.On the other hand, function pipeline (pipeline 114 of similar Fig. 3) carries out same function on a plurality of ME.Therefore, in the function pipeline, need given ME before it withdraws from the level of using critical data, all critical datas to be expelled external memory storage, and must guarantee before arbitrary thread uses CAM the CAM zero clearing.
Before thread used critical data, it used the critical data identifier search CAM such as storage address as the value of searching.As previously mentioned, search causes one of three possibilities: " missing ", " hitting " or " locking ".Miss if return, then do not preserve data partly.Thread replaces the LRU data from external memory storage (promptly from SRAM 38) reading of data.It gets back to external memory storage from local storage device (SRAM controller cache, or local storage device 66) with the expulsion of LRU data, locks cam entry alternatively and send to read from external memory storage to obtain new critical data.In application-specific, as will be described below, insert locking and be read in the process of local storage device, or read still underway to same thread (starting the thread that this reads) instruction memory to other thread designation data.In case return critical data, wait the thread of pending data just to handle these data, these data are carried out any modification, it is write the local storage device, upgrade clauses and subclauses, wherein unblank with new data expulsion LRU data and with cam entry by these clauses and subclauses.
If the result is locking, then thread supposes that it will not attempt reading these data to another ME thread in the process neutralization of reading critical data.Replace, test CAM after it and when removing locking, use this data.When the result is when hitting, then critical data resides in the local storage device.Referring now to Fig. 5 to 8 particular instance that CAM uses is described.
As mentioned above and as shown in Figure 3, processor 12 can be programmed and use a micro engine 20 as QM106.CAM60 among the QM106 is as the mark memory of the mark of hold queue descriptor, and these queue descriptors are by 40 storages of SRAM controller.
QM106 receives the request of joining the team from this group micro engine that is used as receiving function pipeline 114.Receiving pipeline 114 is programmed to be handled and classifies by the packet such as one of the network equipment 14,16 of physical layer device 14 (Fig. 1).The request of joining the team is specified and which output queue the grouping that arrives should be sent to.Transmission scheduler program 108 will go out team's request and send to QM106.This goes out team request and specifies output queue, remove grouping with by being sent to the destination such as one of network equipment 14,16 that switches tissue 16 from this output queue.
The operation of joining the team is upgraded corresponding queue descriptor in the lump with what the information that arrives packet was added output queue to.Go out team's operation and remove information and upgrade corresponding queue descriptor, allow network equipment 16 that information is sent to suitable destination thus from one of output queue.
With reference to figure 5A, example and its corresponding queue descriptor 152 of residing in " n " transmit queue 150 in the external memory storage (SRAM 38) are shown.Each output queue 150 comprise element 154 connection tabulation, they each all have the pointer that contains the address of next element in the formation.Each element 154 also comprises pointer, and it points to other information local and element representation that is stored in.Usually, the pointer of last element comprises null value in the formation 150.Queue descriptor 152 comprises end, fragment counting 158, stem pointer 160, tail pointer 162 and the frame count 164 of pointer EOP indicator 156.Descriptor 152 also comprises other queue parameter (not shown).Stem pointer 160 points to first element of transmit queues 150, and tail pointer 30 is pointed to the last element of transmit queues 150.The quantity of element in the fragment counting 158 expression transmit queues 150.
With reference now to Fig. 5 B,, joins the team and go out team's operation by some queue identifier 152 being stored in as realizing in the cache in the SRAM controller 40 170 a large amount of transmit queues in the SRAM storer 38 150 being carried out with high-bandwidth connections speed.The ME 20 that carries out as queue management device 106 uses the identifier 122 of the clauses and subclauses 120 among its CAM 60 to discern the storage address of joining the team or going out 16 queue identifier 152 using at most recently in team's operation, i.e. the queue descriptor of hypervelocity storage.The corresponding queue descriptor 152 at the place, address that cache 170 store storage are discerned in mark memory (CAM60) (EOP value 156, fragment counting 158, stem pointer 160, tail pointer 162 and frame count 164).
Queue management device 106 is given an order queue descriptor 152 is turned back to storer 38 and obtains new queue descriptor 152 from storer, thereby the queue descriptor that is stored in the cache 170 keeps consistent with the address in the mark memory 60.Queue management device 106 also issues commands to SRAM controller 38 and indicates which queue descriptor 152 in the cache 170 should be used for carrying out this order.The stem pointer 160 of queue descriptor 152 in the cache 170 or the order of tail pointer 162 are quoted in the order execution that arrives SRAM controller 38 with them.
With the cache 170 of queue descriptor 152 be arranged on memory controller 40 places allow to low delayed access from cache 170 and storer 38.Equally, the control structure with the queue operation that is used for the program design engine can allow to allow flexible high-performance when using existing micro engine hardware.
Order with strictness is carried out the thread relevant with QM106.Thread uses the interior thread signaling of local to keep strict order.Catch up with the line speed that enters in order to ensure QM106, in equaling the minimum frame time slot of time of arrival each thread carry out one join the team and one go out team operation.
Fig. 6 illustrates the exemplary queue operation 180 (expression is joined the team or gone out team's operation) that QM106 carries out.QM106 receives the request 182 that is used for queue operation.When being join the team when request receives this request from CA content pipe level ME, and when being to receive from TX scheduler program content pipe level ME when being used to group request of operating.QM106 reads number of queues 184 from this request.
Subsequently, QM106 uses its CAM to come the formation of appointment in the probe requests thereby and QM106 that it is carried out temporary correlativity between the most last 16 formations of this operation.Therefore, QM106 carries out CAM according to the number of queues of determining in the request and searches 186.If correlativity is arranged, promptly the QM thread detects CAM and hits 188, then eliminates the delay of reading queue descriptor, remains in queue descriptor's cache 170 (Fig. 5 B) corresponding to the descriptor of number of queues is current because CAM hits indication.In hitting the situation of generation, QM 106 proceeds to execution command 190, and its order SRAM controller 40 carries out institute's requested operation.
Miss if determine the result of CAM search at 188 places, then the entry number of least-recently-used cam entry is returned to QM 106.There is directly mapping between cam entry and the cache clauses and subclauses (queue descriptor).In other words, LRU cam entry " n " indication cache clauses and subclauses " n " should be ejected.Therefore, QM106 expels the queue descriptor 192 corresponding to the number of queues of storing in the LRU cam entry from cache.In case expelled the cache clauses and subclauses, QM106 reads " newly " queue descriptor (i.e. the descriptor of the number of queues the request) into cache 194 from cache.New queue descriptor comprises and connects tabulation stem pointer (being used to team) and tail pointer (being used to join the team) and indicate the quantity of frame or the counting (shown in Fig. 5 A-5B) of buffering in formation.QM106 also is stored in the number of queues of new queue descriptor in the cam entry 196, and it is identified as the LRU clauses and subclauses and replaces queue descriptor's number of expelling.QM106 execution command 190, this designated command SRAM controller 40 carries out institute's requested operation.
SRAM controller 40 is used to join the team or goes out the connection tabulation operation of team.
When the operation of carrying out arbitrary type (join the team or go out team), QM106 sends a message to TX scheduler program 108.After going out team's operation, QM106 will send request and be delivered to TX data front and back pipe level 110.
Another level of using CAM60 is that CRC handles pipe level 96a.ME20 in this grade of receiving function pipeline 114 uses its inner CAM60 to keep carrying out the coherence that CRC handles the CRC remainder between 8 threads of managing level 96a.
With reference now to Fig. 7,, show CRC pipe level program flow 200, comprise the use of the CAM60 of support function.Only indicate (by the contiguous circuit 21a (Fig. 2) of the next one) it just enters CRC level 96a when having withdrawed from grade at previous ME.This guarantees the ME critical data that access is nearest (CRC remainder).It is also important that run through this pipe level, all threads are carried out to guarantee correctly to calculate CRC with the order of strictness.Because CRC level 96a uses CAM60, it at first from previous pipe level with the still any data zero clearing 202 among CAM.Its read port type 204 is also determined its whether designated ATM cell 206.If cell is not ATM cell (being that it is certain other type, such as Ethernet or POS), the ME that then carries out the CRC level make cell by and do not carry out any processing 208.If cell is an ATM cell, then ME20 carries out the CRC processing.
This processing may further comprise the steps: read the CRC remainder, the SOP/EOP state among ATM type and the SRAM; Determine whether cell carries SOP, body or EOP; Confirm VC carrying AAL5 cell and, if like this, then carry out CRC and calculate; And the EOP-SOP state among renewal CRC remainder and the SRAM.
Carrying out CRC with the CRC unit 72 (Fig. 2) among the ME20 calculates.Must carry out CRC with the order of strictness calculates to guarantee to calculate with correct CRC remainder the CRC of the cell that belongs to same VC.
CRC handles and is divided into fetch phase and modification/write phase.In these two stages, all use CAM60.In first stage, CAM60 is used for determining that whether thread reads remainder/type field or use result from the previous thread of storing local storage device 66 (Fig. 2) from SRAM38.First stage begins with the pointer search CAM210 that points to the state of ressembling with given thread.If thread detects CAM and misses 212, then thread writes cam entry with locked entries with send to read from SRAM storer 38 and obtain CRC remainder and AAL types 214 with ressembling pointer and status information.If at 212 places, thread detects and hits, then do not send and read.
When thread receives suitable incident signaling, promptly indicate previous thread to finish the event signal of processing, then thread is waken and incipient stages 2 processing.It searches for CAM220 with the same pointer of ressembling.If thread has sent the lock-out state 220 that reads and judge the cam entry that is used to mate, then thread moves to local storage device 222 with the result that reads in the transmission register.Mobile result's also release clauses and subclauses of this thread, guarantee thus to this particular pointer future, CAM searched hits.Otherwise, if cam entry is not unlocked, then hits and produce, and thread reads corresponding information 224, i.e. remainder and type simply from the local storage device.
After second stage CAM search, by checking the type field from the VC table, each thread is confirmed S carrying AAL5.For the AAL5 type, the CRC226 on the thread computes cell.If type is not AAL5, then cell is delivered to exception handler, or is abandoned, and this depends on enforcement.
If thread determines that the PTI position indication information element in the ATM stem is the EOP cell, then thread all is 0 and sets the SOP position for 1 and upgrade and ressemble device 230 by the CRO remainder is set for.If cell is not the EOP cell, then thread is set at 0232 with new remainder update mode and with SOP.It is kept in the local storage device CRC remainder that upgrades and SOP so that use 235 by other thread, and writes back the cache strategy according to it and also CRC remainder and SOP are write back to the state of ressembling among the SRAM38.Thread is delivered to the next one (packet transaction) level 236 with SOP, EOP and subjective situation.
Other grade knows whether ATM cell comprises EOP, SOP or main body is important in the RX pipeline.For ATM, SOP represents whether received whole cell (relative with whole group) with the setting of EOP position, thereby the EOP position state that provides in the stem pti field must be provided the CRC thread.EOP is only supported in the PTI position, thereby when detecting EOP, the CRC thread is in that it has set SOP position in its part of ressembling state table of SOP to the indication of next thread.Each CRC thread reads when ressembling state, and it reads the SOP position, if but it be set, and the PTI bit representation in the ATM stem do not have EOP, then its removes SOP position.
Because other level does not read the CRC thread and ressembles state region, the CRC thread arrives pipeline with the EOP/SOP state transfer.In case the CRC thread has been finished CRC calculating and upgraded and ressembled state table, then thread is prepared to move on the next pipe level.
Finish SRAM that its CRC calculating and sending goes out its remainder/type when thread and write fashionablely, it also sends signal indication to the thread of next ME, and it can begin its CRC pipe level.Importantly, this signaling guarantees that not providing signal can be secured at next ME up to it to next ME sends its remainder and write arbitrary unsettled remainder before reading.
Be appreciated that, although the quantity that the enforcement of describing is used CAM60 to reduce to read access is (by " amounting to ", as previously mentioned), but the strict consecutive order of thread is by using CAM to keep before and after carry out giving in the deciding grade and level, has interior thread signaling and read by guaranteeing to finish that reference and modification activity keep before data that need be identical by in succession thread but replace by use.
But, be appreciated that CAM 60 can be used for keeping the coherence and proofread and correct the packet transaction order.For example, suppose that (say) thread handling with (or relevant with same number of queues) in first-class two continuous groupings and the same SRAM of access position.Access speed is fast because grouping arrives speed ratio SRAM, will prepare to read and access data is before finished in the modification activity at the SRAM of the thread of handling first (more preceding) grouping so handle the thread of second grouping.In this case, the CAM cache of software control is implemented to be used for discerning correlativity and is guaranteed that current information is used all the time.Therefore, each thread uses CAM60 to carry out multiple ratio concurrently with the CAM look-up command, and wherein source-register provides fluxion or number of queues as the value of searching, as previously mentioned.
If produce and to miss, then thread begins SRAM and reads and distribute cam entry, and thread is arranged on fluxion in these clauses and subclauses.If stream in CAM, then returns hit-detector together with unique pointer value (for example, the entry number among the CAM is mated).The thread that obtains to hit among the CAM can obtain the nearest copy (cache in the SRAM controller 40, or ME local storage device 66) from the data of local storage device and needn't carry out SRAM and read.
When thread is loaded into cam entry with fluxion, it also in clauses and subclauses storaging state information but finish (it is " in service ") not yet so that thread is afterwards searched and can be determined a) that SRAM reads and begin; Or b) SRAM reads and finishes, and data are effective.If determine " in service " state, then thread afterwards knows that it should not begin to read, but it still can not use reading of data.The state that it can continue tested entries has been changed with the reflection valid data up to definite state.
Other embodiment also within the scope of the appended claims.

Claims (30)

1. a method of carrying out data search is characterized in that, comprising:
The first data path element in a plurality of data path elements provides input, described a plurality of data path element places the execution data path in the processor, described input comprises selection information and input value, wherein, provide input to comprise the execution look-up command, described look-up command assigned source register and destination storer, described source-register provide the operand of the described input value of carrying;
Based on described selection information, determine that the described first data path element is selected as carrying out at least one operation;
Use the described first data path element to carry out described at least one operation, wherein described at least at least one operation comprises the identifier value that makes described more described input value of the first data path element and storage; And
From the described first data path element reception result, the identifier value that described result is based on more described input value and storage generates, and described result is stored in the described destination register.
2. the method for claim 1 is characterized in that, the described first data path element comprises:
Content Addressable Memory.
3. method as claimed in claim 2 is characterized in that described Content Addressable Memory comprises a plurality of clauses and subclauses, is storing identifier value and corresponding entry number.
4. method as claimed in claim 3 is characterized in that described result comprises the status information that indicates whether to find coupling.
5. method as claimed in claim 4 is characterized in that described result also comprises entry number.
6. method as claimed in claim 5 is characterized in that, if the indication coupling, the entry number among the then described result is corresponding to the entry number of the clauses and subclauses of coupling.
7. method as claimed in claim 6 is characterized in that described Content Addressable Memory keeps the tabulation of least-recently-used clauses and subclauses.
8. method as claimed in claim 7 is characterized in that, if described status information indication result is not as a comparison mated appearance, the entry number among the then described result is the least-recently-used clauses and subclauses according to the tabulation of least-recently-used clauses and subclauses.
9. method as claimed in claim 5 is characterized in that, described a plurality of clauses and subclauses are storaging state information also, and the status information among the described result also comprises the value in the status information of clauses and subclauses of coupling.
10. method as claimed in claim 9, it is characterized in that, identifier value in the described clauses and subclauses is relevant with the data in being stored in storer, and the status information in the described clauses and subclauses comprises lock-out state, and the lock-out state indication described data relevant with the clauses and subclauses of mating are in the process that is modified.
11. the method for claim 1 is characterized in that, provides input to comprise and carries out the instruction with the operand that carries described input value.
12. the method for claim 1, it is characterized in that, also comprise to the described first data path element different selection information is provided, based on described different selection information, determine that the different data path element in described a plurality of data path element is selected as carrying out one or more operations.
13. the method for claim 1 is characterized in that, described source-register and destination register are the registers in the general-purpose register file, and described general-purpose register file and described execution data path communicate.
14. method that is used in the processor that is integrated with a plurality of multithreadings engine able to programme:
On first thread in of a plurality of multithreading engines:
In of a plurality of multithreading engines, carry out the first thread search operation to the mark in the Content Addressable Memory,
When if the first thread lookup result in the search operation has been missed the mark in the Content Addressable Memory:
The data read relevant with mark sent in a plurality of positions of the reservoir of one the inside of the storer of one outside from described a plurality of multithreading engines in described a plurality of multithreading engines;
In Content Addressable Memory, write the clauses and subclauses of mark; And
When if the first thread lookup result in the search operation has hit the mark in the Content Addressable Memory:
Revise the data of at least one position in described a plurality of positions of reservoir of one inside in described a plurality of multithreading engine, and the storer of one outside from described a plurality of multithreading engines does not send the data read relevant with mark;
On second thread in of a plurality of multithreading engines:
In of a plurality of multithreading engines, carry out the second thread search operation to the mark in the Content Addressable Memory,
When if the second thread lookup result in the search operation has been missed the mark in the Content Addressable Memory:
The data read relevant with mark sent in a plurality of positions of the reservoir of one the inside of the storer of one outside from described a plurality of multithreading engines in described a plurality of multithreading engines;
In Content Addressable Memory, write the clauses and subclauses of mark; And
When if the second thread lookup result in the search operation has hit the mark in the Content Addressable Memory:
Revise the data of at least one position in a plurality of positions of reservoir of one inside in described a plurality of multithreading engine, and the storer of one outside from described a plurality of multithreading engines does not send the data read relevant with mark.
15. method as claimed in claim 14 is characterized in that, the data of a plurality of positions of the reservoir of one inside in described a plurality of multithreading engines is write the storer of one outside in described a plurality of multithreading engine.
16. method as claimed in claim 14 is characterized in that, is sending data read on first thread and is sending data read all also comprise on second thread: a plurality of positions of determining data relevant with described mark in the described reservoir.
17. method as claimed in claim 16 is characterized in that, describedly determines to comprise: the call number of the clauses and subclauses in the content-based addressable memory and determining.
18. method as claimed in claim 14 is characterized in that, also comprises:
At described first thread, lock described clauses and subclauses; And
At described second thread, waited for that before revising described data described clauses and subclauses are unlocked.
19. method as claimed in claim 18 is characterized in that, described locking comprises: the status data relevant with clauses and subclauses write Content Addressable Memory.
20. method as claimed in claim 14 is characterized in that,
Described first thread comprises first thread that divides into groups that processing receives from a network; And
Described second thread comprises second thread that divides into groups that processing receives from this network.
21. an equipment that comprises processor comprises:
Be integrated in the engine a plurality of able to programme in the described processor, each engine comprises:
A plurality of registers; And
A plurality of performance elements, be coupled to described a plurality of register in the operation, to receive input and output be written to described a plurality of register, described a plurality of performance element is in response to instruction, based on the operand that provides by described a plurality of registers and executable operations, described performance element comprises ALU and Content Addressable Memory, in the described instruction at least one comprises at least one opcode bits, the chosen content addressable memory is to carry out look-up command in a plurality of performance elements for this opcode bits, and described look-up command control is from the inner buffer and the modification of the next data of external memory storage.
22. equipment as claimed in claim 21 is characterized in that, described a plurality of engines able to programme comprise a plurality of multithreadings engine able to programme, and each engine comprises a plurality of programmable counters, corresponding to each the corresponding thread by described engine provided.
23. equipment as claimed in claim 22 is characterized in that, described ALU and Content Addressable Memory are configured to described a plurality of registers parallel.
24. equipment as claimed in claim 21 is characterized in that,
If Content Addressable Memory clauses and subclauses are complementary with the mark that is searched, then this Content Addressable Memory is configured to export hiting signal; And
If do not have Content Addressable Memory clauses and subclauses and the mark that is searched to be complementary, then this Content Addressable Memory is configured to export the number of missing signal and nearest employed Content Addressable Memory clauses and subclauses.
25. equipment as claimed in claim 21 is characterized in that, described a plurality of registers comprise register file.
26. equipment as claimed in claim 21 is characterized in that, described a plurality of registers comprise that buffer memory is transferred to the register of data of the storer of engine outside, and are buffered between a plurality of engines the directly register of the data of transmission.
27. equipment as claimed in claim 21 is characterized in that, institute's directive command comprises that Content Addressable Memory look-up command and Content Addressable Memory write instruction.
28. a system comprises:
At least one MAC controller;
At least one processor is coupled to described at least one MAC controller, and described processor comprises:
A plurality of multithreadings engine able to programme is integrated in the described processor, a plurality of the comprising in the described engine:
A plurality of registers,
A plurality of performance elements, be coupled to described a plurality of register in the operation, to receive input and output be written to described a plurality of register, described a plurality of performance element is in response to instruction, the operand that provides based on described a plurality of registers and executable operations, described performance element comprises ALU and Content Addressable Memory, in the described instruction at least one comprises at least one opcode bits, the chosen content addressable memory is to carry out look-up command in a plurality of performance elements for this opcode bits, and described look-up command control is from the inner buffer and the modification of the next data of external memory storage.
29. system as claimed in claim 28 is characterized in that,
Described ALU and Content Addressable Memory are configured to described a plurality of registers parallel;
Wherein
If Content Addressable Memory clauses and subclauses are complementary with the mark that is searched, then this Content Addressable Memory is configured to export hiting signal; And
If do not have Content Addressable Memory clauses and subclauses and the mark that is searched to be complementary, then this Content Addressable Memory is configured to export the number of missing signal and nearest employed Content Addressable Memory clauses and subclauses; And
Wherein said instruction comprises that Content Addressable Memory look-up command and Content Addressable Memory write instruction.
30. system as claimed in claim 28 is characterized in that, described at least one MAC controller comprises the Ethernet MAC controller.
CNB028167740A 2001-08-27 2002-08-27 Software controlled content addressable memory in a general purpose execution datapath Expired - Fee Related CN100533410C (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US31514401P 2001-08-27 2001-08-27
US60/315,144 2001-08-27
US10/212,943 2002-08-05

Publications (2)

Publication Number Publication Date
CN101137966A CN101137966A (en) 2008-03-05
CN100533410C true CN100533410C (en) 2009-08-26

Family

ID=39161106

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB028167740A Expired - Fee Related CN100533410C (en) 2001-08-27 2002-08-27 Software controlled content addressable memory in a general purpose execution datapath

Country Status (1)

Country Link
CN (1) CN100533410C (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102790667B (en) * 2011-05-18 2016-08-03 中兴通讯股份有限公司 A kind of method processing re-transmission data and base station
KR20180089140A (en) * 2017-01-31 2018-08-08 에스케이하이닉스 주식회사 Data storage device
US10497438B2 (en) * 2017-04-14 2019-12-03 Sandisk Technologies Llc Cross-point memory array addressing
CN110045986B (en) * 2018-01-16 2021-07-27 龙芯中科(北京)信息技术有限公司 Instruction processing method, device and storage medium
US11188480B1 (en) * 2020-05-12 2021-11-30 Hewlett Packard Enterprise Development Lp System and method for cache directory TCAM error detection and correction
CN116701246A (en) * 2023-05-23 2023-09-05 合芯科技有限公司 Method, device, equipment and storage medium for improving cache bandwidth

Also Published As

Publication number Publication date
CN101137966A (en) 2008-03-05

Similar Documents

Publication Publication Date Title
EP1586037B1 (en) A software controlled content addressable memory in a general purpose execution datapath
US7216204B2 (en) Mechanism for providing early coherency detection to enable high performance memory updates in a latency sensitive multithreaded environment
US7742405B2 (en) Network processor architecture
US6996639B2 (en) Configurably prefetching head-of-queue from ring buffers
US7376952B2 (en) Optimizing critical section microblocks by controlling thread execution
US8861344B2 (en) Network processor architecture
KR100932038B1 (en) Message Queuing System for Parallel Integrated Circuit Architecture and Its Operation Method
US7113985B2 (en) Allocating singles and bursts from a freelist
US7240164B2 (en) Folding for a multi-threaded network processor
US20100205608A1 (en) Mechanism for Managing Resource Locking in a Multi-Threaded Environment
US20050243734A1 (en) Multi-threaded packet processing engine for stateful packet processing
EP2273378B1 (en) Data stream flow controller and computing system architecture comprising such a flow controller
KR20060132538A (en) Advanced processor
CN101878475A (en) Delegating network processor operations to star topology serial bus interfaces
JP2002149424A (en) A plurality of logical interfaces to shared coprocessor resource
Melvin et al. A massively multithreaded packet processor
CN100533410C (en) Software controlled content addressable memory in a general purpose execution datapath
CN1997973B (en) Processor for dynamically caching engine instructions, method, device and equipment
US7039054B2 (en) Method and apparatus for header splitting/splicing and automating recovery of transmit resources on a per-transmit granularity
US20060161647A1 (en) Method and apparatus providing measurement of packet latency in a processor
WO2003090018A2 (en) Network processor architecture
US9588928B1 (en) Unique packet multicast packet ready command
US7500239B2 (en) Packet processing system
US8799909B1 (en) System and method for independent synchronous and asynchronous transaction requests

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090826

Termination date: 20180827

CF01 Termination of patent right due to non-payment of annual fee