US20170351555A1 - Network on chip with task queues - Google Patents
Network on chip with task queues Download PDFInfo
- Publication number
- US20170351555A1 US20170351555A1 US15/173,017 US201615173017A US2017351555A1 US 20170351555 A1 US20170351555 A1 US 20170351555A1 US 201615173017 A US201615173017 A US 201615173017A US 2017351555 A1 US2017351555 A1 US 2017351555A1
- Authority
- US
- United States
- Prior art keywords
- task
- queue
- address
- data
- processing element
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/548—Queue
Definitions
- Multi-processor computer architectures capable of parallel computing operations were originally developed for supercomputers.
- FIG. 1 is a block diagram conceptually illustrating an example of a multiprocessor chip with a hierarchical on-chip network architecture that includes task-assignable hardware queues.
- FIGS. 2A to 2G illustrate examples of task distributors that assign tasks to hardware queues, and how the task distributors distribute task requests.
- FIG. 3 illustrates an example of a packet header used to communicate within the architecture.
- FIGS. 4A to 4D illustrate examples of packet payloads containing task descriptors and/or an address where a task descriptor is stored, as used within the architecture to delegate tasks.
- FIG. 5 illustrates task descriptors being enqueue and dequeued from the memory/register stack of a hardware task queue.
- FIG. 6 is an abstract representation of how slots within the a queue stack are accessed and recycled in a first-in-first-out (FIFO) manner.
- FIG. 7 is an example circuit overview of a task-assignable hardware queue.
- FIG. 8 is a block diagram conceptually illustrating example components of a processing element of the chip in FIG. 1 .
- FIG. 9 illustrates a plurality of the multiprocessor chips connected together, with the task-assignable queues of several of the chips assigned to receive tasks.
- FIG. 10 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is deposited into an output queue for the processor to retrieve.
- FIG. 11 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is written back directly to the processor.
- FIG. 12 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and execution chains across queues, with the end-result being deposited into an output queue for the processor to retrieve.
- FIGS. 13A to 13F illustrate examples of the content of several of the data transactions in FIG. 12 .
- FIG. 14 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and a task-assigned processor deposits a sub-task into another input queue as a subroutine, with the end-result being deposited into an output queue for the processor to retrieve.
- FIG. 15 is a hybrid process-flow transaction-flow diagram illustrating execution of a scheduler program by a task-assigned processor, enabling the processor to autonomously subscribe and unsubscribe from task queues.
- Semiconductor chips that include multiple computer processors have increased in complexity and scope to the point that on-chip communications may benefit from a routed packet network within the semiconductor chip.
- a seamless fabric is created for high data throughput computation that does not require data to be re-packed and re-transmitted between devices.
- a multi-core chip may include a top level (L 1 ) packet router for moving data inside the chip and between chips. All data packets are preceded by a header containing routing data. Routing to internal parts of the chip may be done by fixed addressing rules. Routing to external ports may be done by comparing the packet header against a set of programmable tables and/or registers. The same hardware can route internal-to-internal packets (loopback), internal-to-external packets (outbound), external-to-internal packets (inbound) and external-to-external packets (pass through).
- the routing framework supports a wide variety of geometries of chip connections, and allows execution-time optimization of the fabric to adapt to changing data flows.
- FIG. 1 is a block diagram conceptually illustrating an example of a multiprocessor chip 100 with a hierarchical on-chip network architecture that includes task-assignable hardware queues 118 .
- the processor chip 100 may be composed of a large number of processing elements 134 (e.g., 256 ), connected together on the chip via a switched or routed fabric similar to what is typically seen in a computer network.
- FIFO input and output hardware queues 118 are provided on the chip 100 , each of which is assignable to serve as an input queue or an output queue. When configured as an input queue, the queue 118 is associated with a single “task.”
- a task comprises multiple executable instructions, such as the instructions for routine, subroutine, or other complex operation.
- Defined tasks are each assigned a task identifier or “tag.”
- a task descriptor is sent to a task distributor 114 .
- the task descriptor includes the task identifier, any needed operands or data, and an address where the task result should be returned.
- the task distributor 114 identifies a nearby queue associated with one or more processing elements 134 configured to perform the task.
- the assigned queue may be on a same chip 100 as the processing element 134 running the software that invoked the task, or may be on another chip. Since the processing elements subscribed to input queues repeatedly perform the same tasks, they can locally store and execute the same code over-and-over, substantially reducing the communication bottlenecks created when a processing element must go and fetch code (or be sent code) for execution.
- Each input queue is affiliated with at least one subscribed processing element 134 .
- the processing elements 134 affiliated with the input queues may each be loaded with a small scheduler program that is invoked after the processing element is idle for (or longer than) a specified/preset/predetermined duration (which may vary in length in accordance with the complexity of the task of the queue to which the processing element is currently affiliated/subscribed)
- a specified/preset/predetermined duration which may vary in length in accordance with the complexity of the task of the queue to which the processing element is currently affiliated/subscribed
- the schedulingr program invoked, the processing element 134 may unsubscribe from the input queue it was servicing and subscribe to a different input queue. In this way, processing elements can self-load balance independent of any central dispatcher.
- the chip 100 has some queues at a top level (in the network hierarchy), with each queue supporting one type of task at any time.
- a program deposits a descriptor of the task that needs to be done with a task distributor 114 , which deposits the descriptor into the appropriate queue 118 .
- the processing elements affiliated with the queue do the work, and typically produce output to some other queue (e.g., a queue 118 configured as an output queue).
- Each hardware queue 118 has at least one event flag attached, so a processor core can sleep while waiting for a task to be placed in the queue, powering down and/or de-clocking operations. After a task descriptor is enqueued, at least one of the cores affiliated with that queue is awakened by the change in state of the event flag, causing the processor core to retrieve (dequeue) the descriptor and to start processing the operands and/or data it contains, using the locally-stored executable task code.
- the hardware queues 118 may be configured as input queues or output queues. Dedicated input queues and dedicated output queues may also/instead be provided. When a task is finished, the last processing element to execute a portion of the assigned task or chain of tasks may deposit the results in an output queue. These output queues can generate event flags that produce externally visible (e.g., electrical) signals, so a host processor or other hardware (e.g., logic in an FPGA) can retrieve the finished result.
- a host processor or other hardware e.g., logic in an FPGA
- each chip 100 includes four superclusters 122 a - 122 d , each supercluster 122 comprises eight clusters 128 a - 128 h , and each cluster 128 comprises eight processing elements 134 a - 134 h .
- each processing element 134 includes two-hundred-fifty-six externally exposed registers, then within the chip 100 , each of the registers may be individually addressed with a sixteen bit address: two bits to identify the supercluster, three bits to identify the cluster, three bits to identify the processing element, and eight bits to identify the register.
- Memory within a system including the processor chip 100 may also be hierarchical, and memory of different tiers may be physically different types of memory.
- Each processing element 134 may have a local program memory containing instructions that will be fetched by the core's micro-sequencer and loaded into the instruction registers for execution in accordance with a program counter.
- Processing elements 134 within a cluster 124 may also share a cluster memory 136 , such as a shared memory serving a cluster 128 including eight processor cores 134 a - 134 h .
- While a processor core may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of the instruction pipeline) when accessing its own operand registers, accessing global addresses external to a processing element 134 may experience a larger latency due to (among other things) the physical distance between the addressed component and the processing element 134 . As a result of this additional latency, the time needed for a processor core to access an external main memory, a shared cluster memory 136 , and the registers of other processing elements may be greater than the time needed for a core to access its own execution registers.
- Each tier in the architecture hierarchy may include a router.
- the top-level (L 1 ) router 102 may have its own clock domain and be connected by a plurality of asynchronous data busses to multiple clusters of processor cores on the chip.
- the L 1 router may also be connected to one or more external-facing ports that connect the chip to other chips, devices, components, and networks.
- the chip-level router (L 1 ) 102 routes packets destined for other chips or destinations through the external ports 103 over one or more high-speed serial busses 104 a , 104 b .
- Each serial bus 104 comprises at least one media access control (MAC) port 105 a , 105 b and a physical layer hardware transceiver 106 a , 106 b.
- MAC media access control
- the L 1 router 102 routes packets to and from a primary general-purpose memory for the chip through a supervisor port 107 to a memory supervisor 108 that manages the general-purpose memory. Packets to-and-from lower-tier components are routed through internal ports 121 .
- Each of the superclusters 122 a - 122 d may be interconnected via an inter-supercluster router (L 2 ) 120 which routes transactions between superclusters and between a supercluster 122 and the chip-level router (L 1 ) 102 .
- Each supercluster 122 may include an inter-cluster router (L 3 ) 126 which routes transactions between each cluster 128 in the supercluster 122 , and between a cluster 128 and the inter-supercluster router (L 2 ) 120 .
- Each cluster 128 may include an intra-cluster router (L 4 ) 132 which routes transactions between each processing element 134 in the cluster 128 , and between a processing element 134 and the inter-cluster router (L 3 ) 126 .
- the level 4 (L 4 ) intra-cluster router 132 may also direct packets between processing elements 134 of the cluster and a cluster memory 136 . Tiers may also include cross-connects (not illustrated) to route packets between elements in a same tier in the hierarchy.
- the router When data packets arrive in one of the routers, the router examines the header at the front of each packet to determine the destination of the packet's data payload. Each chip 100 is assigned a unique device identifier (“device ID”). Packet headers received via the external ports 103 each identify a destination chip by including the device ID in an address contained in the packet header. Packets that are received by the L 1 router 102 that have a device ID matching that of the chip containing the L 1 router are routed within the chip using a fixed pipeline to the supervisor 108 or through one of the internal ports 121 linked to a cluster of processor cores within the chip. When packets are received with a non-matching device ID by the L 1 router 102 , the L 1 router 102 uses programmable routing to select an external port and relay the packet back off the chip.
- device ID device identifier
- the invoking processing element 134 sends a packet comprising a task descriptor to the local task distributor 114 .
- the L 1 router 102 and/or the L 2 router 120 include a task port 113 a / 113 b and a queue port 115 a / 115 b .
- the routers route the packet containing the task descriptor via the task port 113 to the task distributor 114 , which examines the task identifier included in the descriptor and determines which queue 118 to which to assign the task.
- the assigned queue 118 may be on the chip 100 , or may be a queue on another chip.
- the task distributor 114 transfers the descriptor to the queue router 116 , which deposits the descriptor in the assigned queue. Otherwise, task descriptor is routed to the other chip which contains the assigned queue 118 .
- the queue port 115 a is used by the L 1 router 102 to route descriptors that have been assigned by the task distributor 114 on another chip to the designated input queue 118 via the queue router 116 .
- the processing elements 134 may retrieve task results from the output queue via the queue port 115 a / 115 b using read/get requests, routed via the L 1 and/or L 2 routers.
- Cross-connects may be provided to signal when there is data enqueued in the I/O queues, which the processing elements 134 can monitor.
- an eight-bit bus may be provided, where each bit of the bus corresponds to one of the I/O queues 118 a - 118 f .
- a processing element 134 may monitor the bit line corresponding to the queue while awaiting task results, and retrieve the results after the bit line is asserted.
- subscribed processing elements 134 may monitor the bit line corresponding to the queue for tasks for the availability of tasks awaiting processing.
- FIGS. 2A and 2E are examples task distributors 114 / 114 ′ that assign tasks to a hardware queue.
- the task distributors 114 / 114 ′ receives a task request 240 via a task port 113 , selects a task input queue 118 associated with the task based on a task identifier 232 included in the task request 240 , obtains an address or other queue identifier of the task input queue 118 , and enqueues the task request 240 in the queue 118 using the address or other identifier.
- Selecting the task input queue and obtaining its address may be performed as plural steps or may be combined as a single step.
- a task input queue may be selected and then an address/identifier may be obtained for the selected task input queue, or the addresses/identifiers of one or more task input queues may be obtained and then the task input queue may be selected.
- Process combinations may also be used to select a queue and obtain a queue address/identifier, such as selecting several candidate input task queues, obtaining their addresses/identifiers, and then selecting an input task queue based on its address/identifier.
- the task distributor 114 receives a task request 240 and the controller 214 of the task distributor 114 uses a content-addressable memory (CAM) 252 to select the task input queue 118 and obtain the address/identifier 210 of the input queue 118 based on the extracted task identifier 232 .
- CAM content-addressable memory
- An advantage of using a CAM 252 over using hash tables or table look-up techniques is that a CAM can return a result typically within one or two clock cycles, which will typically be faster than hashing or searching a table.
- a disadvantage of CAM is that each CAM 252 takes up more physical space on the chip 100 , with the amount of space needed increasing as the number of queues 118 increases.
- CAM is practical if there is a limited number of task queues (e.g., 8 input queues). Thus, there is a speed versus space trade-off between CAM and other address resolution approaches.
- FIG. 2B illustrates an example structure of the task request packet 240
- FIG. 2C illustrates an example structure of the queue assignment packet 242
- the structures of these packets will be discussed in more detail in connection with FIG. 3 and FIGS. 4A-4D below, but are introduced here to explain the operation of the task distributors 114 and 114 ′.
- the task request packet 240 includes a header 202 a and a payload comprising a task descriptor 230 a .
- the header 202 a includes the address of the task distributor 114 / 114 ′.
- the task descriptor 230 a comprises a task identifier 232 and various task parameters and data 233 .
- the queue assignment packet 242 includes a header 202 b and a payload comprising a task descriptor 230 b .
- the task descriptor 230 b comprises the task parameters and data 233 .
- FIG. 2D illustrates the components of the controller 214 , and the principles of operation of the task distributor 114 .
- the task request packet 240 has a particular format, such that a parser 270 can read/extract the specific range of bits from the packet that correspond to the task identifier 232 , with the bits that follow the task identifier 232 being the task parameters and data 233 .
- the task identifier 232 begins at a pre-defined offset (e.g., an offset of zero as illustrated in FIG. 2B ).
- the parser 270 outputs the bits corresponding to the task identifier 232 to the CAM 252 .
- the bits corresponding to the task parameters and data 233 are directed to an assembler 274 .
- the CAM 252 contains an associative array table that links search tags (i.e., the task identifiers) to input queue addresses/identifiers.
- the CAM 252 receives the task identifier 232 and outputs a queue address/identifier 210 of a selected input queue that is configured to receive the specified task.
- the parser 270 may optionally include additional functionality. For example, it is possible to compress the task descriptor 230 a (e.g., using Huffman encoding). In such a case, the parser 270 may be responsible for de-compressing any data that precedes the task identifier 232 to find the offset at which the task identifier 232 starts, then transmitting the task identifier 232 to the CAM 252 . In such a design, the CAM 252 might use either the compressed or un-compressed form of the task identifier 232 as its key. In the latter case, the parser 270 would also be responsible for de-compressing the task identifier 232 prior to transmitting it to the CAM 252 .
- the assembler 274 is roughly a mirror image of the parser 270 . Where the parser 270 extracts a task identifier 232 that indirectly refers to a task queue, the assembler 274 re-assembles an output packet (queue assignment 242 ) that describes the task with a header 202 b that includes a physical or virtual address of a selected queue based on the address/identifier 210 , where the header address is for the selected input queue that can carry out the type of task denoted by the task identifier 232 .
- the payload of the output packet comprises the parameters and data 233 .
- the assembler 274 receives the address/identifier 210 of the selected input queue from the CAM 252 and the task parameters and data 233 from the parser 270 .
- Various approaches may be used by the assembler 274 to assemble the output packet 242 .
- the parser 270 may send the task descriptor 230 a to the assembler, and the assembler 274 may overwrite the bits corresponding to the task identifier 232 with the header address, or the assembler 274 may concatenate the header address with the task parameters and data 233 .
- the assembler 274 may also include additional functionality. For example, if a compressed format is being used, the assembler 274 may re-compress some or all the task parameters and data 233 contained in the routable task descriptor 230 b . The assembler 274 could also rearrange the data, or carry out other transformations such as converting a compressed format to an uncompressed format or vice versa.
- the task distributor 114 ′ receives the task request 240 via the task port 113 .
- a hash table 220 or sorted table 221 may be stored in a memory or a plurality of registers 250 associated with the task distributor 114 ′.
- types of tasks are identified by a system-wide address of a kernel used to process that type of task.
- a controller 216 extracts the task identifier 232 from the descriptor 230 a of the task request 240 , and applies a hash function or search function to select a task input queue 118 , and to obtain the address 210 or other queue identifier of the task input queue 118 .
- a hash function may be used to select the queue and obtain the queue's address/identifier 210 with or without a hash table 220 .
- a search function may be used to select the queue and obtain the queue's address/identifier 210 based on data in a sorted table 221 .
- the hash table 220 may be a distributed hash table, so one type of task has queues distributed throughout the system.
- a task request 240 causes the controller 216 to apply a distributed hash function to produce a hash that would find a “nearby” queue for that task, where the nearby queue should be reachable with a latency from the task identifier that is less than (or tied for smallest) that to reach other queues associated with the same task.
- Expected latency may be determined, among other ways, based on the minimum number of “hops” (e.g., intervening routers) to reach each queue from the task distributor 114 ′.
- the controller 216 outputs a packet containing the queue assignment 242 , replacing the destination address in the header with the address of the assigned queue, as discussed in connection with FIGS. 2A-2D .
- the packet is then routed to the assigned queue where it is enqueued, either via the queue router 116 or the L 1 router 102 (if the queue is on another chip).
- Hop information be determined, among other ways, from a routing table.
- the routing table may, for example, be used to allocate addresses that indicate locality to at least some degree.
- Distributed hashing frequently uses a very large (and very sparse) address space “overlaid” on top of a more typical addressing scheme like Internet Protocol (IP). For example, a hash might produce a 160-bit “address” that's later transformed to a 32-bit IPv4 address.
- IP Internet Protocol
- the allocation of addresses maybe be tailored to the system topology, such that the address itself provides an indication of that node's locality (e.g., assuming a backbone plus branches, the address could directly indicate a node's position on the backbone and distance from the backbone on its branch).
- Hop information can be used with the CAM 252 as well. However, given the expense of storage in a CAM and the advantageous of keeping that data to a minimum, each CAM 252 will ordinarily store just one “best” result for a given tag lookup.
- FIG. 2F illustrates the components of the controller 216 , and the principles of operation of the task distributor 114 ′.
- the parser 270 and the assembler 274 are the same as those discussed in connection with FIG. 2D . However, in controller 216 , the parser 270 outputs the task identifier 232 to an address resolver 272 .
- the address resolver 272 applies a hash or search function to select the queue and obtain the queue's address/identifier 210 , outputting the address/identifier 210 to the assembler 274 .
- FIG. 2G illustrates examples of different process flows that may be used by the controller 214 / 216 for address resolution ( 290 a - 290 e ).
- Resolution process 290 a corresponds to that used by the task distributor 114 in FIGS. 2A and 2D , with a task identifier (tag 232 ) input into the CAM 252 , producing the queue address/identifier 210 .
- Resolution process 290 b may be used by an address resolver 272 a (an example of address resolver 272 in FIG. 2F ) without a table 220 / 221 .
- the address resolver 272 a inputs the tag 232 into a hash function 280 as the function's “key,” where the hash function 280 hashes the key to produce the queue address/identifier 210 .
- resolution processes 290 c adds an address lookup to resolve the hash into an address or other identifier.
- An address resolver 272 b (an example of address resolver 272 in FIG. 2F ) uses a hash table 220 to lookup the address/identifier 210 .
- the tag 232 is input into the hash function 281 as functions “key,” where the hash function 281 hashes the key to produce one or more index values 208 .
- the address resolver 272 b resolves the index value 208 into the address/identifier 210 using the hash table 220 . If there is more than one tag 232 that hashes to the same table location, the result is a hash “collision.” Such collisions can be resolved in any of several ways, such as linear probing, collision chaining, secondary hashing, etc.
- a distributed hash function (e.g., 280 , 281 ) may be recomputed and redistributed to all the task distributors 114 ′.
- Other options include leaving the function 280 / 281 itself, but modifying data that it uses internally (not illustrated, but may also be stored in registers/memory 250 ), or leave the function 280 / 281 alone, but modify the address lookup data (e.g., hash table 220 ). Choosing between modifying the hash function's data and modifying the lookup data is often a fairly open choice, and depends in part on how the hash function is structured and implemented (e.g., implemented in hardware, implemented as processor-executed software, etc.).
- the hash functions 280 / 281 used by the task distributors 114 ′ may the same throughout the system, or may be localized, depending upon whether localization data is updated by updating the hash function 280 / 281 , its internal data, or its lookup table 220 .
- the distributed hash tables 220 , sorted tables 221 , and/or data used by the functions stored in one or more registers may be updated each time a device/node 100 is added or removed from the system.
- a lookup table may be used to store a tag 232 , and with it an address/queue identifier 210 .
- Sorting the table by tag 232 an interpolating search 282 may be used to search a small table, or a binary search 283 search may be used to sort a large table.
- Resolution processes 290 d may be used by an address resolver 272 c (an example of address resolver 272 in FIG. 2F ) with a sorted table 221 .
- the address resolver 272 c performs an interpolating search 282 on the sorted table 221 , using an index 208 based on the tag 232 .
- the search 282 produces the address/identifier 210 .
- Resolution processes 290 e may be used by an address resolver 272 d (an example of address resolver 272 in FIG. 2F ) with the sorted table 221 .
- the address resolver 272 d performs a binary search 283 on the sorted table 221 , using the index 208 based on the tag 232 .
- the search 283 produces the address/identifier 210 .
- Other search methods may be used.
- the table 221 is sorted for efficiency, a non-sorted table may instead be used, depending upon the search method employed.
- the logic providing the function 280 - 283 may fixed, with updates being to table values (e.g., 220 / 221 ) and/or to other registers storing values used by the function, separate from the logic.
- table values e.g., 220 / 221
- other registers storing values used by the function separate from the logic.
- the function is implemented as processor-executed software, either the software (as stored in memory) may be updated, table values (e.g., 220 / 221 ) may be updated, and/or registers storing values used by the function may be updated.
- the type of function and nature of the tables may be changed as the system scales, selecting a function 280 - 283 optimized for the scale of the topology.
- Hash tables 220 typically have O( 1 ) expected complexity, but O(N) worst case (but deletion is often more expensive, and sometimes completely unsupported).
- Sorted tables 221 with binary search 283 offers O(log N) lookup, and O(N) insertion or deletion.
- Sorted tables 212 with interpolating search 282 improves search complexity to O(log log N), but insertion or deletion is still typically O(N).
- a self-balanced binary search tree may be used for O(log N) insertion, deletion or lookup. In a small system, all of the table-based address resolution approaches should be adequate, as the tables involved are relatively small.
- one-or-more processing elements 134 on the chip 100 may load and launch a queue update program.
- the queue update program may determine the input queue address/identifier 210 for each possible task ID 232 , and determine whether any of those addresses/identifiers are for I/O queues 118 on the device 100 containing the task distributor 114 / 114 ′.
- the queue update program then configures each queue for the assigned task (if not already configured), and configures at least one processing element 134 to subscribe to each input queue.
- FIG. 3 illustrates an example of a packet header 302 used to communicate within the architecture.
- a processing element 134 may access its own registers directly without a global address or use of packets. For example, if each processor core has 256 operand registers, the core may access each register via the register's 8-bit unique identifier. Likewise, a processing element can directly access its own program memory. In comparison, a global address may be (for example) 64 bits. Shared memory and the externally accessible locations in the memory and registers of other processing elements may be addressed using a global address of the location, which may include that address' local identifier and the identifier of the tier (e.g., device ID 312 , cluster ID 314 ).
- a packet header 302 may include a global address.
- a payload size 304 may indicate a size of the payload associated with the header. If no payload is included, the payload size 304 may be zero.
- a packet opcode 306 may indicate the type of transaction conveyed by the header 302 , such as indicating a write instruction, a read instruction, or a task assignment.
- a memory tier “M” 308 may indicate what tier of device memory is being addressed, such as main memory (connected to memory supervisor 108 ), cluster memory 136 , or a program memory or registers within a processing element 134 .
- a processing-element-level address 310 c may include the device identifier 312 , the cluster identifier 314 , a processing element identifier 316 , an event flag mask 318 , and an address 320 c of the specific location in the processing element's operand registers, program memory, etc.
- Global addressing may accommodate both physical and virtual addresses.
- the event flag mask 318 may be used by a packet to set an “event” flag upon arrival at its destination.
- Special purpose registers within the execution registers of each processing element may include one or more event flag registers, which may be used to indicate when specific data transactions have occurred. So, for example, a packet header designating an operand register of a processing element 134 may indicate to set an event flag upon arrival at the destination processing element. A single event flag but may be associated with all the registers, or with a group of registers. Each processing element 134 may have multiple event flag bits that may be altered in such a manner. Which flag is triggered may be configured by software, with the flag to be triggered designated within the arriving packet. A packet may also write to an operand register without setting an event flag, if the packet event flag mask 318 does not indicate to change an event flag bit.
- FIGS. 4A to 4D illustrate examples of packet payloads containing task descriptors, used within the architecture to delegate tasks.
- a packet payload contains a task descriptor 430 a .
- the task descriptor 430 a includes the task identifier 432 , a normal return indicator 434 indicating where to deposit (i.e., write/save/store/enqueue) a normal response, an address 436 where to report an error, and any task operands and data 438 (or an address of where operands and data are stored).
- the task descriptor 430 a may also include an bit 433 that indicates whether the task descriptor 430 a includes additional task identifiers 432 .
- the additional task bit 433 may be appended onto the task identifier 432 , or indicated elsewhere in the task descriptor.
- the normal return indicator 434 and error reporting address 436 may indicate a memory or register address, the address of an output queue, or the address of any reachable component within the system. “Returning” results data to a location specified by the normal return indicator 434 includes causing the results data to be written, saved, stored, and/or enqueued to the location.
- FIG. 4B illustrates an example of a packet payload 422 b including a task descriptor 430 b that contains multiple task assignments.
- the descriptor includes a first task identifier 432 a , a second task identifier 432 b , a third task identifier 432 c , the normal return indicator 434 , the error reporting address 436 , and the task operands and data 438 .
- An additional task bit 433 a is appended onto the first task identifier 432 a , and indicates that there are additional tasks after the first task.
- An additional task bit 433 b is appended onto the second task identifier 432 b , and indicates that there are additional tasks after the second task.
- An additional task bit 433 c is appended onto the third task identifier 432 c , and indicates that there are no further tasks after the third task.
- FIG. 4C illustrates a packet payload 422 c that comprises an address 440 in memory from which the task descriptor 430 may be fetched.
- the stored task descriptor may be, for example, the task descriptors 430 a or 430 b .
- the originating processor stores the task descriptor prior to sending the packet carrying the memory address 440 of the task descriptor in its payload 422 c .
- the size of task requests 240 and queue assignments 242 are reduced, such that the capacity of each slot in the queues 118 to be smaller. For example, using the payload 422 c , the size of each slot in the queues 118 may be a single word.
- a “word” is a fixed-sized piece of data, such as a quantity of data handled as a unit by the instruction set and/or the processor core of a processing element 134 , and can vary from core to core. For example, a “word” might be 64 bits in one architecture, whereas a “word” might be 128 bits on another architecture.
- a trade-off is that the task distributor 114 and a processing element 134 subscribed to an input queue must access memory to retrieve some or all of the descriptor. For example, the task distributor 114 may read the first word of the stored descriptor to determine the task identifier 432 , whereas a subscribed processing element 134 may retrieve the entire stored descriptor.
- each processing element 134 that works with the descriptor 430 b may adjust an offset of the address 440 or otherwise crop the task descriptor 430 b so that the identifiers of tasks that have already been completed are not retrieved again in subsequent operations.
- FIG. 4D illustrates a packet payload 422 d that comprises a task identifier 432 a and an address 450 in memory from which a remainder of the task descriptor 430 may be fetched. While the packet payload 422 d doubles the size of the payload relative to payload 422 c , including the next task identifier within the packet itself simplifies the processing to be performed by the task distributor 114 , since the task distributor can issue the queue assignment 242 without having to access memory to determine the next task identifier 432 a .
- the task-executing processing element 134 can extract any subsequent task identifier (e.g., 432 b ) and expose the subsequent task identifier in the same manner as illustrated in FIG. 4D , when sending the subsequent task to another task distributor 114 .
- any subsequent task identifier e.g., 432 b
- FIG. 5 illustrates task descriptors being enqueue and dequeued from the memory/register stack of a hardware task queue 118 .
- Each queue 118 comprises a stack of storage slots 572 a to 572 h , where each “slot” comprises a plurality of registers or memory locations.
- the size of each slot may correspond, for example, to a maximum allowed size for a descriptor 430 (e.g., the maximum number of words).
- the descriptor is 430 b is dequeued from the front of the queue in accordance with a front pointer 532 .
- the front pointer 532 and the back pointer 533 may be equal.
- FIG. 6 is an abstract representation of how slots within the a queue stack are accessed and recycled in a first-in-first-out (FIFO) manner. Enqueued descriptors remain in their assigned slot 572 , with the back pointer 533 and front pointer 532 changing as descriptors 430 / 530 are enqueued and dequeued.
- FIFO first-in-first-out
- FIG. 7 is an example circuit overview of a task-assignable hardware queue 118 .
- the queue 118 includes several general registers 760 that are used for both input queue and output queue operations. Also included are input queue-specific registers 767 that are used specifically for input queue operations.
- the general purpose registers 760 include a front pointer register 762 containing the front pointer 532 , a back pointer register 763 containing the back pointer 533 , a depth register 764 containing the current depth of the queue, and several event flag register 764 .
- the event flag registers is an empty flag 765 , indicating that the queue is empty.
- a data-enqueued interrupt signal may be sent to subscribed processors (input queue) or a processor awaiting results (output queue), signaling them to wake and dequeue a descriptor or result.
- the data-enqueued interrupt signal can be generated by an inverter (not illustrated) that has its input tied to the output of the AND gate 755 or to the empty flag 765 .
- Another event flag 764 is the full flag 766 .
- the data transaction interface 720 can output a back-pressure signal to the queue router 116 . Assertion of a back-pressure signal may result in error reporting (in accordance with the error reporting address 436 ) if a task arrives for a full queue.
- the queue router 116 may also include an arbiter to reassign the descriptor received for the full queue to another input queue attached the queue router 116 that is configured to perform a same task (if such a queue exists).
- the event flags 764 may be masked so that when results data is enqueued, an interrupt is generated indicating to a waiting (or sleeping) processing element 134 that a result has arrived.
- processing elements subscribed to an input queue can set a mask so that a data-enqueued signal from the subscribed queued causes an interrupt, but data-enqueued signals from other queues are ignored.
- a “data available” flag register may be used, replacing the AND gate 755 with a NAND gate. In that case, data-enqueued interrupt signal can be generated in accordance with the output of the NAND gate, or the state of the data available flag register.
- the input queue registers 767 are used by processing elements to subscribe and unsubscribe to the queue.
- a register 768 indicates how many processing elements 134 are subscribed to the queue. Each queue always has at least one subscribed processing element, so if an idle processing elements goes to unsubscribe, but it is the only subscribed processing element, then the processing element remains subscribed. When new processing elements subscribe to the queue, the number in the register 768 is incremented. Also, when a new processing element subscribes to a queue, it determines the start address where the executable instructions for the task are in memory (e.g., 780 ) from a program memory address register 769 . The newly subscribed processing element then loads the task program into its own program memory.
- a data transaction interface 720 asserts a put signal 731 , causing a write circuit 733 to save/store the descriptor 430 or address 440 / 450 into the stack 570 at a write address 734 determined based on the back pointer 533 .
- the back point 533 may specify the most significant bits corresponding to the slot 572 where the descriptor 430 is to be stored.
- the write circuit 733 may write (i.e., save/store) an entirety of a descriptor 430 as a parallel write operation, or may write the descriptor in a series of operations (e.g., one word at a time), toggling a write strobe 735 and incrementing the least significant bits of the write address 734 until an entirety of the descriptor 430 is stored.
- the data transaction interface 720 de-asserts the put signal 731 , causing a counter 737 to increment the back pointer on the falling edge of the put signal 731 and causing a counter 757 to increase the depth value.
- the counter 737 counts up in a loop, with the maximum count equaling the number of slots 572 in the stack 570 . When the count exceeds the maximum count, a carry signal may be used to reset the counter 737 , such that the counter 737 operates in a continual loop.
- the data transaction interface 720 asserts a get signal 741 , causing a read circuit 743 to read the descriptor 430 or descriptor address 530 from the stack at a read address 744 determined based on the front pointer 532 .
- the front point 532 may specify the most significant bits corresponding to the slot 572 where the descriptor 430 b is to be stored.
- the read circuit 743 may read an entirety of the descriptor 430 as a parallel read operation, or may read the descriptor 430 as a series of reads (e.g., one word at a time).
- the data transaction interface 720 de-asserts the get signal 741 , causing a counter 747 to increment the front pointer 532 on the falling edge of the get signal 741 and causing a counter 757 to decrease the depth value.
- the counter 747 counts up in a loop, with the maximum count equaling the number of slots 572 . When the count exceeds the maximum count, a carry signal may be used to reset the counter 747 , such that the counter 747 operates in a continual loop.
- the empty flag 765 may be set by circuit composed of a comparator 753 , an inverter 754 , and an AND gate 755 .
- the comparator 753 determines when the front pointer 532 equals the back pointer 533 .
- the inverter 754 receives the queue-full signal as input.
- the AND gate 755 receives the outputs of the comparator 753 and the inverter 754 . When the front and back pointers are equal and the full signal is not asserted, the output of the AND gate 755 is asserted, indicating that the queue is empty.
- the counters 737 , 747 , 757 manage their output while asserting their “carry” signals, it may be possible for the front and back pointers to be equal when the queue is full.
- the inverter 754 and AND gate 755 provide for that eventuality, so that when the front and back pointers are equal and the full signal is also asserted, the output of the AND gate 755 is de-asserted, indicating that the queue is not empty.
- a comparator may compare the depth 764 to zero to determine when the depth equals zero.
- the full flag 766 may be set by the carry output of the counter 757 , or a comparator may compare the depth 764 to the depth value corresponding to full.
- the queue 118 uses a write-and-then-increment the back pointer, read-and-then-increment the front pointer arrangement
- the queue may instead use an increment-and-then-write and increment-and-then-read arrangement.
- the counter 737 increments on the leading edge of the put signal 731
- the counter 747 increments on the leading edge of the get signal 741 .
- the front pointer 532 may be incremented on the falling edge of the get signal 741 , such that the front pointer 532 points to the slot that is currently at the front of the queue, whereas the back pointer 533 may be incremented on the leading edge of the put signal 731 , such that the back pointer is 533 is pointing to one slot behind where the slot that will be used for the next write.
- the front pointer and back pointer will not be equal.
- a comparison of the front and back pointers by comparator 753 will not indicate whether the stack 570 is empty. In that case, whether the stack 570 is or is not empty may be determined from the depth 764 (e.g., comparing the depth value to zero).
- Whether the counter 757 increments and decrements the depth on the falling or leading edges may be independent of the arrangement used by the counters 737 and 747 . If the counter 757 increments and decrements on the leading put/get signal edges, subscribed or monitoring processing elements 134 may begin to dequeue a descriptor or descriptor address while it is being enqueued, since the data-enqueued interrupt signal may be generated be generated before enqueuing is complete, thereby accelerating the enqueuing and dequeuing process. To accommodate simultaneous enqueuing and dequeuing from a same slot of the stack 570 , the memory/registers used for the stack 570 may be dual-ported.
- Dual-ported memory cells/registers can be read via one port and written to via another port at a same time. In comparison, if the counter 757 increments and decrements on the falling put/get signal edges (as illustrated in FIG. 7 ), then the descriptor or descriptor address will be fully loaded into the slot 572 before the data-enqueued interrupt signal is asserted.
- the front pointer 532 , the back pointer 533 , the depth value, empty flag, and full flag are illustrated in FIG. 7 as being stored in general registers 760 .
- looping increment and decrement circuits may be used to update the front pointer 532 , back pointer 533 , and depth value as stored in their registers instead of using dedicated counters.
- the general registers 760 used to store the front pointer 502 , back pointer 503 , depth value, empty flag, and full flag may be omitted, with the values read from the counters and logic (e.g., logic 753 , 754 , 755 ).
- the third value can be determined. So, for example, the depth can be determined based on the difference between the front pointer and the back pointer, or the depth can be used to determine the value of the front or back pointer, based on the value of the other pointer.
- FIGS. 5 through 7 illustrate the FIFO queues as circular queues
- a shift register queue comprises a series of registers, where each time a slot is dequeued, all of the contents are copied forward. With shift register queues, the slot constituting the “front” is always the same, with only the back pointer changing.
- circular queues have advantages over shift register queues, such as lower power consumption, since copying multiple descriptors or descriptor addresses from slot-to-slot each time a descriptor 430 or descriptor address 440 / 450 is dequeued increases power consumption relative to the operations of a circular queue.
- FIG. 8 is a block diagram conceptually illustrating example components of a processing element of the chip in FIG. 1 .
- the structure of the processing elements 134 that are executing the main software program and that are subscribed to individual task queues may be identical, with the difference being that a processing element that is subscribed to a task queue 118 is loaded/configured with the scheduler 883 and idle counter 887 .
- a data transaction interface 872 sends and receives packets and connects the processor core 890 to its associated program memory 874 .
- the processor core 890 may be of a conventional “pipelined” design, and may be coupled to sub-processors such as an arithmetic logic unit 894 and a floating point unit 896 .
- the processor core 890 includes a plurality of execution registers 880 that are used by the core 890 to perform operations.
- the registers 880 may include, for example, instruction registers 882 , operand registers 884 , and various special purpose registers 886 . These registers 880 are ordinarily for the exclusive use of the core 890 for the execution of operations. Instructions and data are loaded into the execution registers 880 to “feed” an instruction pipeline 892 .
- a processor core 890 may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of a micro-sequencer 891 ) when accessing its own execution registers 880 , accessing memory that is external to the core 890 may produce a larger latency due to (among other things) the physical distance between the core 890 and the memory.
- the instruction registers 882 store instructions loaded into the core that are being/will be executed by an instruction pipeline 892 .
- the operand registers 884 store data that has been loaded into the core 890 that is to be processed by an executed instruction.
- the operand registers 884 also receive the results of operations executed by the core 890 via an operand write-back unit 898 .
- the special purpose registers 886 may be used for various “administrative” functions, such as being set to indicate divide-by-zero errors, to increment or decrement transaction counters, to indicate core interrupt “events,” etc.
- the instruction fetch circuitry of a micro-sequencer 891 fetches a stream of instructions for execution by the instruction pipeline 892 in accordance with an address generated by a program counter 893 .
- the micro-sequencer 891 may, for example, may fetch an instruction every “clock” cycle, where the clock is a signal that controls the timing of operations by the micro-sequencer 891 and the instruction pipeline 892 .
- the instruction pipeline 892 comprises a plurality of “stages,” such as an instruction decode stage, an operand fetch stage, an instruction execute stage, and an operand write-back stage. Each stage corresponds to circuitry.
- the chips' firmware may include a small scheduler program 883 in firmware.
- a core 890 waits too long (an exact duration may be specified in a register, e.g., based on a number of clock cycles) for a task to show up in its queue, the core 890 wakes up and runs the scheduler 883 to find some other queue with tasks for it to execute, and thereafter begins executing those tasks.
- the scheduler program 883 may be loaded into the instruction registers 882 of processing elements 134 subscribed to a task queue when the processing element's idle counter 887 indicates at that the threshold duration of time has transpired (e.g., that the requisite number of clock cycles have elapsed).
- the scheduler program 883 may either be preloaded into the processing element 134 , or loaded upon expiration of the idle counter 887 .
- the idle counter 887 causes generation of an interrupt resulting in the micro-sequencer 891 executing the scheduler 883 , causing the processing element 134 to search through the (currently in-use) queues, and find a queue with tasks that need execution. Once it finds a new queue, it unsubscribes from the old queue (decrementing the number in register 768 ), subscribes to the new queue (incrementing the number in register 768 ), fetches the program address from register 769 of the new queue, and loads the task program code into its own program memory 874 .
- FIG. 9 illustrates a plurality of the multiprocessor chips connected together, with the task-assignable queues of several of the chips assigned to receive tasks.
- a processor chip 100 a includes a Task 1 queue 118 . 1 a , a Task 2 queue 118 . 2 a , and a Task 3 queue 118 . 5 a .
- a processor chip 100 d includes a Task 2 queue 118 . 2 d , a Task 3 queue 118 . 3 d , and a Task 4 queue 118 . 4 d .
- a processor chip 100 a includes a Task 1 queue 118 . 1 a , a Task 2 queue 118 . 2 a , and a Task 3 queue 118 . 5 a .
- a processor chip 100 h includes a Task 1 queue 118 . 1 h , a Task 3 queue 118 . 3 h , and a Task 5 queue 118 . 5 h .
- Processor chips 100 b , 100 c , 100 e , 100 f , 100 g , and 100 i have no active task input queues, although some or all of their queues 118 may be arranged as output queues, receiving results when a task is completed.
- the arrangement of chips in FIG. 9 will be uses as the basis for specific execution examples discussed in connection with FIGS. 10-14 .
- FIG. 10 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is deposited into an output queue for the processor to retrieve.
- Task execution 1000 begins when a program executed by processor 134 a on processor chip 100 b results in issuance of a task 3 request 1002 to the task distributor 114 b on the processor chip 100 b .
- the task distributor 114 b using a hash table 220 or CAM 252 , assigns 1004 the task to the task 3 queue 118 . 3 d on processor chip 100 d , which is closer (in terms of network hops) than the task 3 queue 118 . 3 h on processor chip 100 h.
- the processor 134 c After a processor 134 c subscribed to the task 3 input queue 118 . 3 d becomes free and determines from the empty flag 765 that there is a descriptor 430 b waiting to be dequeued, the processor 134 c retrieves 1006 the descriptor from the queue 118 . 3 d . Upon completion of the task, the processor 134 c writes 1010 (by packet) the result to an output queue 118 h on the processor chip 100 b in accordance with the normal return indicator 434 . The output queue 118 h generates an event signal 1012 , waking the processor 134 a (if in a low power mode), and causing the processor 134 a to retrieve 1014 the results from output queue 118 h.
- FIG. 11 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is written back directly to the processor.
- Task execution 1110 begins when a program executed by processor 134 a on processor chip 100 b results in issuance of a task 3 request 1102 to the task distributor 114 b on the processor chip 100 b .
- the task distributor 114 b using a hash table 220 or CAM 252 , assigns 1104 the task to the task 3 queue 118 . 3 d on processor chip 100 d , which is closer (in terms of network hops) than the task 3 queue 118 . 3 h on processor chip 100 h.
- processor 134 c subscribed to the task 3 input queue 118 . 3 d becomes free and determines from the empty flag 765 that there is a descriptor 430 b waiting to be dequeued, the processor 134 c retrieves 1106 the descriptor from the queue 118 . 3 d . Upon completion of the task, the processor 134 c writes 1110 (by packet) the result directly to operand registers 884 or program memory 874 of the processing element 134 a in accordance with the normal return indicator 434 .
- FIG. 12 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and execution chains across queues, with the end-result being deposited into an output queue for the processor to retrieve.
- Chaining may be based on there being multiple task identifiers in the original task descriptor (e.g., FIG. 4B ), and/or based on one or more tasks initiating a chain when that task is invoked.
- the discussion of task execution 1200 in connection with FIGS. 12 and 13A to 13F is based on the former, where the original task descriptor includes multiple task identifiers.
- Task execution 1200 begins when a program executed by processor 134 a on processor chip 100 b results in issuance of a task 4 request 1202 to the task distributor 114 b on the processor chip 100 b .
- the task distributor 114 b uses a hash table 220 or CAM 252 , assigns 1204 the task to the task 4 queue 118 . 4 d on processor chip 100 d.
- the processor 134 e After a processor 134 e subscribed to the task 4 input queue 118 . 4 d becomes free and determines from the empty flag 765 that there is a descriptor 430 b waiting to be dequeued, the processor 134 e retrieves 1206 the descriptor from the queue 118 . 4 d . Upon completion of the task, the processor 134 e writes 1210 (by packet) the result to a task distributor 114 d on the processor chip 100 d as a Task 1 request as part of a chained task request. The task distributor 114 d send 1212 the Task 1 assignment to the Task 1 input queue 118 . 1 a on processor chip 100 a.
- the processor 134 a writes 12120 (by packet) the result to an output queue 118 h on processor chip 100 b , in accordance with the normal return indicator 434 .
- the output queue 118 h generates an event signal 1230 , waking the processor 134 a (if in a low power mode), and causing the processor 134 a to retrieve 1234 the results from output queue 118 h.
- FIGS. 13A to 13F illustrate examples of the content of several of the data transactions in FIG. 12 , based on the packet structure discussed in connection with FIGS. 3, 4A and 4B . If a packet payload only contains the address 440 of the task descriptor in memory (as discussed in connection with FIG. 4C ) or a packet payload contains a task identifier 432 and the address 450 of a remainder of the task descriptor in memory (as discussed in connection with FIG. 4D ), then the descriptors in the transactions illustrated in FIGS. 13A to 13F would reflect the state of the descriptors as stored at the addresses 440 or 450 .
- FIG. 13A illustrates a packet 1300 a used for the task 3 request 1202 , as issued by the processing element 134 a .
- the header 1302 a contains the address of the task distributor 114 b .
- the packet payload comprises a task descriptor 1330 a .
- the task descriptor 1330 a includes a task 4 task identifier 1332 a , a task 1 task identifier 1332 b , a normal return indicator 1134 corresponding to the address of the output queue 118 h , an error reporting address 1336 , and the task operands and/or data 1338 a .
- the additional task bit 1333 a appended on to the task 4 identifier 1332 a is set to indicate there is another task to be performed after task 4 .
- the additional task bit 1333 b appended on to the task 1 identifier 1332 b is set to indicate there is no other task to be performed after task 1 .
- FIG. 13B illustrates a packet 1300 b used for the queue assignment 1204 , as issued by the task distributor 114 b .
- the packet header 1302 b contains the address of the task 4 input queue 118 . 4 d .
- the packet payload comprises a task descriptor 1330 b .
- the descriptor 1330 b omits the task 4 identifier 1332 a .
- FIG. 13C illustrates the task descriptor 1330 b as pulled 1206 from the task 4 input queue 118 . 4 d by the task 4 processor 134 e.
- FIG. 13D illustrates a packet 1300 c used for the task 1 request 1210 , as issued by the task 4 processor 134 e .
- the packet header 1302 c contains the address of the task distributor 114 d .
- the packet payload comprises a task descriptor 1330 c .
- the descriptor 1330 c includes the results 1338 b from task 4 .
- the task 4 results may be appended onto the original task operands and data 1338 a (as illustrate), mixed with original operands and data 1338 a , or the original operands and data 1338 a may be omitted.
- FIG. 13E illustrates a packet 1300 d used for the queue assignment 1212 , as issued by the task distributor 114 d .
- the packet header 1302 d contains the address of the task 1 input queue 118 . 1 a .
- the packet payload comprises a task descriptor 1330 d .
- the descriptor 1330 d omits the task 1 identifier 1332 b.
- FIG. 13F illustrates a packet 1300 e sent by the task 1 processor 134 a to the output queue 118 h in accordance with the normal return indicator 1334 .
- the packet header 1302 e contains the address of the output queue 118 h .
- the packet payload may comprise the error reporting address 1336 , the task 4 results data 1338 b , and the task 1 results data 1338 c .
- the task 1 and the task 4 results may be separate or mixed, or the task 4 1338 b results may be omitted. If the original task operands and data 1338 a did carry through the chain to the last processor in the chain (task 1 processor 134 a in FIG. 12 , determining that it is last based on the additional tasks bit 1333 b ), that last processor may omit the original operands and data 1338 a in the final results.
- FIG. 14 is a transaction flow diagram illustrating an example where an originating processor 134 a deposits a task descriptor into an input queue, and a task-assigned processor deposits a sub-task into another input queue as a subroutine, with the end-result being deposited into an output queue 118 h for the originating processor 134 a to retrieve.
- Task execution 1400 begins when a program executed by processor 134 a on processor chip 100 b results in issuance of a task 4 request 1402 to the task distributor 114 b on the processor chip 100 b .
- the task distributor 114 b using a hash table 220 or CAM 252 , assigns 1404 the task to the task 4 queue 118 . 4 d on processor chip 100 d.
- the task 4 processor 134 e retrieves 1406 the descriptor from the queue 118 . 4 d .
- task 4 itself uses task 1 as a subroutine, resulting in the task 4 processor 134 e sending 1410 a task 1 request to the task distributor 114 d on the processor chip 100 d .
- the task distributor 114 d sends 1412 the task 1 assignment to the task 1 input queue 118 . 1 a on processor chip 100 a.
- the processor 134 a After a processor 134 a subscribed to the task 1 input queue 118 . 1 a becomes free and determines from the empty flag 765 that there is a descriptor 430 b waiting to be dequeued, the processor 134 a retrieves 1414 the descriptor from the queue 118 . 1 a . Upon completion of the task, the task 1 processor 134 a writes 1420 (by packet) the result directly to the task 4 processor 134 e that issued the task 1 request. The task 4 processor 134 e thereafter completes task 4 , using the task 1 data. Upon completion, the task 4 processor 134 e writes 1422 (by packet) the result to an output queue 118 h on processor chip 100 b , in accordance with the normal return indicator 434 . The output queue 118 h generates an event signal 1430 , waking the originating processor 134 a (if in a low power mode), and causing the originating processor 134 a to retrieve 1434 the results from output queue
- FIG. 15 is a hybrid process-flow transaction-flow diagram illustrating execution of the scheduler program 883 by a task-assigned processor, enabling the processor to autonomously subscribe and unsubscribe from task queues.
- a task processor 134 a is subscribed to a task 1 input queue 118 . a , which has two subscribed cores (as specified in register 768 ).
- the queue depth (from register 764 ) is initially zero.
- the task processor 134 a dequeues 1522 the task descriptor and executes 1524 the task, returning the results in accordance with the normal return indicator 434 .
- the task processor 134 a starts 1426 its idle counter 887 and may enter into a low power mode, waiting for an interrupt from the subscribed task queue indicating that a descriptor is ready to be dequeued.
- the task processor 134 a runs the scheduler program 883 , which determines 1530 whether there is more than one core subscribed to the task 1 queue 118 . 1 a (from register 768 ), such that the scheduler program 883 is permitted to choose a new input queue. If there is not ( 1530 “No”) more than one processor subscribe to the task 1 queue 118 . 1 a , the processor 134 a continues to wait 1534 for a new task 1 descriptor to appear in the input queue 118 . 1 a . Otherwise, the scheduler program 883 checks other input queues on the device to determine 1532 whether the depth of any of the other queues exceeds a minimum threshold depth “R”. The threshold depth is used to reduce the frequency with which processors unsubscribe and subscribe from and to input queues, since each new subscription results in memory being accessed to retrieve the task program executable code.
- the threshold depth is used to reduce the frequency with which processors unsubscribe and subscribe from and to input queues, since each new subscription results in memory being accessed to retrieve
- the processor remains subscribed to the task 1 queue 118 . 1 a . Otherwise, the scheduler 883 selects 1536 a new input queue. For example, the scheduler 883 may select the input queue with the greatest depth, or among input queues tied for the greatest depth. The scheduler 883 unsubscribes 1538 from the task 1 queue 118 . 1 a , decrementing register 768 . The scheduler then subscribes 1440 to the task 2 input queue 118 . 2 a which had the largest depth of the task input queues on the device.
- the scheduler 883 then loads 1542 the task 2 program to the program memory 874 of the processing element 134 a , based on the program address in the register 769 of the task 2 queue 118 . 2 a .
- the task processor 134 a resumes normal operations, retrieving 1544 a task 2 descriptor from the task 2 queue 118 . 2 a , and executing that task 1546 .
- the task processor 134 a will continue executing that same retrieved program until such time that its idle counter expires again without a task becoming available.
- the scheduler program 883 may comprise executable software and/or firmware instructions, may be integrated into each task processor 134 as a sequential logic circuit, or may be a combination of sequential logic with executable instructions.
- sequential logic included in the task processor 134 may set and start ( 1526 ) the idle counter, and determine ( 1528 ) that the task processor 134 has been idle for (or longer than) a specified/preset/predetermined duration (e.g., based on the counter expiring or based on a comparison of the count on the counter equaling or exceeding the duration value).
- the sequential logic may load a remainder of the scheduler program 883 into the instruction registers 882 from the program memory 874 or another memory, based on an address stored in a specified register such as a special purpose register 886 .
- the disclosed system allows for a simple, relatively easy to understand interface that accommodates chips with a large number of cores and that improves scaling of a system by decoupling logical tasks from the arrangement of physical cores.
- a programmer writing the main program does not need to know (or care much) about how many cores will be executing assigned tasks.
- the number of cores can simply increase or decrease, depending on the number of tasks needing execution. Combined with the ability of cores to sleep while waiting for input, this flexible distribution of tasks also helps to reduce power consumption.
- bus-based networks may comprise address lines and data lines, conveying addresses via the address lines and data via the data lines.
- packet-based network connections may comprise a single serial data-line, or plural data lines, conveying addresses in packet headers and data in packet bodies via the data line(s).
- aspects of the disclosed system may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium.
- the computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure.
- the computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media.
- a digital comparator that determines whether the depth 764 is equal to zero is functionally identical a NOR gate where the data lines conveying the depth value are input into a NOR gate, the output of which will be asserted when the binary value across the data lines equals zero.
- the FIFO queues 118 will be hardware queues, as discussed in connection with FIG. 7 .
- the queues 118 will be hardware queues, software-controlled queues could be substituted. A mix of hardware queues and software-controlled queues may also be used.
- “Writing,” “storing,” and “saving” are used interchangeably. “Enqueuing” includes writing/storing/saving to a queue.
- a component e.g., by a processing element, a task distributor, etc.
- the operation may be directed by the component or the component may send the data to be written/enqueued (e.g., sending the data by packet, together with a write instruction).
- writing and “enqueuing” should be understood to encompass “causing” data to be written or enqueued.
- the operation may be directed by the component or the component may send a request (e.g., by packet, by asserting a signal line, etc.) that causes the data to be provided to the component.
- Queue management e.g., the updating of the depth, the front pointer, and back pointer
- the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
Abstract
Description
- Multi-processor computer architectures capable of parallel computing operations were originally developed for supercomputers. Today, with modern microprocessors containing multiple processor “cores,” the principles of parallel computing have become relevant to both on-chip and distributed computing environment.
- For a more complete understanding of the present disclosure, reference is now made to the following description taken in conjunction with the accompanying drawings.
-
FIG. 1 is a block diagram conceptually illustrating an example of a multiprocessor chip with a hierarchical on-chip network architecture that includes task-assignable hardware queues. -
FIGS. 2A to 2G illustrate examples of task distributors that assign tasks to hardware queues, and how the task distributors distribute task requests. -
FIG. 3 illustrates an example of a packet header used to communicate within the architecture. -
FIGS. 4A to 4D illustrate examples of packet payloads containing task descriptors and/or an address where a task descriptor is stored, as used within the architecture to delegate tasks. -
FIG. 5 illustrates task descriptors being enqueue and dequeued from the memory/register stack of a hardware task queue. -
FIG. 6 is an abstract representation of how slots within the a queue stack are accessed and recycled in a first-in-first-out (FIFO) manner. -
FIG. 7 is an example circuit overview of a task-assignable hardware queue. -
FIG. 8 is a block diagram conceptually illustrating example components of a processing element of the chip inFIG. 1 . -
FIG. 9 illustrates a plurality of the multiprocessor chips connected together, with the task-assignable queues of several of the chips assigned to receive tasks. -
FIG. 10 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is deposited into an output queue for the processor to retrieve. -
FIG. 11 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is written back directly to the processor. -
FIG. 12 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and execution chains across queues, with the end-result being deposited into an output queue for the processor to retrieve. -
FIGS. 13A to 13F illustrate examples of the content of several of the data transactions inFIG. 12 . -
FIG. 14 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and a task-assigned processor deposits a sub-task into another input queue as a subroutine, with the end-result being deposited into an output queue for the processor to retrieve. -
FIG. 15 is a hybrid process-flow transaction-flow diagram illustrating execution of a scheduler program by a task-assigned processor, enabling the processor to autonomously subscribe and unsubscribe from task queues. - Semiconductor chips that include multiple computer processors have increased in complexity and scope to the point that on-chip communications may benefit from a routed packet network within the semiconductor chip. By using a same packet format on-chip as well as off-chip, a seamless fabric is created for high data throughput computation that does not require data to be re-packed and re-transmitted between devices.
- To facilitate such an architecture, a multi-core chip may include a top level (L1) packet router for moving data inside the chip and between chips. All data packets are preceded by a header containing routing data. Routing to internal parts of the chip may be done by fixed addressing rules. Routing to external ports may be done by comparing the packet header against a set of programmable tables and/or registers. The same hardware can route internal-to-internal packets (loopback), internal-to-external packets (outbound), external-to-internal packets (inbound) and external-to-external packets (pass through). The routing framework supports a wide variety of geometries of chip connections, and allows execution-time optimization of the fabric to adapt to changing data flows.
- However, as the number of processing elements within a system increase, there are several engineering challenges that need to be addressed. Two of the challenges are minimizing the processing bottlenecks and latency delays caused by multiple processors accessing memory at a same time, and the assigning of processing threads to processing elements. Early solutions placed the burden of assigning threads to processors on the software compiler. However, as the number of processing cores in a system may vary, compiler solutions are somewhat less flexible at run-time. Runtime solutions typically use one or more processors as dispatchers, keeping track of which processing elements are busy and which are free, and sending tasks to free processors for execution. Using a runtime solution, the burden on the compiler is reduced, since the compiler need only designate which threads can be run in parallel and which threads must be run sequentially.
- While runtime solutions provide better utilization of processing elements, implementation can actually exacerbate the bottlenecks created by multiple processors overloading the memory bus with read requests. Specifically, each time a processing element is assigned to a new thread by a dispatcher, the processing element must fetch (or be sent) the executable code necessary to execute the thread specified by the dispatcher. The end result is a performance trade-off between maximizing the load balance between processors and the bus and memory bottlenecks that occur as a result.
-
FIG. 1 is a block diagram conceptually illustrating an example of amultiprocessor chip 100 with a hierarchical on-chip network architecture that includes task-assignable hardware queues 118. Theprocessor chip 100 may be composed of a large number of processing elements 134 (e.g., 256), connected together on the chip via a switched or routed fabric similar to what is typically seen in a computer network. - Multiple first-in-first-out (FIFO) input and
output hardware queues 118 are provided on thechip 100, each of which is assignable to serve as an input queue or an output queue. When configured as an input queue, thequeue 118 is associated with a single “task.” A task comprises multiple executable instructions, such as the instructions for routine, subroutine, or other complex operation. - Defined tasks are each assigned a task identifier or “tag.” When a task identifier is invoked during execution of a program by a
processing element 134, a task descriptor is sent to atask distributor 114. The task descriptor includes the task identifier, any needed operands or data, and an address where the task result should be returned. Thetask distributor 114 identifies a nearby queue associated with one ormore processing elements 134 configured to perform the task. The assigned queue may be on asame chip 100 as theprocessing element 134 running the software that invoked the task, or may be on another chip. Since the processing elements subscribed to input queues repeatedly perform the same tasks, they can locally store and execute the same code over-and-over, substantially reducing the communication bottlenecks created when a processing element must go and fetch code (or be sent code) for execution. - Each input queue is affiliated with at least one subscribed
processing element 134. Theprocessing elements 134 affiliated with the input queues may each be loaded with a small scheduler program that is invoked after the processing element is idle for (or longer than) a specified/preset/predetermined duration (which may vary in length in accordance with the complexity of the task of the queue to which the processing element is currently affiliated/subscribed) When the scheduler program is invoked, theprocessing element 134 may unsubscribe from the input queue it was servicing and subscribe to a different input queue. In this way, processing elements can self-load balance independent of any central dispatcher. - In other words, it is not up to the main software program or a central dispatcher to assign work to a particular core (or possibly even to a particular chip). Instead, the
chip 100 has some queues at a top level (in the network hierarchy), with each queue supporting one type of task at any time. To get a task done, a program deposits a descriptor of the task that needs to be done with atask distributor 114, which deposits the descriptor into theappropriate queue 118. The processing elements affiliated with the queue do the work, and typically produce output to some other queue (e.g., aqueue 118 configured as an output queue). - Each
hardware queue 118 has at least one event flag attached, so a processor core can sleep while waiting for a task to be placed in the queue, powering down and/or de-clocking operations. After a task descriptor is enqueued, at least one of the cores affiliated with that queue is awakened by the change in state of the event flag, causing the processor core to retrieve (dequeue) the descriptor and to start processing the operands and/or data it contains, using the locally-stored executable task code. - As noted, the
hardware queues 118 may be configured as input queues or output queues. Dedicated input queues and dedicated output queues may also/instead be provided. When a task is finished, the last processing element to execute a portion of the assigned task or chain of tasks may deposit the results in an output queue. These output queues can generate event flags that produce externally visible (e.g., electrical) signals, so a host processor or other hardware (e.g., logic in an FPGA) can retrieve the finished result. - In the example in
FIG. 1 , theprocessing elements 134 are arranged in a hierarchical architecture, although other arrangements may be used. In the hierarchy, eachchip 100 includes four superclusters 122 a-122 d, each supercluster 122 comprises eight clusters 128 a-128 h, and each cluster 128 comprises eightprocessing elements 134 a-134 h. If eachprocessing element 134 includes two-hundred-fifty-six externally exposed registers, then within thechip 100, each of the registers may be individually addressed with a sixteen bit address: two bits to identify the supercluster, three bits to identify the cluster, three bits to identify the processing element, and eight bits to identify the register. - Memory within a system including the
processor chip 100 may also be hierarchical, and memory of different tiers may be physically different types of memory. Eachprocessing element 134 may have a local program memory containing instructions that will be fetched by the core's micro-sequencer and loaded into the instruction registers for execution in accordance with a program counter.Processing elements 134 within a cluster 124 may also share a cluster memory 136, such as a shared memory serving a cluster 128 including eightprocessor cores 134 a-134 h. While a processor core may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of the instruction pipeline) when accessing its own operand registers, accessing global addresses external to aprocessing element 134 may experience a larger latency due to (among other things) the physical distance between the addressed component and theprocessing element 134. As a result of this additional latency, the time needed for a processor core to access an external main memory, a shared cluster memory 136, and the registers of other processing elements may be greater than the time needed for a core to access its own execution registers. - Each tier in the architecture hierarchy may include a router. The top-level (L1)
router 102 may have its own clock domain and be connected by a plurality of asynchronous data busses to multiple clusters of processor cores on the chip. The L1 router may also be connected to one or more external-facing ports that connect the chip to other chips, devices, components, and networks. The chip-level router (L1) 102 routes packets destined for other chips or destinations through theexternal ports 103 over one or more high-speed serial busses 104 a, 104 b. Each serial bus 104 comprises at least one media access control (MAC)port layer hardware transceiver - The
L1 router 102 routes packets to and from a primary general-purpose memory for the chip through asupervisor port 107 to amemory supervisor 108 that manages the general-purpose memory. Packets to-and-from lower-tier components are routed throughinternal ports 121. - Each of the superclusters 122 a-122 d may be interconnected via an inter-supercluster router (L2) 120 which routes transactions between superclusters and between a supercluster 122 and the chip-level router (L1) 102. Each supercluster 122 may include an inter-cluster router (L3) 126 which routes transactions between each cluster 128 in the supercluster 122, and between a cluster 128 and the inter-supercluster router (L2) 120. Each cluster 128 may include an intra-cluster router (L4) 132 which routes transactions between each
processing element 134 in the cluster 128, and between aprocessing element 134 and the inter-cluster router (L3) 126. The level 4 (L4) intra-cluster router 132 may also direct packets betweenprocessing elements 134 of the cluster and a cluster memory 136. Tiers may also include cross-connects (not illustrated) to route packets between elements in a same tier in the hierarchy. - When data packets arrive in one of the routers, the router examines the header at the front of each packet to determine the destination of the packet's data payload. Each
chip 100 is assigned a unique device identifier (“device ID”). Packet headers received via theexternal ports 103 each identify a destination chip by including the device ID in an address contained in the packet header. Packets that are received by theL1 router 102 that have a device ID matching that of the chip containing the L1 router are routed within the chip using a fixed pipeline to thesupervisor 108 or through one of theinternal ports 121 linked to a cluster of processor cores within the chip. When packets are received with a non-matching device ID by theL1 router 102, theL1 router 102 uses programmable routing to select an external port and relay the packet back off the chip. - When a program invokes a task, the invoking
processing element 134 sends a packet comprising a task descriptor to thelocal task distributor 114. TheL1 router 102 and/or the L2 router 120 include atask port 113 a/113 b and aqueue port 115 a/115 b. The routers route the packet containing the task descriptor via thetask port 113 to thetask distributor 114, which examines the task identifier included in the descriptor and determines whichqueue 118 to which to assign the task. The assignedqueue 118 may be on thechip 100, or may be a queue on another chip. If the task is to be deposited in a queue on the chip, thetask distributor 114 transfers the descriptor to thequeue router 116, which deposits the descriptor in the assigned queue. Otherwise, task descriptor is routed to the other chip which contains the assignedqueue 118. - The
queue port 115 a is used by theL1 router 102 to route descriptors that have been assigned by thetask distributor 114 on another chip to the designatedinput queue 118 via thequeue router 116. Whenqueues 118 are configured as output queues, theprocessing elements 134 may retrieve task results from the output queue via thequeue port 115 a/115 b using read/get requests, routed via the L1 and/or L2 routers. - Cross-connects (not illustrated) may be provided to signal when there is data enqueued in the I/O queues, which the
processing elements 134 can monitor. For example, an eight-bit bus may be provided, where each bit of the bus corresponds to one of the I/O queues 118 a-118 f. When a queue is configured as an output queue, aprocessing element 134 may monitor the bit line corresponding to the queue while awaiting task results, and retrieve the results after the bit line is asserted. When a queue is configured as an input queue, subscribed processingelements 134 may monitor the bit line corresponding to the queue for tasks for the availability of tasks awaiting processing. -
FIGS. 2A and 2E areexamples task distributors 114/114′ that assign tasks to a hardware queue. In each of the examples, thetask distributors 114/114′ receives atask request 240 via atask port 113, selects atask input queue 118 associated with the task based on atask identifier 232 included in thetask request 240, obtains an address or other queue identifier of thetask input queue 118, and enqueues thetask request 240 in thequeue 118 using the address or other identifier. - Selecting the task input queue and obtaining its address may be performed as plural steps or may be combined as a single step. Depending upon how the
task distributor 114/114′ is implemented, a task input queue may be selected and then an address/identifier may be obtained for the selected task input queue, or the addresses/identifiers of one or more task input queues may be obtained and then the task input queue may be selected. Process combinations may also be used to select a queue and obtain a queue address/identifier, such as selecting several candidate input task queues, obtaining their addresses/identifiers, and then selecting an input task queue based on its address/identifier. - In the example in
FIG. 2A , thetask distributor 114 receives atask request 240 and thecontroller 214 of thetask distributor 114 uses a content-addressable memory (CAM) 252 to select thetask input queue 118 and obtain the address/identifier 210 of theinput queue 118 based on the extractedtask identifier 232. An advantage of using aCAM 252 over using hash tables or table look-up techniques is that a CAM can return a result typically within one or two clock cycles, which will typically be faster than hashing or searching a table. A disadvantage of CAM is that eachCAM 252 takes up more physical space on thechip 100, with the amount of space needed increasing as the number ofqueues 118 increases. However, CAM is practical if there is a limited number of task queues (e.g., 8 input queues). Thus, there is a speed versus space trade-off between CAM and other address resolution approaches. -
FIG. 2B illustrates an example structure of thetask request packet 240, andFIG. 2C illustrates an example structure of thequeue assignment packet 242. The structures of these packets will be discussed in more detail in connection withFIG. 3 andFIGS. 4A-4D below, but are introduced here to explain the operation of thetask distributors FIG. 2B , thetask request packet 240 includes aheader 202 a and a payload comprising atask descriptor 230 a. Theheader 202 a includes the address of thetask distributor 114/114′. Thetask descriptor 230 a comprises atask identifier 232 and various task parameters anddata 233. Referring toFIG. 2C , thequeue assignment packet 242 includes aheader 202 b and a payload comprising atask descriptor 230 b. Thetask descriptor 230 b comprises the task parameters anddata 233. -
FIG. 2D illustrates the components of thecontroller 214, and the principles of operation of thetask distributor 114. As illustrated inFIG. 2B , thetask request packet 240 has a particular format, such that aparser 270 can read/extract the specific range of bits from the packet that correspond to thetask identifier 232, with the bits that follow thetask identifier 232 being the task parameters anddata 233. Relative to the packet payload containing thetask descriptor 230 a, thetask identifier 232 begins at a pre-defined offset (e.g., an offset of zero as illustrated inFIG. 2B ). Theparser 270 outputs the bits corresponding to thetask identifier 232 to theCAM 252. The bits corresponding to the task parameters anddata 233 are directed to anassembler 274. TheCAM 252 contains an associative array table that links search tags (i.e., the task identifiers) to input queue addresses/identifiers. TheCAM 252 receives thetask identifier 232 and outputs a queue address/identifier 210 of a selected input queue that is configured to receive the specified task. - The
parser 270 may optionally include additional functionality. For example, it is possible to compress thetask descriptor 230 a (e.g., using Huffman encoding). In such a case, theparser 270 may be responsible for de-compressing any data that precedes thetask identifier 232 to find the offset at which thetask identifier 232 starts, then transmitting thetask identifier 232 to theCAM 252. In such a design, theCAM 252 might use either the compressed or un-compressed form of thetask identifier 232 as its key. In the latter case, theparser 270 would also be responsible for de-compressing thetask identifier 232 prior to transmitting it to theCAM 252. - The
assembler 274 is roughly a mirror image of theparser 270. Where theparser 270 extracts atask identifier 232 that indirectly refers to a task queue, theassembler 274 re-assembles an output packet (queue assignment 242) that describes the task with aheader 202 b that includes a physical or virtual address of a selected queue based on the address/identifier 210, where the header address is for the selected input queue that can carry out the type of task denoted by thetask identifier 232. The payload of the output packet comprises the parameters anddata 233. Theassembler 274 receives the address/identifier 210 of the selected input queue from theCAM 252 and the task parameters anddata 233 from theparser 270. Various approaches may be used by theassembler 274 to assemble theoutput packet 242. For example, theparser 270 may send thetask descriptor 230 a to the assembler, and theassembler 274 may overwrite the bits corresponding to thetask identifier 232 with the header address, or theassembler 274 may concatenate the header address with the task parameters anddata 233. - The
assembler 274 may also include additional functionality. For example, if a compressed format is being used, theassembler 274 may re-compress some or all the task parameters anddata 233 contained in theroutable task descriptor 230 b. Theassembler 274 could also rearrange the data, or carry out other transformations such as converting a compressed format to an uncompressed format or vice versa. - In
FIG. 2E , thetask distributor 114′ receives thetask request 240 via thetask port 113. A hash table 220 or sorted table 221 may be stored in a memory or a plurality ofregisters 250 associated with thetask distributor 114′. In the tables 220/221, types of tasks are identified by a system-wide address of a kernel used to process that type of task. Acontroller 216 extracts thetask identifier 232 from thedescriptor 230 a of thetask request 240, and applies a hash function or search function to select atask input queue 118, and to obtain theaddress 210 or other queue identifier of thetask input queue 118. A hash function may be used to select the queue and obtain the queue's address/identifier 210 with or without a hash table 220. A search function may be used to select the queue and obtain the queue's address/identifier 210 based on data in a sorted table 221. - In the case of a large system, the hash table 220 may be a distributed hash table, so one type of task has queues distributed throughout the system. A
task request 240 causes thecontroller 216 to apply a distributed hash function to produce a hash that would find a “nearby” queue for that task, where the nearby queue should be reachable with a latency from the task identifier that is less than (or tied for smallest) that to reach other queues associated with the same task. Expected latency may be determined, among other ways, based on the minimum number of “hops” (e.g., intervening routers) to reach each queue from thetask distributor 114′. Thecontroller 216 outputs a packet containing thequeue assignment 242, replacing the destination address in the header with the address of the assigned queue, as discussed in connection withFIGS. 2A-2D . The packet is then routed to the assigned queue where it is enqueued, either via thequeue router 116 or the L1 router 102 (if the queue is on another chip). - “Hop” information be determined, among other ways, from a routing table. The routing table may, for example, be used to allocate addresses that indicate locality to at least some degree. Distributed hashing frequently uses a very large (and very sparse) address space “overlaid” on top of a more typical addressing scheme like Internet Protocol (IP). For example, a hash might produce a 160-bit “address” that's later transformed to a 32-bit IPv4 address. With a logical address space like this, the allocation of addresses maybe be tailored to the system topology, such that the address itself provides an indication of that node's locality (e.g., assuming a backbone plus branches, the address could directly indicate a node's position on the backbone and distance from the backbone on its branch).
- Hop information can be used with the
CAM 252 as well. However, given the expense of storage in a CAM and the advantageous of keeping that data to a minimum, eachCAM 252 will ordinarily store just one “best” result for a given tag lookup. -
FIG. 2F illustrates the components of thecontroller 216, and the principles of operation of thetask distributor 114′. Theparser 270 and theassembler 274 are the same as those discussed in connection withFIG. 2D . However, incontroller 216, theparser 270 outputs thetask identifier 232 to anaddress resolver 272. Theaddress resolver 272 applies a hash or search function to select the queue and obtain the queue's address/identifier 210, outputting the address/identifier 210 to theassembler 274. -
FIG. 2G illustrates examples of different process flows that may be used by thecontroller 214/216 for address resolution (290 a-290 e).Resolution process 290 a corresponds to that used by thetask distributor 114 inFIGS. 2A and 2D , with a task identifier (tag 232) input into theCAM 252, producing the queue address/identifier 210. -
Resolution process 290 b may be used by anaddress resolver 272 a (an example ofaddress resolver 272 inFIG. 2F ) without a table 220/221. Theaddress resolver 272 a inputs thetag 232 into ahash function 280 as the function's “key,” where thehash function 280 hashes the key to produce the queue address/identifier 210. In comparison toresolution process 290 b, resolution processes 290 c adds an address lookup to resolve the hash into an address or other identifier. Anaddress resolver 272 b (an example ofaddress resolver 272 inFIG. 2F ) uses a hash table 220 to lookup the address/identifier 210. Thetag 232 is input into thehash function 281 as functions “key,” where thehash function 281 hashes the key to produce one or more index values 208. Theaddress resolver 272 b resolves theindex value 208 into the address/identifier 210 using the hash table 220. If there is more than onetag 232 that hashes to the same table location, the result is a hash “collision.” Such collisions can be resolved in any of several ways, such as linear probing, collision chaining, secondary hashing, etc. - Since the number of nodes/
chips 100 in a system may vary dynamically, when a node is added or removed, a distributed hash function (e.g., 280, 281) may be recomputed and redistributed to all thetask distributors 114′. Other options include leaving thefunction 280/281 itself, but modifying data that it uses internally (not illustrated, but may also be stored in registers/memory 250), or leave thefunction 280/281 alone, but modify the address lookup data (e.g., hash table 220). Choosing between modifying the hash function's data and modifying the lookup data is often a fairly open choice, and depends in part on how the hash function is structured and implemented (e.g., implemented in hardware, implemented as processor-executed software, etc.). - To optimize results for locality within the system, it is advantageous to produce a final address result that is based on location (relative to the topology of interconnected devices 100). The hash functions 280/281 used by the
task distributors 114′ may the same throughout the system, or may be localized, depending upon whether localization data is updated by updating thehash function 280/281, its internal data, or its lookup table 220. For example, the distributed hash tables 220, sorted tables 221, and/or data used by the functions stored in one or more registers may be updated each time a device/node 100 is added or removed from the system. - As an alternative to a hash function, a lookup table may be used to store a
tag 232, and with it an address/queue identifier 210. Sorting the table bytag 232, an interpolatingsearch 282 may be used to search a small table, or abinary search 283 search may be used to sort a large table. Resolution processes 290 d may be used by anaddress resolver 272 c (an example ofaddress resolver 272 inFIG. 2F ) with a sorted table 221. Theaddress resolver 272 c performs an interpolatingsearch 282 on the sorted table 221, using anindex 208 based on thetag 232. Thesearch 282 produces the address/identifier 210. Resolution processes 290 e may be used by anaddress resolver 272 d (an example ofaddress resolver 272 inFIG. 2F ) with the sorted table 221. Theaddress resolver 272 d performs abinary search 283 on the sorted table 221, using theindex 208 based on thetag 232. Thesearch 283 produces the address/identifier 210. Other search methods may be used. Also, while the table 221 is sorted for efficiency, a non-sorted table may instead be used, depending upon the search method employed. - If the
hash function 280/281 orsearch function 282/283 is implemented in hardware, the logic providing the function 280-283 may fixed, with updates being to table values (e.g., 220/221) and/or to other registers storing values used by the function, separate from the logic. If the function is implemented as processor-executed software, either the software (as stored in memory) may be updated, table values (e.g., 220/221) may be updated, and/or registers storing values used by the function may be updated. Also, the type of function and nature of the tables may be changed as the system scales, selecting a function 280-283 optimized for the scale of the topology. - Choosing between address resolution techniques depends pretty factors that are not relevant to the
task queues 118 themselves, and are fairly well known in the art. Hash tables 220 typically have O(1) expected complexity, but O(N) worst case (but deletion is often more expensive, and sometimes completely unsupported). Sorted tables 221 withbinary search 283 offers O(log N) lookup, and O(N) insertion or deletion. Sorted tables 212 with interpolatingsearch 282 improves search complexity to O(log log N), but insertion or deletion is still typically O(N). A self-balanced binary search tree may be used for O(log N) insertion, deletion or lookup. In a small system, all of the table-based address resolution approaches should be adequate, as the tables involved are relatively small. - Each time the data and/or functions used by the
controllers 214/216 is updated, one-or-more processing elements 134 on thechip 100 may load and launch a queue update program. In conjunction with thetask distributor 114/114′, the queue update program may determine the input queue address/identifier 210 for eachpossible task ID 232, and determine whether any of those addresses/identifiers are for I/O queues 118 on thedevice 100 containing thetask distributor 114/114′. The queue update program then configures each queue for the assigned task (if not already configured), and configures at least oneprocessing element 134 to subscribe to each input queue. -
FIG. 3 illustrates an example of apacket header 302 used to communicate within the architecture. Aprocessing element 134 may access its own registers directly without a global address or use of packets. For example, if each processor core has 256 operand registers, the core may access each register via the register's 8-bit unique identifier. Likewise, a processing element can directly access its own program memory. In comparison, a global address may be (for example) 64 bits. Shared memory and the externally accessible locations in the memory and registers of other processing elements may be addressed using a global address of the location, which may include that address' local identifier and the identifier of the tier (e.g.,device ID 312, cluster ID 314). - As illustrated in
FIG. 3 , apacket header 302 may include a global address. Apayload size 304 may indicate a size of the payload associated with the header. If no payload is included, thepayload size 304 may be zero. Apacket opcode 306 may indicate the type of transaction conveyed by theheader 302, such as indicating a write instruction, a read instruction, or a task assignment. A memory tier “M” 308 may indicate what tier of device memory is being addressed, such as main memory (connected to memory supervisor 108), cluster memory 136, or a program memory or registers within aprocessing element 134. - The structure of the
physical address 310 in thepacket header 302 may vary based on the tier of memory being addressed. For example, at a top tier (e.g., M=1), a device-level address 310 a may include aunique device identifier 312 identifying theprocessor chip 100 and anaddress 320 a corresponding to a location in main-memory. At a next tier (e.g., M=2), a cluster-level address 310 b may include thedevice identifier 312, a cluster identifier 314 (identifying both the supercluster 122 and cluster 128), and anaddress 320 b corresponding to a location in cluster memory 136. At the processing element level (e.g., M=3), a processing-element-level address 310 c may include thedevice identifier 312, the cluster identifier 314, a processing element identifier 316, anevent flag mask 318, and anaddress 320 c of the specific location in the processing element's operand registers, program memory, etc. Global addressing may accommodate both physical and virtual addresses. - The
event flag mask 318 may be used by a packet to set an “event” flag upon arrival at its destination. Special purpose registers within the execution registers of each processing element may include one or more event flag registers, which may be used to indicate when specific data transactions have occurred. So, for example, a packet header designating an operand register of aprocessing element 134 may indicate to set an event flag upon arrival at the destination processing element. A single event flag but may be associated with all the registers, or with a group of registers. Eachprocessing element 134 may have multiple event flag bits that may be altered in such a manner. Which flag is triggered may be configured by software, with the flag to be triggered designated within the arriving packet. A packet may also write to an operand register without setting an event flag, if the packetevent flag mask 318 does not indicate to change an event flag bit. -
FIGS. 4A to 4D illustrate examples of packet payloads containing task descriptors, used within the architecture to delegate tasks. InFIG. 4A , a packet payload contains atask descriptor 430 a. Thetask descriptor 430 a includes thetask identifier 432, anormal return indicator 434 indicating where to deposit (i.e., write/save/store/enqueue) a normal response, anaddress 436 where to report an error, and any task operands and data 438 (or an address of where operands and data are stored). Thetask descriptor 430 a may also include anbit 433 that indicates whether thetask descriptor 430 a includesadditional task identifiers 432. Theadditional task bit 433 may be appended onto thetask identifier 432, or indicated elsewhere in the task descriptor. - The
normal return indicator 434 anderror reporting address 436 may indicate a memory or register address, the address of an output queue, or the address of any reachable component within the system. “Returning” results data to a location specified by thenormal return indicator 434 includes causing the results data to be written, saved, stored, and/or enqueued to the location. -
FIG. 4B illustrates an example of apacket payload 422 b including atask descriptor 430 b that contains multiple task assignments. The descriptor includes afirst task identifier 432 a, asecond task identifier 432 b, athird task identifier 432 c, thenormal return indicator 434, theerror reporting address 436, and the task operands anddata 438. - An additional task bit 433 a is appended onto the
first task identifier 432 a, and indicates that there are additional tasks after the first task. Anadditional task bit 433 b is appended onto thesecond task identifier 432 b, and indicates that there are additional tasks after the second task. Anadditional task bit 433 c is appended onto thethird task identifier 432 c, and indicates that there are no further tasks after the third task. The use of task chaining using thetask descriptor format 430 b will be discussed further below in connection withFIGS. 12 and 13A to 13F . -
FIG. 4C illustrates apacket payload 422 c that comprises anaddress 440 in memory from which the task descriptor 430 may be fetched. The stored task descriptor may be, for example, thetask descriptors memory address 440 of the task descriptor in itspayload 422 c. By sending only the memory address of thetask descriptor 440, the size of task requests 240 andqueue assignments 242 are reduced, such that the capacity of each slot in thequeues 118 to be smaller. For example, using thepayload 422 c, the size of each slot in thequeues 118 may be a single word. A “word” is a fixed-sized piece of data, such as a quantity of data handled as a unit by the instruction set and/or the processor core of aprocessing element 134, and can vary from core to core. For example, a “word” might be 64 bits in one architecture, whereas a “word” might be 128 bits on another architecture. A trade-off is that thetask distributor 114 and aprocessing element 134 subscribed to an input queue must access memory to retrieve some or all of the descriptor. For example, thetask distributor 114 may read the first word of the stored descriptor to determine thetask identifier 432, whereas a subscribedprocessing element 134 may retrieve the entire stored descriptor. In an arrangement where a chained-task descriptor 430 b is stored, eachprocessing element 134 that works with thedescriptor 430 b (as stored in memory at address 440) may adjust an offset of theaddress 440 or otherwise crop thetask descriptor 430 b so that the identifiers of tasks that have already been completed are not retrieved again in subsequent operations. -
FIG. 4D illustrates apacket payload 422 d that comprises atask identifier 432 a and anaddress 450 in memory from which a remainder of the task descriptor 430 may be fetched. While thepacket payload 422 d doubles the size of the payload relative topayload 422 c, including the next task identifier within the packet itself simplifies the processing to be performed by thetask distributor 114, since the task distributor can issue thequeue assignment 242 without having to access memory to determine thenext task identifier 432 a. After a task-executingprocessing element 134 dequeues the packet and accesses remainder of thetask descriptor 450 in memory, the task-executingprocessing element 134 can extract any subsequent task identifier (e.g., 432 b) and expose the subsequent task identifier in the same manner as illustrated inFIG. 4D , when sending the subsequent task to anothertask distributor 114. -
FIG. 5 illustrates task descriptors being enqueue and dequeued from the memory/register stack of ahardware task queue 118. Eachqueue 118 comprises a stack ofstorage slots 572 a to 572 h, where each “slot” comprises a plurality of registers or memory locations. The size of each slot may correspond, for example, to a maximum allowed size for a descriptor 430 (e.g., the maximum number of words). When an input queue receives anew descriptor 530 a, it is enqueued to the back in accordance with aback pointer 533. When a subscribed processing element dequeues adescriptor 530 b, the descriptor is 430 b is dequeued from the front of the queue in accordance with afront pointer 532. When the queue is empty, thefront pointer 532 and theback pointer 533 may be equal. -
FIG. 6 is an abstract representation of how slots within the a queue stack are accessed and recycled in a first-in-first-out (FIFO) manner. Enqueued descriptors remain in their assigned slot 572, with theback pointer 533 andfront pointer 532 changing as descriptors 430/530 are enqueued and dequeued. -
FIG. 7 is an example circuit overview of a task-assignable hardware queue 118. Thequeue 118 includes severalgeneral registers 760 that are used for both input queue and output queue operations. Also included are input queue-specific registers 767 that are used specifically for input queue operations. - The general purpose registers 760 include a
front pointer register 762 containing thefront pointer 532, aback pointer register 763 containing theback pointer 533, adepth register 764 containing the current depth of the queue, and severalevent flag register 764. Among the event flag registers is anempty flag 765, indicating that the queue is empty. When theempty flag 765 is de-asserted, indicating that there is at least one descriptor enqueued in thequeue 118, a data-enqueued interrupt signal may be sent to subscribed processors (input queue) or a processor awaiting results (output queue), signaling them to wake and dequeue a descriptor or result. The data-enqueued interrupt signal can be generated by an inverter (not illustrated) that has its input tied to the output of the ANDgate 755 or to theempty flag 765. Anotherevent flag 764 is thefull flag 766. When thefull flag 766 is asserted, thedata transaction interface 720 can output a back-pressure signal to thequeue router 116. Assertion of a back-pressure signal may result in error reporting (in accordance with the error reporting address 436) if a task arrives for a full queue. Thequeue router 116 may also include an arbiter to reassign the descriptor received for the full queue to another input queue attached thequeue router 116 that is configured to perform a same task (if such a queue exists). - If configured as an output queue, the event flags 764 may be masked so that when results data is enqueued, an interrupt is generated indicating to a waiting (or sleeping)
processing element 134 that a result has arrived. Likewise, processing elements subscribed to an input queue can set a mask so that a data-enqueued signal from the subscribed queued causes an interrupt, but data-enqueued signals from other queues are ignored. Instead of an “empty”flag register 765, a “data available” flag register may be used, replacing the ANDgate 755 with a NAND gate. In that case, data-enqueued interrupt signal can be generated in accordance with the output of the NAND gate, or the state of the data available flag register. - The input queue registers 767 are used by processing elements to subscribe and unsubscribe to the queue. A
register 768 indicates how many processingelements 134 are subscribed to the queue. Each queue always has at least one subscribed processing element, so if an idle processing elements goes to unsubscribe, but it is the only subscribed processing element, then the processing element remains subscribed. When new processing elements subscribe to the queue, the number in theregister 768 is incremented. Also, when a new processing element subscribes to a queue, it determines the start address where the executable instructions for the task are in memory (e.g., 780) from a programmemory address register 769. The newly subscribed processing element then loads the task program into its own program memory. - When a descriptor 430 or the
address 440/450 of a descriptor is received by thequeue 118 for enqueuing, adata transaction interface 720 asserts aput signal 731, causing awrite circuit 733 to save/store the descriptor 430 oraddress 440/450 into thestack 570 at awrite address 734 determined based on theback pointer 533. For example, theback point 533 may specify the most significant bits corresponding to the slot 572 where the descriptor 430 is to be stored. Thewrite circuit 733 may write (i.e., save/store) an entirety of a descriptor 430 as a parallel write operation, or may write the descriptor in a series of operations (e.g., one word at a time), toggling awrite strobe 735 and incrementing the least significant bits of thewrite address 734 until an entirety of the descriptor 430 is stored. - After the descriptor 430 or
descriptor address 440/450 is stored, thedata transaction interface 720 de-asserts the putsignal 731, causing acounter 737 to increment the back pointer on the falling edge of the putsignal 731 and causing acounter 757 to increase the depth value. Thecounter 737 counts up in a loop, with the maximum count equaling the number of slots 572 in thestack 570. When the count exceeds the maximum count, a carry signal may be used to reset thecounter 737, such that thecounter 737 operates in a continual loop. - When a descriptor 430 is to be dequeued by a subscribing
processing element 134, thedata transaction interface 720 asserts aget signal 741, causing aread circuit 743 to read the descriptor 430 or descriptor address 530 from the stack at aread address 744 determined based on thefront pointer 532. For example, thefront point 532 may specify the most significant bits corresponding to the slot 572 where thedescriptor 430 b is to be stored. Theread circuit 743 may read an entirety of the descriptor 430 as a parallel read operation, or may read the descriptor 430 as a series of reads (e.g., one word at a time). - After the descriptor 430 or
descriptor address 440/450 is dequeued, thedata transaction interface 720 de-asserts theget signal 741, causing acounter 747 to increment thefront pointer 532 on the falling edge of theget signal 741 and causing acounter 757 to decrease the depth value. Thecounter 747 counts up in a loop, with the maximum count equaling the number of slots 572. When the count exceeds the maximum count, a carry signal may be used to reset thecounter 747, such that thecounter 747 operates in a continual loop. - The
empty flag 765 may be set by circuit composed of acomparator 753, aninverter 754, and an ANDgate 755. Thecomparator 753 determines when thefront pointer 532 equals theback pointer 533. Theinverter 754 receives the queue-full signal as input. The ANDgate 755 receives the outputs of thecomparator 753 and theinverter 754. When the front and back pointers are equal and the full signal is not asserted, the output of the ANDgate 755 is asserted, indicating that the queue is empty. Depending upon how thecounters inverter 754 and ANDgate 755 provide for that eventuality, so that when the front and back pointers are equal and the full signal is also asserted, the output of the ANDgate 755 is de-asserted, indicating that the queue is not empty. As an alternative to determine when the queue is empty, a comparator may compare thedepth 764 to zero to determine when the depth equals zero. Thefull flag 766 may be set by the carry output of thecounter 757, or a comparator may compare thedepth 764 to the depth value corresponding to full. - Although the
queue 118 uses a write-and-then-increment the back pointer, read-and-then-increment the front pointer arrangement, the queue may instead use an increment-and-then-write and increment-and-then-read arrangement. In that case, thecounter 737 increments on the leading edge of the putsignal 731, and thecounter 747 increments on the leading edge of theget signal 741. - Also, instead of having both the
counters front pointer 532 may be incremented on the falling edge of theget signal 741, such that thefront pointer 532 points to the slot that is currently at the front of the queue, whereas theback pointer 533 may be incremented on the leading edge of the putsignal 731, such that the back pointer is 533 is pointing to one slot behind where the slot that will be used for the next write. In such an arrangement, when thestack 570 is empty, the front pointer and back pointer will not be equal. As a consequence, a comparison of the front and back pointers bycomparator 753 will not indicate whether thestack 570 is empty. In that case, whether thestack 570 is or is not empty may be determined from the depth 764 (e.g., comparing the depth value to zero). - Whether the
counter 757 increments and decrements the depth on the falling or leading edges may be independent of the arrangement used by thecounters counter 757 increments and decrements on the leading put/get signal edges, subscribed ormonitoring processing elements 134 may begin to dequeue a descriptor or descriptor address while it is being enqueued, since the data-enqueued interrupt signal may be generated be generated before enqueuing is complete, thereby accelerating the enqueuing and dequeuing process. To accommodate simultaneous enqueuing and dequeuing from a same slot of thestack 570, the memory/registers used for thestack 570 may be dual-ported. Dual-ported memory cells/registers can be read via one port and written to via another port at a same time. In comparison, if thecounter 757 increments and decrements on the falling put/get signal edges (as illustrated inFIG. 7 ), then the descriptor or descriptor address will be fully loaded into the slot 572 before the data-enqueued interrupt signal is asserted. - The
front pointer 532, theback pointer 533, the depth value, empty flag, and full flag are illustrated inFIG. 7 as being stored ingeneral registers 760. Using such registers, looping increment and decrement circuits may be used to update thefront pointer 532, backpointer 533, and depth value as stored in their registers instead of using dedicated counters. In the alternative, using thecounters 737/747/757, thegeneral registers 760 used to store the front pointer 502, back pointer 503, depth value, empty flag, and full flag may be omitted, with the values read from the counters and logic (e.g.,logic front pointer 532, backpointer 533, and the depth value are known, the third value can be determined. So, for example, the depth can be determined based on the difference between the front pointer and the back pointer, or the depth can be used to determine the value of the front or back pointer, based on the value of the other pointer. - Although
FIGS. 5 through 7 illustrate the FIFO queues as circular queues, other queue styles may be used such as FIFO shift register queues. A shift register queue comprises a series of registers, where each time a slot is dequeued, all of the contents are copied forward. With shift register queues, the slot constituting the “front” is always the same, with only the back pointer changing. However, circular queues have advantages over shift register queues, such as lower power consumption, since copying multiple descriptors or descriptor addresses from slot-to-slot each time a descriptor 430 ordescriptor address 440/450 is dequeued increases power consumption relative to the operations of a circular queue. -
FIG. 8 is a block diagram conceptually illustrating example components of a processing element of the chip inFIG. 1 . In terms of hardware, the structure of theprocessing elements 134 that are executing the main software program and that are subscribed to individual task queues may be identical, with the difference being that a processing element that is subscribed to atask queue 118 is loaded/configured with thescheduler 883 andidle counter 887. - A
data transaction interface 872 sends and receives packets and connects theprocessor core 890 to its associatedprogram memory 874. Theprocessor core 890 may be of a conventional “pipelined” design, and may be coupled to sub-processors such as anarithmetic logic unit 894 and a floatingpoint unit 896. Theprocessor core 890 includes a plurality of execution registers 880 that are used by thecore 890 to perform operations. Theregisters 880 may include, for example, instruction registers 882, operand registers 884, and various special purpose registers 886. Theseregisters 880 are ordinarily for the exclusive use of thecore 890 for the execution of operations. Instructions and data are loaded into the execution registers 880 to “feed” aninstruction pipeline 892. While aprocessor core 890 may experience no latency (or a latency of one-or-two cycles of the clock controlling timing of a micro-sequencer 891) when accessing its own execution registers 880, accessing memory that is external to thecore 890 may produce a larger latency due to (among other things) the physical distance between the core 890 and the memory. - The instruction registers 882 store instructions loaded into the core that are being/will be executed by an
instruction pipeline 892. The operand registers 884 store data that has been loaded into thecore 890 that is to be processed by an executed instruction. The operand registers 884 also receive the results of operations executed by thecore 890 via an operand write-backunit 898. The special purpose registers 886 may be used for various “administrative” functions, such as being set to indicate divide-by-zero errors, to increment or decrement transaction counters, to indicate core interrupt “events,” etc. - The instruction fetch circuitry of a micro-sequencer 891 fetches a stream of instructions for execution by the
instruction pipeline 892 in accordance with an address generated by aprogram counter 893. The micro-sequencer 891 may, for example, may fetch an instruction every “clock” cycle, where the clock is a signal that controls the timing of operations by the micro-sequencer 891 and theinstruction pipeline 892. Theinstruction pipeline 892 comprises a plurality of “stages,” such as an instruction decode stage, an operand fetch stage, an instruction execute stage, and an operand write-back stage. Each stage corresponds to circuitry. - The chips' firmware may include a
small scheduler program 883 in firmware. When a core 890 waits too long (an exact duration may be specified in a register, e.g., based on a number of clock cycles) for a task to show up in its queue, thecore 890 wakes up and runs thescheduler 883 to find some other queue with tasks for it to execute, and thereafter begins executing those tasks. Thescheduler program 883 may be loaded into the instruction registers 882 of processingelements 134 subscribed to a task queue when the processing element'sidle counter 887 indicates at that the threshold duration of time has transpired (e.g., that the requisite number of clock cycles have elapsed). Thescheduler program 883 may either be preloaded into theprocessing element 134, or loaded upon expiration of theidle counter 887. Theidle counter 887 causes generation of an interrupt resulting in the micro-sequencer 891 executing thescheduler 883, causing theprocessing element 134 to search through the (currently in-use) queues, and find a queue with tasks that need execution. Once it finds a new queue, it unsubscribes from the old queue (decrementing the number in register 768), subscribes to the new queue (incrementing the number in register 768), fetches the program address fromregister 769 of the new queue, and loads the task program code into itsown program memory 874. -
FIG. 9 illustrates a plurality of the multiprocessor chips connected together, with the task-assignable queues of several of the chips assigned to receive tasks. Aprocessor chip 100 a includes aTask 1 queue 118.1 a, aTask 2 queue 118.2 a, and aTask 3 queue 118.5 a. Aprocessor chip 100 d includes aTask 2 queue 118.2 d, aTask 3 queue 118.3 d, and aTask 4 queue 118.4 d. Aprocessor chip 100 a includes aTask 1 queue 118.1 a, aTask 2 queue 118.2 a, and aTask 3 queue 118.5 a. Aprocessor chip 100 h includes aTask 1 queue 118.1 h, aTask 3 queue 118.3 h, and aTask 5 queue 118.5 h.Processor chips queues 118 may be arranged as output queues, receiving results when a task is completed. The arrangement of chips inFIG. 9 will be uses as the basis for specific execution examples discussed in connection withFIGS. 10-14 . -
FIG. 10 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is deposited into an output queue for the processor to retrieve.Task execution 1000 begins when a program executed byprocessor 134 a onprocessor chip 100 b results in issuance of atask 3request 1002 to thetask distributor 114 b on theprocessor chip 100 b. Thetask distributor 114 b, using a hash table 220 orCAM 252, assigns 1004 the task to thetask 3 queue 118.3 d onprocessor chip 100 d, which is closer (in terms of network hops) than thetask 3 queue 118.3 h onprocessor chip 100 h. - After a
processor 134 c subscribed to thetask 3 input queue 118.3 d becomes free and determines from theempty flag 765 that there is adescriptor 430 b waiting to be dequeued, theprocessor 134 c retrieves 1006 the descriptor from the queue 118.3 d. Upon completion of the task, theprocessor 134 c writes 1010 (by packet) the result to anoutput queue 118 h on theprocessor chip 100 b in accordance with thenormal return indicator 434. Theoutput queue 118 h generates anevent signal 1012, waking theprocessor 134 a (if in a low power mode), and causing theprocessor 134 a to retrieve 1014 the results fromoutput queue 118 h. -
FIG. 11 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and the result is written back directly to the processor.Task execution 1110 begins when a program executed byprocessor 134 a onprocessor chip 100 b results in issuance of atask 3request 1102 to thetask distributor 114 b on theprocessor chip 100 b. Thetask distributor 114 b, using a hash table 220 orCAM 252, assigns 1104 the task to thetask 3 queue 118.3 d onprocessor chip 100 d, which is closer (in terms of network hops) than thetask 3 queue 118.3 h onprocessor chip 100 h. - After a
processor 134 c subscribed to thetask 3 input queue 118.3 d becomes free and determines from theempty flag 765 that there is adescriptor 430 b waiting to be dequeued, theprocessor 134 c retrieves 1106 the descriptor from the queue 118.3 d. Upon completion of the task, theprocessor 134 c writes 1110 (by packet) the result directly to operand registers 884 orprogram memory 874 of theprocessing element 134 a in accordance with thenormal return indicator 434. -
FIG. 12 is a transaction flow diagram illustrating an example where a processor deposits a task descriptor into an input queue, and execution chains across queues, with the end-result being deposited into an output queue for the processor to retrieve. Chaining may be based on there being multiple task identifiers in the original task descriptor (e.g.,FIG. 4B ), and/or based on one or more tasks initiating a chain when that task is invoked. The discussion oftask execution 1200 in connection withFIGS. 12 and 13A to 13F is based on the former, where the original task descriptor includes multiple task identifiers.Task execution 1200 begins when a program executed byprocessor 134 a onprocessor chip 100 b results in issuance of atask 4request 1202 to thetask distributor 114 b on theprocessor chip 100 b. Thetask distributor 114 b, using a hash table 220 orCAM 252, assigns 1204 the task to thetask 4 queue 118.4 d onprocessor chip 100 d. - After a
processor 134 e subscribed to thetask 4 input queue 118.4 d becomes free and determines from theempty flag 765 that there is adescriptor 430 b waiting to be dequeued, theprocessor 134 e retrieves 1206 the descriptor from the queue 118.4 d. Upon completion of the task, theprocessor 134 e writes 1210 (by packet) the result to atask distributor 114 d on theprocessor chip 100 d as aTask 1 request as part of a chained task request. Thetask distributor 114 d send 1212 theTask 1 assignment to theTask 1 input queue 118.1 a onprocessor chip 100 a. - After a
processor 134 a subscribed to thetask 1 input queue 118.1 a becomes free and determines from theempty flag 765 that there is adescriptor 430 b waiting to be dequeued, theprocessor 134 aretrieves 1214 the descriptor from the queue 118.1 a. Upon completion of the task, theprocessor 134 a writes 12120 (by packet) the result to anoutput queue 118 h onprocessor chip 100 b, in accordance with thenormal return indicator 434. Theoutput queue 118 h generates anevent signal 1230, waking theprocessor 134 a (if in a low power mode), and causing theprocessor 134 a to retrieve 1234 the results fromoutput queue 118 h. -
FIGS. 13A to 13F illustrate examples of the content of several of the data transactions inFIG. 12 , based on the packet structure discussed in connection withFIGS. 3, 4A and 4B . If a packet payload only contains theaddress 440 of the task descriptor in memory (as discussed in connection withFIG. 4C ) or a packet payload contains atask identifier 432 and theaddress 450 of a remainder of the task descriptor in memory (as discussed in connection withFIG. 4D ), then the descriptors in the transactions illustrated inFIGS. 13A to 13F would reflect the state of the descriptors as stored at theaddresses -
FIG. 13A illustrates a packet 1300 a used for thetask 3request 1202, as issued by theprocessing element 134 a. Theheader 1302 a contains the address of thetask distributor 114 b. The packet payload comprises atask descriptor 1330 a. Thetask descriptor 1330 a includes atask 4task identifier 1332 a, atask 1task identifier 1332 b, a normal return indicator 1134 corresponding to the address of theoutput queue 118 h, anerror reporting address 1336, and the task operands and/ordata 1338 a. Theadditional task bit 1333 a appended on to thetask 4identifier 1332 a is set to indicate there is another task to be performed aftertask 4. Theadditional task bit 1333 b appended on to thetask 1identifier 1332 b is set to indicate there is no other task to be performed aftertask 1. -
FIG. 13B illustrates a packet 1300 b used for thequeue assignment 1204, as issued by thetask distributor 114 b. Thepacket header 1302 b contains the address of thetask 4 input queue 118.4 d. The packet payload comprises a task descriptor 1330 b. In comparison to thetask descriptor 1330 a, the descriptor 1330 b omits thetask 4identifier 1332 a.FIG. 13C illustrates the task descriptor 1330 b as pulled 1206 from thetask 4 input queue 118.4 d by thetask 4processor 134 e. -
FIG. 13D illustrates apacket 1300 c used for thetask 1request 1210, as issued by thetask 4processor 134 e. Thepacket header 1302 c contains the address of thetask distributor 114 d. The packet payload comprises atask descriptor 1330 c. In comparison to the task descriptor 1330 b, thedescriptor 1330 c includes theresults 1338 b fromtask 4. Thetask 4 results may be appended onto the original task operands anddata 1338 a (as illustrate), mixed with original operands anddata 1338 a, or the original operands anddata 1338 a may be omitted. -
FIG. 13E illustrates apacket 1300 d used for thequeue assignment 1212, as issued by thetask distributor 114 d. Thepacket header 1302 d contains the address of thetask 1 input queue 118.1 a. The packet payload comprises atask descriptor 1330 d. In comparison to thetask descriptor 1330 c, thedescriptor 1330 d omits thetask 1identifier 1332 b. -
FIG. 13F illustrates apacket 1300 e sent by thetask 1processor 134 a to theoutput queue 118 h in accordance with thenormal return indicator 1334. Thepacket header 1302 e contains the address of theoutput queue 118 h. The packet payload may comprise theerror reporting address 1336, thetask 4results data 1338 b, and thetask 1results data 1338 c. Thetask 1 and thetask 4 results may be separate or mixed, or thetask 4 1338 b results may be omitted. If the original task operands anddata 1338 a did carry through the chain to the last processor in the chain (task 1processor 134 a inFIG. 12 , determining that it is last based on the additional tasks bit 1333 b), that last processor may omit the original operands anddata 1338 a in the final results. -
FIG. 14 is a transaction flow diagram illustrating an example where an originatingprocessor 134 a deposits a task descriptor into an input queue, and a task-assigned processor deposits a sub-task into another input queue as a subroutine, with the end-result being deposited into anoutput queue 118 h for the originatingprocessor 134 a to retrieve.Task execution 1400 begins when a program executed byprocessor 134 a onprocessor chip 100 b results in issuance of atask 4request 1402 to thetask distributor 114 b on theprocessor chip 100 b. Thetask distributor 114 b, using a hash table 220 orCAM 252, assigns 1404 the task to thetask 4 queue 118.4 d onprocessor chip 100 d. - After a
processor 134 e subscribed to thetask 4 input queue 118.4 d becomes free and determines from theempty flag 765 that there is adescriptor 430 b waiting to be dequeued, thetask 4processor 134 e retrieves 1406 the descriptor from the queue 118.4 d. In this example,task 4 itself usestask 1 as a subroutine, resulting in thetask 4processor 134 e sending 1410 atask 1 request to thetask distributor 114 d on theprocessor chip 100 d. Thetask distributor 114 d sends 1412 thetask 1 assignment to thetask 1 input queue 118.1 a onprocessor chip 100 a. - After a
processor 134 a subscribed to thetask 1 input queue 118.1 a becomes free and determines from theempty flag 765 that there is adescriptor 430 b waiting to be dequeued, theprocessor 134 aretrieves 1414 the descriptor from the queue 118.1 a. Upon completion of the task, thetask 1processor 134 a writes 1420 (by packet) the result directly to thetask 4processor 134 e that issued thetask 1 request. Thetask 4processor 134 e thereafter completestask 4, using thetask 1 data. Upon completion, thetask 4processor 134 e writes 1422 (by packet) the result to anoutput queue 118 h onprocessor chip 100 b, in accordance with thenormal return indicator 434. Theoutput queue 118 h generates anevent signal 1430, waking the originatingprocessor 134 a (if in a low power mode), and causing the originatingprocessor 134 a to retrieve 1434 the results fromoutput queue 118 h. -
FIG. 15 is a hybrid process-flow transaction-flow diagram illustrating execution of thescheduler program 883 by a task-assigned processor, enabling the processor to autonomously subscribe and unsubscribe from task queues. - Initially, a
task processor 134 a is subscribed to atask 1 input queue 118.a, which has two subscribed cores (as specified in register 768). The queue depth (from register 764) is initially zero. After the queue 118.1 a receives 1520 a task, thetask processor 134 adequeues 1522 the task descriptor and executes 1524 the task, returning the results in accordance with thenormal return indicator 434. Thetask processor 134 a starts 1426 itsidle counter 887 and may enter into a low power mode, waiting for an interrupt from the subscribed task queue indicating that a descriptor is ready to be dequeued. When the counter expires 1528 or reaches a specified value, thetask processor 134 a runs thescheduler program 883, which determines 1530 whether there is more than one core subscribed to thetask 1 queue 118.1 a (from register 768), such that thescheduler program 883 is permitted to choose a new input queue. If there is not (1530 “No”) more than one processor subscribe to thetask 1 queue 118.1 a, theprocessor 134 a continues to wait 1534 for anew task 1 descriptor to appear in the input queue 118.1 a. Otherwise, thescheduler program 883 checks other input queues on the device to determine 1532 whether the depth of any of the other queues exceeds a minimum threshold depth “R”. The threshold depth is used to reduce the frequency with which processors unsubscribe and subscribe from and to input queues, since each new subscription results in memory being accessed to retrieve the task program executable code. - If none of the depths of the other input queues exceed “R” (1532 “No”), the processor remains subscribed to the
task 1 queue 118.1 a. Otherwise, thescheduler 883 selects 1536 a new input queue. For example, thescheduler 883 may select the input queue with the greatest depth, or among input queues tied for the greatest depth. Thescheduler 883 unsubscribes 1538 from thetask 1 queue 118.1 a,decrementing register 768. The scheduler then subscribes 1440 to thetask 2 input queue 118.2 a which had the largest depth of the task input queues on the device. Thescheduler 883 then loads 1542 thetask 2 program to theprogram memory 874 of theprocessing element 134 a, based on the program address in theregister 769 of thetask 2 queue 118.2 a. After thetask 2 program is loaded, thetask processor 134 a resumes normal operations, retrieving 1544 atask 2 descriptor from thetask 2 queue 118.2 a, and executing thattask 1546. Thetask processor 134 a will continue executing that same retrieved program until such time that its idle counter expires again without a task becoming available. - The
scheduler program 883 may comprise executable software and/or firmware instructions, may be integrated into eachtask processor 134 as a sequential logic circuit, or may be a combination of sequential logic with executable instructions. For example, sequential logic included in thetask processor 134 may set and start (1526) the idle counter, and determine (1528) that thetask processor 134 has been idle for (or longer than) a specified/preset/predetermined duration (e.g., based on the counter expiring or based on a comparison of the count on the counter equaling or exceeding the duration value). In response determining (1528) that thetask processor 134 has been idle for (or longer than) the specified duration, the sequential logic may load a remainder of thescheduler program 883 into the instruction registers 882 from theprogram memory 874 or another memory, based on an address stored in a specified register such as aspecial purpose register 886. - The disclosed system allows for a simple, relatively easy to understand interface that accommodates chips with a large number of cores and that improves scaling of a system by decoupling logical tasks from the arrangement of physical cores. A programmer writing the main program does not need to know (or care much) about how many cores will be executing assigned tasks. The number of cores can simply increase or decrease, depending on the number of tasks needing execution. Combined with the ability of cores to sleep while waiting for input, this flexible distribution of tasks also helps to reduce power consumption.
- Other addressing schemes may also be used, as well as different addressing hierarchies. Whereas a
processor core 890 may directly access its own execution registers 882 using address lines and data lines, communications between processing elements through the data transaction interfaces 872 may be via bus-based or packet-based networks. The bus-based networks may comprise address lines and data lines, conveying addresses via the address lines and data via the data lines. In comparison, the packet-based network connections may comprise a single serial data-line, or plural data lines, conveying addresses in packet headers and data in packet bodies via the data line(s). - Aspects of the disclosed system, such as the
scheduler 883 and the various executed software and firmware instructions, may be implemented as a computer method or as an article of manufacture such as a memory device or non-transitory computer readable storage medium. The computer readable storage medium may be readable by a computer and may comprise instructions for causing a computer or other device to perform processes described in the present disclosure. The computer readable storage medium may be implemented by a volatile computer memory, non-volatile computer memory, hard drive, solid-state memory, flash drive, removable disk and/or other media. - The examples discussed herein are meant to be illustrative. They were chosen to explain the principles and application of a task-queue based system computer system, and are not intended to be exhaustive or to limit such a system to the disclosed topologies, hardware structures, logic states, header formats, and descriptor formats. Many modifications and variations that utilize the operating principles of task queuing may be apparent to those of skill in the art. Persons having ordinary skill in the field of computers, microprocessor design, and network architectures should recognize that components and process steps described herein may be interchangeable with other components or steps, or combinations of components or steps, and still achieve the benefits and advantages of task queuing. Moreover, it should be apparent to one skilled in the art, that the disclosure may be practiced without some or all of the specific details and steps disclosed herein.
- Different logic and logic elements can be interchanged for the disclosed logic while achieving the same results. For example, a digital comparator that determines whether the
depth 764 is equal to zero is functionally identical a NOR gate where the data lines conveying the depth value are input into a NOR gate, the output of which will be asserted when the binary value across the data lines equals zero. - To accommodate the high speeds at which the
device 100 will ordinarily operate, it is contemplated that theFIFO queues 118 will be hardware queues, as discussed in connection withFIG. 7 . However, although it is contemplated that thequeues 118 will be hardware queues, software-controlled queues could be substituted. A mix of hardware queues and software-controlled queues may also be used. - “Writing,” “storing,” and “saving” are used interchangeably. “Enqueuing” includes writing/storing/saving to a queue. When data is written or enqueued to a location by a component (e.g., by a processing element, a task distributor, etc.), the operation may be directed by the component or the component may send the data to be written/enqueued (e.g., sending the data by packet, together with a write instruction). As such, “writing” and “enqueuing” should be understood to encompass “causing” data to be written or enqueued. Similarly, when a component “reads” or “dequeues” from a location, the operation may be directed by the component or the component may send a request (e.g., by packet, by asserting a signal line, etc.) that causes the data to be provided to the component. Queue management (e.g., the updating of the depth, the front pointer, and back pointer) may be performed by the queue itself, such that enqueuing and dequeuing causes queue management to occur, but does not require that the component enqueuing to or dequeuing from the queue to itself be responsible for queue management.
- As used in this disclosure, the term “a” or “one” may include one or more items unless specifically stated otherwise. Further, the phrase “based on” is intended to mean “based at least in part on” unless specifically stated otherwise.
Claims (22)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/173,017 US20170351555A1 (en) | 2016-06-03 | 2016-06-03 | Network on chip with task queues |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/173,017 US20170351555A1 (en) | 2016-06-03 | 2016-06-03 | Network on chip with task queues |
Publications (1)
Publication Number | Publication Date |
---|---|
US20170351555A1 true US20170351555A1 (en) | 2017-12-07 |
Family
ID=60483818
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/173,017 Abandoned US20170351555A1 (en) | 2016-06-03 | 2016-06-03 | Network on chip with task queues |
Country Status (1)
Country | Link |
---|---|
US (1) | US20170351555A1 (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180067786A1 (en) * | 2016-09-07 | 2018-03-08 | Military Telecommunication Group (Viettel) | Method of randomly distributing data in distributed multi-core processor systems |
CN109039944A (en) * | 2018-11-01 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of shunt method of data packet, device and equipment |
CN109101344A (en) * | 2018-06-29 | 2018-12-28 | 上海瀚之友信息技术服务有限公司 | A kind of data distributing method |
US20190004701A1 (en) * | 2017-07-03 | 2019-01-03 | Intel Corporation | Tier-Aware Read and Write |
US20190319933A1 (en) * | 2018-04-12 | 2019-10-17 | Alibaba Group Holding Limited | Cooperative tls acceleration |
US20190324926A1 (en) * | 2018-04-19 | 2019-10-24 | Avago Technologies General Ip (Singapore) Pte. Ltd. | System and method for port-to-port communications using direct memory access |
US20190340490A1 (en) * | 2018-05-04 | 2019-11-07 | Apple Inc. | Systems and methods for assigning tasks in a neural network processor |
US10656974B2 (en) | 2016-05-31 | 2020-05-19 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for communication in operating system and related products |
CN113778644A (en) * | 2021-08-18 | 2021-12-10 | 煤炭科学研究总院 | Task processing method, device, equipment and storage medium |
US11281502B2 (en) | 2020-02-22 | 2022-03-22 | International Business Machines Corporation | Dispatching tasks on processors based on memory access efficiency |
US11360809B2 (en) * | 2018-06-29 | 2022-06-14 | Intel Corporation | Multithreaded processor core with hardware-assisted task scheduling |
US11416294B1 (en) * | 2019-04-17 | 2022-08-16 | Juniper Networks, Inc. | Task processing for management of data center resources |
US11868287B2 (en) | 2020-12-17 | 2024-01-09 | Micron Technology, Inc. | Just-in-time (JIT) scheduler for memory subsystems |
CN117408220A (en) * | 2023-12-15 | 2024-01-16 | 湖北工业大学 | Programmable switching architecture chip resource arrangement method and device |
-
2016
- 2016-06-03 US US15/173,017 patent/US20170351555A1/en not_active Abandoned
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10656974B2 (en) | 2016-05-31 | 2020-05-19 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for communication in operating system and related products |
US10664326B2 (en) * | 2016-05-31 | 2020-05-26 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Method for communication in operating system and related products |
US10417064B2 (en) * | 2016-09-07 | 2019-09-17 | Military Industry—Telecommunication Group (Viettel) | Method of randomly distributing data in distributed multi-core processor systems |
US20180067786A1 (en) * | 2016-09-07 | 2018-03-08 | Military Telecommunication Group (Viettel) | Method of randomly distributing data in distributed multi-core processor systems |
US20190004701A1 (en) * | 2017-07-03 | 2019-01-03 | Intel Corporation | Tier-Aware Read and Write |
US11366588B2 (en) * | 2017-07-03 | 2022-06-21 | Intel Corporation | Tier-aware read and write |
US20190319933A1 (en) * | 2018-04-12 | 2019-10-17 | Alibaba Group Holding Limited | Cooperative tls acceleration |
US20190324926A1 (en) * | 2018-04-19 | 2019-10-24 | Avago Technologies General Ip (Singapore) Pte. Ltd. | System and method for port-to-port communications using direct memory access |
US10664420B2 (en) * | 2018-04-19 | 2020-05-26 | Avago Technologies International Sales Pte. Limited | System and method for port-to-port communications using direct memory access |
US20190340490A1 (en) * | 2018-05-04 | 2019-11-07 | Apple Inc. | Systems and methods for assigning tasks in a neural network processor |
US11360809B2 (en) * | 2018-06-29 | 2022-06-14 | Intel Corporation | Multithreaded processor core with hardware-assisted task scheduling |
CN109101344A (en) * | 2018-06-29 | 2018-12-28 | 上海瀚之友信息技术服务有限公司 | A kind of data distributing method |
CN109039944A (en) * | 2018-11-01 | 2018-12-18 | 郑州云海信息技术有限公司 | A kind of shunt method of data packet, device and equipment |
US11416294B1 (en) * | 2019-04-17 | 2022-08-16 | Juniper Networks, Inc. | Task processing for management of data center resources |
US11281502B2 (en) | 2020-02-22 | 2022-03-22 | International Business Machines Corporation | Dispatching tasks on processors based on memory access efficiency |
US11868287B2 (en) | 2020-12-17 | 2024-01-09 | Micron Technology, Inc. | Just-in-time (JIT) scheduler for memory subsystems |
CN113778644A (en) * | 2021-08-18 | 2021-12-10 | 煤炭科学研究总院 | Task processing method, device, equipment and storage medium |
CN117408220A (en) * | 2023-12-15 | 2024-01-16 | 湖北工业大学 | Programmable switching architecture chip resource arrangement method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20170351555A1 (en) | Network on chip with task queues | |
US11915057B2 (en) | Computational partition for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11573796B2 (en) | Conditional branching control for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11836524B2 (en) | Memory interface for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11675598B2 (en) | Loop execution control for a multi-threaded, self-scheduling reconfigurable computing fabric using a reenter queue | |
US11567766B2 (en) | Control registers to store thread identifiers for threaded loop execution in a self-scheduling reconfigurable computing fabric | |
US11675734B2 (en) | Loop thread order execution control of a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11868163B2 (en) | Efficient loop execution for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11782710B2 (en) | Execution or write mask generation for data selection in a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11489791B2 (en) | Virtual switch scaling for networking applications | |
US11635959B2 (en) | Execution control of a multi-threaded, self-scheduling reconfigurable computing fabric | |
US7487505B2 (en) | Multithreaded microprocessor with register allocation based on number of active threads | |
JP3880034B2 (en) | Register transfer unit for electronic processors | |
US8194690B1 (en) | Packet processing in a parallel processing environment | |
US7443836B2 (en) | Processing a data packet | |
US7376952B2 (en) | Optimizing critical section microblocks by controlling thread execution | |
US20230153258A1 (en) | Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
US11409506B2 (en) | Data plane semantics for software virtual switches | |
JP2016503933A (en) | Scheduling system, packet processing scheduling method and module | |
US9092275B2 (en) | Store operation with conditional push of a tag value to a queue | |
US10346049B2 (en) | Distributed contiguous reads in a network on a chip architecture | |
US10078606B2 (en) | DMA engine for transferring data in a network-on-a-chip processor | |
CN114116155A (en) | Lock-free work stealing thread scheduler | |
US20180088904A1 (en) | Dedicated fifos in a multiprocessor system | |
WO2022088074A1 (en) | Instruction processing method based on multiple instruction engines, and processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: XL INNOVATE FUND, L.P., CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:KNUEDGE INCORPORATED;REEL/FRAME:040601/0917 Effective date: 20161102 |
|
AS | Assignment |
Owner name: KNUEDGE, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:COFFIN, JEROME VINCENT;REEL/FRAME:040485/0204 Effective date: 20161028 |
|
AS | Assignment |
Owner name: XL INNOVATE FUND, LP, CALIFORNIA Free format text: SECURITY INTEREST;ASSIGNOR:KNUEDGE INCORPORATED;REEL/FRAME:044637/0011 Effective date: 20171026 |
|
AS | Assignment |
Owner name: FRIDAY HARBOR LLC, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:KNUEDGE, INC.;REEL/FRAME:047156/0582 Effective date: 20180820 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |