US20210191781A1 - Concurrent program execution optimization - Google Patents
Concurrent program execution optimization Download PDFInfo
- Publication number
- US20210191781A1 US20210191781A1 US17/195,174 US202117195174A US2021191781A1 US 20210191781 A1 US20210191781 A1 US 20210191781A1 US 202117195174 A US202117195174 A US 202117195174A US 2021191781 A1 US2021191781 A1 US 2021191781A1
- Authority
- US
- United States
- Prior art keywords
- given
- app
- instance
- stage
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/173—Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/48—Indexing scheme relating to G06F9/48
- G06F2209/483—Multiproc
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5021—Priority
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/65—Updates
- G06F8/656—Updates while running
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/70—Admission control; Resource allocation
- H04L47/78—Architectures of resource allocation
Definitions
- This invention pertains to the field of information processing, particularly to techniques for managing execution of multiple concurrent, multi-task software programs on parallel processing hardware.
- An aspect of the invention provides systems and methods for arranging secure and reliable, concurrent execution of a set of internally parallelized and pipelined software programs on a pool of processing resources shared dynamically among the programs, wherein the dynamic sharing of the resources is based at least in part on i) processing input data loads for instances and tasks of the programs and ii) contractual capacity entitlements of the programs.
- An aspect of the invention provides methods and systems for intelligent, destination task defined prioritization of inter-task communications (ITC) for a computer program, for architectural ITC performance isolation among a set of programs executing concurrently on a dynamically shared data processing platform, as well as for prioritizing instances of the program tasks for execution at least in part based on which of the instances have available to them their input data, including ITC data, enabling any given one of such instances to execute at the given time.
- ITC inter-task communications
- An aspect of the invention provides a system for prioritizing instances of a software program for execution.
- a system comprises: 1) a subsystem for determining which of the instances are ready to execute on an array of processing cores, at least in part based on whether a given one of the instances has available to it input data to process, and 2) a subsystem for assigning a subset of the instances for execution on the array of cores based at least in part on the determining.
- Various embodiments of that system include further features such as features whereby a) the input data is from a data source such that the given instance has assigned a high priority for purposes of receiving data; b) the input data is such data that it enables the given program instance to execute; c) the subset includes cases of none, some as well as all of the instances of said program; d) the instance is: a process, a job, a task, a thread, a method, a function, a procedure or an instance any of the foregoing, or an independent copy of the given program; and/or e) the system is implemented by hardware logic that is able to operate without software involvement.
- An aspect of the invention provides a hardware logic implemented method for prioritizing instances of a software program for execution, with such a method involving: classifying instances of the program into the following classes, listed in the order from higher to lower priority for execution, i.e., in their reducing execution priority order: (I) instances indicated as having high priority input data for processing, and (II) any other instances.
- Various embodiments of that method include further steps and features such as features whereby a) the other instances are further classified into the following sub-classes, listed in their reducing execution priority order: (i) instances indicated as able to execute presently without the high priority input data, and (ii) any remaining instances; b) the high priority input data is data that is from a source where its destination instance, of said program, is expecting high priority input data; c) a given instance of the program comprises tasks, with one of said tasks referred to as a destination task and others as source tasks of the given instance, and for the given instance, a unit of the input data is considered high priority if it is from such one of the source tasks that the destination task has assigned a high priority for inter-task communications to it; d) for any given one of the instances, a step of computing a number of its non-empty source task specific buffers among its input data buffers such that belong to source tasks of the given instance indicated at the time as high priority source tasks for communications to the destination task of the given instance, with this
- An aspect of the invention provides a system for processing a set of computer programs instances, with inter-task communications (ITC) performance isolation among the set of program instances.
- ITC inter-task communications
- Such a system comprises: 1) a number of processing stages; and 2) a group of multiplexers connecting ITC data to a given stage among the processing stages, wherein a multiplexer among said group is specific to one given program instance among said set.
- the system hosts each task of the given program instance at different one of the processing stages, and supports copies of same task software code being located at more than one of the processing stages in parallel.
- Various embodiments of this system include further features such as a) a feature whereby at least one of processing stages comprises multiple processing cores such as CPU execution units, with, for any of the cores, at any given time, one of the program instances assigned for execution; b) a set of source task specific buffers for buffering data destined for a task of the given program instance located at the given stage, referred to as a destination task, and hardware logic for forming a hardware signal indicating whether sending ITC is presently permitted to a given buffer among the source task specific buffers, with such forming based at least in part on a fill level of the given buffer, and with such a signal being connected to a source task for which the given buffer is specific to; c) a feature providing, for the destination task, a set of source task specific buffers, wherein a given buffer is specific to one of the other tasks of the program instance for buffering ITC from said other task to the destination task; d) feature wherein the destination task provides ITC prioritization information for other tasks of the program
- aspects of the invention involve application-program instance specific hardware logic resources for secure and reliable ITC among tasks of application program instances hosted at processing stages of a multi-stage parallel processing system.
- the invented mechanisms efficiently inter-connect the tasks of any given application program instance using the per application program instance specific inter-processing stage ITC hardware logic resources. Due to the ITC being handled with such application program instance specific hardware logic resources, the ITC performance experience by one application instance does not depend on the ITC resource usage (e.g. data volume and inter-task communications intensiveness) of the other applications sharing the given data processing system per the invention. This results in effective inter-application isolation for ITC in a multi-stage parallel processing system shared dynamically among multiple application programs.
- An aspect of the invention provides systems and methods for scheduling instances of software programs for execution based at least in part on (1) availability of input data of differing priorities for any given one of the instances and/or (2) availability, on their fast-access memories, of memory contents needed by any given one of the instances to execute.
- An aspect of the invention provides systems and methods for optimally allocating and assigning input port capacity to a data processing systems among data streams of multiple software programs based at least in part on input data load levels and contractual capacity entitlements of the programs.
- An aspect of the invention provides systems and methods for resolution of resource access contentions, for resources including computing, storage and communication resources such as memories, queues, ports or processors. Such methods enable multiple potential user systems for a shared resource, in a coordinated and fair manner, to avoid conflicting resource access decisions, even while multiple user systems are deciding on access to set of shared resources concurrently, including at the same clock cycle.
- An aspect of the invention provides systems and methods for load balancing, whereby the load balancer is configured to forward, by its first layer, any packets without destination instance within its destination application specified (referred to as no-instance-specified packets or NIS packets for short) it receives from its network input to such one of the processing systems in the local load balancing group that presently has the highest score for accepting NIS packets for the destination app of the given NIS packet.
- the load balancers further have destination processing system (i.e.
- the score for accepting NIS packets for a destination processing system among the load balancing group is based at least in part on the amount of presently inactive instance resources at the given processing system for the destination application of a given NIS packet.
- FIG. 1 shows, in accordance with an embodiment of the invention, a functional block diagram for a load balancing architecture for a bank of processor systems, such as those discussed in the following with reference to the remaining FIGS.
- FIG. 2 shows, in accordance with an embodiment of the invention, a functional block diagram for a multi-stage many core processing system shared dynamically among a set of software program instances, with the system providing capabilities for optimally scheduling inter-task communications (ITC) units between various tasks of any one of the program instances, as well as scheduling and placing instances of a given program task for execution on the processing stages of the system, at least in part based on which of the instances have available for them the input data, e.g. ITC data, needed by them to execute.
- ITC inter-task communications
- FIG. 3 shows, in accordance with an embodiment of the invention, a functional block diagram for a receive (RX) logic module of any of the processing stages of the multi-stage manycore processor system per FIG. 2 .
- RX receive
- FIG. 4 shows, in accordance with an embodiment of the invention, a functional block diagram for an application program specific submodule of the processing stage RX logic module per FIG. 3 .
- FIG. 5 shows, in accordance with an embodiment of the invention, a functional block diagram for an application program instance specific submodule of the application program specific submodule per FIG. 4 .
- FIG. 6 shows, in accordance with an embodiment of the invention, a functional block diagram for logic resources within one of the processing stages of a system 1 per FIG. 2 for connecting ITC data from input buffers of the RX logic (per FIGS. 3-5 ) to the manycore processor of the local processing stage.
- FIG. 7 shows, in accordance with an embodiment of the invention, a functional block diagram for the application load adaptive manycore processor of a processing stage of the multi-stage processing system per preceding FIGS.
- FIGS. and related descriptions in the following provide specifications for embodiments and aspects of hardware-logic based systems and methods for inter-task communications (ITC) with destination task defined source task prioritization, for input data availability based prioritization of instances of a given application task for execution on processing cores of a processing stage hosting the given task, for architecture-based application performance isolation for ITC in multi-stage manycore data processing system, as well as for load balancing of incoming processing data units among a group of such processing systems.
- ITC inter-task communications
- FIG. 1 presents the load balancing architecture for a row of processing systems per this description, comprising a set 4 of T load balancers 3 and a load balancing group 2 of S processing systems 1 (T and S are positive integers).
- each of the balancers forward any no-instance-specific (NIS) packets (i.e. packets without a specific instance of their destination applications identified) arriving to them via their network inputs to one of the processing systems of the group, based on the NIS packet forwarding preference scores (for the destination app of the given NIS packet) of the individual processing systems of the load balancing group 2 .
- NIS no-instance-specific
- the load balancing per FIG. 1 for a bank 2 of the processing systems operates as follows:
- load balancing logic 4 computes the collective sum Z of the Y numbers across all the apps (with this across-apps-sum Z naturally being the same for all apps on a given processing system).
- the mechanisms per above three bullet points are designed to eliminate all packet drops in the system such that are avoidable by system design, i.e., for reasons other than app-instance specific buffer overflows caused be systemic mismatches between input data loads to a given app-inst and the capacity entitlement level subscribed to by the given app.
- FIG. 2 provides, according to an embodiment of the invention, a functional block diagram for a multistage manycore processor system 1 shared dynamically multiple concurrent application programs (apps), with hardware logic implemented capabilities for scheduling tasks of application program instances and prioritizing inter-task communications (ITC) among tasks of a given app instance, based at least in part on, for any given app-inst, at a given time, which tasks are expecting input data from which other tasks and which tasks are ready to execute on cores of the multi-stage manycore processing system, with the ready-to-execute status of a given task being determined at least in part based on whether the given task has available to it the input data from other tasks or system 1 inputs 19 so as to enable it to execute at the given time, including producing its processing outputs, such as ITC communications 20 to other tasks or program processing results etc.
- ITC inter-task communications
- the multi-stage manycore processor system 1 is shared dynamically among tasks of multiple application programs (apps) and instances (insts) thereof, with, for each of the apps, each task located at one of the (manycore processor) based processing stages 300 .
- apps application programs
- insts instances
- copies of same task software i.e. copies of same software code
- the architecture per FIG. 2 with its any-to-any ITC connectivity between the stages 300 , supports organizing tasks of a program flexibly for any desirable mixes or matches of pipelined and/or parallelized processing.
- the system provides data processing services to be used by external parties (e.g. by clients of the programs hosted on the system) over networks.
- the system 1 receives data units (e.g. messages, requests, data packets or streams to be processed) from its users through its inputs 19 , and transmits the processing results to the relevant parties through its network outputs 50 .
- data units e.g. messages, requests, data packets or streams to be processed
- the network ports of the system of FIG. 2 can be used also for connecting with other (intermediate) resources and services (e.g. storage, databases etc.) as desired for the system to produce the requested processing results to the relevant external parties.
- the application program tasks executing on the entry stage manycore processor are typically of ‘master’ type for parallelized/pipelined applications, i.e., they manage and distribute the processing workloads for ‘worker’ type tasks running (in pipelined and/or parallel manner) on the worker stage manycore processing systems (note that the processor system hardware is similar across all instances of the processing stages 300 ).
- the instances of master tasks typically do preliminary processing (e.g. message/request classification, data organization) and workflow management based on given input data units (packets), and then typically involve appropriate worker tasks at their worker stage processors to perform the data processing called for by the given input packet, potentially in the context of and in connection with other related input packets and/or other data elements (e.g.
- the master tasks typically pass on the received data units (using direct connection techniques to allow most of the data volumes being transferred to bypass the actual processor cores) through the (conceptual) inter-stage packet-switch (PS) to the worker stage processors, with the destination application-task instance (and thereby, the destination worker stage) identified for each data unit as described in the following.
- PS packet-switch
- the hardware controller of each processor 300 rather than any application software (executing on a given processor), inserts the application ID # bits for the data packets passed to the PS 200 . That way, the tasks of any given application running on the processing stages in a system can trust that the packets they receive from the PS are from its own application.
- the controller determines, and therefore knows, the application ID # that each given core within its processor is assigned to at any given time, via the application-instance to core mapping info that the controller produces. Therefore the controller is able to insert the presently-assigned app ID # bits for the inter-task data units being sent from the cores of its processing stage over the core-specific output ports to the PS.
- any given application server program
- the system enables external parties to communicate with any such application hosted on the system without knowledge about any specifics (incl. existence, status, location) of their internal tasks or instances.
- the incoming data units to the system are expected to identify just their destination application, and when applicable, the application instance.
- the system enables external parties to communicate with any given application hosted on a system through any of the network input ports 10 of any of the load balancers 3 , without such external parties knowing whether or at which cores 520 ( FIG. 7 ) or processing stages 300 any instance of the given application task (app-task) may be executing at any time.
- the architecture enables the aforesaid flexibility and efficiency through its hardware logic functionality, so that no system or application software running on the system needs to either keep track of whether or where any of the instances of any of the app-tasks may be executing at any given time, or which port any given inter-task or external communication may have used.
- the system while providing a highly dynamic, application workload adaptive usage of the system processing and communications resources, allows the software running on and/or remotely using the system to be designed with a straightforward, abstracted view of the system: the software (both remote and local programs) can assume that all the applications, and all their tasks and instances, hosted on the given system are always executing on their virtual dedicated processor cores within the system.
- said virtual dedicated processors can also be considered by software to be time-share slices on a single (unrealistically high speed) processor.
- the presented architecture thereby enables achieving, at the same time, both the vital application software development productivity (simple, virtual static view of the actually highly dynamic processing hardware) together with high program runtime performance (scalable concurrent program execution with minimized overhead) and resource efficiency (adaptively optimized resource allocation) benefits.
- Techniques enabling such benefits of the architecture are described in the following through more detailed technical description of the system 1 and its subsystems.
- the any-to-any connectivity among the app-tasks of all the processing stages 300 provided by the PS 200 enables organizing the worker tasks (located at the array of worker stage processors) flexibly to suit the individual demands (e.g. task inter-dependencies) of any given application program on the system: the worker tasks can be arranged to conduct the work flow for the given application using any desired combinations of parallel and pipelined processing.
- the worker tasks can be arranged to conduct the work flow for the given application using any desired combinations of parallel and pipelined processing.
- it is possible to have the same task of a given application located on any number of the worker stages in the architecture per FIG. 2 to provide a desired number of parallel copies of a given task per an individual application instance, i.e. to support also data-parallelism, along with task concurrency.
- the set of applications configured to run on the system can have their tasks identified by (intra-app) IDs according to their descending order of relative (time-averaged) workload levels.
- intra-app task ID assignment principle
- the above scheme causes the task IDs of the set of apps to be placed at the processing stages per Table 1 below:
- the sum of the task ID #s (with each task ID # representing the workload ranking of its task within its app) is the same for any row i.e. for each worker stage.
- This load balancing scheme can be straightforwardly applied for differing numbers of processing stages/tasks and applications, so that the overall task processing load is to be, as much as possible, equal across all worker-stage processors of the system. Advantages of such schemes include achieving optimal utilization efficiency of the processing resources and eliminating or at least minimizing the possibility and effects of any of the worker-stage processors forming system-wide performance bottlenecks.
- a non-exclusive alternative task to stage placement principle targets grouping tasks from the apps in order to minimize any variety among the processing core types demanded by the set of app-tasks placed on any given individual processing stage; that way, if all app-tasks placed on a given processing stage optimally run on the same processing core type, there is no need for reconfiguring the core slots of the manycore array at the given stage regardless which of the locally hosted app-tasks get assigned to which of its core slots (see [ 1 ], Appendix A, Ch. 5 . 5 for task type adaptive core slot reconfiguration, which may be used when the app-task located on the given processing stage demand different execution core types).
- FIGS. 3-5 present the processing stage, app, app-instance level microarchitectures for the processing stage receive (RX) logic modules 201 (which collectively accomplish the functionality of the conceptual inter-stage packet-switch (PS) module of FIG. 2 ).
- RX receive
- PS packet-switch
- the functionality of the conceptual inter-stage PS 200 is actually realized by instantiating the logic per FIG. 3 (and its submodules) as the RX logic of each manycore processing system 300 (referred to as a stage) in the multi-stage architecture; there is no need for other logic to the PS.
- the stage RX logic 201 per FIG. 3-5 is part of the processing stage 300 that it interfaces to; i.e., in an actual hardware implementation, there is no PS module as its functionality is distributed to the individual processing stages.
- FIG. 4 shows how the app-specific RX logic forms, for purposes of optimally assigning the processing cores of the local manycore processor among insts of the apps sharing the system, the following info for the given app:
- the app-instance specific RX logic per FIG. 5 performs multiplexing 280 ITC packets from the source stage i.e. source task (of a given app-inst) specific First-in First-Out buffers (FIFOs) 260 to the local manycore processor via the input port 290 of that processor dedicated to the given app instance.
- source stage i.e. source task (of a given app-inst) specific First-in First-Out buffers (FIFOs) 260 to the local manycore processor via the input port 290 of that processor dedicated to the given app instance.
- FIFOs First-in First-Out buffers
- stage RX logic 201 of each stage in this multi-stage architecture, and thus the operation of the stage RX logic can be fully explained by (as is done in the following) by assuming that the processing stage under study is instantiated as a worker or exit stage processing system, such that receives its input data from the other processing stages of the given multi-stage manycore processor, rather than from the load balancers of the given load balancing group, as in the case of the entry-stage processors; the load balancers appear to the entry-stage as virtual processing stages.
- the references to ‘source stage’ are to be understood as actually referring to load balancers, and the references to ITC mean input data 19 to the multi-stage manycore processor system—except in case of the ITC 20 from the exit stage, as detailed above and as illustrated in FIG. 2 .
- the description of the stage RX logic herein is written considering the operating context of worker and exit stage processors (with the same hardware logic being used also for the entry-stage).
- the app-instance specific RX logic per FIG. 5 has a FIFO module 245 per each of the source stages.
- the source-stage specific FIFO module comprises:
- the app-instance RX module 203 per FIG. 5 further provides arbitrating logic 270 to decide, at multiplexing packet boundaries 281 , from which of the source stage FIFO modules 245 to mux 280 out the next packet to the local manycore processor via the processor data input port 290 specific to the app-instance under study.
- This muxing process operates as follows:
- Each given app-instance software provides a logic vector 595 to the arbitrating logic 270 of its associated app-instance RX module 203 such that has a priority indicator bit within it per each of its individual source stage specific FIFO modules 245 : while a bit of such a vector relating to a particular source stage is at its active state (e.g. logic ‘1’), ITC from the source stage in question to the local task of the app-instance will be considered to be high priority, and otherwise normal priority, by the arbitrator logic in selecting the source stage specific FIFO from where to read the next ITC packet to the local (destination) task of the studied app-instance.
- the arbitrator selects the source stage specific FIFO 260 (within the array 240 of the local app-instance RX module 203 ) for reading 265 , 290 the next packet per the following source priority ranking algorithm:
- the ITC source task prioritization info 595 from the task software of app-instances to their RX logic modules 203 can change dynamically, as the processing state and demands of input data for a given app-instance task evolve over time, and the arbitrator modules 270 ( FIG. 5 ) apply the current state of the source task prioritization info provided to them in selecting from which of the source stages to multiplex 280 out the next ITC packet over the output port 290 of the app-instance RX logic.
- the local task of a given app-inst when a need arises, writes 575 , 595 the respective ITC prioritization levels for its source tasks (of the given app-inst) on its source-task specific ITC prioritization hardware registers, which are located at (or their info connected to) source-stage prioritization control logic submodule 285 of the arbitrator 270 of the RX module 203 of that given app-inst.
- source-stage prioritization control logic submodule 285 of the arbitrator 270 of the RX module 203 of that given app-inst Please see FIG. 7 for the muxing 580 of the input data read control info (incl. source prioritization) from the app-insts executing at the cores of the array to their associated RX modules 203 .
- Each of the source stage specific FIFO modules 245 of a given app-instance at the RX logic for a given processing stage maintains a signal 212 indicating whether the task (of the app instance under study) located at the source stage that the given FIFO 260 is specific to is presently permitted to send ITC to the local (destination) task of the app-instance under study: the logic denies the permit when the FIFO fill level is above a defined threshold, while it otherwise grants the permit.
- any given (source) task when assigned for execution at a core 520 ( FIG. 7 ) at the processing stage where the given task is located, receives the ITC sending permission signals from each of the other (destination) tasks of its app-instance.
- these ITC permissions are connected 213 to the processing cores of the (ITC source) stages through multiplexers 600 , which, according to the control 560 from the controller 540 at the given (ITC source) processing stage identifying the active app-instance for each execution core 520 , connect 213 the incoming ITC permission signals 212 from the other stages of the given multi-stage system 1 to the cores 520 at that stage.
- the processing stage provides core specific muxes 600 , each of which connects to its associated core the incoming ITC send permit signals from the ‘remote’ (destination) tasks of the app-instance assigned at the time to the given core, i.e., from the tasks of that app-instance located at the other stages of the given processing system.
- the (destination) task RX logic modules 203 activate the ITC permission signals for times that the source task for which the given permission signal is directed to is permitted to send further ITC data to that destination task of the given app-inst.
- Each given processing stage receive and monitor ITC permit signal signals 212 from those of the processing stages that the given stage actually is able to send ITC data to; please see FIG. 2 for ITC connectivity among the processing stages in the herein studied embodiment of the presented architecture.
- the ITC permit signal buses 212 will naturally be connected across the multi-stage system 1 between the app-instance specific modules 203 of the RX logic modules 202 of the ITC destination processing stages and the ITC source processing stages (noting that a given stage 300 will be both a source and destination for ITC as illustrated in FIG. 2 ), though the inter-stage connections of the ITC flow control signals are not shown in FIG. 2 .
- the starting and ending points of the of the signals are shown, in FIG. 5 and FIG. 7 respectively, while the grouping of these ITC flow control signals according to which processing stage the given signal group is directed to, as well as forming of the stage specific signal groups according to the app-instance # that any given ITC flow control signal concerns, are illustrated also in FIGS. 3-4 .
- stage specific groups of signals ( FIG. 3 ) to any of the processing stages 300 ( FIG. 7 )
- the principle is that, at arrival to the stage that a given set of such groups of signals is directed to, the signals from said groups are re-grouped to form, for each of the app-instances hosted on the system 1 , a bit vector where a bit of a given index indicates whether the task of a given app-instance (that the given bit vector is specific to) hosted at this (source) stage under study is permitted at that time to send ITC data to its task located at the stage ID # of that given index.
- each given bit in these bit vectors informs whether the studied task of the given app-instance is permitted to send ITC to the task of that app-instance with task ID # equal to the index of the given bit.
- the above discussed core specific muxes 600 are able to connect to any given core 520 of the local manycore array the (task-ID-indexed) ITC flow control bit vector of the app-instance presently assigned for execution at the given core.
- the FIFO fill-above-threshold indications from the source stage specific FIFOs 260 of the app-instance specific submodules of the RX logic modules of the (ITC destination) processing stages of the present multi-stage system are wired directly, though as inverted, as the ITC send permission indication signals to the appropriate muxes 600 of the (ITC source) stages, without going through the arbitrator modules (of the app-instance RX logic modules at the ITC destination stages).
- an ITC permission signal indicating that the destination FIFO for the given ITC flow has its fill level presently above the configured threshold is to be understood by the source task for that ITC flow as a denial of the ITC permission (until that signal would turn to indicate that the fill level of the destination FIFO is below the configured ITC permission activation threshold).
- Each source task applies these ITC send permission signals from a given destination task of its app-instance at times that it is about to begin sending a new packet over its (assigned execution core specific) processing stage output port 210 to that given destination task.
- the ITC destination FIFO 260 monitoring threshold for allowing/disallowing further ITC data to be sent to the given destination task (from the source task that the given FIFO is specific to) is set to a level where the FIFO still has room for at least one ITC packet worth of data bytes, with the size of such ITC packets being configurable for a given system implementation, and the source tasks are to restrict the remaining length of their packet transmissions to destination tasks denying the ITC permissions according to such configured limits.
- the app-level RX logic per FIG. 4 arranges the instances of its app for the instance execution priority list 535 (sent via info flow 430 ) according to their descending order of their priority scores computed for each instance based on their numbers 429 of source stage specific non-empty FIFOs 260 ( FIG. 5 ) as follows.
- priority scores we first define (a non-negative integer) H as the number of non-empty FIFOs of the given instance whose associated source stage was assigned a high ITC priority (by the local task of the given app-instance hosted at the processing stage under study).
- the intra-app execution priority score P for a given instance specific module is formed with equations as follows, with different embodiments having differing coefficients for the factors H, L and the number of tasks for the app, T:
- the logic for prioritizing the instances of the given app for its execution priority list 535 via a continually repeating process, signals (via hardware wires dedicated for the purpose) to the controller 540 of the local manycore processor 500 ( FIG. 7 ) this instance execution priority list using the following format:
- the process periodically starts from priority order 0 (i.e. the app's instance with the greatest priority score P), and steps through the remaining priority orders 1 through the maximum supported number of instances for the given application (specifically, for its task located at the processing stage under study) less 1, producing one instance entry per each step on the list that is sent to the controller as such individual entries.
- Each entry of such a priority list comprises, as its core info, simply the instance ID # (as the priority order of any given instance is known from the number of clock cycles since the bit pulse marking the priority order 0 at the start of a new list).
- the priority order i.e. the number of clock cycles since the bit pulse marking the priority order 0
- the priority order i.e. the number of clock cycles since the bit pulse marking the priority order 0
- the controller 540 of the manycore processor uses the most recent set of complete priority order lists 535 received from the application RX modules 202 to determine which (highest priority) instances of each given app to assign for execution for the next core allocation period on that processor.
- the ITC source prioritization, program instance execution prioritization and ITC flow control techniques provide effective program execution optimization capabilities for each of a set of individual programs configured to dynamically share a given data processing system 1 per this description, without any of the programs impacting or being impacted by in any manner the other programs of such set.
- the individual instances (e.g. different user sessions) of a given program are fully independent from each other.
- the herein described techniques and architecture thus provide effective performance and runtime isolation between individual programs among groups of programs running on the dynamically shared parallel computing hardware.
- any of the processing stages 300 of the multi-stage system 1 per FIG. 2 has, besides the RX logic 201 and the actual manycore processor system ( FIG. 7 ), an input multiplexing subsystem 450 , which connects input data packets from any of the app-instance specific input ports 290 to any of the processing cores 520 of the processing stage, according to which app-instance is executing at any of the cores at any given time.
- the monitoring of the buffered input data availability 261 at the destination app-instance FIFOs 260 of the processing stage RX logic enables optimizing the allocation of processing core capacity of the local manycore processor among the application tasks hosted on the given processing stage. Since the controller module 540 of the local manycore processor determines which instances of the locally hosted tasks of the apps in the system 1 execute at which of the cores of the local manycore array 515 , the controller is able to provide the dynamic control 560 for the muxes 450 per FIG. 6 to connect the appropriate app-instance specific input data port 290 from the stage RX logic to each of the core specific input data ports 490 of the manycore array of the local processor.
- FIG. 7 Internal elements and operation of the application load adaptive manycore processor system 500 are illustrated in FIG. 7 .
- the intra processing stage discussion it shall be recalled that there is no more than one task located per processing stage per each of the apps, though there can be up to X (a positive integer) parallel instances of any given app-task at its local processing stage (having an array 515 of X cores).
- X a positive integer
- the term app-instance in the context of a single processing stage means an instance of an app-task hosted at the given processing stage under study.
- FIG. 7 provides a functional block diagram for the manycore processor system dynamically shared among instances of the locally hosted app-tasks, with capabilities for application input data load adaptive allocation of the cores 520 among the applications and for app-inst execution priority based assignment of the cores (per said allocation), as well as for accordantly dynamically reconfigured 550 , 560 I/O and memory access by the app-insts.
- the processor system 500 comprises an array 515 of processing cores 520 , which are dynamically shared among instances of the locally hosted tasks of the application programs configured to run on the system 1 , under the direction 550 , 560 of the hardware logic implemented controller 540 .
- Application program specific logic functions at the RX module ( FIG. 3-5 ) signal their associated applications' capacity demand indicators 430 to the controller.
- the core-demand-figures (CDFs) 530 express how many cores their associated app is presently able utilize for its (ready to execute) instances.
- Each application's capacity demand expressions 430 for the controller further include a list of its ready instances in an execution priority order 535 .
- any of the cores 520 of a processor per FIG. 7 can comprise any types of software program and data processing hardware resources, e.g. central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs) or application specific processors (ASPs) etc., and in programmable logic (FPGA) implementation, the core type for any core slot 520 is furthermore reconfigurable per expressed demands of its assigned app-task, e.g. per [1], Appendix A, Ch. 5.5.
- CPUs central processing units
- GPUs graphics processing units
- DSPs digital signal processors
- ASPs application specific processors
- FPGA programmable logic
- the hardware logic based controller 540 module within the processor system through a periodic process, allocates and assigns the cores 520 of the processor among the set of applications and their instances based on the applications' core demand figures (CDFs) 530 as well as their contractual core capacity entitlements (CEs).
- This application instance to core assignment process is exercised periodically, e.g. at intervals such as once per a defined number (for instance 64 , 256 or 1024 , or so forth) of processing core clock or instruction cycles.
- the app-instance to core assignment algorithms of the controller produce, per the app-instances on the processor, identification 550 of their execution cores (if any, at any given time), as well as per the cores of the fabric, identification 560 of their respective app-instances to execute.
- the assignments 550 , 560 between app-insts and the cores of the array 515 control the access between the cores 520 of the fabric and the app-inst specific memories at the fabric network and memory subsystem 800 (which can be implemented e.g. per [1] Appendix A, Ch. 5.4).
- the app-instance to core mapping info 560 also directs the muxing 450 of input data from the RX buffers 260 of an appropriate app-instance to each core of the array 515 , as well as the muxing 580 of the input data read control signals ( 570 to 590 , and 575 to 595 ) from the core array to the RX logic submodule ( FIG. 5 ) of the app-instance that is assigned for any given core 520 at any given time.
- the core to app-inst mapping info 560 also directs the muxing 600 of the (source) app-instance specific ITC permit signals ( 212 to 213 ) from the destination processing stages to the cores 520 of the local manycore array, according to which app-instance is presently mapped to which core.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multi Processors (AREA)
Abstract
An architecture for a load-balanced groups of multi-stage manycore processors shared dynamically among a set of software applications, with capabilities for destination task defined intra-application prioritization of inter-task communications (ITC), for architecture-based ITC performance isolation between the applications, as well as for prioritizing application task instances for execution on cores of manycore processors based at least in part on which of the task instances have available for them the input data, such as ITC data, that they need for executing.
Description
- This application is a continuation application of U.S. application Ser. No. 16/434,581 filed Jun. 7, 2019, which is a continuation application of U.S. application Ser. No. 15/267,153 filed Sep. 16, 2016 (now U.S. Pat. No. 10,318,353), which is a continuation application of U.S. application Ser. No. 14/318,512 filed Jun. 27, 2014 (now U.S. Pat. No. 9,448,847), which claims the benefit and priority of the following provisional applications:
- [1] U.S. Provisional Application No. 61/934,747 filed Feb. 1, 2014; and
- [2] U.S. Provisional Application No. 61/869,646 filed Aug. 23, 2013;
- This application is also related to the following co-pending or patented applications:
- [3] U.S. Utility application Ser. No. 13/184,028, filed Jul. 15, 2011;
- [4] U.S. Utility application Ser. No. 13/270,194, filed Oct. 10, 2011;
- [5] U.S. Utility application Ser. No. 13/277,739, filed Nov. 21, 2011;
- [6] U.S. Utility application Ser. No. 13/297,455, filed Nov. 16, 2011;
- [7] U.S. Utility application Ser. No. 13/684,473, filed Nov. 23, 2012;
- [8] U.S. Utility application Ser. No. 13/717,649, filed Dec. 17, 2012;
- [9] U.S. Utility application Ser. No. 13/901,566, filed May 24, 2013; and
- [10] U.S. Utility application Ser. No. 13/906,159, filed May 30, 2013.
- All above identified applications are hereby incorporated by reference in their entireties for all purposes.
- This invention pertains to the field of information processing, particularly to techniques for managing execution of multiple concurrent, multi-task software programs on parallel processing hardware.
- Conventional microprocessor and computer system architectures rely on system software for handling runtime matters relating to sharing processing resources among multiple application programs and their instances, tasks etc., as well as orchestrating the concurrent (parallel and/or pipelined) execution between and within the individual applications sharing the given set of processing resources. However, the system software consumes by itself ever increasing portions of the system processing capacity, as the number of applications, their instances and tasks and the pooled processing resources would grow, as well as the more frequently the optimizations of the dynamic resource management among the applications and their tasks would be needed to be performed, in response to variations in the applications' and their instances' and tasks' processing loads etc. variables of the processing environment. As such, the conventional approaches for supporting dynamic execution of concurrent programs on shared processing capacity pools will not scale well.
- This presents significant challenges to the scalability of the networked utility (‘cloud’) computing model, in particular as there will be a continuously increasing need for greater degrees of concurrent processing also at intra-application levels, in order to enable increasing individual application on-time processing throughput performance, without the automatic speed-up from processor clock rates being available due to the practical physical and economic constraints faced by the semiconductor etc. physical hardware implementation technologies.
- To address the challenges per above, there is a need for inventions enabling scalable, multi-application dynamic concurrent execution on parallel processing systems, with high resource utilization efficiency, high application processing on-time throughput performance, as well built-in, architecture based security and reliability.
- An aspect of the invention provides systems and methods for arranging secure and reliable, concurrent execution of a set of internally parallelized and pipelined software programs on a pool of processing resources shared dynamically among the programs, wherein the dynamic sharing of the resources is based at least in part on i) processing input data loads for instances and tasks of the programs and ii) contractual capacity entitlements of the programs.
- An aspect of the invention provides methods and systems for intelligent, destination task defined prioritization of inter-task communications (ITC) for a computer program, for architectural ITC performance isolation among a set of programs executing concurrently on a dynamically shared data processing platform, as well as for prioritizing instances of the program tasks for execution at least in part based on which of the instances have available to them their input data, including ITC data, enabling any given one of such instances to execute at the given time.
- An aspect of the invention provides a system for prioritizing instances of a software program for execution. Such a system comprises: 1) a subsystem for determining which of the instances are ready to execute on an array of processing cores, at least in part based on whether a given one of the instances has available to it input data to process, and 2) a subsystem for assigning a subset of the instances for execution on the array of cores based at least in part on the determining. Various embodiments of that system include further features such as features whereby a) the input data is from a data source such that the given instance has assigned a high priority for purposes of receiving data; b) the input data is such data that it enables the given program instance to execute; c) the subset includes cases of none, some as well as all of the instances of said program; d) the instance is: a process, a job, a task, a thread, a method, a function, a procedure or an instance any of the foregoing, or an independent copy of the given program; and/or e) the system is implemented by hardware logic that is able to operate without software involvement.
- An aspect of the invention provides a hardware logic implemented method for prioritizing instances of a software program for execution, with such a method involving: classifying instances of the program into the following classes, listed in the order from higher to lower priority for execution, i.e., in their reducing execution priority order: (I) instances indicated as having high priority input data for processing, and (II) any other instances. Various embodiments of that method include further steps and features such as features whereby a) the other instances are further classified into the following sub-classes, listed in their reducing execution priority order: (i) instances indicated as able to execute presently without the high priority input data, and (ii) any remaining instances; b) the high priority input data is data that is from a source where its destination instance, of said program, is expecting high priority input data; c) a given instance of the program comprises tasks, with one of said tasks referred to as a destination task and others as source tasks of the given instance, and for the given instance, a unit of the input data is considered high priority if it is from such one of the source tasks that the destination task has assigned a high priority for inter-task communications to it; d) for any given one of the instances, a step of computing a number of its non-empty source task specific buffers among its input data buffers such that belong to source tasks of the given instance indicated at the time as high priority source tasks for communications to the destination task of the given instance, with this number referred to as an H number for its instance, and wherein, within the class I), the instances are prioritized for execution at least in part according to magnitudes of their H numbers, in descending order such that an instance with a greater H number is prioritized before an instance with lower H number; e) in case of two or more of the instances tied for the greatest H number, such tied instances are prioritized at least in part according to their respective total numbers of non-empty input data buffers, and/or f) at least one of the instances is either a process, a job, a task, a thread, a method, a function, a procedure, or an instance any of the foregoing, or an independent copy of the given program.
- An aspect of the invention provides a system for processing a set of computer programs instances, with inter-task communications (ITC) performance isolation among the set of program instances. Such a system comprises: 1) a number of processing stages; and 2) a group of multiplexers connecting ITC data to a given stage among the processing stages, wherein a multiplexer among said group is specific to one given program instance among said set. The system hosts each task of the given program instance at different one of the processing stages, and supports copies of same task software code being located at more than one of the processing stages in parallel. Various embodiments of this system include further features such as a) a feature whereby at least one of processing stages comprises multiple processing cores such as CPU execution units, with, for any of the cores, at any given time, one of the program instances assigned for execution; b) a set of source task specific buffers for buffering data destined for a task of the given program instance located at the given stage, referred to as a destination task, and hardware logic for forming a hardware signal indicating whether sending ITC is presently permitted to a given buffer among the source task specific buffers, with such forming based at least in part on a fill level of the given buffer, and with such a signal being connected to a source task for which the given buffer is specific to; c) a feature providing, for the destination task, a set of source task specific buffers, wherein a given buffer is specific to one of the other tasks of the program instance for buffering ITC from said other task to the destination task; d) feature wherein the destination task provides ITC prioritization information for other tasks of the program instance located at their respective ones of the stages; d) a feature whereby the ITC prioritization information is provided by the destination task via a set of one or more hardware registers, with each register of the set specific to one of the other tasks of the program instance, and with each register configured to store a value specifying a prioritization level of the task that it is specific to, for purposes of ITC communications to the destination task; e) an arbitrator controlling from which source task of the program instance the multiplexer specific to that program instance will read its next ITC data unit for the destination task; and/or f) a feature whereby the arbitrator prioritizes source tasks of the program instance for selection by the multiplexer to read its next ITC data unit based at least in part on at least one of: (i) source task specific ITC prioritization information provided by the destination task, and (ii) source task specific availability information of ITC data for the destination task from the other tasks of the program instance.
- Accordingly, aspects of the invention involve application-program instance specific hardware logic resources for secure and reliable ITC among tasks of application program instances hosted at processing stages of a multi-stage parallel processing system. Rather than seeking to inter-connect the individual processing stages or cores of the multi-stage manycore processing system as such, the invented mechanisms efficiently inter-connect the tasks of any given application program instance using the per application program instance specific inter-processing stage ITC hardware logic resources. Due to the ITC being handled with such application program instance specific hardware logic resources, the ITC performance experience by one application instance does not depend on the ITC resource usage (e.g. data volume and inter-task communications intensiveness) of the other applications sharing the given data processing system per the invention. This results in effective inter-application isolation for ITC in a multi-stage parallel processing system shared dynamically among multiple application programs.
- An aspect of the invention provides systems and methods for scheduling instances of software programs for execution based at least in part on (1) availability of input data of differing priorities for any given one of the instances and/or (2) availability, on their fast-access memories, of memory contents needed by any given one of the instances to execute.
- An aspect of the invention provides systems and methods for optimally allocating and assigning input port capacity to a data processing systems among data streams of multiple software programs based at least in part on input data load levels and contractual capacity entitlements of the programs.
- An aspect of the invention provides systems and methods for resolution of resource access contentions, for resources including computing, storage and communication resources such as memories, queues, ports or processors. Such methods enable multiple potential user systems for a shared resource, in a coordinated and fair manner, to avoid conflicting resource access decisions, even while multiple user systems are deciding on access to set of shared resources concurrently, including at the same clock cycle.
- An aspect of the invention provides systems and methods for load balancing, whereby the load balancer is configured to forward, by its first layer, any packets without destination instance within its destination application specified (referred to as no-instance-specified packets or NIS packets for short) it receives from its network input to such one of the processing systems in the local load balancing group that presently has the highest score for accepting NIS packets for the destination app of the given NIS packet. The load balancers further have destination processing system (i.e. for each given application, instance group) specific sub-modules, which, for NIS packets forwarded to them by the first layer balancing logic, specify a destination instance among the available, presently inactive instance resources of the destination app of a given NIS packet to which to forward the given NIS packet. In at least some embodiments of the invention, the score for accepting NIS packets for a destination processing system among the load balancing group is based at least in part on the amount of presently inactive instance resources at the given processing system for the destination application of a given NIS packet.
-
FIG. 1 shows, in accordance with an embodiment of the invention, a functional block diagram for a load balancing architecture for a bank of processor systems, such as those discussed in the following with reference to the remaining FIGS. -
FIG. 2 shows, in accordance with an embodiment of the invention, a functional block diagram for a multi-stage many core processing system shared dynamically among a set of software program instances, with the system providing capabilities for optimally scheduling inter-task communications (ITC) units between various tasks of any one of the program instances, as well as scheduling and placing instances of a given program task for execution on the processing stages of the system, at least in part based on which of the instances have available for them the input data, e.g. ITC data, needed by them to execute. -
FIG. 3 shows, in accordance with an embodiment of the invention, a functional block diagram for a receive (RX) logic module of any of the processing stages of the multi-stage manycore processor system perFIG. 2 . -
FIG. 4 shows, in accordance with an embodiment of the invention, a functional block diagram for an application program specific submodule of the processing stage RX logic module perFIG. 3 . -
FIG. 5 shows, in accordance with an embodiment of the invention, a functional block diagram for an application program instance specific submodule of the application program specific submodule perFIG. 4 . -
FIG. 6 shows, in accordance with an embodiment of the invention, a functional block diagram for logic resources within one of the processing stages of asystem 1 perFIG. 2 for connecting ITC data from input buffers of the RX logic (perFIGS. 3-5 ) to the manycore processor of the local processing stage. -
FIG. 7 shows, in accordance with an embodiment of the invention, a functional block diagram for the application load adaptive manycore processor of a processing stage of the multi-stage processing system per preceding FIGS. - FIGS. and related descriptions in the following provide specifications for embodiments and aspects of hardware-logic based systems and methods for inter-task communications (ITC) with destination task defined source task prioritization, for input data availability based prioritization of instances of a given application task for execution on processing cores of a processing stage hosting the given task, for architecture-based application performance isolation for ITC in multi-stage manycore data processing system, as well as for load balancing of incoming processing data units among a group of such processing systems.
- The invention is described herein in further detail by illustrating the novel concepts in reference to the drawings. General symbols and notations used in the drawings:
-
- Boxes indicate a functional module comprising digital hardware logic.
- Arrows indicate a digital signal flow. A signal flow may comprise one or more parallel bit wires. The direction of an arrow indicates the direction of primary flow of information associated with it with regards to discussion of the system functionality herein, but does not preclude information flow also in the opposite direction.
- A dotted line marks a border of a group of drawn elements that form a logical entity with internal hierarchy.
- An arrow reaching to a border of a hierarchical module indicate connectivity of the associated information to/from all sub-modules of the hierarchical module.
- Lines or arrows crossing in the drawings are decoupled unless otherwise marked.
- For clarity of the drawings, generally present signals for typical digital logic operation, such as clock signals, or enable, address and data bit components of write or read access buses, are not shown in the drawings.
- General notes regarding this specification (incl. text in the drawings):
-
- For brevity: ‘application (program)’ is occasionally written in as ‘app’, ‘instance’ as ‘inst’ and ‘application-task/instance’ as ‘app-task/inst’ and so forth.
- Terms software program, application program, application and program are used interchangeably in this specification, and each generally refers to any type of executable computer program.
- In
FIG. 5 , and through the related discussions, thebuffers 260 are considered to be First-in First-Out buffers (FIFO); however also other types than first-in first-out buffers can be used in various embodiments.
- Illustrative embodiments and aspects of the invention are described in the following with references to the FIGS.
-
FIG. 1 presents the load balancing architecture for a row of processing systems per this description, comprising aset 4 ofT load balancers 3 and aload balancing group 2 of S processing systems 1 (T and S are positive integers). Per this architecture, each of the balancers forward any no-instance-specific (NIS) packets (i.e. packets without a specific instance of their destination applications identified) arriving to them via their network inputs to one of the processing systems of the group, based on the NIS packet forwarding preference scores (for the destination app of the given NIS packet) of the individual processing systems of theload balancing group 2. - The load balancing per
FIG. 1 for abank 2 of the processing systems operates as follows: -
- The
processing systems 1 count, for each of the application programs (apps) hosted on them:- a number X of their presently inactive instance resources, i.e., the number of additional parallel instances of the given app at the given processing system that could be activated at the time; and
- from the above number, the portion Y (if any) of the additional activatable instances within the Core Entitlement (CE) level of the given app, wherein the CE is a number of processing cores at (any one of) the processing stages of the given processing system up to which the app in question is assured to get its requests for processing cores (to be assigned for its active instances) met;
- the difference W=X-Y. The quantities X and/or W and Y, per each of the apps hosted on the
load balancing group 2, are signaled 5 from eachprocessing system 1 to theload balancers 4.
- The
- In addition,
load balancing logic 4 computes the collective sum Z of the Y numbers across all the apps (with this across-apps-sum Z naturally being the same for all apps on a given processing system). -
- From the above numbers, for each app, the
load balancer module 4 counts a no-instance-specified (NIS) packet forwarding preference score (NIS score) for each processing system in the given load balancing group with a formula of: A*Y+B*W+C*Z, where A, B and C are software programmable, defaulting to e.g. A=4, B=1 and C=2.- In forming the NIS scores for a given app (by formula per above), a given instance of the app under study is deemed available for NIS packets at times that the app instance software has set an associated device register bit (specific to that app-inst) to an active value, and unavailable otherwise. The multiplexing (muxing) mechanism used to connect the app-instance software, from whichever core at its host manycore processor it may be executing at any given time, to its app-instance specific memory, is used also for connecting the app-instance software to its NIS-availability control device register.
- The app-instance NIS availability control register of a given app-instance is reset (when the app-instance software otherwise would still keep its NIS availability control register at its active stage) also automatically by processing stage RX logic hardware whenever there is data at the input buffer for the given app-instance.
- Each of the processing systems in the given load balancing group signals their NIS scores for each app hosted on the load balancing group to each of the
load balancers 4 in front of therow 2 of processing systems. Also, theprocessing systems 1 provide to the load balancers app specific vectors (as part of info flows 9) indicating which of their local instance resources of the given app are available for receiving NIS packets (i.e. packets with no destination instance specified). - Data packets from the
network inputs 10 to the load balancing group include bits indicating whether any given packet is a NIS packet such that has its destination app but not any particular instance of the app specified. Theload balancer 3 forwards any NIS packet it receives from itsnetwork input 10 to theprocessing system 1 in the localload balancing group 2 with the highest NIS score for the destination app of the given NIS packet. (In case of ties among the processing systems for the NIS score for the given destination app, the logic forwards the packet to the processing system among such tied systems based on their ID #, e.g. to the system with lowest ID #.) The forwarding of a NIS packet to a particular processing system 1 (in theload balancing group 2 of such systems) is done by this first layer of load balancing logic by forming packet write enable vectors where each given bit is a packet write enable bit specific to the processing system within the given load balancing group of the same system index # as the given bit in its write enable bit vector. For example, the processingsystem ID # 2 from a load balancing group of processing systems ofID # 0 throughID # 4 takes the bit atindex 2 of the packet write enable vectors from the load balancers of the given group. In a straightforward scheme, the processing system #K within a given load balancing group hosts the instance group #K of each of the apps hosted by this group of the processing systems (where K=0,1, . . . ,max nr of processing systems in the load balancing group less 1). - The
load balancers 3 further have destination processing system 1 (i.e. for each given app, instance group) specific submodules, which, for NIS packets forwarded to them by the first layer balancing logic (per above), specify a destination instance among the available (presently inactive) instance resources of the destination app of a given NIS packet to which to forward the given NIS packet. In an straightforward scheme, for each given NIS packet forwarded to it, this instance group specific load balancing submodule selects, from the at-the-time available instances of the of the destination app, within the instance group that the given submodule is specific to, the instance resource with lowest ID #. - For other (not NIS) packets, the
load balancer logic 3 simply forwards a given (non NIS) packet to theprocessing system 1 in theload balancing group 2 that hosts, for the destination app of the given packet, the instance group of the identified destination instance of the packet. - According to the forwarding decision per above bullet points, the (conceptual, actually distributed per the destination processing systems)
packet switch module 6 filters packets from theoutput buses 15 of theload balancers 3 to inputbuses 19 of the destination processing systems, so that each givenprocessing system 1 in theload balancing group 2 receives as active packet transmissions (marked e.g. by write by write enable signaling) on itsinput bus 19, from the packets arriving from theload balancer inputs 10, those packets that were indicated as destined to the givensystem 1 at entry to the load balancers, as well as the NIS packets that the load balancers of theset 4 forwarded to that givensystem 1. - Note also that the
network inputs 10 to the load balancers, as well as all the bold data path arrows in the FIGS., may comprise a number of parallel of (e.g. 10 Gbps) ports. - The load balancing logic implements coordination among port modules of the same balancer, so that any given NIS packet is forwarded, according to the above destination instance selection logic, to one of such app-instances that is not, at the time of the forwarding decision, already being forwarded a packet (incl. forwarding decisions made at the same clock cycle) by port modules with higher preference rank (e.g. based on lower port #) of the same balancer. Note that each processing system supports receiving packets destined for the same app-instance concurrently from different load balancers (as explained below).
- The
load balancers 3 support, per each app-inst, a dedicated input buffer per each of the external input ports (within the buses 10) to the load balancing group. The system thus supports multiple packets being received (both via the sameload balancer module 3, as well as across the different load balancer modules perFIG. 1 ) simultaneously for the same app-instances via multiple external input ports. From the load balancer input buffers, data packets are muxed to theprocessing systems 1 of the load balancing group so that the entry stage processor of each of the multi-stage systems (seeFIG. 2 ) in such group receives data from the load balancers similarly as the non-entry-stage processors receive data from the other processing stages of the given multi-stage processing system—i.e., in a manner that the entry stage (like the other stages) will get data per each of its app-instances at most via one of its input ports per a (virtual) source stage at any given time; the load balancer modules of the given load balancing group (FIG. 1 ) appear thus as virtual source processing stages to entry stage of the multi-stage processing systems of such load balancing group. The aforesaid functionality is achieved by logic atmodule 4 as detailed below:- To eliminate packet drops in cases where packets directed to same app-inst arrive in a time-overlapping manner through multiple input ports (within the buses 10) of
same balancer 3,destination processing system 1 specific submodules atmodules 3buffer input data 15 destined for the givenprocessing system 1 at app-inst specific buffers, and assign theprocessing system 1 input ports (within thebus 19 connecting to their associated processing system 1) among the app-insts so that each app-inst is assigned at any given time at most one input port per aload balancer 3. (Note that inputs to aprocessing system 1 fromdifferent load balancers 3 are handled by the entry stage (FIG. 2 ) the same way as the other processing stages 300 handle inputs from different source stages, as detailed in connection toFIG. 5 —in a manner that supports concurrent reception of packets to the same destination app-inst from multiple source stages.) More specifically, theport capacity 19 for transfer of data fromload balancers 4 to the givenprocessing system 1 entry-stage buffers gets assigned using the same algorithm as is used for assignment of processing cores between the app-instances at the processing stages (FIG. 7 ), i.e., in a realtime input data load adaptive manner, while honoring the contractual capacity entitlements and fairness among the apps for actually materialized demands. This algorithm, which allocates at most one of the cores per each of the app-insts for the core allocation periods following each of its runs—and similarly assigns at most one of the ports atbuses 19 to the givenprocessing system 1 per each of the app-inst specific buffers queuing data destined for that processing system from any givensource load balancer 3—is specified in detail in [1], Appendix A, Ch. 5.2.3. By this logic, the entry stage of the processing system (FIG. 2 ) will get its input data same way as the other stages, and there thus is no need to prepare for cases of multiple packets to same app-inst arriving simultaneously at any destination processing stage from any of its source stages or load balancers. This logic also ensures that any app with moderate input bandwidth consumption will gets its contractually entitled share of the processing system input bandwidth (i.e. the logic protects moderate bandwidth apps from more input data intensive neighbors).
- To eliminate packet drops in cases where packets directed to same app-inst arrive in a time-overlapping manner through multiple input ports (within the buses 10) of
- Note that since packet transfer within a load balancing group (incl. within the sub-modules of the processing systems) is between app-instance specific buffers, with all the overhead bits (incl. destination app-instance ID) transferred and buffered as parallel wires besides the data, core allocation period (CAP) boundaries will not break the packets while being transferred from the load balancer buffers to a given
processing system 1 or between the processing stages of a givenmulti-stage system 1.
- From the above numbers, for each app, the
- The mechanisms per above three bullet points are designed to eliminate all packet drops in the system such that are avoidable by system design, i.e., for reasons other than app-instance specific buffer overflows caused be systemic mismatches between input data loads to a given app-inst and the capacity entitlement level subscribed to by the given app.
-
FIG. 2 provides, according to an embodiment of the invention, a functional block diagram for a multistagemanycore processor system 1 shared dynamically multiple concurrent application programs (apps), with hardware logic implemented capabilities for scheduling tasks of application program instances and prioritizing inter-task communications (ITC) among tasks of a given app instance, based at least in part on, for any given app-inst, at a given time, which tasks are expecting input data from which other tasks and which tasks are ready to execute on cores of the multi-stage manycore processing system, with the ready-to-execute status of a given task being determined at least in part based on whether the given task has available to it the input data from other tasks orsystem 1inputs 19 so as to enable it to execute at the given time, including producing its processing outputs, such asITC communications 20 to other tasks or program processing results etc. communications for external parties viaexternal outputs 50. Operation and internal structure and elements ofFIG. 2 , other than for the aspects described herein, is, according to at least some embodiments of the invention, per [1], which the reader may review before this specification for context and background material. - In the architecture per
FIG. 2 , the multi-stagemanycore processor system 1 is shared dynamically among tasks of multiple application programs (apps) and instances (insts) thereof, with, for each of the apps, each task located at one of the (manycore processor) based processing stages 300. Note however that, for any given app-inst, copies of same task software (i.e. copies of same software code) can be located at more than one of the processing stages 300 of a givensystem 1; thus the architecture perFIG. 2 , with its any-to-any ITC connectivity between thestages 300, supports organizing tasks of a program flexibly for any desirable mixes or matches of pipelined and/or parallelized processing. - General operation of the application load adaptive, multi-stage parallel data processing system per
FIG. 2 , focusing on the main inputs to outputs data flows, is as follows: The system provides data processing services to be used by external parties (e.g. by clients of the programs hosted on the system) over networks. Thesystem 1 receives data units (e.g. messages, requests, data packets or streams to be processed) from its users through itsinputs 19, and transmits the processing results to the relevant parties through its network outputs 50. Naturally the network ports of the system ofFIG. 2 can be used also for connecting with other (intermediate) resources and services (e.g. storage, databases etc.) as desired for the system to produce the requested processing results to the relevant external parties. - The application program tasks executing on the entry stage manycore processor are typically of ‘master’ type for parallelized/pipelined applications, i.e., they manage and distribute the processing workloads for ‘worker’ type tasks running (in pipelined and/or parallel manner) on the worker stage manycore processing systems (note that the processor system hardware is similar across all instances of the processing stages 300). The instances of master tasks typically do preliminary processing (e.g. message/request classification, data organization) and workflow management based on given input data units (packets), and then typically involve appropriate worker tasks at their worker stage processors to perform the data processing called for by the given input packet, potentially in the context of and in connection with other related input packets and/or other data elements (e.g. in memory or storage resources accessible by the system) referred to by such packets. (The processors have access to system memories through interfaces also additional to the IO ports shown in
FIG. 2 , e.g. as described in [1], Appendix A, Ch. 5.4). Accordingly, the master tasks typically pass on the received data units (using direct connection techniques to allow most of the data volumes being transferred to bypass the actual processor cores) through the (conceptual) inter-stage packet-switch (PS) to the worker stage processors, with the destination application-task instance (and thereby, the destination worker stage) identified for each data unit as described in the following. - To provide isolation among the different applications configured to run on the processors of the system, by default the hardware controller of each
processor 300, rather than any application software (executing on a given processor), inserts the application ID # bits for the data packets passed to thePS 200. That way, the tasks of any given application running on the processing stages in a system can trust that the packets they receive from the PS are from its own application. Note that the controller determines, and therefore knows, the application ID # that each given core within its processor is assigned to at any given time, via the application-instance to core mapping info that the controller produces. Therefore the controller is able to insert the presently-assigned app ID # bits for the inter-task data units being sent from the cores of its processing stage over the core-specific output ports to the PS. - While the processing of any given application (server program) at a system per
FIG. 2 is normally parallelized and/or pipelined, and involves multiple tasks (many of which tasks and instances thereof can execute concurrently on the manycore arrays of the processing stages 300), the system enables external parties to communicate with any such application hosted on the system without knowledge about any specifics (incl. existence, status, location) of their internal tasks or instances. As such, the incoming data units to the system are expected to identify just their destination application, and when applicable, the application instance. Moreover, the system enables external parties to communicate with any given application hosted on a system through any of thenetwork input ports 10 of any of theload balancers 3, without such external parties knowing whether or at which cores 520 (FIG. 7 ) or processingstages 300 any instance of the given application task (app-task) may be executing at any time. - Notably, the architecture enables the aforesaid flexibility and efficiency through its hardware logic functionality, so that no system or application software running on the system needs to either keep track of whether or where any of the instances of any of the app-tasks may be executing at any given time, or which port any given inter-task or external communication may have used. Thus the system, while providing a highly dynamic, application workload adaptive usage of the system processing and communications resources, allows the software running on and/or remotely using the system to be designed with a straightforward, abstracted view of the system: the software (both remote and local programs) can assume that all the applications, and all their tasks and instances, hosted on the given system are always executing on their virtual dedicated processor cores within the system. Also, where useful, said virtual dedicated processors can also be considered by software to be time-share slices on a single (unrealistically high speed) processor.
- The presented architecture thereby enables achieving, at the same time, both the vital application software development productivity (simple, virtual static view of the actually highly dynamic processing hardware) together with high program runtime performance (scalable concurrent program execution with minimized overhead) and resource efficiency (adaptively optimized resource allocation) benefits. Techniques enabling such benefits of the architecture are described in the following through more detailed technical description of the
system 1 and its subsystems. - The any-to-any connectivity among the app-tasks of all the processing stages 300 provided by the
PS 200 enables organizing the worker tasks (located at the array of worker stage processors) flexibly to suit the individual demands (e.g. task inter-dependencies) of any given application program on the system: the worker tasks can be arranged to conduct the work flow for the given application using any desired combinations of parallel and pipelined processing. E.g., it is possible to have the same task of a given application located on any number of the worker stages in the architecture perFIG. 2 , to provide a desired number of parallel copies of a given task per an individual application instance, i.e. to support also data-parallelism, along with task concurrency. - The set of applications configured to run on the system can have their tasks identified by (intra-app) IDs according to their descending order of relative (time-averaged) workload levels. Under such (intra-app) task ID assignment principle, the sum of the intra-application task IDs, each representing the workload ranking of its tasks within its application, of the app-tasks hosted at any given processing system is equalized by appropriately configuring the tasks of differing ID #s, i.e. of differing workload levels, across the applications for each processing system, to achieve optimal overall load balancing. For instance, in case of T=4 worker stages, if the system is shared among M=4 applications and each of that set of applications has four worker tasks, for each application of that set, the busiest task (i.e. the worker task most often called for or otherwise causing the heaviest processing load among tasks of the app) is given
task ID # 0, the second busiesttask ID # 1, the thirdbusiest ID # 2, and thefourth ID # 3. To balance the processing loads across the applications among the worker stages of the system, the worker stage #t gets task ID #t+m (rolling over at 3 to 0) of the application ID #m (t=0,1, . . . T−1; m=0,1, . . . M−1) (note that the mastertask ID # 4 of each app is located at the entry/exit stages). In this example scenario of four application streams, four worker tasks per app as well as four worker stages, the above scheme causes the task IDs of the set of apps to be placed at the processing stages per Table 1 below: -
TABLE 1 App ID# m Processing worker stage# t 0 1 2 3 0 0 1 2 3 1 1 2 3 0 2 2 3 0 1 3 3 0 1 2 - As seen in the example of Table 1, the sum of the task ID #s (with each task ID # representing the workload ranking of its task within its app) is the same for any row i.e. for each worker stage. This load balancing scheme can be straightforwardly applied for differing numbers of processing stages/tasks and applications, so that the overall task processing load is to be, as much as possible, equal across all worker-stage processors of the system. Advantages of such schemes include achieving optimal utilization efficiency of the processing resources and eliminating or at least minimizing the possibility and effects of any of the worker-stage processors forming system-wide performance bottlenecks.
- A non-exclusive alternative task to stage placement principle targets grouping tasks from the apps in order to minimize any variety among the processing core types demanded by the set of app-tasks placed on any given individual processing stage; that way, if all app-tasks placed on a given processing stage optimally run on the same processing core type, there is no need for reconfiguring the core slots of the manycore array at the given stage regardless which of the locally hosted app-tasks get assigned to which of its core slots (see [1], Appendix A, Ch. 5.5 for task type adaptive core slot reconfiguration, which may be used when the app-task located on the given processing stage demand different execution core types).
-
FIGS. 3-5 present the processing stage, app, app-instance level microarchitectures for the processing stage receive (RX) logic modules 201 (which collectively accomplish the functionality of the conceptual inter-stage packet-switch (PS) module ofFIG. 2 ). - For a system of
FIG. 2 , note that the functionality of the conceptualinter-stage PS 200 is actually realized by instantiating the logic perFIG. 3 (and its submodules) as the RX logic of each manycore processing system 300 (referred to as a stage) in the multi-stage architecture; there is no need for other logic to the PS. Accordingly, in the hardware implementation, thestage RX logic 201 perFIG. 3-5 is part of theprocessing stage 300 that it interfaces to; i.e., in an actual hardware implementation, there is no PS module as its functionality is distributed to the individual processing stages. - Besides the division of the app-
specific submodules 202 of the stage RX logic perFIG. 3 further to thearray 410 of app-instancespecific sub-modules 203,FIG. 4 shows how the app-specific RX logic forms, for purposes of optimally assigning the processing cores of the local manycore processor among insts of the apps sharing the system, the following info for the given app: -
- Formation of a request for a number of processing cores (Core Demand Figure, CDF) at the local processing stage by the given app. The logic forms the CDF for the app based on the number of instances of the app that presently have (1) input data at their input buffers (with those buffers located at the instance specific stage
RX logic submodules 203 perFIG. 5 ) and (2) their on-chip fast-access memory contents ready for the given instance to execute without access to the slower-access off-chip memories. InFIG. 4 , (1) and (2) per above are signaled to the app-specific RX logic module 209 via the info flows 429 and 499 from the app-inst specific modules 203 (FIG. 5 ) and 800 (FIG. 7 ), respectively, per each of the insts of the app under study. - The priority order of instances of the app for purposes of selecting such instances for execution on the cores of the local manycore processor.
The info per the above two bullet points are sent from theRX logic 202 of each app via theinfo flow 430 to the controller 540 (FIG. 7 ) of the localmanycore processor 500, for the controller to assign optimal sets of the app-insts for execution on thecores 520 of theprocessor 500.
- Formation of a request for a number of processing cores (Core Demand Figure, CDF) at the local processing stage by the given app. The logic forms the CDF for the app based on the number of instances of the app that presently have (1) input data at their input buffers (with those buffers located at the instance specific stage
- The app-instance specific RX logic per
FIG. 5 performs multiplexing 280 ITC packets from the source stage i.e. source task (of a given app-inst) specific First-in First-Out buffers (FIFOs) 260 to the local manycore processor via theinput port 290 of that processor dedicated to the given app instance. - Note that when considering the case of RX logic of the entry-stage processing system of the multi-stage architecture per
FIG. 4.1 , note that inFIG. 5 and associated descriptions the notion of source stage/task naturally is replaced by the source load balancer, except in case of theITC 20 from the exit stage to entry-stage, in which case the data source naturally is the exit stage processing system. However, the same actual hardware logic is instantiated for each occurrence of the processing stages 300 (incl. for theRX logic 201 of each stage) in this multi-stage architecture, and thus the operation of the stage RX logic can be fully explained by (as is done in the following) by assuming that the processing stage under study is instantiated as a worker or exit stage processing system, such that receives its input data from the other processing stages of the given multi-stage manycore processor, rather than from the load balancers of the given load balancing group, as in the case of the entry-stage processors; the load balancers appear to the entry-stage as virtual processing stages. Accordingly, when the RX logic of the entry stage manycore processor is considered, the references to ‘source stage’ are to be understood as actually referring to load balancers, and the references to ITC meaninput data 19 to the multi-stage manycore processor system—except in case of theITC 20 from the exit stage, as detailed above and as illustrated inFIG. 2 . With this caveat, the description of the stage RX logic herein is written considering the operating context of worker and exit stage processors (with the same hardware logic being used also for the entry-stage). - Before the actual multiplexer, the app-instance specific RX logic per
FIG. 5 has aFIFO module 245 per each of the source stages. The source-stage specific FIFO module comprises: -
- The
actual FIFO 260 for queuing packets from its associated source stage that are destined to the local task of the app-instance that the given module perFIG. 5 is specific to. - A write-side multiplexer 250 (to the above referred FIFO) that (1) takes as its
data inputs 20 the processing core specific data outputs 210 (seeFIG. 7 ) from the processing stage that the given source-stage specific FIFO module is specific to, (2) monitors (via the data input overhead bits identifying the app-instance and destination task within it for any given packet transmission) from which one of its input ports 210 (within the bus 20) it may at any given time be receiving a packet destined to the local task of the app-instance that the app-instance specific RX logic under study is specific to, with such an input referred to as the selected input, and (3) connects 255 to itsFIFO queue 260 the packet transmission from the present selected input. Note that at any of the processing stages, at any given time, at most one processing core will be assigned for any given app instance. Thus any of the source stagespecific FIFO modules 245 of the app-instance RX logic perFIG. 5 can, at any given time, receive data destined to the local task of the app-instance that the given app-instance RX logic module is specific to from at most one of the (processing core specific) data inputs of the write-side multiplexer (mux) 250 of the given FIFO module. Thus there is no need for separate FIFOs per each of the (e.g. 16 core specific) ports of thedata inputs 20 at these source stage specific FIFO modules, and instead, just one common FIFO suffices per each given source stagespecific buffering module 245.
For clarity, the “local” task refers to the task of the app-instance that is located at theprocessing stage 300 that the RX logic under study interfaces to, with that processing stage or processor being referred to as the local processing stage or processor. Please recall that per any given app, the individual tasks are located at separate processing stages. Note though that copies of the same task for a given app can be located at multiple processing stages in parallel. Note further that, at any of the processing stages, there can be multiple parallel instances of any given app executing concurrently, as well as that copies of the task can be located in parallel at multiple processing stages of the multi-stage architecture, allowing for processing speed via parallel execution at application as well as task levels, besides between the apps.
- The
- The app-
instance RX module 203 perFIG. 5 further provides arbitratinglogic 270 to decide, at multiplexing packet boundaries 281, from which of the sourcestage FIFO modules 245 to mux 280 out the next packet to the local manycore processor via the processordata input port 290 specific to the app-instance under study. This muxing process operates as follows: - Each given app-instance software provides a
logic vector 595 to the arbitratinglogic 270 of its associated app-instance RX module 203 such that has a priority indicator bit within it per each of its individual source stage specific FIFO modules 245: while a bit of such a vector relating to a particular source stage is at its active state (e.g. logic ‘1’), ITC from the source stage in question to the local task of the app-instance will be considered to be high priority, and otherwise normal priority, by the arbitrator logic in selecting the source stage specific FIFO from where to read the next ITC packet to the local (destination) task of the studied app-instance. - The arbitrator selects the source stage specific FIFO 260 (within the
array 240 of the local app-instance RX module 203) for reading 265, 290 the next packet per the following source priority ranking algorithm: -
- The source priority ranking logic maintains three logic vectors as follows:
- 1) A bit vector wherein each given bit indicates whether a source stage of the same index as the given bit is both assigned by the local (ITC destination) task of the app-instance under study a high priority for ITC to it and has its
FIFO 260 fill level above a configured monitoring threshold; - 2) A bit vector wherein each given bit indicates whether a source stage of the same index as the given bit is both assigned a high priority for ITC (to the task of the studied app-instance located at the local processing stage) and has its FIFO non-empty;
- 3) A bit vector wherein each given bit indicates whether a source stage of the same index as the given bit has its FIFO fill level above the monitoring threshold; and
- 4) A bit vector wherein each given bit indicates whether a source stage of the same index as the given bit has data available for reading.
- The
FIFO 260 fill level and data-availability is signaled inFIG. 5 via info flow 261 per each of the source-stagespecific FIFO modules 245 of the app-instspecific array 240 to thearbitrator 270 of the app-inst RX module, for the arbitrator, together with the its source stageprioritization control logic 285, to select 272 the next packet to read from the optimal source-stage specific FIFO module 245 (as detailed below). - The
arbitrator logic 270 also forms (by logic OR) an indicator bit for each of the above vectors 1) through 4) telling whether the vector associated with the given indicator has any bits in its active state. From these indicators, the algorithm searches the first vector, starting from vector 1) and proceeding toward vector 4), that has one or more active bits; the logic keeps searching until such a vector is detected. - From the detected highest priority ranking vector with active bit(s), the algorithm scans bits, starting from the index of the current start-source-stage (and after reaching the max bit index of the vector, continuing from bit index 0), until it finds a bit in an active state (logic ‘1’); the index of such found active bit is the index of the source stage from which the arbitrator controls its app-
instance port mux 280 to read 265 its next ITC packet for the local task of the studied app-instance. - The arbitrator logic uses a revolving (incrementing by one at each run of the algorithm, and returning to 0 from the maximum index) starting source stage number as a starting stage in its search of the next source stage for reading an ITC packet.
When the arbitrator has the appropriate data source (from the array 240) thus selected for reading 265, 290 the next packet, thearbitrator 270 directs 272 themux 280 to connect the appropriate source-stagespecific signal 265 to itsoutput 290, and accordingly activates, when enabled by the read-enablecontrol 590 from the app-inst software, the read enable 271 signal for theFIFO 260 of the presently selected source-stagespecific module 245.
- Note that the ITC source
task prioritization info 595 from the task software of app-instances to theirRX logic modules 203 can change dynamically, as the processing state and demands of input data for a given app-instance task evolve over time, and the arbitrator modules 270 (FIG. 5 ) apply the current state of the source task prioritization info provided to them in selecting from which of the source stages tomultiplex 280 out the next ITC packet over theoutput port 290 of the app-instance RX logic. In an embodiment, the local task of a given app-inst, when a need arises, writes 575, 595 the respective ITC prioritization levels for its source tasks (of the given app-inst) on its source-task specific ITC prioritization hardware registers, which are located at (or their info connected to) source-stage prioritizationcontrol logic submodule 285 of thearbitrator 270 of theRX module 203 of that given app-inst. Please seeFIG. 7 for themuxing 580 of the input data read control info (incl. source prioritization) from the app-insts executing at the cores of the array to their associatedRX modules 203. - In addition, the app-instance RX logic per
FIG. 5 participates in the inter-stage ITC flow-control operation as follows: - Each of the source stage
specific FIFO modules 245 of a given app-instance at the RX logic for a given processing stage maintains asignal 212 indicating whether the task (of the app instance under study) located at the source stage that the givenFIFO 260 is specific to is presently permitted to send ITC to the local (destination) task of the app-instance under study: the logic denies the permit when the FIFO fill level is above a defined threshold, while it otherwise grants the permit. - As a result, any given (source) task, when assigned for execution at a core 520 (
FIG. 7 ) at the processing stage where the given task is located, receives the ITC sending permission signals from each of the other (destination) tasks of its app-instance. PerFIG. 7 , these ITC permissions are connected 213 to the processing cores of the (ITC source) stages throughmultiplexers 600, which, according to thecontrol 560 from thecontroller 540 at the given (ITC source) processing stage identifying the active app-instance for eachexecution core 520, connect 213 the incoming ITC permission signals 212 from the other stages of the givenmulti-stage system 1 to thecores 520 at that stage. For this purpose, the processing stage provides corespecific muxes 600, each of which connects to its associated core the incoming ITC send permit signals from the ‘remote’ (destination) tasks of the app-instance assigned at the time to the given core, i.e., from the tasks of that app-instance located at the other stages of the given processing system. The (destination) taskRX logic modules 203 activate the ITC permission signals for times that the source task for which the given permission signal is directed to is permitted to send further ITC data to that destination task of the given app-inst. - Each given processing stage receive and monitor ITC permit signal signals 212 from those of the processing stages that the given stage actually is able to send ITC data to; please see
FIG. 2 for ITC connectivity among the processing stages in the herein studied embodiment of the presented architecture. - The ITC
permit signal buses 212 will naturally be connected across themulti-stage system 1 between the app-instancespecific modules 203 of theRX logic modules 202 of the ITC destination processing stages and the ITC source processing stages (noting that a givenstage 300 will be both a source and destination for ITC as illustrated inFIG. 2 ), though the inter-stage connections of the ITC flow control signals are not shown inFIG. 2 . The starting and ending points of the of the signals are shown, inFIG. 5 andFIG. 7 respectively, while the grouping of these ITC flow control signals according to which processing stage the given signal group is directed to, as well as forming of the stage specific signal groups according to the app-instance # that any given ITC flow control signal concerns, are illustrated also inFIGS. 3-4 . In connecting these per app-instance ID # arranged, stage specific groups of signals (FIG. 3 ) to any of the processing stages 300 (FIG. 7 ), the principle is that, at arrival to the stage that a given set of such groups of signals is directed to, the signals from said groups are re-grouped to form, for each of the app-instances hosted on thesystem 1, a bit vector where a bit of a given index indicates whether the task of a given app-instance (that the given bit vector is specific to) hosted at this (source) stage under study is permitted at that time to send ITC data to its task located at the stage ID # of that given index. Thus, each given bit in these bit vectors informs whether the studied task of the given app-instance is permitted to send ITC to the task of that app-instance with task ID # equal to the index of the given bit. With the incoming ITC flow control signals thus organized to app-instance specific bit vectors, the above discussed core specific muxes 600 (FIG. 7 ) are able to connect to any givencore 520 of the local manycore array the (task-ID-indexed) ITC flow control bit vector of the app-instance presently assigned for execution at the given core. By monitoring the destination stage (i.e. destination task) specific bits of the ITC permission bit vector thus connected to the present execution core of a task of the studied app-instance located at the ITC (source) processing stage under study (at times that the given app-instance actually is assigned for execution), that ITC source task will be able to know to which of the other tasks of its app-instance sending ITC is permitted at any given time. - Note that, notwithstanding the functional illustration in
FIG. 5 , in actual hardware implementation, the FIFO fill-above-threshold indications from the source stagespecific FIFOs 260 of the app-instance specific submodules of the RX logic modules of the (ITC destination) processing stages of the present multi-stage system are wired directly, though as inverted, as the ITC send permission indication signals to theappropriate muxes 600 of the (ITC source) stages, without going through the arbitrator modules (of the app-instance RX logic modules at the ITC destination stages). Naturally, an ITC permission signal indicating that the destination FIFO for the given ITC flow has its fill level presently above the configured threshold is to be understood by the source task for that ITC flow as a denial of the ITC permission (until that signal would turn to indicate that the fill level of the destination FIFO is below the configured ITC permission activation threshold). - Each source task applies these ITC send permission signals from a given destination task of its app-instance at times that it is about to begin sending a new packet over its (assigned execution core specific) processing
stage output port 210 to that given destination task. TheITC destination FIFO 260 monitoring threshold for allowing/disallowing further ITC data to be sent to the given destination task (from the source task that the given FIFO is specific to) is set to a level where the FIFO still has room for at least one ITC packet worth of data bytes, with the size of such ITC packets being configurable for a given system implementation, and the source tasks are to restrict the remaining length of their packet transmissions to destination tasks denying the ITC permissions according to such configured limits. - The app-level RX logic per
FIG. 4 arranges the instances of its app for the instance execution priority list 535 (sent via info flow 430) according to their descending order of their priority scores computed for each instance based on theirnumbers 429 of source stage specific non-empty FIFOs 260 (FIG. 5 ) as follows. To describe the forming of priority scores, we first define (a non-negative integer) H as the number of non-empty FIFOs of the given instance whose associated source stage was assigned a high ITC priority (by the local task of the given app-instance hosted at the processing stage under study). We also define (a non-negative integer) L as the number of other (non-high ITC priority source task) non-empty FIFOs of the given instance. With H and L thus defined, the intra-app execution priority score P for a given instance specific module (of the present app under study) is formed with equations as follows, with different embodiments having differing coefficients for the factors H, L and the number of tasks for the app, T: -
for H>0,P=T−1+2H+L; and -
for H=0,P=L. - The logic for prioritizing the instances of the given app for its
execution priority list 535, via a continually repeating process, signals (via hardware wires dedicated for the purpose) to thecontroller 540 of the local manycore processor 500 (FIG. 7 ) this instance execution priority list using the following format: - The process periodically starts from priority order 0 (i.e. the app's instance with the greatest priority score P), and steps through the remaining
priority orders 1 through the maximum supported number of instances for the given application (specifically, for its task located at the processing stage under study) less 1, producing one instance entry per each step on the list that is sent to the controller as such individual entries. Each entry of such a priority list comprises, as its core info, simply the instance ID # (as the priority order of any given instance is known from the number of clock cycles since the bit pulse marking thepriority order 0 at the start of a new list). To simplify the logic, also the priority order (i.e. the number of clock cycles since the bit pulse marking the priority order 0) of any given entry on these lists is sent along with the instance ID #. - At the beginning of its core to app-instance assignment process, the
controller 540 of the manycore processor uses the most recent set of complete priority order lists 535 received from theapplication RX modules 202 to determine which (highest priority) instances of each given app to assign for execution for the next core allocation period on that processor. - Per the foregoing, the ITC source prioritization, program instance execution prioritization and ITC flow control techniques provide effective program execution optimization capabilities for each of a set of individual programs configured to dynamically share a given
data processing system 1 per this description, without any of the programs impacting or being impacted by in any manner the other programs of such set. Moreover, for ITC capabilities, also the individual instances (e.g. different user sessions) of a given program are fully independent from each other. The herein described techniques and architecture thus provide effective performance and runtime isolation between individual programs among groups of programs running on the dynamically shared parallel computing hardware. - From here, we continue by exploring the internal structure and operation of a given
processing stage 300 beyond its RX logic perFIGS. 3-5 , with references toFIGS. 6 and 7 . - Per
FIG. 6 , any of the processing stages 300 of themulti-stage system 1 perFIG. 2 has, besides theRX logic 201 and the actual manycore processor system (FIG. 7 ), aninput multiplexing subsystem 450, which connects input data packets from any of the app-instancespecific input ports 290 to any of theprocessing cores 520 of the processing stage, according to which app-instance is executing at any of the cores at any given time. - The monitoring of the buffered input data availability 261 at the destination app-
instance FIFOs 260 of the processing stage RX logic enables optimizing the allocation of processing core capacity of the local manycore processor among the application tasks hosted on the given processing stage. Since thecontroller module 540 of the local manycore processor determines which instances of the locally hosted tasks of the apps in thesystem 1 execute at which of the cores of the localmanycore array 515, the controller is able to provide thedynamic control 560 for themuxes 450 perFIG. 6 to connect the appropriate app-instance specificinput data port 290 from the stage RX logic to each of the core specificinput data ports 490 of the manycore array of the local processor. - Internal elements and operation of the application load adaptive
manycore processor system 500 are illustrated inFIG. 7 . For the intra processing stage discussion, it shall be recalled that there is no more than one task located per processing stage per each of the apps, though there can be up to X (a positive integer) parallel instances of any given app-task at its local processing stage (having anarray 515 of X cores). With one task per application perprocessing stage 300, the term app-instance in the context of a single processing stage means an instance of an app-task hosted at the given processing stage under study. -
FIG. 7 provides a functional block diagram for the manycore processor system dynamically shared among instances of the locally hosted app-tasks, with capabilities for application input data load adaptive allocation of thecores 520 among the applications and for app-inst execution priority based assignment of the cores (per said allocation), as well as for accordantly dynamically reconfigured 550, 560 I/O and memory access by the app-insts. - As illustrated in
FIG. 7 , theprocessor system 500 comprises anarray 515 ofprocessing cores 520, which are dynamically shared among instances of the locally hosted tasks of the application programs configured to run on thesystem 1, under thedirection controller 540. Application program specific logic functions at the RX module (FIG. 3-5 ) signal their associated applications'capacity demand indicators 430 to the controller. Among each of these indicators, the core-demand-figures (CDFs) 530, express how many cores their associated app is presently able utilize for its (ready to execute) instances. Each application'scapacity demand expressions 430 for the controller further include a list of its ready instances in anexecution priority order 535. - Any of the
cores 520 of a processor perFIG. 7 can comprise any types of software program and data processing hardware resources, e.g. central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs) or application specific processors (ASPs) etc., and in programmable logic (FPGA) implementation, the core type for anycore slot 520 is furthermore reconfigurable per expressed demands of its assigned app-task, e.g. per [1], Appendix A, Ch. 5.5. - The hardware logic based
controller 540 module within the processor system, through a periodic process, allocates and assigns thecores 520 of the processor among the set of applications and their instances based on the applications' core demand figures (CDFs) 530 as well as their contractual core capacity entitlements (CEs). This application instance to core assignment process is exercised periodically, e.g. at intervals such as once per a defined number (for instance 64, 256 or 1024, or so forth) of processing core clock or instruction cycles. The app-instance to core assignment algorithms of the controller produce, per the app-instances on the processor,identification 550 of their execution cores (if any, at any given time), as well as per the cores of the fabric,identification 560 of their respective app-instances to execute. Moreover, theassignments array 515 control the access between thecores 520 of the fabric and the app-inst specific memories at the fabric network and memory subsystem 800 (which can be implemented e.g. per [1] Appendix A, Ch. 5.4). - The app-instance to
core mapping info 560 also directs themuxing 450 of input data from the RX buffers 260 of an appropriate app-instance to each core of thearray 515, as well as the muxing 580 of the input data read control signals (570 to 590, and 575 to 595) from the core array to the RX logic submodule (FIG. 5 ) of the app-instance that is assigned for any givencore 520 at any given time. - Similarly, the core to app-
inst mapping info 560 also directs themuxing 600 of the (source) app-instance specific ITC permit signals (212 to 213) from the destination processing stages to thecores 520 of the local manycore array, according to which app-instance is presently mapped to which core. - Further reference specifications for aspects and embodiments of the invention are in the references [1] through [10].
- The functionality of the invented systems and methods described in this specification, where not otherwise mentioned, is implemented by hardware logic of the system (wherein hardware logic naturally also includes any necessary signal wiring, memory elements and such).
- Generally, this description and drawings are included to illustrate architecture and operation of practical embodiments of the invention, but are not meant to limit the scope of the invention. For instance, even though the description does specify certain system elements to certain practical types or values, persons of skill in the art will realize, in view of this description, that any design utilizing the architectural or operational principles of the disclosed systems and methods, with any set of practical types and values for the system parameters, is within the scope of the invention. Moreover, the system elements and process steps, though shown as distinct to clarify the illustration and the description, can in various embodiments be merged or combined with other elements, or further subdivided and rearranged, etc., without departing from the spirit and scope of the invention. Finally, persons of skill in the art will realize that various embodiments of the invention can use different nomenclature and terminology to describe the system elements, process phases etc. technical concepts in their respective implementations. Generally, from this description many variants and modifications will be understood by one skilled in the art that are yet encompassed by the spirit and scope of the invention.
Claims (9)
1. (canceled)
2. A method performed in a data processing system, the method comprising:
receiving, by hardware logic and/or software logic, requests to perform different tasks on behalf of instances of a plurality of programs managed by a data processing system;
identifying, by the hardware logic and/or software logic for each of the instances, communication interdependencies between different processing stages of a set of processing stages of the respective instance;
based on conditions in the data processing system, dynamically varying, by the hardware logic and/or software logic, structures of field-programmable gate arrays used to process different tasks of the instances of the plurality of programs, the structures being dynamically varied by
identifying available field-programmable gate arrays of the data processing system that are available to process different processing stages of requesting instances of respective programs,
based at least on the conditions in the data processing system, identifying selected field-programmable gate arrays from the available field-programmable gate arrays to execute the different processing stages of the requesting instances of the respective programs,
configuring the selected field-programmable gate arrays to process a respective processing stage of a respective requesting instance, and
configuring certain selected field-programmable gate arrays to support communicating, by the task executing on the respective field-programmable gate array, final results to a requesting client over a network in the data processing system.
3. The method of claim 2 , further comprising configuring a portion of the selected field-programmable gate arrays to communicate intermediate results of the different tasks over the network to the certain selected field-programmable gate arrays, the certain selected field-programmable gate arrays using the intermediate results to produce the final results.
4. The method of claim 2 , further comprising:
maintaining availability information identifying availability of the field-programmable gate arrays; and
modifying the availability information to indicate that a particular field-programmable gate array is in use.
5. The method of claim 2 , wherein the conditions comprise changes in demand expressions of the plurality of programs.
6. The method of claim 2 , wherein, for at least one program of the plurality of programs, configuring the selected field-programmable gate arrays results in configuring two or more of the selected field-programmable gate arrays as two or more parallel copies of a given task of the at least one program as a parallelized processing stage.
7. The method of claim 6 , wherein dynamically varying the structures of the field-programmable gate arrays comprises repeatedly configuring the structures for a plurality of iterations to dynamically adjust performance of the different tasks based at least on the conditions.
8. The method of claim 6 , wherein repeatedly configuring the structures comprises, for at least one iteration of the plurality of iterations, deactivating a given copy of the two or more parallel copies in the parallelized processing stage while remaining copies of the two or more parallel copies continue executing the parallelized processing stage, wherein
the given copy is deactivated based at least on the conditions.
9. The method of claim 8 , wherein the conditions comprise a decrease in processing load destined for the at least one program.
Priority Applications (10)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/195,174 US11036556B1 (en) | 2013-08-23 | 2021-03-08 | Concurrent program execution optimization |
US17/344,636 US11188388B2 (en) | 2013-08-23 | 2021-06-10 | Concurrent program execution optimization |
US17/463,098 US11347556B2 (en) | 2013-08-23 | 2021-08-31 | Configurable logic platform with reconfigurable processing circuitry |
US17/470,926 US11385934B2 (en) | 2013-08-23 | 2021-09-09 | Configurable logic platform with reconfigurable processing circuitry |
US17/747,839 US20220276903A1 (en) | 2013-08-23 | 2022-05-18 | Configurable logic platform with reconfigurable processing circuitry |
US17/859,657 US11500682B1 (en) | 2013-08-23 | 2022-07-07 | Configurable logic platform with reconfigurable processing circuitry |
US17/979,526 US11816505B2 (en) | 2013-08-23 | 2022-11-02 | Configurable logic platform with reconfigurable processing circuitry |
US17/979,542 US11687374B2 (en) | 2013-08-23 | 2022-11-02 | Configurable logic platform with reconfigurable processing circuitry |
US18/116,389 US11915055B2 (en) | 2013-08-23 | 2023-03-02 | Configurable logic platform with reconfigurable processing circuitry |
US18/394,944 US20240126609A1 (en) | 2013-08-23 | 2023-12-22 | Configurable logic platform with reconfigurable processing circuitry |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201361869646P | 2013-08-23 | 2013-08-23 | |
US201461934747P | 2014-02-01 | 2014-02-01 | |
US14/318,512 US9448847B2 (en) | 2011-07-15 | 2014-06-27 | Concurrent program execution optimization |
US15/267,153 US10318353B2 (en) | 2011-07-15 | 2016-09-16 | Concurrent program execution optimization |
US16/434,581 US10942778B2 (en) | 2012-11-23 | 2019-06-07 | Concurrent program execution optimization |
US17/195,174 US11036556B1 (en) | 2013-08-23 | 2021-03-08 | Concurrent program execution optimization |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/434,581 Continuation US10942778B2 (en) | 2012-11-23 | 2019-06-07 | Concurrent program execution optimization |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/344,636 Continuation US11188388B2 (en) | 2013-08-23 | 2021-06-10 | Concurrent program execution optimization |
Publications (2)
Publication Number | Publication Date |
---|---|
US11036556B1 US11036556B1 (en) | 2021-06-15 |
US20210191781A1 true US20210191781A1 (en) | 2021-06-24 |
Family
ID=76320887
Family Applications (13)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/318,512 Expired - Fee Related US9448847B2 (en) | 2011-07-15 | 2014-06-27 | Concurrent program execution optimization |
US15/267,153 Active US10318353B2 (en) | 2011-07-15 | 2016-09-16 | Concurrent program execution optimization |
US16/434,581 Active 2034-09-05 US10942778B2 (en) | 2012-11-23 | 2019-06-07 | Concurrent program execution optimization |
US17/195,174 Active US11036556B1 (en) | 2013-08-23 | 2021-03-08 | Concurrent program execution optimization |
US17/344,636 Active US11188388B2 (en) | 2013-08-23 | 2021-06-10 | Concurrent program execution optimization |
US17/463,098 Active US11347556B2 (en) | 2013-08-23 | 2021-08-31 | Configurable logic platform with reconfigurable processing circuitry |
US17/470,926 Active US11385934B2 (en) | 2013-08-23 | 2021-09-09 | Configurable logic platform with reconfigurable processing circuitry |
US17/747,839 Abandoned US20220276903A1 (en) | 2013-08-23 | 2022-05-18 | Configurable logic platform with reconfigurable processing circuitry |
US17/859,657 Active US11500682B1 (en) | 2013-08-23 | 2022-07-07 | Configurable logic platform with reconfigurable processing circuitry |
US17/979,526 Active US11816505B2 (en) | 2013-08-23 | 2022-11-02 | Configurable logic platform with reconfigurable processing circuitry |
US17/979,542 Active US11687374B2 (en) | 2013-08-23 | 2022-11-02 | Configurable logic platform with reconfigurable processing circuitry |
US18/116,389 Active US11915055B2 (en) | 2013-08-23 | 2023-03-02 | Configurable logic platform with reconfigurable processing circuitry |
US18/394,944 Pending US20240126609A1 (en) | 2013-08-23 | 2023-12-22 | Configurable logic platform with reconfigurable processing circuitry |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/318,512 Expired - Fee Related US9448847B2 (en) | 2011-07-15 | 2014-06-27 | Concurrent program execution optimization |
US15/267,153 Active US10318353B2 (en) | 2011-07-15 | 2016-09-16 | Concurrent program execution optimization |
US16/434,581 Active 2034-09-05 US10942778B2 (en) | 2012-11-23 | 2019-06-07 | Concurrent program execution optimization |
Family Applications After (9)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/344,636 Active US11188388B2 (en) | 2013-08-23 | 2021-06-10 | Concurrent program execution optimization |
US17/463,098 Active US11347556B2 (en) | 2013-08-23 | 2021-08-31 | Configurable logic platform with reconfigurable processing circuitry |
US17/470,926 Active US11385934B2 (en) | 2013-08-23 | 2021-09-09 | Configurable logic platform with reconfigurable processing circuitry |
US17/747,839 Abandoned US20220276903A1 (en) | 2013-08-23 | 2022-05-18 | Configurable logic platform with reconfigurable processing circuitry |
US17/859,657 Active US11500682B1 (en) | 2013-08-23 | 2022-07-07 | Configurable logic platform with reconfigurable processing circuitry |
US17/979,526 Active US11816505B2 (en) | 2013-08-23 | 2022-11-02 | Configurable logic platform with reconfigurable processing circuitry |
US17/979,542 Active US11687374B2 (en) | 2013-08-23 | 2022-11-02 | Configurable logic platform with reconfigurable processing circuitry |
US18/116,389 Active US11915055B2 (en) | 2013-08-23 | 2023-03-02 | Configurable logic platform with reconfigurable processing circuitry |
US18/394,944 Pending US20240126609A1 (en) | 2013-08-23 | 2023-12-22 | Configurable logic platform with reconfigurable processing circuitry |
Country Status (1)
Country | Link |
---|---|
US (13) | US9448847B2 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11347556B2 (en) | 2013-08-23 | 2022-05-31 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
Families Citing this family (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8789065B2 (en) | 2012-06-08 | 2014-07-22 | Throughputer, Inc. | System and method for input data load adaptive parallel processing |
US9130885B1 (en) | 2012-09-11 | 2015-09-08 | Mellanox Technologies Ltd. | End-to-end cache for network elements |
JPWO2015068598A1 (en) * | 2013-11-11 | 2017-03-09 | 日本電気株式会社 | Apparatus, session processing quality stabilization system, priority processing method, transmission method, relay method, and program |
US9325641B2 (en) * | 2014-03-13 | 2016-04-26 | Mellanox Technologies Ltd. | Buffering schemes for communication over long haul links |
US9584429B2 (en) | 2014-07-21 | 2017-02-28 | Mellanox Technologies Ltd. | Credit based flow control for long-haul links |
US9733987B2 (en) * | 2015-02-20 | 2017-08-15 | Intel Corporation | Techniques to dynamically allocate resources of configurable computing resources |
WO2016176650A1 (en) * | 2015-04-30 | 2016-11-03 | Amazon Technologies, Inc. | Managing load balancers associated with auto-scaling groups |
US10310820B2 (en) * | 2016-05-12 | 2019-06-04 | Basal Nuclei Inc | Programming model and interpreted runtime environment for high performance services with implicit concurrency control |
US10223317B2 (en) | 2016-09-28 | 2019-03-05 | Amazon Technologies, Inc. | Configurable logic platform |
US10162921B2 (en) | 2016-09-29 | 2018-12-25 | Amazon Technologies, Inc. | Logic repository service |
US10250572B2 (en) | 2016-09-29 | 2019-04-02 | Amazon Technologies, Inc. | Logic repository service using encrypted configuration data |
US10282330B2 (en) * | 2016-09-29 | 2019-05-07 | Amazon Technologies, Inc. | Configurable logic platform with multiple reconfigurable regions |
US11115293B2 (en) | 2016-11-17 | 2021-09-07 | Amazon Technologies, Inc. | Networked programmable logic service provider |
US10642648B2 (en) | 2017-08-24 | 2020-05-05 | Futurewei Technologies, Inc. | Auto-adaptive serverless function management |
CN108363615B (en) * | 2017-09-18 | 2019-05-14 | 清华大学 | Method for allocating tasks and system for reconfigurable processing system |
CN109976885B (en) * | 2017-12-28 | 2021-07-06 | 中移物联网有限公司 | Event processing method and device based on multitask operating system and storage medium |
WO2019207790A1 (en) * | 2018-04-27 | 2019-10-31 | 三菱電機株式会社 | Data processing device, task control method, and program |
EP3591938A1 (en) * | 2018-07-03 | 2020-01-08 | Electronics and Telecommunications Research Institute | System and method to control a cross domain workflow based on a hierachical engine framework |
US10951549B2 (en) | 2019-03-07 | 2021-03-16 | Mellanox Technologies Tlv Ltd. | Reusing switch ports for external buffer network |
KR20220016859A (en) * | 2019-05-07 | 2022-02-10 | 엑스페데라, 아이엔씨. | Method and apparatus for scheduling matrix jobs in digital processing system |
US11556382B1 (en) * | 2019-07-10 | 2023-01-17 | Meta Platforms, Inc. | Hardware accelerated compute kernels for heterogeneous compute environments |
US11144546B2 (en) * | 2020-02-13 | 2021-10-12 | International Business Machines Corporation | Dynamically selecting a data access path to improve query performance |
US11741043B2 (en) * | 2021-01-29 | 2023-08-29 | The Trustees Of Dartmouth College | Multi-core processing and memory arrangement |
US11558316B2 (en) | 2021-02-15 | 2023-01-17 | Mellanox Technologies, Ltd. | Zero-copy buffering of traffic of long-haul links |
US11973696B2 (en) | 2022-01-31 | 2024-04-30 | Mellanox Technologies, Ltd. | Allocation of shared reserve memory to queues in a network device |
US11729057B1 (en) * | 2022-02-07 | 2023-08-15 | The Bank Of New York Mellon | Application architecture drift detection system |
Family Cites Families (365)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US3032889A (en) | 1958-07-17 | 1962-05-08 | Shri Ram Inst For Ind Res | Guide roller mounting and fluid injection system for fluidized beds for textile treatment |
US3551892A (en) | 1969-01-15 | 1970-12-29 | Ibm | Interaction in a multi-processing system utilizing central timers |
JPS5416006B2 (en) | 1973-01-29 | 1979-06-19 | ||
JPS5197619A (en) | 1975-02-25 | 1976-08-27 | KESHOBAN | |
US4402046A (en) | 1978-12-21 | 1983-08-30 | Intel Corporation | Interprocessor communication system |
IT1126475B (en) | 1979-12-03 | 1986-05-21 | Honeywell Inf Systems | COMMUNICATION APPARATUS BETWEEN MORE PROCESSORS |
US4403286A (en) | 1981-03-06 | 1983-09-06 | International Business Machines Corporation | Balancing data-processing work loads |
JPS604314A (en) | 1983-06-23 | 1985-01-10 | Nippon Telegr & Teleph Corp <Ntt> | Antenna device |
FR2549621B1 (en) | 1983-07-19 | 1988-09-16 | Telecommunications Sa | MULTIPROCESSOR SYSTEM FOR COMMUNICATION OF PROCESSORS BETWEEN THEM |
DE3340123A1 (en) | 1983-11-05 | 1985-05-15 | Helmut Dipl.-Inform. 5860 Iserlohn Bähring | Communications unit for coupling microcomputers |
SU1327106A1 (en) | 1986-02-05 | 1987-07-30 | Киевское Высшее Инженерное Радиотехническое Училище Противовоздушной Обороны | Apparatus for distributing jobs to processors |
JP2738674B2 (en) | 1986-05-23 | 1998-04-08 | 株式会社日立製作所 | Parallel computer and data transfer method of parallel computer |
JPH0755527B2 (en) | 1987-11-20 | 1995-06-14 | ファナック株式会社 | Clamping device for injection molding machine |
US4956771A (en) | 1988-05-24 | 1990-09-11 | Prime Computer, Inc. | Method for inter-processor data transfer |
US5452231A (en) | 1988-10-05 | 1995-09-19 | Quickturn Design Systems, Inc. | Hierarchically connected reconfigurable logic assembly |
US5031146A (en) | 1988-12-22 | 1991-07-09 | Digital Equipment Corporation | Memory apparatus for multiple processor systems |
US5341477A (en) | 1989-02-24 | 1994-08-23 | Digital Equipment Corporation | Broker for computer network server selection |
US5519829A (en) | 1990-08-03 | 1996-05-21 | 3Dlabs Ltd. | Data-array processing and memory systems |
US5303369A (en) | 1990-08-31 | 1994-04-12 | Texas Instruments Incorporated | Scheduling system for multiprocessor operating system |
US5237673A (en) | 1991-03-20 | 1993-08-17 | Digital Equipment Corporation | Memory management method for coupled memory multiprocessor systems |
JPH05197619A (en) | 1992-01-22 | 1993-08-06 | Nec Corp | Memory control circuit for multi-cpu |
JPH064314A (en) | 1992-06-18 | 1994-01-14 | Nec Home Electron Ltd | Inter-task synchronizing communication equipment |
US5802290A (en) | 1992-07-29 | 1998-09-01 | Virtual Computer Corporation | Computer network of distributed virtual computers which are EAC reconfigurable in response to instruction to be executed |
JPH0659906A (en) | 1992-08-10 | 1994-03-04 | Hitachi Ltd | Method for controlling execution of parallel |
GB2272311A (en) | 1992-11-10 | 1994-05-11 | Ibm | Call management in a collaborative working network. |
JP3696901B2 (en) | 1994-07-19 | 2005-09-21 | キヤノン株式会社 | Load balancing method |
US5600845A (en) | 1994-07-27 | 1997-02-04 | Metalithic Systems Incorporated | Integrated circuit computing device comprising a dynamically configurable gate array having a microprocessor and reconfigurable instruction execution means and method therefor |
JP3371044B2 (en) | 1994-12-28 | 2003-01-27 | 株式会社日立製作所 | Area allocation method and disk array access method for disk array |
US6728959B1 (en) | 1995-08-08 | 2004-04-27 | Novell, Inc. | Method and apparatus for strong affinity multiprocessor scheduling |
JPH0954699A (en) | 1995-08-11 | 1997-02-25 | Fujitsu Ltd | Process scheduler of computer |
FR2752312B1 (en) | 1996-08-07 | 1998-10-30 | Motorola Semiconducteurs | METHOD AND CIRCUIT FOR DYNAMICALLY ADJUSTING THE SUPPLY VOLTAGE AND OR THE FREQUENCY OF THE CLOCK SIGNAL IN A DIGITAL CIRCUIT |
US6072781A (en) | 1996-10-22 | 2000-06-06 | International Business Machines Corporation | Multi-tasking adapter for parallel network applications |
US6317774B1 (en) | 1997-01-09 | 2001-11-13 | Microsoft Corporation | Providing predictable scheduling of programs using a repeating precomputed schedule |
US6289434B1 (en) | 1997-02-28 | 2001-09-11 | Cognigine Corporation | Apparatus and method of implementing systems on silicon using dynamic-adaptive run-time reconfigurable circuits for processing multiple, independent data and control streams of varying rates |
US5931959A (en) | 1997-05-21 | 1999-08-03 | The United States Of America As Represented By The Secretary Of The Air Force | Dynamically reconfigurable FPGA apparatus and method for multiprocessing and fault tolerance |
US5961606A (en) | 1997-06-30 | 1999-10-05 | Sun Microsystems, Inc. | System and method for remote buffer allocation in exported memory segments and message passing between network nodes |
US6212544B1 (en) | 1997-10-23 | 2001-04-03 | International Business Machines Corporation | Altering thread priorities in a multithreaded processor |
US6345287B1 (en) | 1997-11-26 | 2002-02-05 | International Business Machines Corporation | Gang scheduling for resource allocation in a cluster computing environment |
US6434687B1 (en) | 1997-12-17 | 2002-08-13 | Src Computers, Inc. | System and method for accelerating web site access and processing utilizing a computer system incorporating reconfigurable processors operating under a single operating system image |
US6353616B1 (en) | 1998-05-21 | 2002-03-05 | Lucent Technologies Inc. | Adaptive processor schedulor and method for reservation protocol message processing |
JPH11353291A (en) | 1998-06-11 | 1999-12-24 | Nec Corp | Multiprocessor system and medium recording task exchange program |
US6334175B1 (en) | 1998-07-22 | 2001-12-25 | Ati Technologies, Inc. | Switchable memory system and memory allocation method |
US6211721B1 (en) | 1998-12-28 | 2001-04-03 | Applied Micro Circuits Corporation | Multiplexer with short propagation delay and low power consumption |
US6477558B1 (en) | 1999-05-17 | 2002-11-05 | Schlumberger Resource Management Systems, Inc. | System for performing load management |
US6374300B2 (en) | 1999-07-15 | 2002-04-16 | F5 Networks, Inc. | Method and system for storing load balancing information with an HTTP cookie |
US6779016B1 (en) | 1999-08-23 | 2004-08-17 | Terraspring, Inc. | Extensible computing system |
US6438737B1 (en) | 2000-02-15 | 2002-08-20 | Intel Corporation | Reconfigurable logic for a computer |
US20020152305A1 (en) * | 2000-03-03 | 2002-10-17 | Jackson Gregory J. | Systems and methods for resource utilization analysis in information management environments |
US6769017B1 (en) | 2000-03-13 | 2004-07-27 | Hewlett-Packard Development Company, L.P. | Apparatus for and method of memory-affinity process scheduling in CC-NUMA systems |
US8195823B2 (en) | 2000-04-17 | 2012-06-05 | Circadence Corporation | Dynamic network link acceleration |
US7490328B2 (en) | 2000-05-09 | 2009-02-10 | Surf Communication Solutions, Ltd. | Method and apparatus for allocating processor pool resources for handling mobile data connections |
US6721948B1 (en) | 2000-06-30 | 2004-04-13 | Equator Technologies, Inc. | Method for managing shared tasks in a multi-tasking data processing system |
US7110417B1 (en) | 2000-07-13 | 2006-09-19 | Nortel Networks Limited | Instance memory handoff in multi-processor systems |
US6816905B1 (en) | 2000-11-10 | 2004-11-09 | Galactic Computing Corporation Bvi/Bc | Method and system for providing dynamic hosted service management across disparate accounts/sites |
WO2002009285A2 (en) | 2000-07-20 | 2002-01-31 | Celoxica Limited | System, method and article of manufacture for dynamic programming of one reconfigurable logic device from another reconfigurable logic device |
SG118081A1 (en) | 2000-07-24 | 2006-01-27 | Sony Corp | Information processing method inter-task communication method and computer-excutable program for thesame |
US6909691B1 (en) | 2000-08-07 | 2005-06-21 | Ensim Corporation | Fairly partitioning resources while limiting the maximum fair share |
US7538772B1 (en) | 2000-08-23 | 2009-05-26 | Nintendo Co., Ltd. | Graphics processing system with enhanced memory controller |
US6782410B1 (en) | 2000-08-28 | 2004-08-24 | Ncr Corporation | Method for managing user and server applications in a multiprocessor computer system |
US20020129080A1 (en) | 2001-01-11 | 2002-09-12 | Christian Hentschel | Method of and system for running an algorithm |
US7596784B2 (en) * | 2000-09-12 | 2009-09-29 | Symantec Operating Corporation | Method system and apparatus for providing pay-per-use distributed computing resources |
US7599753B2 (en) | 2000-09-23 | 2009-10-06 | Microsoft Corporation | Systems and methods for running priority-based application threads on a realtime component |
US20020089348A1 (en) | 2000-10-02 | 2002-07-11 | Martin Langhammer | Programmable logic integrated circuit devices including dedicated processor components |
US7020713B1 (en) | 2000-10-10 | 2006-03-28 | Novell, Inc. | System and method for balancing TCP/IP/workload of multi-processor system based on hash buckets |
EP1358581A2 (en) | 2000-10-24 | 2003-11-05 | Koninklijke Philips Electronics N.V. | Method and device for prefetching a referenced resource |
US20020107962A1 (en) | 2000-11-07 | 2002-08-08 | Richter Roger K. | Single chassis network endpoint system with network processor for load balancing |
US7117372B1 (en) | 2000-11-28 | 2006-10-03 | Xilinx, Inc. | Programmable logic device with decryption and structure for preventing design relocation |
US6892279B2 (en) | 2000-11-30 | 2005-05-10 | Mosaid Technologies Incorporated | Method and apparatus for accelerating retrieval of data from a memory system with cache by reducing latency |
US6915502B2 (en) | 2001-01-03 | 2005-07-05 | University Of Southern California | System level applications of adaptive computing (SLAAC) technology |
WO2002059743A2 (en) | 2001-01-25 | 2002-08-01 | Improv Systems, Inc. | Compiler for multiple processor and distributed memory architectures |
US7155717B2 (en) | 2001-01-26 | 2006-12-26 | Intel Corporation | Apportioning a shared computer resource |
US6848103B2 (en) | 2001-02-16 | 2005-01-25 | Telefonaktiebolaget Lm Ericsson | Method and apparatus for processing data in a multi-processor environment |
GB2372847B (en) | 2001-02-19 | 2004-12-29 | Imagination Tech Ltd | Control of priority and instruction rates on a multithreaded processor |
US7139242B2 (en) | 2001-03-28 | 2006-11-21 | Proficient Networks, Inc. | Methods, apparatuses and systems facilitating deployment, support and configuration of network routing policies |
US7178145B2 (en) | 2001-06-29 | 2007-02-13 | Emc Corporation | Queues for soft affinity code threads and hard affinity code threads for allocation of processors to execute the threads in a multi-processor system |
US6912706B1 (en) | 2001-08-15 | 2005-06-28 | Xilinx, Inc. | Instruction processor and programmable logic device cooperative computing arrangement and method |
US7349414B2 (en) * | 2001-08-24 | 2008-03-25 | Optimum Communications Services, Inc. | System and method for maximizing the traffic delivery capacity of packet transport networks via real-time traffic pattern based optimization of transport capacity allocation |
US7165256B2 (en) | 2001-09-11 | 2007-01-16 | Sun Microsystems, Inc. | Task grouping in a distributed processing framework system and methods for implementing the same |
US7412492B1 (en) | 2001-09-12 | 2008-08-12 | Vmware, Inc. | Proportional share resource allocation with reduction of unproductive resource consumption |
TWI227616B (en) | 2001-11-20 | 2005-02-01 | Hitachi Ltd | Packet communication device, packet communication system, packet communication module, data processor and data transmission system |
US6986021B2 (en) | 2001-11-30 | 2006-01-10 | Quick Silver Technology, Inc. | Apparatus, method, system and executable module for configuration and operation of adaptive integrated circuitry having fixed, application specific computational elements |
US6605960B2 (en) | 2002-01-03 | 2003-08-12 | Altera Corporation | Programmable logic configuration device with configuration memory accessible to a second device |
US7028167B2 (en) | 2002-03-04 | 2006-04-11 | Hewlett-Packard Development Company, L.P. | Core parallel execution with different optimization characteristics to decrease dynamic execution path |
US7099813B2 (en) | 2002-04-09 | 2006-08-29 | Arm Limited | Simulating program instruction execution and hardware device operation |
GB0304628D0 (en) | 2003-02-28 | 2003-04-02 | Imec Inter Uni Micro Electr | Method for hardware-software multitasking on a reconfigurable computing platform |
EP1372084A3 (en) | 2002-05-31 | 2011-09-07 | Imec | Method for hardware-software multitasking on a reconfigurable computing platform |
US7631107B2 (en) * | 2002-06-11 | 2009-12-08 | Pandya Ashish A | Runtime adaptable protocol processor |
US7328314B2 (en) | 2002-06-19 | 2008-02-05 | Alcatel-Lucent Canada Inc. | Multiprocessor computing device having shared program memory |
US7093258B1 (en) | 2002-07-30 | 2006-08-15 | Unisys Corporation | Method and system for managing distribution of computer-executable program threads between central processing units in a multi-central processing unit computer system |
US8108656B2 (en) | 2002-08-29 | 2012-01-31 | Qst Holdings, Llc | Task definition for specifying resource requirements |
US7315897B1 (en) | 2002-09-13 | 2008-01-01 | Alcatel Lucent | Adaptable control plane architecture for a network element |
US7062606B2 (en) | 2002-11-01 | 2006-06-13 | Infineon Technologies Ag | Multi-threaded embedded processor using deterministic instruction memory to guarantee execution of pre-selected threads during blocking events |
JP2004171234A (en) | 2002-11-19 | 2004-06-17 | Toshiba Corp | Task allocation method in multiprocessor system, task allocation program and multiprocessor system |
US7171667B2 (en) | 2002-12-06 | 2007-01-30 | Agilemath, Inc. | System and method for allocating resources based on locally and globally determined priorities |
US7415540B2 (en) | 2002-12-31 | 2008-08-19 | Intel Corporation | Scheduling processing threads |
US7738496B1 (en) | 2002-12-31 | 2010-06-15 | Cypress Semiconductor Corporation | Device that provides the functionality of dual-ported memory using single-ported memory for multiple clock domains |
US20040158637A1 (en) | 2003-02-12 | 2004-08-12 | Lee Timothy Charles | Gated-pull load balancer |
US7290260B2 (en) | 2003-02-20 | 2007-10-30 | International Business Machines Corporation | Dynamic processor redistribution between partitions in a computing system |
US7450617B2 (en) | 2003-08-14 | 2008-11-11 | Broadcom Corporation | System and method for demultiplexing video signals |
US7058868B2 (en) | 2003-08-14 | 2006-06-06 | Broadcom Corporation | Scan testing mode control of gated clock signals for memory devices |
US7191329B2 (en) | 2003-03-05 | 2007-03-13 | Sun Microsystems, Inc. | Automated resource management using perceptron prediction |
US7502901B2 (en) | 2003-03-26 | 2009-03-10 | Panasonic Corporation | Memory replacement mechanism in semiconductor device |
US7627506B2 (en) | 2003-07-10 | 2009-12-01 | International Business Machines Corporation | Method of providing metered capacity of temporary computer resources |
US7093147B2 (en) | 2003-04-25 | 2006-08-15 | Hewlett-Packard Development Company, L.P. | Dynamically selecting processor cores for overall power efficiency |
US7469311B1 (en) * | 2003-05-07 | 2008-12-23 | Nvidia Corporation | Asymmetrical bus |
US7177961B2 (en) | 2003-05-12 | 2007-02-13 | International Business Machines Corporation | Managing access, by operating system images of a computing environment, of input/output resources of the computing environment |
US7996839B2 (en) | 2003-07-16 | 2011-08-09 | Hewlett-Packard Development Company, L.P. | Heterogeneous processor core systems for improved throughput |
US7200837B2 (en) | 2003-08-21 | 2007-04-03 | Qst Holdings, Llc | System, method and software for static and dynamic programming and configuration of an adaptive computing architecture |
US20050055694A1 (en) | 2003-09-04 | 2005-03-10 | Hewlett-Packard Development Company, Lp | Dynamic load balancing resource allocation |
US7478390B2 (en) | 2003-09-25 | 2009-01-13 | International Business Machines Corporation | Task queue management of virtual devices using a plurality of processors |
US20050080999A1 (en) | 2003-10-08 | 2005-04-14 | Fredrik Angsmark | Memory interface for systems with multiple processors and one memory system |
DE10353268B3 (en) | 2003-11-14 | 2005-07-28 | Infineon Technologies Ag | Parallel multi-thread processor with divided contexts has thread control unit that generates multiplexed control signals for switching standard processor body units to context memories to minimize multi-thread processor blocking probability |
US7437730B2 (en) | 2003-11-14 | 2008-10-14 | International Business Machines Corporation | System and method for providing a scalable on demand hosting system |
US7461376B2 (en) | 2003-11-18 | 2008-12-02 | Unisys Corporation | Dynamic resource management system and method for multiprocessor systems |
WO2005055058A1 (en) | 2003-12-04 | 2005-06-16 | Matsushita Electric Industrial Co., Ltd. | Task scheduling device, method, program, recording medium, and transmission medium for priority-driven periodic process scheduling |
US7802255B2 (en) | 2003-12-19 | 2010-09-21 | Stmicroelectronics, Inc. | Thread execution scheduler for multi-processing system and method |
US7380039B2 (en) | 2003-12-30 | 2008-05-27 | 3Tera, Inc. | Apparatus, method and system for aggregrating computing resources |
US7669035B2 (en) | 2004-01-21 | 2010-02-23 | The Charles Stark Draper Laboratory, Inc. | Systems and methods for reconfigurable computing |
US7565653B2 (en) | 2004-02-20 | 2009-07-21 | Sony Computer Entertainment Inc. | Methods and apparatus for processor task migration in a multi-processor system |
DE102004009497B3 (en) | 2004-02-27 | 2005-06-30 | Infineon Technologies Ag | Chip integrated multi-processor system e.g. for communications system, with 2 processors each having input/output coupled to common tightly-coupled memory |
DE102004009610B4 (en) | 2004-02-27 | 2007-08-16 | Infineon Technologies Ag | Heterogeneous Parallel Multithreaded Processor (HPMT) with Shared Contexts |
JP4171910B2 (en) | 2004-03-17 | 2008-10-29 | 日本電気株式会社 | Parallel processing system and parallel processing program |
US7257811B2 (en) | 2004-05-11 | 2007-08-14 | International Business Machines Corporation | System, method and program to migrate a virtual machine |
US7444454B2 (en) | 2004-05-11 | 2008-10-28 | L-3 Communications Integrated Systems L.P. | Systems and methods for interconnection of multiple FPGA devices |
US7112997B1 (en) | 2004-05-19 | 2006-09-26 | Altera Corporation | Apparatus and methods for multi-gate silicon-on-insulator transistors |
US7512813B2 (en) | 2004-05-28 | 2009-03-31 | International Business Machines Corporation | Method for system level protection of field programmable logic devices |
US7861063B1 (en) | 2004-06-30 | 2010-12-28 | Oracle America, Inc. | Delay slot handling in a processor |
JP4546775B2 (en) | 2004-06-30 | 2010-09-15 | 富士通株式会社 | Reconfigurable circuit capable of time-division multiplex processing |
US7720063B2 (en) | 2004-07-02 | 2010-05-18 | Vt Idirect, Inc. | Method apparatus and system for accelerated communication |
US8429660B2 (en) | 2004-08-23 | 2013-04-23 | Goldman, Sachs & Co. | Systems and methods to allocate application tasks to a pool of processing machines |
US7634774B2 (en) | 2004-09-13 | 2009-12-15 | Integrated Device Technology, Inc. | System and method of scheduling computing threads |
US7543091B2 (en) | 2004-09-22 | 2009-06-02 | Kabushiki Kaisha Toshiba | Self-organized parallel processing system |
JP4405884B2 (en) | 2004-09-22 | 2010-01-27 | キヤノン株式会社 | Drawing processing circuit and image output control device |
US8015392B2 (en) * | 2004-09-29 | 2011-09-06 | Intel Corporation | Updating instructions to free core in multi-core processor with core sequence table indicating linking of thread sequences for processing queued packets |
US8230426B2 (en) | 2004-10-06 | 2012-07-24 | Digipede Technologies, Llc | Multicore distributed processing system using selection of available workunits based on the comparison of concurrency attributes with the parallel processing characteristics |
WO2006040903A1 (en) | 2004-10-14 | 2006-04-20 | Tokyo Denki University | Exchange node and exchange node control method |
US20060136606A1 (en) | 2004-11-19 | 2006-06-22 | Guzy D J | Logic device comprising reconfigurable core logic for use in conjunction with microprocessor-based computer systems |
US7765547B2 (en) | 2004-11-24 | 2010-07-27 | Maxim Integrated Products, Inc. | Hardware multithreading systems with state registers having thread profiling data |
JP4606142B2 (en) | 2004-12-01 | 2011-01-05 | 株式会社ソニー・コンピュータエンタテインメント | Scheduling method, scheduling apparatus, and multiprocessor system |
US7665092B1 (en) | 2004-12-15 | 2010-02-16 | Sun Microsystems, Inc. | Method and apparatus for distributed state-based load balancing between task queues |
US7707578B1 (en) | 2004-12-16 | 2010-04-27 | Vmware, Inc. | Mechanism for scheduling execution of threads for fair resource allocation in a multi-threaded and/or multi-core processing system |
US7478097B2 (en) | 2005-01-31 | 2009-01-13 | Cassatt Corporation | Application governor providing application-level autonomic control within a distributed computing system |
US7631130B2 (en) | 2005-02-04 | 2009-12-08 | Mips Technologies, Inc | Barrel-incrementer-based round-robin apparatus and instruction dispatch scheduler employing same for use in multithreading microprocessor |
US20060212870A1 (en) * | 2005-02-25 | 2006-09-21 | International Business Machines Corporation | Association of memory access through protection attributes that are associated to an access control level on a PCI adapter that supports virtualization |
JP4757648B2 (en) | 2005-03-03 | 2011-08-24 | 日本電気株式会社 | Processing apparatus and failure recovery method thereof |
US7971072B1 (en) | 2005-03-10 | 2011-06-28 | Xilinx, Inc. | Secure exchange of IP cores |
US7581079B2 (en) | 2005-03-28 | 2009-08-25 | Gerald George Pechanek | Processor composed of memory nodes that execute memory access instructions and cooperate with execution nodes to execute function instructions |
US8762595B1 (en) * | 2005-04-05 | 2014-06-24 | Oracle America, Inc. | Method for sharing interfaces among multiple domain environments with enhanced hooks for exclusiveness |
US20060242647A1 (en) | 2005-04-21 | 2006-10-26 | Kimbrel Tracy J | Dynamic application placement under service and memory constraints |
CN101176061A (en) | 2005-04-22 | 2008-05-07 | Nxp股份有限公司 | Implementation of multi-tasking on a digital signal processor |
US7908606B2 (en) | 2005-05-20 | 2011-03-15 | Unisys Corporation | Usage metering system |
US8893016B2 (en) | 2005-06-10 | 2014-11-18 | Nvidia Corporation | Using a graphics system to enable a multi-user computer system |
US7743001B1 (en) | 2005-06-21 | 2010-06-22 | Amazon Technologies, Inc. | Method and system for dynamic pricing of web services utilization |
US7805706B1 (en) | 2005-06-21 | 2010-09-28 | Unisys Corporation | Process for optimizing software components for an enterprise resource planning (ERP) application SAP on multiprocessor servers |
US7389403B1 (en) | 2005-08-10 | 2008-06-17 | Sun Microsystems, Inc. | Adaptive computing ensemble microprocessor architecture |
US8429630B2 (en) | 2005-09-15 | 2013-04-23 | Ca, Inc. | Globally distributed utility computing cloud |
US7412353B2 (en) | 2005-09-28 | 2008-08-12 | Intel Corporation | Reliable computing with a many-core processor |
JP4777994B2 (en) | 2005-09-29 | 2011-09-21 | 富士通株式会社 | Multi-core processor |
GB0519981D0 (en) * | 2005-09-30 | 2005-11-09 | Ignios Ltd | Scheduling in a multicore architecture |
WO2008057071A2 (en) | 2005-10-11 | 2008-05-15 | California Institute Of Technology | Function mapping based on measured delays of a reconfigurable circuit |
US8144149B2 (en) | 2005-10-14 | 2012-03-27 | Via Technologies, Inc. | System and method for dynamically load balancing multiple shader stages in a shared pool of processing units |
US8060610B1 (en) | 2005-10-28 | 2011-11-15 | Hewlett-Packard Development Company, L.P. | Multiple server workload management using instant capacity processors |
US7447873B1 (en) * | 2005-11-29 | 2008-11-04 | Nvidia Corporation | Multithreaded SIMD parallel processor with loading of groups of threads |
US7730261B1 (en) | 2005-12-20 | 2010-06-01 | Marvell International Ltd. | Multicore memory management system |
US7616642B2 (en) | 2006-01-04 | 2009-11-10 | Sap Ag | Priority assignment and transmission of sensor data |
US20070198981A1 (en) | 2006-02-17 | 2007-08-23 | Jacobs Paul E | System and method for multi-processor application support |
US7774590B2 (en) | 2006-03-23 | 2010-08-10 | Intel Corporation | Resiliently retaining state information of a many-core processor |
US8032889B2 (en) | 2006-04-05 | 2011-10-04 | Maxwell Technologies, Inc. | Methods and apparatus for managing and controlling power consumption and heat generation in computer systems |
US8001549B2 (en) | 2006-04-27 | 2011-08-16 | Panasonic Corporation | Multithreaded computer system and multithread execution control method |
US20070283311A1 (en) | 2006-05-30 | 2007-12-06 | Theodore Karoubalis | Method and system for dynamic reconfiguration of field programmable gate arrays |
US7406407B2 (en) | 2006-06-01 | 2008-07-29 | Microsoft Corporation | Virtual machine for operating N-core application on M-core processor |
US8713574B2 (en) * | 2006-06-05 | 2014-04-29 | International Business Machines Corporation | Soft co-processors to provide a software service function off-load architecture in a multi-core processing environment |
JP4936517B2 (en) | 2006-06-06 | 2012-05-23 | 学校法人早稲田大学 | Control method for heterogeneous multiprocessor system and multi-grain parallelizing compiler |
EP1868094B1 (en) | 2006-06-12 | 2016-07-13 | Samsung Electronics Co., Ltd. | Multitasking method and apparatus for reconfigurable array |
KR100753421B1 (en) | 2006-06-19 | 2007-08-31 | 주식회사 하이닉스반도체 | Address latch circuit of semiconductor memory device |
WO2008014494A2 (en) | 2006-07-28 | 2008-01-31 | Drc Computer Corporation | Fpga co-processor for accelerated computation |
US20080046997A1 (en) | 2006-08-21 | 2008-02-21 | Guardtec Industries, Llc | Data safe box enforced by a storage device controller on a per-region basis for improved computer security |
US8312120B2 (en) | 2006-08-22 | 2012-11-13 | Citrix Systems, Inc. | Systems and methods for providing dynamic spillover of virtual servers based on bandwidth |
GB0618894D0 (en) | 2006-09-26 | 2006-11-01 | Ibm | An entitlement management system |
US20080086395A1 (en) | 2006-10-06 | 2008-04-10 | Brenner Larry B | Method and apparatus for frequency independent processor utilization recording register in a simultaneously multi-threaded processor |
GB2442984B (en) | 2006-10-17 | 2011-04-06 | Advanced Risc Mach Ltd | Handling of write access requests to shared memory in a data processing apparatus |
US8087029B1 (en) | 2006-10-23 | 2011-12-27 | Nvidia Corporation | Thread-type-based load balancing in a multithreaded processor |
US7698541B1 (en) | 2006-10-31 | 2010-04-13 | Netapp, Inc. | System and method for isochronous task switching via hardware scheduling |
US8429656B1 (en) | 2006-11-02 | 2013-04-23 | Nvidia Corporation | Thread count throttling for efficient resource utilization |
US8539207B1 (en) * | 2006-11-03 | 2013-09-17 | Nvidia Corporation | Lattice-based computations on a parallel processor |
US8326819B2 (en) * | 2006-11-13 | 2012-12-04 | Exegy Incorporated | Method and system for high performance data metatagging and data indexing using coprocessors |
WO2008061162A1 (en) | 2006-11-14 | 2008-05-22 | Star Bridge Systems, Inc. | Hybrid computing platform having fpga components with embedded processors |
US7992151B2 (en) | 2006-11-30 | 2011-08-02 | Intel Corporation | Methods and apparatuses for core allocations |
US7598766B2 (en) | 2007-01-09 | 2009-10-06 | University Of Washington | Customized silicon chips produced using dynamically configurable polymorphic network |
US8407658B2 (en) | 2007-02-01 | 2013-03-26 | International Business Machines Corporation | Methods, systems, and computer program products for using direct memory access to initialize a programmable logic device |
KR100893527B1 (en) | 2007-02-02 | 2009-04-17 | 삼성전자주식회사 | Method of mapping and scheduling of reconfigurable multi-processor system |
US7818699B1 (en) * | 2007-02-14 | 2010-10-19 | Xilinx, Inc. | Dynamic core pipeline |
US7685409B2 (en) | 2007-02-21 | 2010-03-23 | Qualcomm Incorporated | On-demand multi-thread multimedia processor |
US8447933B2 (en) | 2007-03-06 | 2013-05-21 | Nec Corporation | Memory access control system, memory access control method, and program thereof |
US8185899B2 (en) | 2007-03-07 | 2012-05-22 | International Business Machines Corporation | Prediction based priority scheduling |
EP2119134B1 (en) | 2007-03-12 | 2012-05-30 | Citrix Systems, Inc. | Systems and methods for dynamic bandwidth control by proxy |
US8510741B2 (en) | 2007-03-28 | 2013-08-13 | Massachusetts Institute Of Technology | Computing the processor desires of jobs in an adaptively parallel scheduling environment |
US9195462B2 (en) | 2007-04-11 | 2015-11-24 | Freescale Semiconductor, Inc. | Techniques for tracing processes in a multi-threaded processor |
US8279865B2 (en) | 2007-04-20 | 2012-10-02 | John Giacomoni | Efficient pipeline parallelism using frame shared memory |
US8024731B1 (en) | 2007-04-25 | 2011-09-20 | Apple Inc. | Assigning priorities to threads of execution |
US8046766B2 (en) | 2007-04-26 | 2011-10-25 | Hewlett-Packard Development Company, L.P. | Process assignment to physical processors using minimum and maximum processor shares |
US9405585B2 (en) | 2007-04-30 | 2016-08-02 | International Business Machines Corporation | Management of heterogeneous workloads |
US7814295B2 (en) | 2007-05-18 | 2010-10-12 | International Business Machines Corporation | Moving processing operations from one MIMD booted SIMD partition to another to enlarge a SIMD partition |
US7518396B1 (en) | 2007-06-25 | 2009-04-14 | Xilinx, Inc. | Apparatus and method for reconfiguring a programmable logic device |
US20090025004A1 (en) | 2007-07-16 | 2009-01-22 | Microsoft Corporation | Scheduling by Growing and Shrinking Resource Allocation |
US8544014B2 (en) | 2007-07-24 | 2013-09-24 | Microsoft Corporation | Scheduling threads in multi-core systems |
US8280974B2 (en) | 2007-07-31 | 2012-10-02 | Hewlett-Packard Development Company, L.P. | Migrating workloads using networked attached memory |
US8374929B1 (en) | 2007-08-06 | 2013-02-12 | Gogrid, LLC | System and method for billing for hosted services |
US8234652B2 (en) | 2007-08-28 | 2012-07-31 | International Business Machines Corporation | Performing setup operations for receiving different amounts of data while processors are performing message passing interface tasks |
US20090070762A1 (en) | 2007-09-06 | 2009-03-12 | Franaszek Peter A | System and method for event-driven scheduling of computing jobs on a multi-threaded machine using delay-costs |
US8136153B2 (en) * | 2007-11-08 | 2012-03-13 | Samsung Electronics Co., Ltd. | Securing CPU affinity in multiprocessor architectures |
WO2009059377A1 (en) | 2007-11-09 | 2009-05-14 | Manjrosoft Pty Ltd | Software platform and system for grid computing |
US7603428B2 (en) | 2008-02-05 | 2009-10-13 | Raptor Networks Technology, Inc. | Software application striping |
US7996346B2 (en) | 2007-12-19 | 2011-08-09 | International Business Machines Corporation | Method for autonomic workload distribution on a multicore processor |
US9342363B2 (en) | 2008-01-08 | 2016-05-17 | International Business Machines Corporation | Distributed online optimization for latency assignment and slicing |
US8185718B2 (en) | 2008-02-04 | 2012-05-22 | Mediatek Inc. | Code memory capable of code provision for a plurality of physical channels |
FR2927438B1 (en) | 2008-02-08 | 2010-03-05 | Commissariat Energie Atomique | METHOD FOR PRECHARGING IN A MEMORY HIERARCHY CONFIGURATIONS OF A RECONFIGURABLE HETEROGENETIC INFORMATION PROCESSING SYSTEM |
US8145894B1 (en) | 2008-02-25 | 2012-03-27 | Drc Computer Corporation | Reconfiguration of an accelerator module having a programmable logic device |
US7765512B1 (en) | 2008-03-25 | 2010-07-27 | Xilinx, Inc. | Relocatable circuit implemented in a programmable logic device |
US8255917B2 (en) | 2008-04-21 | 2012-08-28 | Hewlett-Packard Development Company, L.P. | Auto-configuring workload management system |
US9058483B2 (en) | 2008-05-08 | 2015-06-16 | Google Inc. | Method for validating an untrusted native code module |
US8195896B2 (en) | 2008-06-10 | 2012-06-05 | International Business Machines Corporation | Resource sharing techniques in a parallel processing computing system utilizing locks by replicating or shadowing execution contexts |
US20090320031A1 (en) | 2008-06-19 | 2009-12-24 | Song Justin J | Power state-aware thread scheduling mechanism |
US8050256B1 (en) | 2008-07-08 | 2011-11-01 | Tilera Corporation | Configuring routing in mesh networks |
US8151349B1 (en) | 2008-07-21 | 2012-04-03 | Google Inc. | Masking mechanism that facilitates safely executing untrusted native code |
US20100043008A1 (en) | 2008-08-18 | 2010-02-18 | Benoit Marchand | Scalable Work Load Management on Multi-Core Computer Systems |
US8327126B2 (en) | 2008-08-25 | 2012-12-04 | International Business Machines Corporation | Multicore processor and method of use that adapts core functions based on workload execution |
US8018866B1 (en) | 2008-08-26 | 2011-09-13 | Juniper Networks, Inc. | Adaptively applying network acceleration services with an intermediate network device |
US9910708B2 (en) * | 2008-08-28 | 2018-03-06 | Red Hat, Inc. | Promotion of calculations to cloud-based computation resources |
US8261273B2 (en) | 2008-09-02 | 2012-09-04 | International Business Machines Corporation | Assigning threads and data of computer program within processor having hardware locality groups |
US7990974B1 (en) | 2008-09-29 | 2011-08-02 | Sonicwall, Inc. | Packet processing on a multi-core processor |
US8683471B2 (en) | 2008-10-02 | 2014-03-25 | Mindspeed Technologies, Inc. | Highly distributed parallel processing on multi-core device |
CN102171627A (en) | 2008-10-03 | 2011-08-31 | 悉尼大学 | Scheduling an application for performance on a heterogeneous computing system |
WO2010043401A2 (en) | 2008-10-15 | 2010-04-22 | Martin Vorbach | Data processing device |
US8181184B2 (en) | 2008-10-17 | 2012-05-15 | Harris Corporation | System and method for scheduling tasks in processing frames |
US8040808B1 (en) | 2008-10-20 | 2011-10-18 | Juniper Networks, Inc. | Service aware path selection with a network acceleration device |
US8806611B2 (en) | 2008-12-02 | 2014-08-12 | At&T Intellectual Property I, L.P. | Message administration system |
US9390130B2 (en) | 2008-12-12 | 2016-07-12 | Hewlett Packard Enterprise Development Lp | Workload management in a parallel database system |
US8370493B2 (en) | 2008-12-12 | 2013-02-05 | Amazon Technologies, Inc. | Saving program execution state |
US8249904B1 (en) * | 2008-12-12 | 2012-08-21 | Amazon Technologies, Inc. | Managing use of program execution capacity |
US8528001B2 (en) | 2008-12-15 | 2013-09-03 | Oracle America, Inc. | Controlling and dynamically varying automatic parallelization |
US9507640B2 (en) | 2008-12-16 | 2016-11-29 | International Business Machines Corporation | Multicore processor and method of use that configures core functions based on executing instructions |
US8495342B2 (en) | 2008-12-16 | 2013-07-23 | International Business Machines Corporation | Configuring plural cores to perform an instruction having a multi-core characteristic |
US8370318B2 (en) | 2008-12-19 | 2013-02-05 | Oracle International Corporation | Time limited lock ownership |
US20100162230A1 (en) | 2008-12-24 | 2010-06-24 | Yahoo! Inc. | Distributed computing system for large-scale data handling |
US8245173B2 (en) | 2009-01-26 | 2012-08-14 | International Business Machines Corporation | Scheduling for parallel processing of regionally-constrained placement problem |
US20100228951A1 (en) | 2009-03-05 | 2010-09-09 | Xerox Corporation | Parallel processing management framework |
US8194593B2 (en) | 2009-03-11 | 2012-06-05 | Sony Corporation | Quality of service architecture for home mesh network |
US8131970B2 (en) | 2009-04-21 | 2012-03-06 | Empire Technology Development Llc | Compiler based cache allocation |
US8515965B2 (en) * | 2010-05-18 | 2013-08-20 | Lsi Corporation | Concurrent linked-list traversal for real-time hash processing in multi-core, multi-thread network processors |
US20100287320A1 (en) | 2009-05-06 | 2010-11-11 | Lsi Corporation | Interprocessor Communication Architecture |
US8296434B1 (en) * | 2009-05-28 | 2012-10-23 | Amazon Technologies, Inc. | Providing dynamically scaling computing load balancing |
US8018961B2 (en) | 2009-06-22 | 2011-09-13 | Citrix Systems, Inc. | Systems and methods for receive and transmission queue processing in a multi-core architecture |
US8429652B2 (en) * | 2009-06-22 | 2013-04-23 | Citrix Systems, Inc. | Systems and methods for spillover in a multi-core system |
US8495643B2 (en) | 2009-06-30 | 2013-07-23 | International Business Machines Corporation | Message selection based on time stamp and priority in a multithreaded processor |
US8412151B2 (en) | 2009-07-16 | 2013-04-02 | Cox Communications, Inc. | Payback calling plan |
US8561183B2 (en) | 2009-07-31 | 2013-10-15 | Google Inc. | Native code module security for arm instruction set architectures |
US8245234B2 (en) | 2009-08-10 | 2012-08-14 | Avaya Inc. | Credit scheduler for ordering the execution of tasks |
US8560758B2 (en) | 2009-08-24 | 2013-10-15 | Red Hat Israel, Ltd. | Mechanism for out-of-synch virtual machine memory management optimization |
US8310492B2 (en) | 2009-09-03 | 2012-11-13 | Ati Technologies Ulc | Hardware-based scheduling of GPU work |
JP2011065645A (en) | 2009-09-18 | 2011-03-31 | Square Enix Co Ltd | Multi-core processor system |
US8174287B2 (en) | 2009-09-23 | 2012-05-08 | Avaya Inc. | Processor programmable PLD device |
US8352609B2 (en) * | 2009-09-29 | 2013-01-08 | Amazon Technologies, Inc. | Dynamically modifying program execution capacity |
JP4931978B2 (en) | 2009-10-06 | 2012-05-16 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Parallelization processing method, system, and program |
CA2695564C (en) | 2010-02-26 | 2017-05-30 | Lesley Lorraine Shannon | Modular re-configurable profiling core for multiprocessor systems-on-chip |
US8566836B2 (en) | 2009-11-13 | 2013-10-22 | Freescale Semiconductor, Inc. | Multi-core system on chip |
US9292662B2 (en) | 2009-12-17 | 2016-03-22 | International Business Machines Corporation | Method of exploiting spare processors to reduce energy consumption |
US9032411B2 (en) | 2009-12-25 | 2015-05-12 | International Business Machines Corporation | Logical extended map to demonstrate core activity including L2 and L3 cache hit and miss ratio |
US8572622B2 (en) | 2009-12-30 | 2013-10-29 | International Business Machines Corporation | Reducing queue synchronization of multiple work items in a system with high memory latency between processing nodes |
US8549363B2 (en) | 2010-01-08 | 2013-10-01 | International Business Machines Corporation | Reliability and performance of a system-on-a-chip by predictive wear-out based activation of functional components |
TWI447645B (en) | 2010-02-11 | 2014-08-01 | Univ Nat Chiao Tung | A dynamically reconfigurable heterogeneous with load balancing architecture and method |
US9122538B2 (en) * | 2010-02-22 | 2015-09-01 | Virtustream, Inc. | Methods and apparatus related to management of unit-based virtual resources within a data center environment |
JP5504985B2 (en) * | 2010-03-11 | 2014-05-28 | 富士ゼロックス株式会社 | Data processing device |
US9141580B2 (en) | 2010-03-23 | 2015-09-22 | Citrix Systems, Inc. | Systems and methods for monitoring and maintaining consistency of a configuration |
WO2011123467A1 (en) | 2010-03-29 | 2011-10-06 | Amazon Technologies, Inc. | Managing committed request rates for shared resources |
JP5671327B2 (en) | 2010-03-31 | 2015-02-18 | キヤノン株式会社 | Communication processing apparatus and communication processing method |
US8681619B2 (en) | 2010-04-08 | 2014-03-25 | Landis+Gyr Technologies, Llc | Dynamic modulation selection |
US20110258317A1 (en) | 2010-04-19 | 2011-10-20 | Microsoft Corporation | Application sla based dynamic, elastic, and adaptive provisioning of network capacity |
US9141350B2 (en) | 2010-04-23 | 2015-09-22 | Vector Fabrics B.V. | Embedded system performance |
EP2387270A1 (en) | 2010-05-12 | 2011-11-16 | Nokia Siemens Networks Oy | Radio link failure recovery control in communication network having relay nodes |
US8738333B1 (en) | 2010-05-25 | 2014-05-27 | Vmware, Inc. | Capacity and load analysis in a datacenter |
GB201008819D0 (en) | 2010-05-26 | 2010-07-14 | Zeus Technology Ltd | Apparatus for routing requests |
US9934079B2 (en) | 2010-05-27 | 2018-04-03 | International Business Machines Corporation | Fast remote communication and computation between processors using store and load operations on direct core-to-core memory |
US20110307661A1 (en) | 2010-06-09 | 2011-12-15 | International Business Machines Corporation | Multi-processor chip with shared fpga execution unit and a design structure thereof |
US8626970B2 (en) | 2010-06-23 | 2014-01-07 | International Business Machines Corporation | Controlling access by a configuration to an adapter function |
US8627329B2 (en) * | 2010-06-24 | 2014-01-07 | International Business Machines Corporation | Multithreaded physics engine with predictive load balancing |
US8719415B1 (en) | 2010-06-28 | 2014-05-06 | Amazon Technologies, Inc. | Use of temporarily available computing nodes for dynamic scaling of a cluster |
US8352611B2 (en) | 2010-06-29 | 2013-01-08 | International Business Machines Corporation | Allocating computer resources in a cloud environment |
US8516272B2 (en) | 2010-06-30 | 2013-08-20 | International Business Machines Corporation | Secure dynamically reconfigurable logic |
WO2012003486A1 (en) * | 2010-07-01 | 2012-01-05 | Neodana, Inc. | A system and method for virtualization and cloud security |
US8566837B2 (en) | 2010-07-16 | 2013-10-22 | International Business Machines Corportion | Dynamic run time allocation of distributed jobs with application specific metrics |
US8484287B2 (en) * | 2010-08-05 | 2013-07-09 | Citrix Systems, Inc. | Systems and methods for cookie proxy jar management across cores in a multi-core system |
US11061459B2 (en) * | 2010-08-23 | 2021-07-13 | L. Pierre de Rochemont | Hybrid computing module |
US20120079501A1 (en) | 2010-09-27 | 2012-03-29 | Mark Henrik Sandstrom | Application Load Adaptive Processing Resource Allocation |
WO2012031362A1 (en) | 2010-09-07 | 2012-03-15 | Corporation De L ' Ecole Polytechnique De Montreal | Methods, apparatus and system to support large-scale micro- systems including embedded and distributed power supply, thermal regulation, multi-distributed-sensors and electrical signal propagation |
US8612330B1 (en) | 2010-09-14 | 2013-12-17 | Amazon Technologies, Inc. | Managing bandwidth for shared resources |
US8789170B2 (en) | 2010-09-24 | 2014-07-22 | Intel Corporation | Method for enforcing resource access control in computer systems |
US10013662B2 (en) * | 2010-09-30 | 2018-07-03 | Amazon Technologies, Inc. | Virtual resource cost tracking with dedicated implementation resources |
US8489787B2 (en) | 2010-10-12 | 2013-07-16 | International Business Machines Corporation | Sharing sampled instruction address registers for efficient instruction sampling in massively multithreaded processors |
US8738860B1 (en) * | 2010-10-25 | 2014-05-27 | Tilera Corporation | Computing in parallel processing environments |
US8881141B2 (en) | 2010-12-08 | 2014-11-04 | Intenational Business Machines Corporation | Virtualization of hardware queues in self-virtualizing input/output devices |
US20120151479A1 (en) | 2010-12-10 | 2012-06-14 | Salesforce.Com, Inc. | Horizontal splitting of tasks within a homogenous pool of virtual machines |
US9507632B2 (en) | 2010-12-15 | 2016-11-29 | Advanced Micro Devices, Inc. | Preemptive context switching of processes on ac accelerated processing device (APD) based on time quanta |
US9645854B2 (en) | 2010-12-15 | 2017-05-09 | Advanced Micro Devices, Inc. | Dynamic work partitioning on heterogeneous processing devices |
US8918784B1 (en) * | 2010-12-21 | 2014-12-23 | Amazon Technologies, Inc. | Providing service quality levels through CPU scheduling |
US8789065B2 (en) | 2012-06-08 | 2014-07-22 | Throughputer, Inc. | System and method for input data load adaptive parallel processing |
US20130117168A1 (en) * | 2011-11-04 | 2013-05-09 | Mark Henrik Sandstrom | Maximizing Throughput of Multi-user Parallel Data Processing Systems |
WO2012098684A1 (en) | 2011-01-21 | 2012-07-26 | 富士通株式会社 | Scheduling method and scheduling system |
US8645745B2 (en) | 2011-02-24 | 2014-02-04 | International Business Machines Corporation | Distributed job scheduling in a multi-nodal environment |
US8850574B1 (en) | 2011-02-28 | 2014-09-30 | Google Inc. | Safe self-modifying code |
CN108376097B (en) | 2011-03-25 | 2022-04-15 | 英特尔公司 | Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines |
US8695009B2 (en) * | 2011-04-18 | 2014-04-08 | Microsoft Corporation | Allocating tasks to machines in computing clusters |
RU2011117765A (en) | 2011-05-05 | 2012-11-10 | ЭлЭсАй Корпорейшн (US) | DEVICE (OPTIONS) AND METHOD FOR IMPLEMENTING TWO-PASS PLANNER OF LINEAR COMPLEXITY TASKS |
US8868894B2 (en) | 2011-05-06 | 2014-10-21 | Xcelemor, Inc. | Computing system with hardware scheduled reconfiguration mechanism and method of operation thereof |
US9218195B2 (en) | 2011-05-17 | 2015-12-22 | International Business Machines Corporation | Vendor-independent resource configuration interface for self-virtualizing input/output device |
US20120303809A1 (en) | 2011-05-25 | 2012-11-29 | Microsoft Corporation | Offloading load balancing packet modification |
US10061618B2 (en) * | 2011-06-16 | 2018-08-28 | Imagination Technologies Limited | Scheduling heterogenous computation on multithreaded processors |
GB2529075A (en) * | 2011-06-16 | 2016-02-10 | Imagination Tech Ltd | Graphics processor with non-blocking concurrent architecture |
US9411636B1 (en) * | 2011-07-07 | 2016-08-09 | Emc Corporation | Multi-tasking real-time kernel threads used in multi-threaded network processing |
US8745626B1 (en) | 2012-12-17 | 2014-06-03 | Throughputer, Inc. | Scheduling application instances to configurable processing cores based on application requirements and resource specification |
US8935491B2 (en) * | 2011-07-15 | 2015-01-13 | Throughputer, Inc. | Memory architecture for dynamically allocated manycore processor |
US9448847B2 (en) | 2011-07-15 | 2016-09-20 | Throughputer, Inc. | Concurrent program execution optimization |
US8793698B1 (en) * | 2013-02-21 | 2014-07-29 | Throughputer, Inc. | Load balancer for parallel processors |
US8713572B2 (en) * | 2011-09-15 | 2014-04-29 | International Business Machines Corporation | Methods, systems, and physical computer storage media for processing a plurality of input/output request jobs |
WO2013081556A1 (en) | 2011-12-01 | 2013-06-06 | National University Of Singapore | Polymorphic heterogeneous multi-core architecture |
CN103959245B (en) | 2011-12-02 | 2016-08-24 | 英派尔科技开发有限公司 | Integrated circuit as service |
US9372735B2 (en) | 2012-01-09 | 2016-06-21 | Microsoft Technology Licensing, Llc | Auto-scaling of pool of virtual machines based on auto-scaling rules of user associated with the pool |
US9769292B2 (en) * | 2012-01-19 | 2017-09-19 | Miosoft Corporation | Concurrent process execution |
CA2865930C (en) * | 2012-03-01 | 2016-04-19 | Cirba Inc. | System and method for providing a capacity reservation system for a virtual or cloud computing environment |
US10650452B2 (en) * | 2012-03-27 | 2020-05-12 | Ip Reservoir, Llc | Offload processing of data packets |
US9985848B1 (en) | 2012-03-27 | 2018-05-29 | Amazon Technologies, Inc. | Notification based pricing of excess cloud capacity |
WO2013154539A1 (en) * | 2012-04-10 | 2013-10-17 | Empire Technology Development Llc | Balanced processing using heterogeneous cores |
US9875204B2 (en) * | 2012-05-18 | 2018-01-23 | Dell Products, Lp | System and method for providing a processing node with input/output functionality provided by an I/O complex switch |
US9348724B2 (en) * | 2012-05-21 | 2016-05-24 | Hitachi, Ltd. | Method and apparatus for maintaining a workload service level on a converged platform |
US20130339977A1 (en) | 2012-06-19 | 2013-12-19 | Jack B. Dennis | Managing task load in a multiprocessing environment |
US9104453B2 (en) | 2012-06-21 | 2015-08-11 | International Business Machines Corporation | Determining placement fitness for partitions under a hypervisor |
US8972640B2 (en) * | 2012-06-27 | 2015-03-03 | Intel Corporation | Controlling a physical link of a first protocol using an extended capability structure of a second protocol |
MY169964A (en) * | 2012-06-29 | 2019-06-19 | Intel Corp | An architected protocol for changing link operating mode |
US9569279B2 (en) | 2012-07-31 | 2017-02-14 | Nvidia Corporation | Heterogeneous multiprocessor design for power-efficient and area-efficient computing |
DE102012017339B4 (en) * | 2012-08-31 | 2014-12-24 | Airbus Defence and Space GmbH | computer system |
CN103677752B (en) * | 2012-09-19 | 2017-02-08 | 腾讯科技(深圳)有限公司 | Distributed data based concurrent processing method and system |
US9582287B2 (en) | 2012-09-27 | 2017-02-28 | Intel Corporation | Processor having multiple cores, shared core extension logic, and shared core extension utilization instructions |
US9160617B2 (en) | 2012-09-28 | 2015-10-13 | International Business Machines Corporation | Faulty core recovery mechanisms for a three-dimensional network on a processor array |
US9223635B2 (en) * | 2012-10-28 | 2015-12-29 | Citrix Systems, Inc. | Network offering in cloud computing environment |
US20140380025A1 (en) | 2013-01-23 | 2014-12-25 | Empire Technology Development Llc | Management of hardware accelerator configurations in a processor chip |
US9608933B2 (en) * | 2013-01-24 | 2017-03-28 | Hitachi, Ltd. | Method and system for managing cloud computing environment |
US9971617B2 (en) * | 2013-03-15 | 2018-05-15 | Ampere Computing Llc | Virtual appliance on a chip |
US9697161B2 (en) * | 2013-03-22 | 2017-07-04 | Stmicroelectronics (Grenoble) Sas | Method of handling transactions, corresponding system and computer program product |
US9519518B2 (en) * | 2013-05-15 | 2016-12-13 | Citrix Systems, Inc. | Systems and methods for deploying a spotted virtual server in a cluster system |
JP6102511B2 (en) | 2013-05-23 | 2017-03-29 | 富士通株式会社 | Integrated circuit, control apparatus, control method, and control program |
US8910109B1 (en) | 2013-08-12 | 2014-12-09 | Altera Corporation | System level tools to support FPGA partial reconfiguration |
KR102130813B1 (en) | 2013-10-08 | 2020-07-06 | 삼성전자주식회사 | Re-configurable processor and method for operating re-configurable processor |
US9417876B2 (en) | 2014-03-27 | 2016-08-16 | International Business Machines Corporation | Thread context restoration in a multithreading computer system |
US9503093B2 (en) | 2014-04-24 | 2016-11-22 | Xilinx, Inc. | Virtualization of programmable integrated circuits |
US9851998B2 (en) | 2014-07-30 | 2017-12-26 | Microsoft Technology Licensing, Llc | Hypervisor-hosted virtual machine forensics |
US20160087849A1 (en) | 2014-09-24 | 2016-03-24 | Infinera Corporation | Planning and reconfiguring a multilayer network |
US9483291B1 (en) | 2015-01-29 | 2016-11-01 | Altera Corporation | Hierarchical accelerator registry for optimal performance predictability in network function virtualization |
US10101981B2 (en) | 2015-05-08 | 2018-10-16 | Citrix Systems, Inc. | Auto discovery and configuration of services in a load balancing appliance |
US9589088B1 (en) | 2015-06-22 | 2017-03-07 | Xilinx, Inc. | Partitioning memory in programmable integrated circuits |
US9959418B2 (en) | 2015-07-20 | 2018-05-01 | Intel Corporation | Supporting configurable security levels for memory address ranges |
US10273962B2 (en) * | 2016-09-26 | 2019-04-30 | Caterpillar Inc. | System for selectively bypassing fluid supply to one or more operational systems of a machine |
US10223317B2 (en) | 2016-09-28 | 2019-03-05 | Amazon Technologies, Inc. | Configurable logic platform |
US10282330B2 (en) | 2016-09-29 | 2019-05-07 | Amazon Technologies, Inc. | Configurable logic platform with multiple reconfigurable regions |
US10515046B2 (en) * | 2017-07-01 | 2019-12-24 | Intel Corporation | Processors, methods, and systems with a configurable spatial accelerator |
EP3948555A4 (en) | 2019-03-29 | 2022-11-16 | Micron Technology, Inc. | Computational storage and networked based system |
-
2014
- 2014-06-27 US US14/318,512 patent/US9448847B2/en not_active Expired - Fee Related
-
2016
- 2016-09-16 US US15/267,153 patent/US10318353B2/en active Active
-
2019
- 2019-06-07 US US16/434,581 patent/US10942778B2/en active Active
-
2021
- 2021-03-08 US US17/195,174 patent/US11036556B1/en active Active
- 2021-06-10 US US17/344,636 patent/US11188388B2/en active Active
- 2021-08-31 US US17/463,098 patent/US11347556B2/en active Active
- 2021-09-09 US US17/470,926 patent/US11385934B2/en active Active
-
2022
- 2022-05-18 US US17/747,839 patent/US20220276903A1/en not_active Abandoned
- 2022-07-07 US US17/859,657 patent/US11500682B1/en active Active
- 2022-11-02 US US17/979,526 patent/US11816505B2/en active Active
- 2022-11-02 US US17/979,542 patent/US11687374B2/en active Active
-
2023
- 2023-03-02 US US18/116,389 patent/US11915055B2/en active Active
- 2023-12-22 US US18/394,944 patent/US20240126609A1/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11347556B2 (en) | 2013-08-23 | 2022-05-31 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
US11385934B2 (en) | 2013-08-23 | 2022-07-12 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
US11500682B1 (en) | 2013-08-23 | 2022-11-15 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
US11687374B2 (en) | 2013-08-23 | 2023-06-27 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
US11816505B2 (en) | 2013-08-23 | 2023-11-14 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
US11915055B2 (en) | 2013-08-23 | 2024-02-27 | Throughputer, Inc. | Configurable logic platform with reconfigurable processing circuitry |
Also Published As
Publication number | Publication date |
---|---|
US20150058857A1 (en) | 2015-02-26 |
US20240126609A1 (en) | 2024-04-18 |
US20220276903A1 (en) | 2022-09-01 |
US11915055B2 (en) | 2024-02-27 |
US9448847B2 (en) | 2016-09-20 |
US20220342715A1 (en) | 2022-10-27 |
US10318353B2 (en) | 2019-06-11 |
US20230267008A1 (en) | 2023-08-24 |
US20230053365A1 (en) | 2023-02-23 |
US20190361745A1 (en) | 2019-11-28 |
US11036556B1 (en) | 2021-06-15 |
US20210397484A1 (en) | 2021-12-23 |
US20230046107A1 (en) | 2023-02-16 |
US20210406083A1 (en) | 2021-12-30 |
US10942778B2 (en) | 2021-03-09 |
US20210303361A1 (en) | 2021-09-30 |
US11500682B1 (en) | 2022-11-15 |
US11188388B2 (en) | 2021-11-30 |
US20170004017A1 (en) | 2017-01-05 |
US11816505B2 (en) | 2023-11-14 |
US11347556B2 (en) | 2022-05-31 |
US11385934B2 (en) | 2022-07-12 |
US11687374B2 (en) | 2023-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11036556B1 (en) | Concurrent program execution optimization | |
US11150948B1 (en) | Managing programmable logic-based processing unit allocation on a parallel data processing platform | |
US10133599B1 (en) | Application load adaptive multi-stage parallel data processing architecture | |
US8782665B1 (en) | Program execution optimization for multi-stage manycore processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
IPR | Aia trial proceeding filed before the patent and appeal board: inter partes review |
Free format text: TRIAL NO: IPR2022-01566 Opponent name: MICROSOFT CORPORATION Effective date: 20221107 |