WO2012036954A2

WO2012036954A2 - Scheduling amongst multiple processors

Info

Publication number: WO2012036954A2
Application number: PCT/US2011/050690
Authority: WO
Inventors: Trung Am Diep
Original assignee: Rambus Inc.
Priority date: 2010-09-15
Filing date: 2011-09-07
Publication date: 2012-03-22
Also published as: WO2012036954A3

Abstract

A system or device with multiple processors is disclosed. The multiple processors can have different performance characteristics and thus be asymmetric. A group of tasks (a.k.a., processes or threads) executing on the device are assigned to these processors based on priority data, such as values representing the relative efficiency of running each of these tasks on each of the processors. The processors can share a context memory to hold information (e.g., pointers, parameters and operands) associated with the running tasks. The system or device can be configured to store, in the context memory, little more than what is necessary to stop a task and swap the task to another one of the processors, depending on design and system operating conditions.

Description

SCHEDULING AMONGST MULTIPLE PROCESSORS

BACKGROUND

[0001] Modern electronic devices often have multiple processors. To provide just a few examples, these electronic devices can include computers, televisions (TVs), personal digital assistants (PDAs), mobile phones, wireless routers, mp3 players, videogames, GPS receivers, printers, scanners, photocopiers, and many other types of devices. The individual processors can be general-purpose processors, or they can also be special-purpose processors that are to perform specific tasks. For example, special-purpose processors can include a computer's display processor, a coprocessor used to handle communications processing telephone transmissions in multifunction phones, or a vehicle subsystem controls processor, just to name a few examples. The special-purpose processors can be capable of performing the same tasks or running the same instruction sets as general processors, depending on design, but are often optimized in some manner to perform certain tasks in a more efficient manner; for example, special-purpose processors can be designed with specialized logic circuits, additional or faster circuits of a given type, greater cache resources, or with some other special design consideration. Naturally, this list is not limiting. As electronic devices become more sophisticated, and designs have evolved to include more processors in a given electronic device, it has become increasingly common for designs to be based on multiple asymmetric processor, that is, where multiple processors with different designs or specialities perform different tasks in parallel, often with a supervisory processor or system software assigning tasks to each processor.

[0002] These asymmetric multiprocessor designs usually work well for their intended purposes, but typical designs are usually inflexible in terms of how their resources or tasks are assigned to each processor. Perhaps otherwise stated, tasks in these systems are typically assigned in a predetermined manner; not uncommonly, one processor in a device may stand by idly while another processor is computationally taxed, or operates in a manner that could be further optimized.

[0003] What is needed is a way to more efficiently allocate resources in a multiple processor system, ideally, one that responds to current system availability and requirements. The present invention satisfies this need and provides further, related advantages. BRIEF DESCRIPTION OF THE DRAWINGS

[0004] Figure 1 is a block diagram illustrating a system that schedules processors.

[0005] Figure 2 is a block diagram illustrating a system that schedules processors where the processors have a local memory.

[0006] Figure 3 is a block diagram illustrating a system that schedules processors where the processors have a local memory.

[0007] Figure 4 is a block diagram illustrating a system that schedules processors where the processors have a local "context" memory.

[0008] Figure 5 is a block diagram illustrating a system that schedules processors where the processors have a context memory and local memory.

[0009] Figure 6 is a block diagram illustrating a system that schedules processors with a software scheduler.

[0010] Figure 7 is a block diagram illustrating a system that schedules processors with a scheduler that is independent of the processors.

[0011] Figure 8 is a block diagram illustrating a system that schedules processors based on monitored performance indicators.

[0012] Figure 9A is an illustration of a table that associates priority data with specific tasks.

[0013] Figure 9B is an illustration of an alternate embodiment of table containing priority data, that is, a table that indicates an ordered preference for task by processor.

[0014] Figure 9C is an illustration of an example task list and example tracking benchmark associated with specific processors and specific tasks.

[0015] Figure 9D is an illustration of another example task list and example benchmark data associated with specific processors and specific tasks.

[0016] Figure 10 is a flowchart of a method of operating a multiprocessor system.

[0017] Figure 11 is a flowchart of a method of operating a multiprocessor system based on the storing or updating of benchmark data.

[0018] Figure 12 is a flowchart of a method of switching tasks between processors.

[0019] Figure 13 is a block diagram of a computer system.

DETAILED DESCRIPTION

[0020] This disclosure provides methods and systems to dynamically assign tasks in a multiple processor environment, in which at least some tasks are capable of execution by two or more processors and a run-time decision (e.g., a time of execution decision) can be made regarding which of the two or more processors should execute the task. In particular, monitored information is relied upon to provide information regarding at least one current operating condition. In addition to the monitored information, priority data is retrieved that represents desired priority of performance of the task by a specific one of the two or more processors. The monitored data and priority data are used by a scheduler to dynamically assign the task to one of the two or more processors, with the result that the task can be performed by one processor in one situation, and a different processor in a different situation, dependent on the system operating condition(s). A number of difference values or metrics can be looked at as a form of system operating condition; for example, other tasks awaiting execution (or in execution), processor or system state, or other values as discussed further below.

[0021] As used herein, the term "processor" includes digital logic that executes operational instructions to perform a sequence of tasks. The instructions can be stored in firmware or software, and can represent anywhere from a very limited to a very general instruction set. The processor can consist of a microprocessor, embodied on a single, dedicated die or integrated circuit ("IC") or it can be one of several "cores" that are collocated on a common die or IC with other processors. In a multiple processor ("multi-processor") system, individual processors can be the same as or different than other processors, with potentially different performance characteristics (e.g., operating speed, heat dissipation, cache sizes, pin assignments, functional capabilities, and so forth). A set of "asymmetric" processors refers to a set of two or more processors, where at least two processors in the set have different performance capabilities (or benchmark data). The terms "benchmark data," "cost data" and "task specific information" refer to information representing a relative performance characteristic of a processor or processor type relative to running a specific task, and generally refer to how well the processor would perform working on the task in isolation. For example, the benchmark data could include the number of calls that the processor would make to main memory in connection with executing the task, or the number of external dependencies, or the processors speed in producing an output, and many other factors (several will be exemplified below). Generally speaking, "benchmark data" and "cost-data" will be used interchangeably, although depending on specific context, cost-data can be used as a specific task-dependent form of benchmark data. There are many examples of such data. Also, as used herein, a task is an instance of computer code that is being executed. It typically is defined by a number of instructions, implemented as a single thread, or a larger aggregation of functions, such as several threads, which can be executed sequentially or concurrently. As used in the claims below, and in the other parts of this disclosure, the terms "processor" and "processor core" will generally be used interchangeably; that is to say, the techniques and systems provided by this disclosure can be applied to processors that are standalone integrated circuits as well as to processors that are resident on a common package or die.

[0022] This disclosure also provides methods and systems to dynamically reassign at least some tasks in a multiple processor environment, again, in which tasks to be reassigned are capable of execution on two or more processors. A task is executed on a first one of the two or more processors, with this first processor storing operating parameters associated with execution of the task by the first processor (e.g., variables, operands and the like), in a shared memory. The shared memory is local to each of the two or more processors (e.g., a local cache, or something close-by, as opposed for example to off-board main memory), and is used for the reassignment. A scheduler retrieves task specific information that identifies priority between processors; priority can be an order of preference in terms of which processor should execute the task, or (optionally) a more complex heuristic, such "benchmark data" representing a performance characteristic of a processor or a task-specific cost associated of executing a task on the different processors. A task in-progress can be reassigned to a new processor based on this priority data, or a second task awaiting execution can bump a task in-progress based on priority data. Responsive to system status (e.g., an event or other monitored operating condition) as well as to the priority information, the scheduler determines that a task in-progress should be assigned to a second one of the two or more processors (or otherwise stopped, with any appropriate state change in a task list). The scheduler initiates execution of the moved task on the second processor using the operating parameters stored in shared memory, or it otherwise maintains those operating parameters (e.g., for a stopped task) for later completion; that is to say, either a second processor picks up either where the first processor left off, using the very same local memory and the parameters stored within, or the task is later reinitiated on the first processor using this very same local memory and parameters after one or more intervening tasks have been completed.

[0023] In either of the embodiments set forth above, the priority data can be benchmark data representing relative performance of each of the processors, for example, a number, a countable event, a time, or a processor-dependent cost of performing a specific task. That is to say, instead of a fixed order for each task in terms of which processor should execute the task, the priority data can take the form of data representing relative performance characteristics of each processor in the multiprocessor system. If this benchmark data is used, it can be established by programmed parameters, for example, by use of a one-time fuse or dynamic programming, at start up or otherwise. Alternatively, the data can be empirically measured, for example, by testing each processor in the system at issue; for example, the first time the system is operated, at each power up, or on some other periodic basis. In at least one embodiment, data can be compiled during run-time for use in determining priority, on a dynamic basis if desired, such that the data changes in response to fluctuating system operating conditions (e.g., changing process, voltage or temperature or "PVT"

characteristics).

[0024] This disclosure also provides an optional method and systems for building benchmark data of the type just introduced. It is first determined whether the benchmark data exists for one or more processors in a set for a specific task. If that data does not yet exist for a specific processor, the new task is run on that processor as a preferential matter, until the benchmark data exists for each processor in the set; thereafter when the task is called, a table of benchmark data can be retrieved and used in assigning the task to a particular processor based both on the benchmark data and current system operating conditions. If desired, the data can be periodically updated as a specific task is run on various ones of the processors in the multiprocessor system at issue.

[0025] Each of the methods and systems introduced above can be independently employed, that is, each of them is optional, and each of them can be implemented in any of system (e.g., device) or method form.

[0026] To further introduce one illustrative example of these principles, a hypothetical system can consist of two processors, each capable of executing the same task, where the priority data takes the form of benchmark data representing how each processor performs in executing the task. Such a task could for example be a Fast Fourier Transform ("FFT"), essentially consisting of a set of mathematical operations that convert discrete values representing an input signal into frequency values, or vice-versa. Benchmark data can already be available to indicate that a first processor performs the task more quickly (or with less power, or fewer external calls for example) than the second processor. This information can (depending on embodiment) represent previously measured performance of the specific task of the processor at issue or processor type at issue. The data can be hardcoded into the system (i.e., based on system design), or can be empirically measured by the system itself during a calibration procedure or during prior live operations. The benchmark data can be stored in a table that identifies similar benchmark data for many different tasks for all processors in the system capable of performing the specific task. When a specific task is to be assigned by a scheduler (e.g., a FFT operation), the scheduler can check monitored information such as a current operating condition. For example, a register can indicate that the first processor is busy with a current task and that the second processor is not.

Alternatively, a monitor function can indicate that the system is using power at a level close to a predetermined limit, without much headroom. These are but two examples of information representing current system status or condition, i.e., something that fluctuates or changes dynamically during system operation. The scheduler can determine based on the benchmark data that, while it might be nominally more efficient to assign the task to the first processor rather than the second processor, the task should instead be assigned to the second processor. If desired, the scheduler can measure performance of this task by the second processor, to periodically update the benchmark data. A specific task can be assigned to a first processor when it is executed a first time, and when executed a second time, can be assigned to a different processor, as conditions change, with the benchmark information used to create prioritization between the different processors, subject to system conditions. In this specific example, the benchmark data could hypothetically be a value that represents how well each processor performs the FFT, in terms of minimal memory fetches, minimal cache misses, time require to complete the task, or some other metric.

[0027] Alternatively, a task can be dynamically switched using these techniques. For example, if a previously-busy processor becomes free, a multiprocessor system can check priority data for tasks awaiting execution, tasks in-progress, or both. For example, a task in- execution on a busy processor can be moved to a newly- freed-up processor if priority data for the task in-execution represents a strong preference for execution on that newly- freed-up processor. Alternatively, if priority data for a task awaiting execution (e.g., in a task list representing queued tasks) represents a strong preference for execution on a busy processor, the system can dynamically switch a task in-execution on that busy processor to the newly- freed-up processor, in order to optimally assign the task awaiting execution. In one embodiment, such a move can be based on benchmark data for a specific task, and in another embodiment, such a move can be based on global optimization, i.e., achieving a Pareto optimal system result based on the concurrent execution of many tasks (and consideration of relevant benchmark data). To provide an example of this latter function, a scheduler can determine that global cost is minimized by assigning a first task to a less-than-optimal "second" processor match even though a better "first" match is available, because benefits of assigning a second task to the first processor provides greater global system efficiency that outweigh the costs of the less-than-optimal assignment(s). In yet another implementation, local memory (e.g., a local context memory, such as a cache or other memory shared by or otherwise accessible to each of the processors) can optionally be used to assist with such a move.

[0028] With the principal parts of a system and method thus introduced, additional detail will now be presented to specific embodiments.

[0029] Figure 1 is a block diagram illustrating a system 110 that schedules multiple processors, optionally, as in this example, a set of asymmetric processors (that is, with at least two processors having different performance capabilities, designs or operating parameters). As with the other embodiments described below, the teachings presented herein can be applied to like-processor or asymmetric processors, but are especially useful in asymmetric processor systems. In Figure 1, asymmetric processor system 100 comprises processors 110- 113 and a scheduler 120. Two of the processors 110 and 111 are illustrated in solid lines, while a third or greater number of processors processor 3 - processor N (112-113) are illustrated in dashed lines to illustrate their optional usage. Each processor 110-113 is operatively coupled to scheduler 120. The scheduler 120 is coupled to a source of tasks, in this embodiment illustrated to be an optional task list 130. While the source can only be a source of one task, as illustrated in Figure 1, a task list 130 can have entries for plural tasks 131-134 (also shown as task #1 through task #M, respectively), each to be executed at some point in time. The task list should not be confused in the depicted embodiment with priority data 140, i.e., the task list 130 represents tasks in or awaiting execution in the system, for purposes of managing which task (#1-M) is run on which processor; the priority data provides information associated with preference between the processors in execution of a task. As indicated by dashed lines to indicate optional implementation, in one embodiment, the priority data can take the form of benchmark data 142 holding values that indicate the relative performance of a specific task on a processor capable of executing that task. As indicated by numerals 142 and 143 together, the priority data can also take the form of a table that identifies priority for each task in the task list (or potentially, a much larger set of tasks including tasks not yet a part of the task list). A least one of the tasks is executable by at least two of the processors 110-113. The benchmark data can if desired be stored in memory, as can the task list 130 (e.g., in the same or a different memory).

[0030] Because modern digital systems are often called upon to perform multiple tasks at the same time (e.g., calculations, receiving internet traffic, displaying a movie, etc.) it is desirable to allow multiple tasks to share system resources. Multitasking is used when it is necessary to concurrently perform more tasks than there are processors in a system. In the discussion below, various forms of "multitasking" will be explored, both generally and in the context of multiprocessor systems.

[0031] A system that allows multitasking typically manages tasks by managing their states. First, a task is created. That is, it is typically loaded from a secondary storage device (e.g., a hard disk, EEPROM, or CD-ROM, etc.) into a relatively more local memory. Upon being loaded into local memory, a task is assigned a "waiting" state or its equivalent by the scheduler 120. When it is time to execute the task, the scheduler assigns the task a "running" state and the assigned processor executes its instructions.

[0032] If a task needs to wait for a resource (e.g., wait for user input, a hardware interrupt, or file to open, etc) it is assigned a "blocked" state. The task state is changed back to a running state when the task no longer needs to wait on the resource. Once the task finishes execution, or is terminated by the operating system, it is moved to a "terminated" state, and the system typically removes the program associated with the task from memory 120.

[0033] As mentioned earlier, processors can be implemented as standalone, multicore, co-packaged, or as some other form of device. Thus, in one embodiment, processors 110-113 can reside on the same integrated circuit or die, as depicted by dotted outline 160. Two or more of processors 110-113 can have different performance characteristics. For example, processor 110 can be more efficient (or faster) than processor 111 (or processor 112, etc.) when performing compute-bound tasks. Processor 111 can be more efficient (or faster) than processor 110 (or processor 112, etc.) when performing memory bounded tasks. Finally, processor 112 can be more efficient than processor 110 (or processor 111, etc.) when performing input/output (I/O) bounded tasks. Again, this particular illustration should not be viewed as limiting.

[0034] Architectural differences that can lead to different performance characteristics

(and thus asymmetry) and thus desired prioritization among processors 110-113 include, but are not limited to: (1) differing amounts of instruction execution parallelism; (2) differing amounts of speculative instruction execution; (3) differing cache sizes; (4) differing cache associativity; (5) differing number of cache levels; (6) different cache organizations; (7) differing or specialized execution units (e.g., floating point, I/O processer, etc.); (8) differing operating frequencies (a.k.a., clock speed); (9) differing power supply voltage; (10) differing back bias voltages; (11) differing register files; (12) differing translation look-aside buffer configurations; (13) differing bus widths between either external memory, cache memory, or a local memory; (14) differing native instruction sets; (15) memory types with different memory access times; and (16) different processing circuit capabilities. In some

embodiments, processors can have programmably defined architectures, and thus can be occasionally reconfigured to change their characteristics (e.g., per certain FPGA designs). Depending upon implementation, the architectural differences between processors 110-113 may not be visible to software running on processors 110-113. Thus, in one embodiment, processors 110-113 have the same native instruction set and the same logical register files.

[0035] In one embodiment, the system is an asymmetric processor system 100 configured to execute two or more concurrently running tasks (e.g., two or more threads or programs). If desired, the system 100 can even be designed to run more tasks than there are processors 110-113 with the appearance that all tasks are running concurrently; to this end, asymmetric processor system 100 can time-division multiplex (i.e., multitask) the use one or more of processors 110-113 in executing the tasks represented in task list 130. In other words, asymmetric processor system 100 can switch some of the tasks associated with the entries 131-134 in task list 130 between different processors 110-113. This switching when it occurs generally is performed frequently enough that the user perceives the tasks as running at the same time. On asymmetric processor system 100, some number of these tasks will run at the same instant with each processor 110-113 running a particular task. To manage this activity, the task list 130 can be organized such that each task has a corresponding task entry 131-134 (or state). However, other means of tracking tasks that are running, or need to be run, can also be used. Whether or not the system runs more tasks than processors, and whether or not multitasking is employed, the asymmetric processor system 100 can support the assignment, stoppage, initiation and movement of tasks with a process scheduler (not shown in Figure 1).

[0036] Scheduling determines which tasks can be running on which processors 110-

113 at any given time, while other tasks are waiting to get a turn. The act of reassigning a processor 110-113 from one task to another one is called a "context switch." On asymmetric processor system 100 (i.e., a multiprocessor system), such a context switch can be performed for a number of reasons. To provide one example, the scheduler can determine that a newly arising high priority task should take precedence over a task in execution on a specific processor; a context switch can be employed to substitute in the higher priority task on the specific processor, and the replaced task can then be queued or reassigned depending on implementation. If the system is one where multitasking will be used to allow more tasks to be run than there are processors 110-113, the scheduler can reassess priority and can perform another context switch between the displaced task and yet another task. An operating system running on asymmetric processor system 100 can adopt one of several scheduling strategies from the following non-exhaustive categories: (1) a running task keeps running until it performs an operation that requires waiting for an external event or condition (e.g. waiting for a disk drive) or until the scheduler forcibly swaps the running task out of a processor 110- 113; (2) the running task is periodically required to relinquish a processor 110-113, either voluntarily or by an external event such as a hardware interrupt; or, (3) waiting tasks are guaranteed to be given a processor 110-113 when an external event occurs (e.g., when an incoming data packet arrives); clearly, other examples also exist.

[0037] For example, one hypothetical system can include just two processors 110-111 running the following tasks: (1) keyboard scan; (2) GPS location; (3) clock application; and (4) email check. In this example, the keyboard scan task periodically scans the keys of a keyboard to determine if a user has depressed a key. The GPS location task calculates the current latitude and longitude of the system based on signals received from orbiting satellites. The clock application task periodically updates a display of the current date and time. The email check task periodically queries a network server to determine if new email has been sent to an account associated with asymmetric processor system 100. Because there are four tasks and only two processors 110-111 in this example, only two of the tasks can be in execution at any given instant, with the other two tasks in a state of "waiting." All four tasks would be listed in task list 130. If this condition were to continue, the two running tasks would consume all of the processors 110-111 processing cycles while no work was performed on the non-running tasks. Thus, when a predetermined event or condition occurs, a scheduler stops at least one of the running tasks and assigns the processor(s) 110-111 to run at least one of the non-running tasks (i.e., swaps a new task in). This predetermined event can be, for example, in response to an interrupt (e.g., a periodic clock interrupt) or a task voluntarily giving up control of the processor (e.g., a system call by the task to temporarily halt its execution), or the occurrence of some other event.

[0038] To accomplish the swap or context switch, the location of the next instruction to be executed by a task, and the contents of registers or data that defines the state of the processor 110-113 necessary to restart the executing task, are saved to a specific location in memory. This "context information" is saved in a location that will not be overwritten by other tasks when they are run. Typically, the context information saved includes the program counter, stack pointer, and any internal registers being used (e.g., accumulator, floating point registers, general-purpose registers, condition code register, etc.). Any context information from the replacement task that was saved when it was last ousted from running on a processor 110-113 is then retrieved in order to restore the processor 110-113 to the state it was in immediately prior to being ousted. Once that state (if any) is restored, the scheduler causes the processor 110-113 to begin executing the replacement task at the location of the next instruction for the replacement task (i.e., adjusting for any progress in the instruction sequence achieved on a prior process). This process can be repeated for any tasks in task list 130 depending on design, e.g., to adjust for changing priorities or new events, completed tasks, new tasks, or simply such that each task represented by task entries 131-134 gets a fair allocation of processor 110-113 resources.

[0039] Asymmetric processor system 100 can use priority data to assign a particular task to a particular processor 110-113. This priority data can take the form as a programmed preference for a specific task, or as a table that has an entry for each task, indicating a relative preference for each processor in the system.

[0040] For example, priority data can take into account processor power consumption.

If a hypothetical task (e.g., an FFT operation) consumes significantly more power than other tasks, then it may be desired to identify a preference for this task in favor of the processor which consumes the least power; for a different task, for example, a task that retrieves a sensor value and processes it in some manner, it may be desirable to identify a preference for this task in favor of the processor that has the smallest size cache, so as to preserve cache resources for other tasks. These examples naturally are not limiting. Each task can have a specific order of preference between the processors assigned to it.

[0041] In embodiments where the priority data is based on benchmark data, that is, some type of measured or measurable relative performance, the data can be in the form of a numerical value such as a time or one or more countable events. Without limitation, these countable events can represent power consumption attributable to executing that task on the particular processor 110-113 used to execute that task. Examples of countable events can include, but are not limited to, the time taken to complete all or part of the task, the number of instructions, the number of external fetches, the number of branches, the number of data cache misses, and any other measure which could be used to measure performance of a task on a processor. Other countable or measurable indicators can be used to determine an indicator of power consumption or efficiency attributable to executing a task on a particular processor 110-113, and if desired, the countable events can be in the form of a statistical measure, including an averaged or estimated value. These countable events can be computed by, and stored in, one or more of processors 110-113 or associated memory. [0042] Asymmetric processor system 100 can monitor performance and establish or update benchmark data to track the efficiency or performance of the tasks as they execute on each of processors 110-113. As mentioned, this data can correlate or be indicative of energy efficiency. In an embodiment, the monitored indicators can be normalized against execution times so that a comparison of the efficiency between executing a task among processors 110- 113 is accurate.

[0043] As noted above, one embodiment of the principles provided by this disclosure applies benchmark data and monitored conditions, represented by numeral 150 in Figure 1, to dynamically schedule tasks; a second embodiment can build benchmark data, such as the data used by the first embodiment. In one implementation of the second embodiment, asymmetric processor system 100 collects benchmark data for each particular task on all of processors 110-113, and thereafter uses this information as a basis for scheduling that task. Optionally, prior to that time the asymmetric processor system 100 can assign the particular task (i.e., the particular instruction set) preferentially to processors for which it does not have

corresponding benchmark data in order to build that data. Again, the monitored condition can include any type of operating condition, state, or other information regarding dynamic or variable operation of the system.

[0044] As a sample application of these teachings, asymmetric processor system 100 can select the lowest energy consuming processor 110-113 that is available to execute the next task ready to be executed. Asymmetric processor system 100 can also implement other power management policies to limit which processors 110-113 are available to run a task. For example, if a second task further down in the task list is relatively compute intensive (and thus might consume relatively more power), the asymmetric processor system 100 can instead assign this task to the lowest energy consuming processor available. Clearly, many possibilities exist; for example, in implementing such a set of policies, the asymmetric processor system can apply Boolean logic, or any other type of formula or algorithm to aid scheduling determinations. Asymmetric processor system 100 can shut down one or more of processors 110-113 to save on power consumption. Even with one or more of processors 110-113 shut down, asymmetric processor system 100 will still be able to use this example scheduling policy by selecting the lowest energy consuming processor 110-113 from the set of processors 110-113 that are available and not shut down.

[0045] In some circumstances, asymmetric processor system 100 can limit or have a maximum number of entries allowed in task list 130. In other words, task list 130 can be limited to a maximum number of active tasks, for example, "M." Thus, when other parts of asymmetric processor system need a task scheduled, they communicate with a scheduler to determine whether there is room for a new task in task list 130. As mentioned, the maximum number of active tasks (e.g., M) can be larger than the number of processors 1 10-1 13. In this manner, the details of the underlying processors 1 10-1 13 are hidden from other parts of asymmetric processing system 100, such as an operating system. Then, when there is room in task list 130, a new task can be added to task list 130.

[0046] If all processors 1 10-1 13 are currently busy, and a new task is received, the priority data 140 can be used to determine the processor 1 10-1 13 selected to run the new task. If the implementation is one where tasks in-progress can be dynamically moved, the priority data can be use to determine whether a task-in-progress should be interrupted to allow immediate execution of the new task. Alternatively, a designer may wish to select an optimal processor and queue the new task behind any existing task-still in progress, based on priority data that best matches the capabilities of the open processor. Clearly, many other options exist.

[0047] Once a processor 1 10-1 13 is selected, the task is executed on that processor

1 10-1 13 until a condition is satisfied (or an "event" otherwise occurs).

[0048] A first condition that can be satisfied is the expiration of a set time interval.

This time interval can be set by the operating system (e.g., a multitasking system time-slice). This time interval can be set by the task itself. Typically, if this time interval is set by the operating system, the same time interval can be used uniformly for all tasks, for a class of tasks, or for the specific task only, depending on design. If this time interval is set by the task, there is more visibility into the behavior of the task so that the time interval can be customized to each task.

[0049] A second condition that can be satisfied involves the task stopping itself. For example, the task can have run to its completion. In this case, the entry in task list 130 for that task can be cleared and made available. In another example, the task can be waiting for an external event. This can also be referred to as the task being "blocked." If the task itself detects this blocked condition, then it can make a call to suspend its execution. Asymmetric processor system 100 can also detect this blocked condition through other means such as hardware timers, interrupt calls, and the like. In response, asymmetric processor system 100 can initiate a process to switch out the task. In either of these cases, the tasks context state is saved and if appropriate, benchmark data (e.g., representing measured power consumption or efficiency) can be updated. The task can then sit idle in task list 130 until asymmetric processor system 100 determines that the conditions for restarting the task (such as there being an available processor 110-113) have been satisfied.

[0050] Once an event occurs, the system can check the state of each processor or task in the system (active, in-execution or otherwise), and assign a "next task" in the last list to a free processor, or move a task in-execution to a new processor, based on the priority data (e.g., benchmark data). Thus, upon event occurrence, the system can invoke a method of assigning a new task to one of multiple processors (e.g., per the method discussed above and in connection with Figure 10) or can perform a task swap (e.g., per the method discussed above and in connection with Figure 12) based on priority data and a reevaluation of system operating conditions. Thus, each event can provide a milestone for task assignment, with priority data (e.g., benchmark data) and one or more system operating conditions being used to assign or move zero, one or more tasks as appropriate.

[0051] As mentioned, the asymmetric processor system 100 can use benchmark data that represents previous execution times or performance. This data can be established in a number of ways. For example, the system can rely on "dead reckoned" or estimated data representing typically time for running the task on the "type" of processor at issue (e.g., manufacturer model). If desired, the system can start with dead reckoned data and update this data to become more accurate for the specific system at-issue, i.e., based on calibration or dynamic updating. Alternatively, the system can periodically or always measure

performance for each specific task or function call (e.g., a specific subroutine or thread that requires load of data from remote memory) and "build" or otherwise update benchmark data. This data can be specific to each individual task, or can be established for a class of such routines or threads. The benchmark data can be associated with a group of processors 110 andl 13, or optionally, specific processors (e.g., just processor 110). For example, asymmetric processor system 100 can also electively base prioritization on monitored system status, including by way of non-limiting example, whether a specific processor is currently busy or how much time is left until the tasks next deadline. Thus, for example, two different calls for the same function (e.g., different data operands) can be prioritized differently according to priority data as well as any system status information (e.g., one of the two calls is already in progress or is blocked, or is characterized by some other state).

[0052] Aspects of asymmetric processing system 100 will be illustrated by the following example. Consider a case of three background tasks. For the purposes of this example, they will be referred to as Process A, Process B, and Process C. Process A could, for example, be a video download application storing a MPEG encoded stream that could be viewed by the user at a later time more convenient to a user. Process B could, for example, be a life-capture application in which an image of a surrounding area is captured

automatically on a timed basis. Process C could, for example, be a context-aware application that tracks its GPS location, movement, temperature, weather conditions, and other sensors to initiate any helper activity that could be of interest to the user. Notably, while these examples represent relatively complex tasks, the examples can easily be extended to small threads of instructions, or to much more complex processes or software. The mentioned processes are considered background tasks because they have the characteristics of a background task described previously.

[0053] The background processes can, however, have drastically different execution profiles from other tasks. For example, a first background task can be compute intensive and have a small working set in memory to perform its computation. A second background task can function to retrieve data from a set of sensors to memory. This task would typically have very little computational processing. A third background task can awaken frequently so that it can update its status based on a number of sensors that it tracks. This task can trigger a number of reminder activities based on its status. In all of these exemplary cases of background tasks, the computational and memory requirements can vary across a wide spectrum of compute, memory, and I/O components.

[0054] Processors 110-113 may also not have the same performance or power-saving capabilities. For the purposes of this example, three performance characteristics of classifications of these processors 110-113 will be referred to as X, Y, and Z. In the examples that follow, a characteristic can be a trait shared by one or more processors, the identity of a specific processor, or some other feature which can be used to distinguish processors.

[0055] Asymmetric processor system 100 attempts to find an optimal assignment of the background tasks (Process A, Process B, Process C) to the different performance classifications (X, Y, and Z) with respect to performance and power consumption. In other words, there is a benefit in terms of performance, power, other factors or any combination of these things, that Process A is best assigned to be executed on a processor best fitting characteristic X because Process A's execution profile fits best with that of the processors matching characteristic X. The execution profile can be based on the priority data, and in some embodiments, specifically upon benchmark data. By fitting Process A in this manner, Process A is preferably assigned to a processor that is not over provisioned so as to involve a waste of power or resources. Similarly, Process A is preferably not assigned to a processor that is inadequate, i.e., such that execution of the process would fall below desired

performance requirements. Just as was the case for Process A, Process B and Process C might be matched to processors best fitting characteristics Y and Z, respectively. In this scenario, running a background task on a non-preferred processor (or on a processor having a poorly matched characteristic) would be possible, but perhaps not the most performance and power efficient. In different embodiments, depending on design preference, the optimization can be implemented in different manners, for example: processes can be matched based on priority of process (e.g., Process A can receive its choice of processor if Process A is the most important); alternatively, processes can be simultaneously optimized based on a weighted or unweighted measure of all processes (e.g., a least squares or similar approach can be used to optimize overall system performance, such that Process A might be assigned to a less-than- optimal processor if benefits gained by assignment of Process B and Process C to other processors outweigh the disadvantages, with any necessary or appropriate weighting); other techniques can also be used. It should be appreciated that an asymmetric processor system is especially suited to the management of background tasks, simply by the inherent nature of background tasks; for example, with the mentioned hypothetical task that monitors a sensor, a processor with limited compute resources or cache size can be used (depending on design) without dedicating a power hungry or much more powerful processor to this task. The use of priority data is especially adapted to dynamically assigning tasks using asymmetric resources that can be provided in such an embodiment. Again, the teachings above can be applied to non-asymmetric multiprocessor systems as well.

[0056] In one embodiment, asymmetric processor system 100 can optionally build or otherwise update benchmark data used to provide priority. In this regard, a system can use one or more measurement or monitor functions to measure and store, or update, benchmark data. The outputs of the measuring functions can be fed back to the scheduler. For example, Process A can be initially assigned to a processor matching characteristic Z because there is no benchmark data available relating to the execution of that process on a processor matching characteristic Z. During execution of this task on the assigned processor, any pertinent or selected information (depending upon design) can be tracked to help determine how good a fit the background task is to the assigned processor (or group of processors matching characteristic Z). As mentioned previously, non- limiting examples of benchmark data are the number of compute (versus memory) instructions, the number of cache hits or misses, and the number of cycles that remain uncompleted due to unresolved instruction dependencies or blocked states. At the end of its duty cycle (recall that some types of background tasks can be awakened on a regular basis and have finite, bounded program lifetimes) or otherwise at the next "event," these monitored or measured events can be used to produce updated data.

[0057] Once all the benchmark data has been updated for all processor characteristic

(e.g., group of processors or single processor), the scheduler can then use any number of heuristics to determine the best-fit criteria in assigning an available processor to running that background task. As mentioned earlier, this processor for building benchmark data during run time is optional, depending on design.

[0058] To provide some examples of how measured (or dead reckoned or

preprogrammed) benchmark data can be applied, one application could compare a ratio of compute versus memory instructions called by each specific process, in which case a processor having characteristic X would be selected. Another criteria used to assign a process could be that the LI and L2 cache hit rates fall below some threshold, in which case a processor matching characteristic Y would be selected. Yet another heuristic could be the speed of a processor in performing a compute intensive task, in which case processor matching characteristic Z could be selected. In all of these cases, there is some benchmark data that can be used to approximate how well a background task can be fitted to the specific processor (or processor matching specific characteristics).

[0059] Figure 2 is a block diagram illustrating another system that schedules asymmetric processors. In Figure 2, asymmetric processor system 200 comprises processors 210-213, main memory 220, and an optional local memory 240. Each of processors 210-213 is operatively coupled to main memory 220 and, if used by the design, the local memory 240. The main memory 220 can hold the task list 230 as well as the priority data 240, although these two entries are seen as separated by a dashed line 242 to denote they can be in the same or different physical or logical memory (e.g., in different devices or address spaces). As indicated by numeral 244, the priority data can optionally take the form of benchmark data 242 as has been previously discussed. Task list 230 has M task entries 231-234 (also shown as task #1 through task #M, respectively, in Figure 2). Local memory 240 is depicted as currently holding task #2 250. In one embodiment, processors 210-213 reside on the same integrated circuit (e.g., as co-packaged circuits, or as a multicore processor); in another embodiment, the processors can be standalone ICs. As with the embodiments discussed above, two or more of processors 210-213 can have different performance characteristics. Thus, for the sake of brevity, the earlier example architectural differences that can lead to different performance characteristics among processors 210-213 will not be repeated here. [0060] For purposes of discussion, it should be assumed that asymmetric processor system 200 is configured to execute two or more concurrently running tasks. In order to run more tasks than there are processors 210-213 with the appearance that all tasks are running concurrently, asymmetric processor system 200 can time-division-multiplex processors 210- 213 among the tasks represented in task list 230. On asymmetric processor system 200, some number of these tasks will run at the same time with each processor 210-213 running a particular task. Each task to be run can have a corresponding task entry 231-234 in task list 230. However, other means of tracking tasks that are running or need to be run can be used. Asymmetric processor system 200 can support both time-sliced threading and multiprocessor task threading with a process scheduler (not shown).

[0061] In the implementation of FIG. 2, the asymmetric processor system 200 can store certain tasks in memory 240. For example, the asymmetric processor system can store selected background tasks in this local memory, as noted by reference numeral 250 which indicates that task #2 is stored in local memory 240. In this manner, the tasks selected for storage in local memory 240 can be executed (or restarted) using a reduced number of accesses to main memory 220. Depending on design option, local memory 240 can comprise flash or other non- volatile type memory, or volatile memory types such as SRAM or DRAM.

[0062] In addition to scheduling tasks by assigning a particular processor 210-213 to run a particular task associated with a task entry 231-234, asymmetric processor system 200 can monitor system status and can also measure task performance to update benchmark data associated with each task or task type on the various processors, as has been previously discussed. For example, asymmetric processor system 200 can measure execution of a particular task on a particular processor 210-213 and use any resultant data to update benchmark data for the particular task and the particular processor (e.g., benchmark data can be newly created if not readily available or can be used to update and modify previously stored benchmark data, whether originally programmed, dead reckoned or based on prior empirical measurements). Asymmetric processor system 200 in this embodiment can also monitor information about executing a particular task when all or part of the task is stored in local memory 240.

[0063] For example, for each execution of a task (or time slice thereof), asymmetric processor system 200 can measure one or more countable events. These countable events can allow asymmetric processor system 200 to determine an indication of power consumption attributable to executing that task on the particular processor 210-213 used to execute that task. Examples of countable events that can be indicative of power consumption or performance attributable to executing a task include, but are not limited to, the time taken to complete all or part of the task, the number of compute instruction (vs., for example, memory reference instructions) executed, the number of data cache misses, the number of local memory 240 accesses, and the like. Other countable or measurable indicators can be used to determine an indicator of power consumption or other efficiency attributable to executing a task on a particular processor 210-213.

[0064] Asymmetric processor system 200 can record measurements to track the efficiency or performance of the tasks as they execute on each of processors 210-213. This efficiency can correlate with or be indicative of energy efficiency. In one embodiment, the measured data can be normalized against execution times so that a comparison of the efficiency between executing a task among processors 210-213 is consistent and accurate.

[0065] In an embodiment that builds or updates benchmark data, once asymmetric processor system 200 stores benchmark data about running a particular task on all of processors 210-213, asymmetric processor system 200 can use this information as a basis for scheduling that task. In particular, asymmetric processor system 200 can use the

aforementioned data to aid in the selection of which processor 210-213 is selected to run the particular task. Asymmetric processor system 200 can also use the aforementioned data, and any monitored system status or parameters, to aid in the selection of which tasks to place in local memory 240. In an embodiment where this "building" is performed via a calibration process, a task can be deliberately executed on each processor to build benchmark data for future use. In an embodiment where this "building" is performed run-time, the process can be simply assigned as a matter of preference to a new processor (for which no benchmark data is yet available), and it is otherwise assigned in order of priority determined in view of the benchmark data which is available.

[0066] In another example, asymmetric processor system 200 can select the lowest energy consuming processor 210-213 for the next task ready to be executed, a selection which can (depending on design) be dependent on the particular task. Asymmetric processor system 200 can also implement other power management policies to limit which processors 210-213 are available to run a task. Asymmetric processor system 200 can shut down one or more of processors 210-213 to save on power consumption. Asymmetric processor system 200 can move tasks to local memory 240 in order to reduce power consumption. Asymmetric processor system 200 can move tasks to local memory 240 in order to reduce power consumption based on benchmark data. [0067] Depending on design, the system 200 can limit or have a maximum number of entries allowed in task list 230. As with the embodiments discussed earlier, the

characteristics of the various underlying processors 210-213 and any dynamic assignment can be hidden from other parts of asymmetric processing system 200, such as the operating system. Asymmetric processing system 200 can limit the number or size of tasks that can be stored (either partially or completely) in local memory 240. Depending on embodiment, it can be left up to the scheduler to determine whether there is room for a new task in local memory 240. It should be understood, that the maximum number and size of active tasks can be larger than the size of local memory 240. In this manner, the details of the underlying local memory 240 can be hidden from other parts of asymmetric processing system 200, such as the operating system. Of course, when there is room in task list 230, a new task can be added to task list 230.

[0068] In one embodiment that employs local memory, tasks that are specifically identified or classified as "background tasks" can be placed in local memory 240 by an operating system. Alternatively, background tasks can be hardwired or preprogrammed into local memory.

[0069] For tasks which have not yet been executed, the scheduler (not shown in Figure

2) can assign a next task to a processor 210-213 when that processor becomes available. As yet another optional instantiation of the system, the scheduler can optimize performance of multiple queued tasks by immediately assigning a lower order task, that is, a task other than the next one awaiting execution, based on priority data 240. For example, if (a) the next task awaiting assignment to a processor is task #3 (233), (b) processor #3 (212) is the lone free processor, (c) task #3 has associated priority data that indicates a strong preference for processor #1 (210), and (d) task #4 has priority data representing a strong preference for processor #3 (210), the system can be designed to "leapfrog" task #4 ahead of task #3 based on a form of global system optimization. "Leapfrogging" can also be performed if desired based on other heuristics, for example, if the priority (or preference represented by benchmark data) for task #4 for processor #3 is a threshold amount stronger than the preference for task #3 for the same processor.

[0070] Once a processor 210-213 is selected for the execution of a task by the scheduler, the task is executed on that processor 210-213 until an event or condition is satisfied. These events or conditions can be the same as or similar to those for asymmetric processing system 100, discussed earlier. Therefore, for the sake of brevity, they will not be repeated here. [0071] Asymmetric processor system 200 can prioritize the tasks that are represented in task list 230. This prioritization can be based on data that distinguishes between tasks are ready to be executed, and those that are still waiting on the completion of an external event. Asymmetric processor system 200 can base prioritization on previous execution times that have been tracked and stored as benchmark data. For example, prioritization can be based on how much time is left until the tasks next deadline, or a combination of one or more of the previous execution times and how much time is left until the tasks next deadline.

[0072] Figure 3 is a block diagram illustrating a system that schedules asymmetric processors where the processors share a local memory. In Figure 3, asymmetric processor system 300 comprises processors 310-313, main memory 320, and local memory 340. Each of processors 310-313 is operatively coupled to main memory 320 and local memory 340, with the main memory 320 holding the task list 330, the local memory 340 holding program memory 342 and data memory 346. The program memory 342 can be used to store instructions for task #2 352, while the data memory 346 can hold data for task #2 354. Once again, task information can be stored in the form of a task list 330, holding task entries #1-M (331-334 in Figure 3). As was the case for each of the embodiments discussed earlier, the processors 310-313 can reside on the same integrated circuit, or be configured as cores residing on the same integrated circuit die. Depending on design choice, local memory 340 can also reside on the same integrated circuit as one or more of processors 310-313 or a different IC. Similar to processors 210-213 in Figure 2, and processors 1 10-1 13 in Figure 1 , two or more of processors 310-313 can have different performance characteristics and thus be asymmetric. Thus, for the sake of brevity, the list of example architectural differences that can lead to different performance characteristics among processors 310-313 will not be repeated here. In this embodiment the instructions and/or data for background tasks can be placed in local memory 340 by an operating system, or instructions and/or data for background tasks can be hardwired or preprogrammed into local memory 340.

[0073] Asymmetric processor system 300 operates similarly to asymmetric processor system 200. However, because local memory 340 holds both program memory 342, and data memory 346, asymmetric processor system 300 can maintain additional benchmark data about running the task when one or more of the tasks instructions and data are stored in local memory 240. Thus, asymmetric processor system 300 can make decisions about which processor 310-313 to select based on which parts of a task (e.g., instruction or data) are stored in local memory 340. In addition, asymmetric processor system 300 can make decisions about which part of a task (e.g., instructions, data, or both, for a particular task) to place in local memory 340 based on benchmark data. The benchmark data (or priority data) and its associated storage can be the same as described with reference to the earlier Figures, and so is not repeated here.

[0074] Figure 4 is a block diagram illustrating a system that schedules processors where the processors have a context memory, in this embodiment, a shared context memory. In Figure 4, system 400 comprises processors 410-413, main memory 420, and context memory 460. Each of processors 410-413 is operatively coupled to main memory 420 and context memory 460, with main memory 420 again holding the task list 430 and the task list 430 again being comprised of M number of task entries 431-434 (also shown as task #1 through task #M, respectively, in Figure 4). The context memory 460 is seen as holding the context for task #2 462 and the context for task #3 464. As before, the processors 410-413 can reside on the same integrated circuit, or integrated circuit die, either separate from, or together with the context memory 460, and can have different performance characteristics, as exemplified above in connection with the earlier figures.

[0075] Once again, the system 400 can be configured to execute two or more concurrently running tasks, and so can time-division-multiplex processors 410-413 among the tasks represented in task list 430. In order to accomplish this, asymmetric processor system 400 can store context information about tasks that are not running in context memory 460. The context information stored in context memory 460 includes the information necessary to restart a task that was previously executing on one of processors 410-413 on the same or another of processors 410-413. Typically, this information includes the values of the registers, instruction pointers, and data pointers (e.g., stack pointers). The context information can be limited to only that information necessary to restart a task that was previously executing on one of processors 410-413. In other words, if a register or pointer was not being used, or had an invalid value, then it would not be stored in context memory 460.

[0076] Figure 5 is a block diagram illustrating a system that schedules asymmetric processors where the processors have a context memory and local memory. In Figure 5, asymmetric processor system 500 comprises processors 510-513, main memory 520, local memory 540, and context memory 560. Each of processors 510-513 is operatively coupled to main memory 520, local memory 540, and context memory 560. Once again, main memory 520 is holding task list 530, task list 530 has a number of task entries 531-534 (also shown as task #1 through task #M, respectively, in Figure 5), and context memory 560 is holding the context information 562 and 564. Local memory 540 can hold instructions and/or data 542 and 544, while the context information 562 or 564 can differ between tasks or as between tasks, instructions or data. Asymmetric processor system 500 in this embodiment operates similarly to asymmetric processor systems discussed earlier. However, Asymmetric processor system 500 has both a context memory 560 and a local memory 540.

[0077] In order to time-division-multiplex processors 510-513 among the tasks represented in at least task list 530, asymmetric processor system 500 stores context information about tasks that are not running in context memory 560. In addition, asymmetric processor system 500 can store instructions and/or data associated with both executing and non-executing tasks in local memory 540. The context information stored in context memory 560 includes the information necessary to restart a task that was previously executing on one of processors 510-513 on the same or another of processors 510-513. As alluded to earlier, this information can include the values of the registers, instruction pointers, and data pointers (e.g., stack pointers). The context information stored in context memory 560 can be limited to only the information necessary to restart a task that was previously executing on one of processors 510-513 on the same or another of processors 510-513. In other words, if a register or pointer was not being used, or had an invalid value, it does not have to be stored in context memory 560 in order to swap out a task in-execution with another task. All other information can be stored in local memory 540 or main memory 520.

[0078] Figure 6 is a block diagram illustrating another system having a scheduler. As was the case for the other embodiments already discussed, asymmetric processor system 600 has processors 610-613, main memory 620, local memory 640, and context memory 660, with the scheduler being implemented by one or more of the processors. Each of processors 610-613 is operatively coupled to main memory 620, local memory 640, and context memory 660. Main memory 620 is depicted in Figure 6 as holding task list 630 and scheduler 670, as well as priority data 680. The task list 630 can include M task entries 631-634 (also shown as task #1 through task #M, respectively), and context memory 660 can hold one or more instances of context information (depicted using numerals 662 and 664). Local memory 640 in this embodiment can hold instructions 642 and/or data and 644. As depicted by reference numeral 684, the system 600 can be optionally designed so as to use benchmark data as its priority data, basing dynamic task assignment on data representing performance

characteristics or cost of running a specific task for each different processor.

[0079] Figure 7 is a block diagram illustrating a system that schedules asymmetric processors with a scheduler that is independent of both processors and main memory. In Figure 7, the asymmetric processor system 700 as before comprises processors 710-713, main memory 720, local memory 740, context memory 760, and a scheduler 770. Each of processors 710-713 and scheduler 770 is operatively coupled to main memory 720, and the task list once again is seen to consist of M task entries 731-734. As before, the context memory 760 holds separate instances of the context information 762 and 764 while the local memory 740 holds instructions and/or data 742 and 744. Asymmetric processor system 700 operates similarly to the systems described earlier. However, Figure 7 illustrates that the aforementioned functionality relating to the selection of a processor in these systems to run a task can be controlled by a scheduler 770 that does not run on one or more processors 710- 713. In one embodiment, scheduler 770 is an independent processor running a scheduling task. In another embodiment, scheduler 770 can be dedicated scheduling circuitry or a dedicated processor. As was the case before, numerals are used to designate the use of priority data, which optionally can be in the form of benchmark data, as indicated by numerals 780 and 782.

[0080] Figure 8 is a block diagram illustrating a system similar to those discussed above, but which calls for updating benchmark data based on monitoring performance of tasks during run-time. Once again, the embodiment includes a number of processors 810- 813, main memory 820, local memory 840, context memory 860, scheduler 870, and performance monitor 880. Each of processors 810-813, scheduler 870, and performance monitor 890 is operatively coupled to main memory 820. The performance monitor 890 is operatively coupled so as to update a table of benchmark data, identified in Figure 8 by reference numeral 892. In order to keep benchmark data current to track most recent operating conditions, the system 800 can run tasks in the order defined by the task list 830, which as before is optionally stored in main memory. In part to facilitate hot swaps, context information 862 and 864 can be stored in a context memory (860) while instructions 842 and data 844 can be stored in another form of local memory (840).

[0081] Figure 8 helps illustrate that indicators of efficiency, performance, power consumption, and other measured data associated with running a task on a processor (i.e., benchmark data), can be measured or otherwise gathered by the more performance monitor 890. The performance monitor 890 can be dedicated performance monitoring circuitry (such as a power monitor, or internal register counting events), software running on one or more of processors 810-813, or a process managed by another circuit (not shown in Figure 8).

[0082] Figure 9A is an illustration of a list associating priority data specific to individual tasks. In Figure 9, a table 900 contains M row entries associated with M tasks (1 through M). Each row also contains a column each associated with one of the N processors, but in differing orders. For example, for a first row (corresponding to task #1), the table indicates that the order of preference for execution is in numerical order of the processor, #s 1-N. A second row, however, indicates that task #2 can have a different preferred execution order, for example, with processor #M being the favored processor, and processor #2 being the least favored. Similarly, this exemplary priority table can feature a row for each task indicating the most preferred processor, the second most preferred processor and so on. As an alternative, the table 900 can have specific columns dedicated to specific processors with each column having a value indicating order of priority. For example, returning to the priority data exemplified in dashed lines in Figure 1 , priority for each task could feature an entry for each processor in order, but expressing a value associated for each specific processor; for example, processor #1 (value near the left side of the priority data for each task) could be a low number for one task (e.g., indicating high preference for that processor the associated task), yet high for another task (e.g., indicating low preference for the associated processor for the associated task). Many other examples exist.

[0083] Figure 9B provides another example of a table 905 of priority data, this time indexed by processor instead of by task. That is to say, unlike the example just illustrated, each processor is a row and the table expresses an order of processor preference in terms of tasks, e.g., the specific processor #1-M performing certain tasks more efficiently (or otherwise preferentially) than other processors. In processing priority data organized in this manner, the system can simply access the represented table each time it has a task and a free processor, using the free processor as an index and searching the task list for an appropriate task based on the data. Again, any desired heuristic can be used. In this regard, the system retrieves the row associated with the "free" processor and then compares the task list (not shown in Figure 9B) with the associated priority data to determine which task should be executed. A distance metric can be used, along with any thresholding appropriate to the design. For example, if processor #1 becomes free and the task lists contains task #2 and task#3 in order awaiting execution, the system can assign task #2 to processor #1.

Alternatively, if the task list instead held task #M followed by task #2, task #2 might still be the task assigned to processor #1 based on the much weaker affiliation depicted in Figure 9B between task M and processor #1. Clearly, any prioritization algorithm or weighting appropriate to the particular design can be used. For example, as discussed above, instead of varying the order of tasks in each row of Figure 9B, a specific column could be dedicated to each task with a value stored to represent affiliation between the specific task (column) and the specific processor (row). In this manner, the values in each entry in the table can be scaled in any manner desired in order to impart appropriate weighting for each task-processor combination. Clearly many other examples exist.

[0084] As mentioned earlier, priority data can take the form of benchmark data, such as an entry for each specific processor-task combination. If desired, plural tables can be used to obtain the same end, e.g., as a relational database (e.g., a first table can contain entries for each processor for one or more performance characteristics, and a second table can contain entries for each task with desired performance characteristics, with the two being

computationally processed together to determine an optimal mapping). As mentioned above, benchmark data can be updated in some embodiments, either dynamically or at

predetermined intervals.

[0085] In Figure 9C, a table of benchmark data (910) shows three example row entries associated with three tasks. Each row contains a column associated with a particular task. The first example task is associated with a clock application. The second task is associated with a GPS position update process. The third task is associated with an email checking process. Each row in the table also contains columns associating each of the three processors 1-3 and two benchmark data entries. For each of the three processors, example benchmark data identifies number of data cache misses (D-cache misses) and the number of memory references (memory refs), for use in prioritization.

[0086] For example, in one hypothetical case, processor #1 and processor #3 can be available to have tasks assigned to them, with the scheduler determining that task #2 needs assignment to a processor. In one embodiment, an asymmetric processor system might examine and compare the benchmark data associated with task #2, processors #1 and #3. The asymmetric processor system (e.g., the scheduler/scheduling software) would observe from the entry for processor #1, that no benchmark data has been obtained that relates the performance of running task #2 on processor #1. In this case, the asymmetric processor system can choose to assign processor #1 to execute task #2 in order to "build" priority data for future use. The asymmetric processor system can choose to assign task #2 to processor #1 regardless of the values of the benchmark data associated with the performance of task #2 on the other processors; alternatively, if processor #1 was busy, depending on

implementation, the system might choose to use a different processor, instead performing assignment based on priority data available for other processors (this would involve the use of a monitored system operating condition to in part perform assignment). Until there are values of benchmark data stored that associate every processor to every task, preferentially choosing a processor with no existing priority data allows the asymmetric processor system to gather at least baseline data that it can use to make later assignments.

[0087] In connection with Figure 9D, it should again be assumed that each of processors #2 and #3 are available to have tasks assigned to them, and that task #1 is the next task that needs to be assigned to a processor. As with the previous example, the system could examine the priority data table 920 for task #1, again retrieving and comparing available data for processors #2 and #3. In this case, the asymmetric processor system would observe that processor #2 ran task #1 with ½ the number of D-cache misses and ½ the number of memory references relative to processor #3. Thus, in this case, the asymmetric processor system can choose to assign processor #2 to execute task #1 because of the two available processors, processor #2 runs task #1 more efficiently. Again, more complex heuristics can also be employed if desired for the particular application. For example, the table 920 indicates that processor #1 performs twice as well as processor #2 and four times as well as processor #3 for the particular task. The system can, as previously mentioned, employ any desired optimization algorithm such that, for example, it might chose to "bump" a task in-execution from processor #1, and then assign it to processor #2 or processor #3, depending on priority data for that task. As before, local memory and/or context memory can be employed to assist with the switch operation. Alternatively, the system could determine to queue task #1 for later execution if it was determined processor #1 would be available shortly (again, this might involve a monitored system operating condition, i.e., based on an understanding that a process in operation on processor #1 was near completion). Again, many examples exist regarding the use of priority data.

[0088] Figure 10 is a flowchart of a method of operating an asymmetric

multiprocessor system. The steps illustrated in Figure 10 can be performed by one or more elements of asymmetric processor systems discussed above.

[0089] A set of processors available to run a task is selected (1002). For example, a system can select all or a subset of processors that are available as candidates for running a task and, as mentioned, depending on the task and on system design, can also bump a task in- process. The system receives priority data for the new task (1004). It also receives at least one monitored system operating condition (1006), for example an indication of the state of each processor (e.g., free/busy) and it then determines a priority based on the priority data and the system operating condition (1008). If desired, the priority data can be benchmark data as previously discussed, or otherwise reflect the relative cost of running the specific process on the different processors in the set. [0090] Figure 1 1 shows a method of building priority data (e.g., benchmark data) based on monitored or measured task performance. As mentioned previously, such a method can be used to update dead reckoned data, as part of a calibration routine, or just simply casually as the system in run-times receives tasks for execution. The system receives a task (1 102), and it then accesses priority data; per step (1 104), the system determines that no priority data is available for a specific processor. The task is then run preferentially on that processor (1 106). A performance parameter associated with running the task on the selected processor is then determined and used to build priority for the associated task and processor (1 108). For example, the system can measure some aspect of system performance associated with running the task on the selected processor using one or more performance monitors, as previously discussed. As mentioned, benchmark data can comprise countable events that are indicative of power consumption, efficiency, or performance attributable to executing a task. This data can include performance parameters that include, but are not limited to, the time taken to complete all or part of the task, the number of compute instruction (vs., for example, memory reference instructions) executed, the number of data cache misses, and the like. Other countable or measurable performance indicators can be used as all or part of this data, or to provide an indicator of power consumption, efficiency, or performance attributable to executing a task on a selected processor.

[0091] Data representing measured performance (i.e., benchmark data) or an indicator that such data is available can then be stored (1 1 10). As one non-limiting example, the fact that a value is stored (e.g., a non-zero or non-negative value, for example) can indicate that data associated with running the task on a particular processor is in fact available. In another example, the system can maintain a flag associated with whether data associated with running the task on the first processor is stored. Changing the value of this flag can indicate priority data associated with running the task on the processor is now stored.

[0092] Figure 12 is a flowchart of a method of operating an asymmetric

multiprocessor system based on benchmark data and, in particular, shows how a task can be bumped. The steps illustrated in Figure 12 can be performed by any of the embodiments previously discussed.

[0093] An event occurs which causes the retrieval of priority data associated with running a task (1202). For example, this event can be the fact that a task has completed on one of the processors in the set, that a time limit has been reached, or that a state for the system or one of the processors has changed (e.g., a state table may have been changed to indicate that processor #2 is now in a "blocked" state). The priority data can be retrieved for each task in execution on any processor, one or more tasks (in the task list) awaiting execution, or any subset of these things. A processor is then selected to run a specific task based on the retrieved priority data (1204); for example, the specific task can be one of the tasks already in execution on another processor, or a task awaiting execution in the task list, and again, a simple or complex heuristic can be used to determine the selected task.

Irrespective of task or priority driving a swap (or "context switch"), the system determines that a swap needs to occur involving a move of at least one task to a processor for execution. Accordingly, the system stores context information in context memory (or another form of local memory) about a task which is in-execution and is to be moved, either to another processor or queued for later execution (1206). This information might have already been inherently stored by the processor already running that task. The system then performs the move (1208), for example, by causing the new processor to adopt or otherwise receive context information from the old processor, including any instruction pointers, parameters or other data needed to restart the task, or simply by starting a new (previously unstarted) task (1210). Finally, the system can change state associated with each affected processor or task, or otherwise initiate execution of the swapped-in task (1212).

[0094] A processor to run the "swapped-in" task can be selected based on priority data. For example, an asymmetric processor system can select a processor because benchmark data showed that the selected processor was more efficient or otherwise better suited to running a specific task than other tasks in the task list or a task in-execution on the specified processor. In one embodiment, background tasks can have their context information stored in local context memory, with most (or all) non-background tasks having their context information stored elsewhere during a context switch. Also, any of the methods indicated above (i.e., in connection with Figures 10-13 or elsewhere in this application) can be implemented partially or completely as instructions stored on machine readable media, e.g., using software or firmware; as mentioned, the scheduling and other processes mentioned above can be run on one of the processors in an asymmetric processor set, either loaded from permanent memory, the internet, temporary memory, or using some other mechanism.

[0095] Figure 13 illustrates a block diagram of a computer system. The computer system 1300 is seen to include communication interface 1320, processing system 1330, storage system 1340, and user interface 1360. The storage system 1340 stores software 1350 and data 1470. The processing system 1330 is operatively coupled to storage system 1340, the communication interface 1320 and user interface 1360. As will be understood by one having familiar with digital design, the computer system 1300 can comprise a programmed general-purpose computer or an embedded or other system, for example, based on a microprocessor, FPGA, or other programmable or special-purpose circuitry. The computer system 1300, as well as the communication interface or the user interface can be distributed among multiple devices, processors, storage, and/or interfaces that together comprise elements 1320-1370. The communication interface 1320 can comprise a network interface, modem, port, bus, link, transceiver, or other communication device, and the user interface 1360 can comprise a keyboard, mouse, voice recognition interface, microphone and speakers, graphical display, touch screen, or other type of user interface device. The storage system 1340 can comprise a disk, tape, integrated circuit, RAM, ROM, network storage, server, or other memory function, embodied in one device or among multiple memory devices.

[0096] As with other digital systems, the processing system 1330 retrieves and executes software or firmware 1350 from storage system 1340, and retrieves and stores any needed data 1370, if needed, via communication interface 1320. The processing system can create or modify software 1350 or data 1370 in order to achieve a tangible result, and depending on application, can perform these tasks via the communication interface 1320 or user interface 1370.

[0097] Locally and remotely stored software or firmware can comprise an operating system, utilities, drivers, networking software, and other instructional logic typically executed by a computer system. This logic can comprise an application program, applet, firmware, or other form of machine -readable processing instructions typically executed by a computer system. When executed by processing system 1330, the software or other instructional logic can direct the computer system 1300 to operate as described herein.

[0098] The above description and associated figures teach the best mode of the invention. The following claims specify the scope of the invention. Note that some aspects of the best mode may not fall within the scope of the invention as specified by the claims. Those skilled in the art will appreciate that the features described above can be combined in various ways to form multiple variations of the invention.

[0099] To provide just a few examples of variations that can readily be employed, the embodiments presented above can perform assignment of tasks based on a monitored system operating condition, such as a state of at least one processor in a set of processor, a power state of the system, a memory state of the system, the state of at least one task in a queued task list, the state of at least one task in execution on a processor in the plurality, a software state of the system, a time reminaing to completion of a task, a number of instructions remaining to completion of a task, or a deadline associated with execution of a task. Many other types of information representing a state or operating condition can be used, as can a combination of any of these things.

[00100] The embodiments presented above can also perform assignment of tasks based on priority data, including without limitation, benchmark data. For example, in lieu of a simple priority (e.g., a simple order of preference), data that can be relied upon can include a clock frequency used by a specific processor, power consumption of a specific processor, power consumtion of a specific processor when executing a specific task, a count of external memory fetches by a specific processor when executing a task, a speed with which a specific processor can complete a task, a size of cache memory associated with a specific processor, a number of cache misses by a specific processor when executing a task, or a time associated with a specific processor. Again, nearly any type of value, measured or otherwise, can be used to provide a preference indication, and a combination of these things can also be used.

[00101] As a result, the invention is not limited to the specific embodiments described above, but only by the following claims and their equivalents.

Claims

CLAIMS What is claimed is:

1. An system, comprising:

a plurality of processors including at least a first processor and a second processor; a scheduler to assign a task to a selective one of the plurality of processors, the task comprising instructions executable on each processor in the plurality;

memory to store priority data representing expected relative priority desired for performance of the task between each processor in the plurality;

wherein the scheduler is to dynamically assign the task to any processor in the plurality, dependent upon the priority data and at least one monitored system operating condition.

2. The system of claim 1, wherein the at least one monitored operating condition includes at least one state selected from the group consisting of:

a state of at least one processor in the plurality;

a power state of the system;

a memory state of the system;

the state of at least one task in a queued task list;

the state of at least one task in execution on a processor in the plurality;

a software state of the system;

a time reminaing to completion of a task;

a number of instructions remaining to completion of a task; or

a deadline associated with execution of a task.

3. The system of claim 1, where the priority data includes at least one of:

a number associated a specific processor;

a countable event associated with a specific processor; or

a time associated with a specific processor.

4. The system of claim 1, where the priority data includes benchmark data representing expected performance, including at least one of:

a clock frequency used by a specific processor; a power consumption by a specific processor;

a power consumption of a specific processor when executing a specific task;

a count of external memory fetches by a specific processor when executing a task; a speed with which a specific processor can complete a task;

a size of cache memory associated with a specific processor;

a number of cache misses by a specific processor when executing a task; or a time associated with a specific processor.

5. The system of claim 4, where the benchmark data is empirically measured during system operation for each processor in the system.

6. The system of claim 1, embodied as a single integrated circuit.

7. The system of claim 6, where the plurality of processors are embodied in a single

multicore processor die.

8. A system, comprising:

a plurality of processors;

a context memory, coupled to each processor in the plurality, to hold context

information to allow a task to be stopped on a processor and restarted on a the same or a different processor in the plurality; and

a scheduler to assign tasks to any of the processors in the plurality, the scheduler to dynamically stop the first-mentioned task based on at least one of

(1) priority data that indicates that the task is preferably performed on a second processor in the plurality, or

(2) a determination that a second, new task is more preferably performed on the the first-mentioned processor.

9. The system of claim 8, where the scheduler receives at least one indicator of a monitored operating condition, the scheduler to reassign the first-mentioned task based on the monitored operating condition.

10. The system of claim 9, where the monitored operating condition includes at least one condition selected from the group of: (1) a condition that the second processor is free or (2) a condition that the second processor is blocked.

11. The system of claim 9, where the monitored operating condition includes at least one condition selected from the group of: (1) a state of the system that matches a predetermined state; or (2) a state of the first-mentioned processor that matches a predetermined state; or (3) a state of the second processor that matches a

predetermined state.

12. The system of claim 8, where the scheduler is a second task that runs on a third processor in the plurality.

13. The system of claim 8, embodied in a single die, where each processor in the plurality is a processing core collocated on the single die.

14. The system of claim 8, embodied in a single integrated circuit.

15. The system of claim 14, where each processor retrieves instructions associated with a task without retrieving said instructions from a memory external to said integrated circuit.

16. The system of claim 8, where each of the first-mentioned processor and the second

processor has a different performance capability.

17. The system of claim 16, where the different performance capability includes at least one of:

the first-mentioned processor having a different cache capacity than the second

processor;

the first-mentioned processor having a different clock frequency than the second processor; or

the first-mentioned processor fetching a different number of bits from said local memory in a local memory access cycle than the second processor fetches in the local memory access cycle.

18. The system of claim 8, where the scheduler dynamically reassigns the task based on the determination, where the determination is based on priority data that indicates that the second task is more efficiently performed on the first-mentioned processor.

19. The system of claim 18, where the scheduler performs the determination.

20. The system of claim 18, where the priority data includes benchmark data representing expected performance, including at least one of:

a number associated with clock frequency used by a specific processor;

a power consumption by a specific processor when executing a task;

a size of cache memory associated with a specific processor;

21. The system of claim 20, where the benchmark data is empirically measured for each processor in the system.

22. The system of claim 20, further comprising means to update the benchmark data based on monitoring the execution of the second task on the second processor in the plurality.

23. The system of claim 8, further comrpising means for storing and updating a task list.

24. A method of operating a system having a set of at least two processors capable of executing the same instruction set, the method comprising:

receiving a task for execution, the task based on the instruction set;

selecting a processor from the set based on an indication that benchmark data,

representing a performance of the processor when the task is run on the processor, is not available;

running the task on the processor; and

measuring and storing the benchmark data;

performing the receiving, selecting, running and measuring until benchmark data is available for each processor in the set.

25. The method of claim 24, further comprising using the benchmark data in dynamic, runtime assignment of tasks to the set of processors.

26. The method of claim 24, further comprising using the benchmark data, during run time, to switch a task in-execution from a first processor in the set to a second processor in the set, including by:

storing instructions for the task in-execution on the first processor in a memory local to the processors in the set

storing context information for the task in-execution on the first processor in a

memory local to the processors in the set;

selecting the second processor from the set using the benchmark data; and, stopping the first processor from executing the task in-execution and initiating

execution of the task in-execution on the second processor based on the instructions as-stored and based on the context information as-stored in memory local to the processors in the set.

27. The method of claim 26, where the selecting is also dependent on a monitored system operating condition.

28. The method of claim 27, where the monitored system operating condition includes at least one of:

a state of at least one processor in the plurality;

a power state of the system;

a memory state of the system;

the state of at least one task in a queued task list;

the state of at least one task in execution on a processor in the plurality;

a software state of the system; or

a deadline associated with execution of a task.

29. The method of claim 26, further comprising updating the benchmark data associated with running the task on a specific processor in the set based on running the task on the processor a second time.

30. The method of claim 26, embodied in instructional logic, where the set of processors are collocated on a single integrated circuit or die, and where selecting the processor including selecting from amongst the set of processors on the single integrated circuit or die.

31. A method for use in operating a system having a set of at least two processors and a memory, comprising:

receiving a task for execution, the task executable by each processor in the set;

in a scheduler,

retrieving task specific information from memory, the task specific

information associating a cost associated with executing the task on at least one of the processors in the set, and

retrieving monitored information regarding status of at least one of the

processors in the set; and

assigning the task to a specific processor in the set based upon both the task specific information and the monitored information.

32. The method of claim 31 , where the memory stores task specific information for each of plural tasks, the task specific information for each task identifying a cost associated with executing the corresponding task on plural processors in the set, the method further comprising assigning the task to one of the plural processors based upon a new request for execution of the task based on the retrieved monitored information and the relative cost associated with execution of the task of at least one other processor in the set.

33. The method of claim 32, where at least two of the processors are asymmetric relative to one another, and where the set of processors are co-packaged.

34. A method for use in operating a system having a set of at least two processors, a scheduler and a shared context memory, comprising:

executing a task on a first one of the processors, wherein the first one of the

processors stores operating parameters associated with execution of the task by the first processor in the shared memory, and wherein the task is executable on each processor in the set; and

in a scheduler, retrieving task specific information that associates a cost associated of

executing the task on at least one of the processors in the set, retrieving monitored information regarding status of at least one of the

processors in the set, and

responsive to the task specific information and the monitored information, determining that the task should be assigned to a second processor; and initiating execution of the task on the second processor using the operating parameters stored in the shared memory.

35. A method for use in operating a system having a set of at least two processors and a memory, comprising:

executing a task on each processor in the set;

obtaining benchmark data regarding the execution of the task by each processor in the set;

building a table indexed by task that associates a cost of executing the task by each processor in the set;

storing the table in the memory; and

responsive to contents of the table and monitored processor information, dynamically assigning execution of new tasks to processors in the set.

36. A method of operating multiple processors in a signaling system, comprising:

receiving a task executable by each one of the multiple processors;

selecting a first processor, from the set of processors, based on an indicator that

benchmark data associated with execution of the task on the first processor has not been stored;

executing the task on the first processor;

determining benchmark data associated with execution of the task on the first

processor and storing the benchmark data; and

updating the indicator to reflect the storage of the benchmark data.

37. The method of claim 36, where the indicator is a first indicator and the benchmark data is first benchmark data, the method further comprising: selecting a second processor, from the multiple processors, based on a second indicator that benchmark data associated with execution of the task on the second processor has not been stored;

executing the task on the second processor;

determining second benchmark data associated with execution of the task on the second processor and storing the second benchmark data; and, updating the second indicator to reflect storage of the second benchmark data.

38. The method of claim 37, further comprising subsequently receiving another request to execute the task, and responsively assigning the task to the first processor based on the first benchmark data and the second benchmark data.

39. The method of claim 36, further comprising:

subsequent to executing the task on the first processor, receiving a new request to execute the task on the first processor; and

updating the benchmark data responsive to the execution on the first processor

responsive to the new request.

40. The method of claim 36, where the multiple processors are contained on a single

integrated circuit, the method embodied as a method of operating the single integrated circuit.