US20110145626A2 - Virtual processor methods and apparatus with unified event notification and consumer-produced memory operations - Google Patents
Virtual processor methods and apparatus with unified event notification and consumer-produced memory operations Download PDFInfo
- Publication number
- US20110145626A2 US20110145626A2 US12/605,839 US60583909A US2011145626A2 US 20110145626 A2 US20110145626 A2 US 20110145626A2 US 60583909 A US60583909 A US 60583909A US 2011145626 A2 US2011145626 A2 US 2011145626A2
- Authority
- US
- United States
- Prior art keywords
- thread
- event
- threads
- instruction
- further improvement
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 230000007246 mechanism Effects 0.000 claims abstract description 53
- 230000008569 process Effects 0.000 claims abstract description 20
- 238000012545 processing Methods 0.000 claims description 47
- 238000013507 mapping Methods 0.000 claims description 11
- 230000011664 signaling Effects 0.000 claims description 7
- 230000007781 signaling event Effects 0.000 claims description 4
- 230000004044 response Effects 0.000 claims description 2
- 230000006872 improvement Effects 0.000 claims 38
- 238000004891 communication Methods 0.000 claims 10
- 230000008878 coupling Effects 0.000 claims 10
- 238000010168 coupling process Methods 0.000 claims 10
- 238000005859 coupling reaction Methods 0.000 claims 10
- 238000010586 diagram Methods 0.000 description 18
- 238000013519 translation Methods 0.000 description 18
- 230000007704 transition Effects 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 6
- 238000003491 array Methods 0.000 description 5
- 231100000957 no side effect Toxicity 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 239000000872 buffer Substances 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- LHMQDVIHBXWNII-UHFFFAOYSA-N 3-amino-4-methoxy-n-phenylbenzamide Chemical compound C1=C(N)C(OC)=CC=C1C(=O)NC1=CC=CC=C1 LHMQDVIHBXWNII-UHFFFAOYSA-N 0.000 description 3
- 241001522296 Erithacus rubecula Species 0.000 description 3
- 230000008901 benefit Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000006073 displacement reaction Methods 0.000 description 3
- 230000008520 organization Effects 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 101100070661 Caenorhabditis elegans hint-1 gene Proteins 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 229920001690 polydopamine Polymers 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 238000009738 saturating Methods 0.000 description 2
- 206010000210 abortion Diseases 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000011144 upstream manufacturing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30076—Arrangements for executing specific machine instructions to perform miscellaneous control operations, e.g. NOP
- G06F9/3009—Thread control instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
- G06F9/30047—Prefetch instructions; cache control instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30072—Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/3013—Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3814—Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4812—Task transfer initiation or dispatching by interrupt, e.g. masked
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/542—Event management; Broadcasting; Multicasting; Notifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/54—Indexing scheme relating to G06F9/54
- G06F2209/543—Local
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- embedded processors are being used i n digital televisions, digital video recorders, PDAs, mobile phones, and other appliances to support multi-media applications (including MPEG decoding and /or encoding), voice and/or graphical user interfaces, intelligent agents and other background tasks, and transparent internet, network, peer-to-peer (P2P) or other information access.
- multi-media applications including MPEG decoding and /or encoding
- voice and/or graphical user interfaces voice and/or graphical user interfaces
- intelligent agents and other background tasks and transparent internet, network, peer-to-peer (P2P) or other information access.
- P2P peer-to-peer
- Prior art embedded application systems typically combine: (1) one or more general purpose processors, e.g., of the ARM, MIPs or x86 variety, for handling user interface processing, high level application processing, and operating system, with (2) one or more digital signal processors (DSPs) (including media processors) dedicated to handling specific types of arithmetic computations, at specific interfaces or within specific applications, on real time/low latency bases.
- DSPs digital signal processors
- special-purpose hardware is often provided to handle dedicated needs that a DSP is unable to handle on a programmable basis, e.g., because the DSP cannot handle multiple activities at once or because the DSP cannot meet needs for a very specialized computational element.
- a problem with the prior art systems is hardware design complexity, combined with software complexity in programming and interfacing heterogeneous types of computing elements. The result often manifests itself in embedded processing subsystems that are underpowered computationally, but that are excessive in size, cost and/or electrical power requirements. Another problem is that both hardware and software must be re-engineered for every application. Moreover, prior art systems do not load balance; capacity cannot be transferred from one hardware element to another.
- An object of this invention is to provide improved apparatus and methods for digital data processing.
- a more particular object is to provide improved apparatus and methods that support applications that have high computational requirements, real-time application requirements, multi-media requirements, voice and graphical user interfaces, intelligence, background task support, interactivity, and/or transparent Internet, networking and/or P2P access support.
- a related object is to provide such improved apparatus and methods as support multiple applications meeting having one or more of these requirements while executing concurrently with one another.
- a further object of the invention is to provide improved apparatus and methods for processing (embedded or otherwise) that meet the computational, size, power and cost requirements of today's and future appliances, including by way of non-limiting example, digital televisions, digital video recorders, video and/or audio players, PDAs, personal knowledge navigators, and mobile phones, to name but a few.
- Yet another object is to provide improved apparatus and methods that support a range of applications, including those that are inherently parallel.
- a further object is to provide improved apparatus and methods that support multi-media and user interface driven applications.
- Yet a still further object provide improved apparatus and methods for multi-tasking and multi-processing at one or more levels, including, for example, peer-to-peer multi-processing.
- Still yet another object is to provide such apparatus and methods which are low-cost, low-power and/or support robust rapid-to-market implementations.
- a virtual processor that includes one or more virtual processing units. These execute on one or more processors, and each executes one or more processes or threads (collectively, “threads”). While the threads may be constrained to executing throughout their respective lifetimes on the same virtual processing units, they need not be.
- An event delivery mechanism associates events with respective threads and notifies those threads when the events occur, regardless of which virtual processing unit and/or processor the threads happen to be executing on at the time.
- An event delivery mechanism associates hardware interrupts, software events (e.g.. software-initiated events in the nature of interrupts) and memory events with those respective threads. When an event occurs, the event delivery mechanism delivers it to the appropriate thread, regardless of which virtual processing unit it is executing on at the time.
- a virtual processor as described above in which selected threads respond to notifications from the event delivery mechanism by transitioning from a suspended state to an executing state.
- a user interface thread executing on a virtual processing unit in the digital LCD TV-embedded virtual processor may transition from waiting or idle to executing in response to a user keypad interrupt delivered by the event delivery mechanism.
- Still further related aspects of invention provide a virtual processor as described above in which the event delivery mechanism notifies a system thread executing on one of the virtual processing units of an occurrence of an event associated with a thread that is not resident on a processing unit. They system thread can respond to such notification by transitioning a thread from a suspended state to an executing state.
- Still other related aspects of invention provide a virtual processor as described above wherein at least selected active threads to respective such notifications concurrently with one another and/or without intervention of an operating system kernel.
- Yet further aspects of invention provide a virtual processor as described above in which the event delivery mechanism includes a pending memory operation table that establishes associations between pending memory operations and respective threads that have suspended while awaiting completion of such operations.
- the event delivery mechanism signals a memory event to a thread for which all pending memory operations have completed.
- Related aspects of the invention provide such a virtual processor that includes an event-to-thread lookup table mapping at least hardware interrupts to threads.
- inventions provide a virtual processor as described above wherein one or more threads execute an instruction for enqueuing a software event to the event queue. According to related aspects one or more threads that instruction specify which thread is to be notified of the event.
- aspects of invention provide a virtual processor as described above wherein at least one of the threads responds to a hardware interrupt by suspending execution of a current instruction sequence and executing and error handler.
- the thread further responds to the hardware interrupt by at least temporarily disabling event notification during execution of the error handler.
- that thread responds to the hardware interrupt by suspending the current instruction sequence following execution of the error handler.
- the invention provides digital data processors with improved data-flow-bassed synchronization.
- a digital data processor includes a plurality of processes and/or threads (again, collectively, “threads”), as well as a memory accessible by those threads.
- At least selected memory locations have an associated state and are capable of storing a datum for access by one or more of the threads.
- the states include at least a full state and an empty state.
- a selected thread executes a first memory instruction that references a selected memory location. If the selected location is associated with the empty state, the selected thread suspends until the selected location becomes associated with the full state.
- a related aspect of invention provides an improved such digital data processor wherein, if the selected location is associated with the full state, execution of the first instruction causes a datum stored in the selected location to be read to the selected thread and causes the selected location to become associated with the empty state.
- the plurality of executing threads are resident on one or more processing units and the suspended thread is made at least temporarily nonresident on those units.
- the invention provides, is further aspects, a digital data processor as described above wherein the selected or another thread executes a second of memory instruction that references a selected memory location. If the selected location is associated with the empty state, execution of the second memory operation causes a selected data to be stored to the selected location and causes the selected location to become associated with the full state.
- the invention provide a virtual processor comprising a memory and one or more virtual processing units that execute threads which access that memory.
- a selected thread executes a first memory instruction directed to a location in the memory. If that location is associated with an empty state, execution of instruction causes the thread to suspend until that location becomes associate with a full state.
- Related aspects of the invention provide such devices which incorporate processors with improved dataflow synchronization as described above.
- Yet further aspects of the invention provide methods paralleling the operation of the virtual processors, digital data processors and devices described above.
- FIG. 1 depicts a processor module constructed and operated in accord with one practice of the invention
- FIG. 2 contrasts thread processing by a conventional superscalar processor with that by a processor module constructed and operated in accord with one practice of the invention
- FIG. 3 depicts potential states of a thread executing in a virtual processing unit (or thread processing unit (TPU) in a processor constructed and operated in accord with one practice of the invention
- FIG. 4 depicts an event delivery mechanism in a processor module constructed and operated in accord with one practice of invention
- FIG. 5 illustrates a mechanism for virtual address to system address translation in a system constructed and operated in accord with one practice of the invention
- FIG. 6 depicts the organization of Level 1 and Level 2 caches in a system constructed and operated in accord with one practice the invention
- FIG. 7 depicts the L 2 cache and the logic used to perform a tag lookup in a system constructed and operated in accord with one practice of invention
- FIG. 8 depicts logic used to perform a tag lookup in the L 2 extended cache in a system constructed and operated in accord with one practice invention
- FIG. 9 depicts general-purpose registers, predicate registers and thread state or control registers maintained for each thread processing unit (TPU) in a system constructed and operated in accord with one practice of the invention
- FIG. 10 depicts a mechanism for fetching and dispatching instructions executed by the threads in a system constructed and operated in accord with one practice of the invention
- FIGS. 11-12 illustrate a queue management mechanism used in system constructed and operated in accord with one practice of the invention
- FIG. 13 depicts a system-on-a-chip (SoC) implementation of the processor module of FIG. 1 including logic for implementing thread processing units in accord with one practice of the invention;
- SoC system-on-a-chip
- FIG. 14 is a block diagram of a pipeline control unit in a system constructed and operated in accord with one practice of the invention.
- FIG. 15 is a block diagram of an individual unit queue in a system constructed and operated in accord with one practice of the invention.
- FIG. 16 is a block diagram of the branch unit in a system constructed and operated in accord with one practice of the invention.
- FIG. 17 is a block diagram of a memory unit in a system constructed and operated in accord with one practice of the invention.
- FIG. 18 is a block diagram of a cache unit implementing any of the L 1 instruction cache or L 1 data cache in a system constructed and operated in accord with one practice of the invention
- FIG. 19 depicts an implementation of the L 2 cache and logic of FIG. 7 in a system constructed and operated in accord with one practice of the invention
- FIG. 20 depicts the implementation of the register file in a system constructed and operated in accord with one practice of the invention
- FIGS. 21 and 22 are block diagrams of an integer unit and a compare unit in a system constructed and operated in accord with one practice of the invention
- FIGS. 23A and 23B are block diagrams of a floating point unit in a system constructed and operated in accord with one practice of the invention.
- FIGS. 24A and 24B illustrate use of consumer and producer memory instructions in a system constructed and operated in accord with one practice of the invention
- FIG. 25 is a block diagram of a digital LCD-TV subsystem in a system constructed and operated in accord with one practice of the invention.
- FIG. 26 is a block diagram of a digital LCD-TV or other application subsystem in a system constructed and operated in accord with one practice of the invention.
- FIG. 1 depicts a processor module 5 constructed and operated in accord with one practice of the invention and referred to occasionally throughout this document and the attached drawings as “SEP”.
- the module can provide the foundation for a general purpose processor, such as a PC, workstation or mainframe computer—though, the illustrated embodiment is utilized as an embedded processor.
- the module 5 which amy be used singly or in combination with one or more other such modules, is suited inter alia for devices or systems whose computational requirements are parallel in nature and that benefit from multiple concurrently executing applications and/or instruction level parallelism. This can include devices or system with real-time requirements, those that execute multi-media applications, and/or those with high computational requirements, such as image, signal, graphics and/or network processing.
- the module is also suited for integration of multiple applications on a single platform, e.g., where there is concurrent application use. It provides for seamless application execution across the devices and/or systems in which it is embedded or otherwise incorporated, as well as across the networks (wired, wireless, or otherwise) or other medium via which those devices and/or systems are coupled.
- the module is suited for peer-to-peer (P2P) applications, as well as those with user interactivity.
- P2P peer-to-peer
- the foregoing is not intended to be an extensive listing of the applications and environments to which the module 5 is suited, but merely one of examples.
- Examples of devices and systems in which the module 5 can be embedded include inter alia digital LCD-TVs, e.g., type shown in FIG. 24 , wherein the module 5 is embodied in a system-on-a-chip (SOC) configuration.
- SOC system-on-a-chip
- the module need not be embodied on a single chip and, rather, can be may be embodied in any of a multitude of form factors, including multiple chips, one or more circuit boards, one or more separately-housed devices, and/or a combination of the foregoing).
- DVR digital video recorders
- MP3 servers mobile phones
- applications which integrate still and video cameras game platforms
- universal networked displays e.g., combinations of digital LCD-TV, networked information/Internet appliance, and general-purpose application platform
- G3 mobile phones personal digital assistants, and so forth.
- the module 5 includes thread processing units (TPUs) 10 - 20 , level one (L 1 ) instruction and data caches 22 , 24 , level two (L 2 ) cache 26 , pipeline control 28 and execution (or functional units) 30 - 38 , namely, an integer processing unit, a floating-point processing unit, a compare unit, a memory unit, and a branch unit.
- TPUs thread processing units
- L 1 level one
- L 2 level two
- execution (or functional units) 30 - 38 namely, an integer processing unit, a floating-point processing unit, a compare unit, a memory unit, and a branch unit.
- TPUs 10 - 20 are virtual processing units, physically implemented within processor module 5 , that are each bound to and process one (or more) process(es) and/or thread(s) (collectively, thread(s)) at any given instant.
- the TPUs have respective per-thread state represented in general purpose registers, predicate registers, control registers.
- the TPUs share hardware, such as launch and pipeline control, which launches up to five instructions from any combination of threads each cycle.
- the TPUs additionally share execution units 30 - 38 , which independently execute launched instructions without the need to know what thread they are from.
- illustrated L 2 ache 26 is shared by all of the thread processing units 10 - 20 and stores instructions and data on storage both internal (local) and external to the chip on which the module 5 is embodied.
- Illustrated L 1 instruction and data caches 22 , 24 are shared by the TPUs 10 - 20 and are based on storage local to the aforementioned chip. (Of course, it will be appreciated that, in other embodiments, the level 1 and level 2 caches may be configured differently—e.g., entirely local to the module 5 , entirely external, or otherwise).
- module 5 is scalable. Two or more modules 5 may be “ganged” in an SoC or other configuration, thereby, increasing the number of active threads and overall processing power. Because of the threading model used by the module 5 and described herein, the resultant increase in TPUs is software transparent. Though the illustrated module 5 has six TPUs 10 - 20 , other embodiments may have a greater number of TPUs (as well, of course, as a lesser number). Additional functional units, moreover, may be provided, for example, boosting the number of instructions launched per cycle from five to 10-15, or higher. As evident in the discussion below of L 1 and L 2 cache construction, these too may be scaled.
- Illustrated module 5 utilizes Linux as an application software environment. In conjunction with multi-threading, this enables real-time and non-real-time applications to run on one platform. It also permits leveraging of open source software and applications to increase product functionality. Moreover, it enables execution of applications from a variety of providers.
- TPUs 10 - 20 are virtual processing units, physically implemented within a single processor module 5 , that are each bound to and process one (or more) thread(s) at any given instant.
- the threads can embody a wide range applications. Examples useful in digital LCD-TVs, for example, include MPEG2 signal demultiplexing, MPEG2 video decoding, MPEG audio decoding, digital-TV user interface operation, operating system execution (e.g., Linux). Of course, these and/or other applications may be useful in digital LCD TVs and the range of other devices and systems in which the module 5 may be embodied.
- the threads executed by the TPUs are independent but can communicate through memory and events. During each cycle of processor module 5 , instructions are launched from as many active-executing threads as necessary to utilize the execution or functional uits 30 - 38 . In the illustrated embodiment, a foun robin protocol is imposed in this regard to assure “fairness” to the respective threads (though, in other embodiments, priority or other protocols can be used instead or in addition). Although one or more system threads may be executing on the TPUs (e.g., to launch application, facilitate thread activation, and so forth), no operating system intervention is required to execute active threads.
- Multiple active threads per processor enables a single multi-threaded processor to replace multiple application, media, signal processing and network processors. It also enables multiple threads corresponding to application, image, signal processing and networking to operate and interoperate concurrently with low latency and high performance. Context switching and interfacing overhead is minimized. Even within a single image processing application, like MP4 decode, threads can easily operate simultaneously in a pipelined manner to for example prepare data for frame n+1 while frame n is being composed.
- Multiple active threads per processor increases the performance of the individual processor by better utilizing functional units and tolerating memory and other event latency. It is not unusual to gain a 2 ⁇ performance increase for supporting up to four simultaneously executing threads. Power consumption and die size increases are negligible so that performance per unit power and price performance are imptoved. Multiple active threads per processor also lowers the performance degradation due to branches and cache misses by having another thread execute during these events. Additionally, it eliminates most context switch overhead and lower latency for reat-time activites. Moreover, it supports a general, high performance event model.
- FIG. 2 contrasts thread processing by a conventional superscalar processor with that of the illustrated processor module 5 .
- instructions from a single executing thread are dynamically scheduled to execute on available execution units based on the actual parallelism and dependencies within the code being executed. This means that on the average most execution units are not able to be utilized during each cycle. As the number of execution units increases the percentage utilization typically goes down. Also execution units are idle during memory system and branch prediction misses/waits.
- the module 5 in the module 5 , instructions from multiple threads (indicated by different respective stippling patterns) execute simultaneously. Each cycle, the module 5 schedules instructions from multiple threads to optimally utilize available execution unit resources. Thus the execution unit utilization and total performance is higher, while at the same time transparent to software.
- events include hardware (or device) events, such as interrupts; software events, which are equivalent to device events but are initiated by software instructions and memory events, such as completion of cache misses or resolution of memory producer-consumer (full-empty) transitions.
- Hardware interrupts are translated into device events which are typically handled by an idle thread (e.g., a targeted thread or a thread in a targeted group).
- Software events can be used, for example, to allow one thread to directly wake another thread.
- Each event binds to an active thread. If a specific thread binding doesn't exist, it binds to the default system thread which, in the illustrated embodiment, is always active. That thread then processes the event as appropriate including scheduling a new thread on a virtual processor. If the specific thread binding does exist, upon delivery of a hardware or software event (as discussed below in connection with the event delivery mechanism), the targeted thread is transitioned from idle to executing. If the targeted thread is already active and executing, the event is directed to default system thread for handling.
- threads can become non-executing (block) due to: Memory system stall (short term blockage), including cache miss and waiting on synchronization; Branch miss-prediction (very short term blockage); Explicitly waiting for an event (either software or hardware generated); and System thread explicitly blocking application thread.
- events operate across physical processors modules 5 and networks providing the basis for efficient dynamic distributed execution environment.
- a module 5 executing in an digital LCD-TV or other device or system can execute threads and utilize memory dynamically migrated over a network (wireless, wired or otherwise) or other medium from a server or other (remote) device.
- the thread and memory-based events for example, assure that a thread can execute transparently on any module 5 operating in accord with the principles hereof. This enables, for example, mobile devices to leverage the power of other networked devices. It also permits transparent execution of peer-to-peer and multi-threaded applications on remote networked devices. Benefits include increased performance, increased functionality and lower power consumption.
- Threads run at two privilege levels, System and Application.
- System threads can access all state of its thread and all other threads within the processor.
- An application thread can only access non-privileged state corresponding to itself. By default thread 0 runs at system privilege.
- Other threads can be configured for system privilege when they are created by a system privilege thread.
- thread states are:
- Thread context is loaded into a TPU and thread is not executing instructions.
- An Idle thread transitions to Executing, e.g., when a hardware of software event occurs.
- Thread context is loaded into a TPU, but is currently not executing instructions.
- a Waiting thread transitions to Executing when an event it is waiting for occurs, e.g., a cache operation is completed that would allow the memory instruction to proceed.
- Thread context is loaded into a TPU and is currently executing instructions.
- a thread transitions to Waiting, e.g., when a memory instruction must wait for cache to complete an operation, e.g. a cache miss or an Empty/Fill (producer-consumer memory) instruction cannot be completed.
- a thread transitions to idle when a event instruction is executed
- a thread enable bit (or flag or other indicator) associated with each TPU disables thread execution without disturbing any thread state for software loading and unloading of a thread onto a TPU.
- the processor module 5 load balances across active threads based on the availability of instructions to execute. The module also attempts to keep the instruction queues for each thread uniformly full. Thus, the threads that stay active the most will execute the most instructions.
- FIG. 4 shows an event delivery mechanism in a system according to the one practice of the invention.
- the thread suspends execution (if currently in the Executing state) and recognizes the event by executing the default event handler, e.g., at virtual address 0 ⁇ 0.
- Event Description Thread Delivery Thread wait The timeout value from a wait thread n timeout instruction executed by thread n has expired Thread Executing instruction has thread n exception signaled exception.
- event to thread lookup SW Event Event (like sw interrupt) instruction specifies thread. signaled by sw event If that thread is not Active, instruction Waiting or Idle delivered to default system thread Memory All pending memory opera- thread n Event tions for a thread n have completed.
- Illustrated Event Queue 40 stages events presented by hardware devices and software-based event instructions (e.g., software “interrupts”) in the form of tuples comprising virtual thread number (VTN) and event number:
- VTN virtual thread number
- System privilege 1 how Specifies how the event is signaled of the thread is not in idle state. If the thread is in idle state, this field is ignored and the event is directly signalled 0. Wait for thread in idle state. All events after this event in the queue wait also. 1. Trap thread immediately 15:4 eventnum Specifies the logical number for this event. The vaslue of this field is captured in detail field of the system exception status or application exception status register. 31:16 threadnum Specifies the logical thread number that this event is signaled to.
- the event tuples are, in turn, passed in the order received to the event-to-thread lookup table (also referred to as the event table or thread lookup table) 42 , which determines which TPU is currently handling each indicated thread.
- the events are then presented, in the form of “TPU events” comprised of event numbers, to the TPUs (and, thereby, their respective threads) via the event-to-thread delivery mechanism 44 . If no thread is yet instantiated to handle a particular event, the corresponding event is passed to a default system thread active on one of the TPUs.
- the event queue 40 can be implemented in hardware, software and/or a combination thereof.
- the queue is implemented as a series of gates and dedicated buffers providing the requisite queuing function.
- it is implemented in software (or hardware) linked lists, arrays, or so forth.
- the table 42 establishes a mapping between an event number (e.g., hardware interrupt) presented by a hardware device or event instruction and the preferred thread to signal the event to.
- an event number e.g., hardware interrupt
- the possible cases are:
- the table 42 may be a single storage are, dedicated or otherwise, that maintains an updated mapping of events to threads.
- the table may also constitute multiple storage areas, distributed or otherwise.
- the table 42 may be implemented in hardware, software and/or a combination thereof.
- the table is implemented by gates that perform “hardware” lookups on dedicated storage area(s) that maintains an updated mapping of events to threads. That table is software-accessible, as well—for example, by system-level privilege threads which update the mappings as threads are newly loaded into the TPUs 10 - 20 and/or deactivated and unloaded from them.
- the table 42 is implemented by a software-based lookup of the storage area that maintains the mapping.
- the event-to-thread delivery mechanism 44 may be implemented in hardware, software and/or a combination thereof.
- the mechanism 44 is implemented by gates (and latches) that route the signaled events to TPU queues which, themselves, are implemented as a series of gates and dedicated buffers 46 - 48 for queuing be delivered events.
- the mechanism 44 is implemented in software (or other hardware structures) providing the requisite functionality and, likewise, the queues 46 - 48 are implemented in software (or hardware) linked lists, arrays, or so forth.
- Event is signalled to the TPU which is currently executing active thread.
- the Exception Status, Exception IP and Exception MemAddress control registers are set to indicate information corresponding to the event based on the type of event. All—Thread State is valid.
- the TPU initiates excecution at system privilege of the default event handler at virtual address 0 ⁇ ) with event signaling disabled for the corresponding thread unit.
- GP registers 0-3 contain and predicate registers 0-1 are utilized as scratch registers by the event handlers and are system privilege.
- GP[ 0] is the event processing stack pointer.
- the event handler saves enough state so that it can make itself re-entrant and re-enable event signaling for the corresponding thread execution unit.
- Event handler then processes the event, which could just be posting the event to a SW based queue or taking some other action.
- the event handler then restores state and returns to execution of the original thread.
- the Pending (Memory) Event Table (PET) 50 holds entries for memory operations (from memory reference instructions) which transition a tread from executing to waiting.
- the table 50 which may be implemented like the event-to-thread lookup table 42 , holds the address of the pending memory operations, state information and thread ID which initiated the reference.
- Event is signal to unit which is currently executing active thread
- thread wait timeouts and thread exceptions are signaled directly to the threads and are not passed through the event-to-thread delivery mechanism 44 .
- SEP supports trap mechanism for this purpose. A list of actions based on event types follows, with a full list of the traps, enumerated in the System Exception Status Register.
- Event Type Thread State Privilege Level Action Application event Idle Application or Recognize event by System transitioning to execute state, appli- cation priv System event Idle Application or System trap to System recognize event, transition to execute state Application event Waiting or executing Application or Event stays queued (wait for idle) System until idle Application event Waiting or executing Application or Application trap to (trap if not idle) System recognize event System event Waiting or executing Application System trap to recognize event System event (wait Waiting or executing System Event stays queued for idle) until idle System event (trap if Waiting or executing System System trap to not idle) recognize event Application Trap Any Application Application trap Application Trap Any System System trap System Trap Any Application System trap, system privilege level System Trap Any System System trap
- Illustrated processor module 5 takes the following actions when a trap occurs:
- IP Instruction Pointer
- the Privilege Level is stored into bit 0 of Exception IP register.
- IP Instruction Pointer
- the illustrated processor module 5 utilizes a virtual memory and memory system architecture having a 64-bit Virtual Address (VA) space, a 64-bit System Address (SA) (having different characteristics than a standard physical address), and a segment model of virtual address to system address translation with a sparsely filled VA or SA.
- VA Virtual Address
- SA System Address
- the memory system consists of two logical levels.
- the level 1 cache which is divided into separate data and instruction caches, 24 , 22 , respectively, for optimal latency and bandwidth.
- Illustrated level 2 cache 26 consists of an on-chip portion and off-chip portion referred to as level 2 extended.
- the level 2 cache is the memory system for the individual SEP processor(s) 5 and contributes to a distributed “all cache” memory system in implementations where multiple SEP processors 5 are used.
- those multiple processors would not have to be physically sharing the same memory system, chips or buses and could, for example, be connected over a network or otherwise.
- FIG. 5 illustrates VA to SA translation used in the illustrated system, which translation is handled on a segment basis, where (in the illustrated embodiment) those segments can be of variable size, e.g., 2 21 -2 48 bytes.
- the SAs are cached in the memory system. So an SA that is present in the memory system has an entry in one of the levels of cache 22 / 24 , 26 . An SA that is not present in any cache (and the memory system) is effectively not present in the memory system. Thus, the memory system is filled sparsely at the page (and subpage) granularity in a way that is natural to software and OS, without the overhead of page tables on the processor.
- the virtual memory and memory system architecture of the illustrated embodiment has the following additional features: Direct support for distributed shared: Memory (DSM), Files (DSF), Objects (DSO), Peer to Peer (DSP2P); Scalable cache and memory system architecture; Segments that can be shared between threads; Fast level 1 cache, since lookup is in parallel with tag access, with no complete virtual-to-physical address translation or complexity of virtual cache.
- DSM Memory
- DFS Files
- DSO Objects
- DSP2P Peer to Peer
- Scalable cache and memory system architecture Scalable cache and memory system architecture
- Fast level 1 cache since lookup is in parallel with tag access, with no complete virtual-to-physical address translation or complexity of virtual cache.
- a virtual address in the illustrated system is the 64-bit address constructed by memory reference and branch instructions.
- the virtual address is translated on a per segment basis to a system address which is used to access all system memory and IO devices.
- Each segment can vary in size from 2 24 to 2 48 bytes.
- the virtual address 50 is used to match an entry in a segment table 52 in the manner shown in the drawing.
- the matched entry 54 specifies the corresponding system address, when taken in combination with the components of the virtual address identified in drawing.
- the matched entry 54 specifies the corresponding segment size and privilege. That system address, in turn, maps in to the system memory which in the illustrated embodiment comprises 2 64 bytes sparsely filled.
- the illustrated embodiment permits address translation to be disabled by threads with system privilege, in which case the segment table is bypassed and all addresses are truncated to the low 32 bits.
- Illustrated segment table 52 comprises 16-32 entries per thread (TPU).
- the table may be implemented in hardware, software and/or a combination thereof.
- the table is implemented in hardware, with separated entries in memory being provided for each thread (e.g., a separate table per thread).
- a segment can be shared among two or more threads by setting up a separate entry for each thread that points to the same system address.
- Other hardware or software structures may be used instead, or in addition, for this purpose.
- Level 1 cache is organized as separate Level 1 instruction cache 22 and level 1 data cache 24 to maximize instruction and data bandwidth.
- the on-chip L 2 cache 26 a consists of the tag and data portions. In the illustrated embodiment, it is 0.5-1 Mbytes in size, with 128 blocks, 16-way associative. Each block stores 128 bytes data or 16 extended L 2 tags, with 64 kbytes are provided to store the extended L 2 tags. A tag-mode bit within the tag indicates that the data portion consists of 16 tags for Extended L 2 Cache.
- the extended L 2 cache 26 b is, as noted above, DDR DRAM-based, though other memory types can be employed. In the illustrated embodiment, it is up to 1 gbyte in size, 256-way associative, with 16k byte pages and 128 byte subpages. For a configuration of 0.5 mbyte L 2 cache 26 a and 1 gbyte L 2 extended cache 26 b, only 12% of on-chip L 2 cache is required to fully describe L 2 extended. For larger on-chip L 2 or smaller L 2 extended sizes the percentage is lower. The aggregation of L 2 caches (on-chip and extended) make up the distributed SEP memory system.
- both the L 1 instruction cache 22 and L 1 data cache 24 are 8-way associative with 32 bytes and 128 byte blocks.
- both level 1 caches are proper subsets of level 2 cache.
- the level 2 cache consists on an on-chip and off chip extended L 2 Cache.
- FIG. 7 depicts the L 2 cache 26 a and the logic used in the illustrated embodiment to perform a tag lookup in L 2 cache 26 a to identify a data block 70 matching an L 2 cache address 78 .
- that logic includes sixteen Cache Tag Array Groups 72 a - 72 p, corresponding Tag Compare elements 74 a - 74 p and corresponding Data Array Groups 76 a - 76 p. These are coupled as indicated to match an L 2 cache address 78 against the Group Tag Arrays 72 a - 72 p, as shown, and to select the data block 70 identified by the indicated Data Array Group 76 a - 76 p, again, as shown.
- the Cache Tag Array Groups 72 a - 72 p, Tag Compare elements 74 a - 74 p, corresponding Data Array Groups 76 a - 76 p may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by module 5 , these are implemented in as shown in FIG. 19 , which shown the Cache Tag Array Groups 72 a - 72 p embodied in 32 ⁇ 256 single port memory cells and the Data Array Groups 76 a - 76 p embodied in 128 ⁇ 256 single port memory cells, all coupled with current state control logic 190 as shown.
- That element is, in turn, coupled to state machine 192 which facilitates operation of the L 2 cache unit 26 a in a manner consistent herewith, as well as with a request queue 192 which buffers requests from the L 1 instruction and data caches 22 , 24 , as shown.
- the logic element 190 is further coupled with DDR DRAM control interface 26 c which provides an interface to the off-chip portion 26 b of the L 2 cache. It is likewise coupled to AMBA interface 26 d providing an interface to AMBA-compatible components, such as liquid crystal displays (LCDs), audio out interfaces, video in interfaces, video out interfaces, network interfaces (wireless, wired or otherwise), storage device interfaces, peripheral interfaces (e.g., USB, USB2), bus interfaces (PCI,ATA), to name but a few.
- the DDR DRAM interface 26 c and AMBA interface 26 d are likewise coupled to an interface 196 to the L 1 instruction and data caches by way of L 2 data cache bus 198 , as shown.
- FIG. 8 likewise depicts the logic used in the illustrated embodiment to perform a tag lookup in L 2 extended cache 26 b and to identify a data block 80 matching the designated address 78 .
- that logic includes Data Array Groups 82 a - 82 p, corresponding Tag Compare elements 84 a - 84 p, and Tag Latch 86 . These are coupled as indicated to match and L 2 cache address 78 against the Data Array Groups 72 a - 72 p, as shown, and to select a tag from one of those groups that matches the corresponding portion of the address 78 , again, as shown. The physical page number from the matching tag is combined with the index portion of the address 78 , as shown, to identify data block 80 in the off chip memory 26 b.
- the Data Array Groups 82 a - 82 p and Tag Compare elements 84 a - 84 p may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by module 5 , these are implemented in gates and dedicated memory providing the requisite lookup and tag comparison functions. Other hardware or software structures may be used instead, or in addition, for this purpose.
- L2 tag lookup if hit respond back with data to L1 cache else L2E tag lookup, if hit allocate tag in L2; access L2E data, store in corresponding L2 entry; respond back with data to L1 cache; else extended L2E tag lookup allocate L2E tag; allocate tag in L2; access L2E data, store in corresponding L2 entry; respond back with data to L1 cache; Thread Processing Unit State
- each TPU 10 - 20 includes general-purpose registers, predicate registers, and control registers, as shown in FIG. 9 . Threads at both system and application privilege levels contain identical state, although some thread state information is only visible when at system privilege level—as indicated by the key and respective stippling patterns.
- each TPU additionally includes a pending memory event table, an event queue and an event-to-thread lookup table, none of which are shown in FIG. 9 .
- each thread has up to 128 general purpose registers depending on the implementation.
- General Purpose registers 3-0 are visible at system privilege level and can be utilized for event stack pointer and working registers during early stages of event processing.
- the predicate registers are part of the general purpose SEP predication mechanism.
- the execution of each instruction is conditional based on the value of the reference predicate register.
- the SEP provides up to 64 one-bit predicate registers as part of thread state. Each predicate register holds what is called a predicate, which is set to 1 (true) or reset to 0 (false) based on the result of executing a compare instruction.
- Predicate registers 3-1 PR[3:1]) are visible at system privilege level and can be utilized for working predicates during early stages of event processing. Predicate register 0 is read only and always reads as 1, true. It is by instructions to make their execution unconditional.
- System_rw Thread Branch On reset set for thread 0, cleared for all other threads 0- Thread operation is disabled. System thread can load or store thread state. 1- Thread operation is enabled. 3 priv Privilege level. System_rw Thread Branch On reset cleared. 0- System priv- app_r ilege 1- Application privilege 5:4 state Thread State. System_rw Thread Branch On reset set to “executing” for thread0, set to “idle” for all other threads. 0- Idle 1- reserved 2- Waiting 3- Executing 15:8 mod[7:0] GP Registers App_rw Thread Pipe Modified. Cleared on reset.
- Bit Field Description Privilege Per 63:4 Doubleword Address of instruction double- app thread word 3:2 mask Indicates which instructions app thread within instruction doubleword remain to be executed. • Bit1- first instruction doubleword bit[40:00] • Bit2- second instruction doubleword bit[81: 41] • Bit3- thrid instruction doubleword, bit[122: 82] 0 Always read as zero app thread
- Bit[0] is the privilege level at the time of the exception.
- Bit Field Description Privilege Per 63:4 Doubleword Address of instruction double- system thread word which signaled excep- tion 3:1 mask Indicates which instructions system thread within instruction doubleword remain to be executed. • Bit1- first instruction doubleword bit[40:00] • Bit2- second instruction doubleword bit[81: 41] • Bit3- third instruction doubleword, bit[122: 82] 0 priv Privilege level of thread system thread at time of exception
- Bit[0] is the privilege level at the time of the exception.
- Bit[0] is the privilege level at the time of the exception.
- ISTE and ISTE registers Utilized by ISTE and ISTE registers to specify the ste and field that is read or written.
- Bit Field Description Privilege Per 0 field Specifies the low (0) or high (1) system thread portion of Segment Table Entry 5:1 ste number Specifies the STE number that is system thread read into STE Data Register.
- Bit Field Description Privilege Per 6:2 bank Specifies the bank that is read from system thread Level1 Cache Tag Entry.
- the first implementation has valid banks 0x0-f. 13:7 index Specifies the index address within a System thread bank that is read from Level1 Cache Tag Entry
- the Event Queue Control Register (EQCR) enables normal and diagnostic access to the event queue.
- the sequence for using the register is a register write followed by a register read.
- the contents of the reg_op field specifies the operation for the write and the next read.
- the actual register modification or read is triggered by the write.
- Bit Field Description Privilege Per 1:0 reg_op Specifies the register operation for system proc that write and the next read.
- Valid for register read 0- read 1- write 2- push onto queue 3- pop from queue 17:2 event
- For writes and push specifies the system proc event number written or pushed onto the queue.
- For read and pop opera- tions contains the event number read or popped from the queue 18 empty Indicates whether the queue was system proc empty prior to the current operation.
- 31:19 address Specifies the address for read and system proc write queue operations. Address field is don't care for push and pop operations.
- the Event to Thread lookup table establishes a mapping between an event number presented by a hardware device or event instruction and the preferred thread to signal the event to. Each entry in the table specifies an event number with a bit mask and a corresponding thread that the event is mapped to.
- Bit Field Description Privilege Per 0 reg_op Specifies the register operation system proc for that write and the next read. Valid for register read. 0- read 1- write 16:1 event[15:0] For writes specifies the event system proc number written at the specified table address. For read opera- tions contains the event number at the specified table address 32:17 mask[15:0] Specifies whether the system proc corresponding event bit is significant. 0- significant 1- don't care 40:33 thread 48:41 address Specifies the table address for system proc read and write operations. Address field is don't care for push and pop operations.
- all timer and performance monitor registers are accessible at application privilege.
- Bit Field Description Privilege Per 31:0 active Saturating count of the number of app thread cycles the thread is in active- executing state. Cleared on read. Value of all 1's indicates that the count has overflowed.
- each active thread corresponds to a virtual processor and is specified by a 8-bit active thread number (activethread[7:0]).
- the module 5 supports a 16-bit thread ID (threaded[15:0]) to enable rapid loading (activation) and unloading (de-activation) of threads.
- Other embodiments may support thread IDs of different sizes.
- FIG. 10 is an abstraction of the mechanism employed by module 5 to fetch and dispatch those instructions for execution on functional units 30 - 38 .
- instructions are fetched from the L 1 cache 22 and placed in instruction queues 10 a - 20 a associated with each respective TPU 10 - 20 .
- This is referred to as the fetch stage of the cycle.
- three to six instructions are fetch for each single thread, with an overall goal of keeping thread queues 10 a - 20 a at equal levels.
- different numbers of instructions may be fetched and/or different goals set for relative filling of the queues.
- the module 5 (and, specifically, for example, the event handling mechanisms discussed above) recognize events and transition corresponding threads from waiting to executing.
- the dispatch stage which executes in parallel with the fetch and execute/retire stages—instructions from each of one or more executing threads are dispatched to the functional units 30 - 38 based on a round-robin protocol that takes into account best utilization of those resources for that cycle.
- These instructions can be from any combination of threads.
- the compiler specifies, e.g., utilizing “stop” flags provided in the instruction set, boundaries between groups of instructions within a thread than can be launched in a single cycle.
- other protocols may be employed, e.g., ones that prioritize certain threads, ones that ignore resource utilization, and so forth.
- the execute & retire phase which executes in parallel with the fetch and dispatch stages—multiple instructions are executed from one or more threads simultaneously.
- up to five instructions are launched and executed each cycle, e.g., by the integer, floating, branch, compare and memory functional units 30 - 38 .
- greater or fewer instructions can be launched, for example, depending on the number and type of functional units and depending on the number of TPUs.
- An instruction is retired after execution if it completes: its result is written and the instruction is cleared from the instruction queue.
- FIG. 11 illustrates a three-pointer queue management mechanism used in the illustrated embodiment to facilitate this.
- an instruction queue and a set of three pointers is maintained for each TPU 10 - 20 .
- the queue 110 holds instructions fetched, executing and retired (or invalid) for the associated TPU—and, more particularly, for the thread currently active in that TPU.
- the queue 110 holds instructions fetched, executing and retired (or invalid) for the associated TPU—and, more particularly, for the thread currently active in that TPU.
- the next instruction for execution is identified by the Extract (or Issue) pointer 114 .
- the Commit pointer 116 identifies the last instruction whose execution has been committed. When an instruction is blocked or otherwise aborted, the Commit pointer 116 is rolled back to quash instructions between Commit and Extract pointers in the execution pipeline. Conversely, when a branch is taken, the entire queue is flushed and the pointers reset.
- the queuing mechanism depicted in FIG. 11 can be implemented, for example, as shown in FIG. 12 .
- Instructions are stored in dual ported memory 120 or, alternatively, in a series of registers (not shown).
- the write address at which each newly fetched instruction is stored is supplied by Fetch pointer logic 122 that responds to a Fetch command (e.g., issued by the pipeline control) to generate successive addresses for the memory 120 .
- Issued instructions are taken from the other port, here, shown at bottom.
- the read address from which each instruction is taken is supplied by Issue/Commit pointer logic 124 . That logic responds to Commit and Issue commands (e.g., issued by the pipeline control) to generate successive addresses and/or to reset, as appropriate.
- FIG. 13 depicts and SoC implementation of the processor module 5 of FIG. 1 including, particularly, logic for implementing the TPUs 10 - 20 .
- the implementation of FIG. 13 includes L 1 and L 2 caches 22 - 26 , which are constructed and operated as discussed above.
- the implementation includes functional units 30 - 34 comprising an integer unit, a floating-point unit, and a compare unit, respectively. Additional functional units can be provided instead or in addition.
- Logic for implementing the TPUs 10 - 20 includes pipeline control 130 , branch unit 38 , memory unit 36 , register file 136 and load-store buffer 138 . The components shown in FIG.
- FIG. 13 are interconnected for control and information transfer as shown, with dashed lines indicating major control, thin solid lines indicating predicate value control, thicker solid lines identifying a 64-bit data bus and still thicker lines identifying a 128-bit data bus. It will be appreciated that FIG. 13 represents one implementation of a processor module 5 according to invention and that other implementation may be realized as well.
- pipeline control 130 contains the per-thread queues discussed above in connection with FIGS. 11-12 . There can be parameterized at 12, 15 or 18 instructions per thread.
- the control 130 picks ups instructions from those queues on a round robin bases (though, as also noted, this can be performed on other bases as well). It controls the sequence of accesses to the register file 136 (which is the resource which provides source and destination registers for the instructions), as well as to the functional units 30 - 38 .
- the pipeline control 130 decodes basic instruction classes from the per-thread queues and dispatches instructions to the functional units 30 - 38 . As noted above, multiple instructions from one or more threads can be scheduled for execution by those functional units in the same cycle.
- the control 130 is additionally responsible for signaling the branch unit 38 as it empties the per-thread construction queues, and for idling the functional units when possible, e.g., on a cycle by cycle basis, to decrease our consumption.
- FIG. 14 is a block diagram of the pipeline control unit 130 .
- the unit includes control logic 140 for the thread class queues, the thread class (or per-thread) queues 142 themselves, and instruction dispatch 144 , a longwood decode unit 146 , and functional units queues 148 a - 148 c , connected to one another (and to the other components of module 5 ) as shown in the drawing.
- the thread class (per-thread) queues are constructed and operated as discussed above in connection with FIGS. 11-12 .
- the thread class queue control logic 140 controls the input side of those queues 142 and, hence, provides the Insert pointer functionality shown in FIGS. 11-12 and discussed above.
- the control logic 140 is also responsible for controlling the input side of the unit queues 148 a - 148 e , and for interfacing with the branch unit 38 to control instruction fetching. In this latter regard, logic 140 is responsible for balancing instruction fetching in the manner discussed above (e.g., so as to compensate for those TPUs that are retiring the most instructions).
- the instruction dispatch 144 evaluates and determines, each cycle, the schedule of available instructions in each of the thread class queues. As noted above, in the illustrated embodiment the queues are handled on a round robin basis with account taken for queues that are retiring instructions more rapidly.
- the instruction dispatch 144 also controls the output side of the thread class queues 142 . In this regard, it manages the Extract and Commit pointers discussed above in connection with FIGS. 11-12 , including updating the Commit pointer wind instructions have been retired and rolling that pointer back when an instruction is aborted (e.g., for thread switch or exception).
- the longwood decode unit 146 decodes incoming instruction longwords from the L 1 instruction cache 22 . In the illustrated embodiment, each such longword is decoded into the instructions. This can be parameterized for decoding one or two longwords, which decode into three and six instructions, respectively. The decode unit 146 is also responsible for decoding the instruction class of each instruction.
- Unit queues 148 a - 148 e queue actual instructions which are to be executed by the functional units 30 - 38 . Each queue is organized on a per-thread basis and is kept consistent with the class queues. The unit queues are coupled to the thread class queue control 140 and to the instruction dispatch 144 for control purposes, as discussed above. Instructions from the queues 148 a - 148 e are transferred to corresponding pipelines 150 a - 150 b en route to the functional units themselves 30 - 38 . The instructions are also passed to the register file pipeline 152 .
- FIG. 15 is a block diagram of an individual unit queue, e.g., 148 a. This includes one instruction queue 154 a - 154 e for each TPU. These are coupled to the thread class queue control 140 (labeled tcqueue_ctl) and the instruction dispatch 144 (labelled idispatch) for control purposes. These are also coupled to the longword decode unit 146 (labeled Iwdecode) for instruction input and to a thread selection unit 156 , as shown. That unit controls thread selection based on control signals provided by instruction dispatch 144 , as shown. Output from unit 156 is routed to the corresponding pipeline 150 a - 150 e, as well as to the register file pipeline 152 .
- integer unit pipeline 150 a and floating-point unit pipeline 150 b decode appropriate instruction fields for their respective functional units. Each pipeline also times the commands to that respective functional units. Moreover, each pipeline 150 a , 150 b applies squashing to the respective pipeline based on branching or aborts. Moreover, each applies a powerdown signal to its respective functional unit when it is not used during a cycle. Illustrated compare unit pipeline 150 c , branch unit pipeline 150 d , and memory unit pipeline 150 e , provide like functionality for their respective functional units, compare unit 34 , branch unit 38 and memory unit 36 . Register file pipeline 150 also provide like functionality with respect to register file 136 .
- illustrated branch unit 38 is responsible for instruction address generation and address translation, as well as instruction fetching. In addition, it maintains state for the thread processing units 10 - 20 .
- FIG. 16 is a block diagram of the branch unit 38 . It includes control logic 160 , thread state stores 162 a - 162 e , thread selector 164 , address adder 166 , segment translation content addressable memory (CAM) 168 , connected to one another (and to the other components of module 5 ) as shown in the drawing.
- CAM segment translation content addressable memory
- the control logic drives 160 unit 38 based on a command signal from the pipeline control 130 . It also takes as input the instruction cache 22 state and the L 2 cache 26 acknowledgement, as illustrated.
- the logic 160 outputs a thread switch to the pipeline control 130 , as well as commands to the instruction cache 22 and the L 2 cache, as illustrated.
- the thread state stores 162 a - 162 e store thread state for each of the respective TPUs 10 - 20 . For each of those TPUs, it maintaines the general-purpose registers, predicate registers and control registers shown in FIG. 3 and discussed above.
- Address information obtained from the thread state stores is routed to the thread selector, as shown, which selects the thread address from which and address computation is to be performed based on a control signal (as shown) from the control 160 .
- the address adder 166 increments the selected address or performs a branch address calculation, based on output of the thread selector 164 and addressing information supplied by the register file (labelled register source), as shown.
- the address adder 166 outputs a branch result.
- the newly computed address is routed to the segment translation memory 168 , which operates as discussed above in connection with FIG. 5 , which generates a translated instruction cache address for use in connection with the next instruction fetch.
- memory unit 36 is responsible for memory referents instruction execution, including data cache 24 address generation and address translation. In addition, unit 36 maintains the pending (memory) event table (PET) 50 , discussed above.
- FIG. 17 is a block diagram of the memory unit 36 . It includes control logic 170 , address adder 172 , and segment translation content addressable memory (CAM) 174 , connected to one another (and to the other components of module 5 ) at shown in the drawing.
- CAM segment translation content addressable memory
- the control logic drives 170 unit 36 based on a command signal from the pipeline control 130 . It also takes as input the data cache 22 state and the L 2 cache 26 acknowledgement, as illustrated.
- the logic 170 outputs a thread switch to the pipeline control 130 and branch unit 38 , as well as commands to the data cache 24 and the L 2 cache, as illustrated.
- the address adder 172 increments addressing information provided from the register file 136 or performs a requisite address calculation.
- the newly computed address is routed to the segment translation memory 174 , which operates as discussed above in connection with FIG. 5 , which generates a translated instruction cache address for use in connection with a data access.
- the unit 36 also includes the PET, as previously mentioned.
- FIG. 18 is a block diagram of a cache unit implementing any of the L 1 instruction cache 22 or L 2 data cache 24 .
- the unit includes sixteen 128 ⁇ 256 byte single port memory cells 180 a - 180 p serving as data arrays, along with sixteen corresponding 32 ⁇ 56 byte dual port memory cells 182 a - 182 p serving as tag arrays. These are coupled to L 1 and L 2 address and data buses as shown.
- Control logic 184 and 186 are coupled to the memory cells and to L 1 cache control and L 2 cache control, also as shown.
- the register file 136 serves as the resource for all source and destination registers accessed by the instructions being executed by the functional units 30 - 38 .
- the register file is implemented as shown in FIG. 20 .
- the unit 136 is decomposed into a separate register file instance per functional unit 30 - 38 .
- each instance provides forty-eight 64-bit registers for each of the TPUs. Other embodiments may vary, depending on the number of registers allotted the TPUs, the number of TPUs and the sizes of the registers.
- Each instance 200 a - 200 c has five write ports, as illustrated by the arrows coming into the top of each instance, via which each of the functional units 30 - 38 can simultaneously write output data (thereby insuring that the instances retain consistent data).
- Each provides a varying number of read ports, as illustrated by the arrows eminating from the bottom of each instance, via which their respective functional units obtain data.
- the instances associated with the integer unit 30 , the floating point unit 32 and the memory unit all have three read ports
- the instance associated with the compare unit 34 has two read ports
- the instance associated with the branch unit 38 has one port, as illustrated.
- the register file instances 200 - 200 e can be optimized by having all ports read for a single thread each cycle. In addition, storage bits can be folded under wires to port access.
- FIGS. 21 and 22 are block diagrams of the integer unit 30 and the compare unit 34 , respectively.
- FIGS. 23A and 23B are block diagrams, respectively, of the floating point unit 32 and the fused multiply-add unit employed therein. The construction and operation of these units is evident from the components, interconnections and labelling supplied with the drawings.
- the processor module 5 provides memory instructions that permit this to be done easily, enabling threads to wait on the availability of data and transparently wake up when another thread indicates the data is available.
- Such software transparent consumer-producer memory operations enable higher performance fine grained thread level parallelism with an efficient data oriented, consumer-producer programming style.
- the illustrated embodiment provides a “Fill” memory instruction, which is used by a thread that is a data producer to load data into a selected memory location and to associate a state with that location, namely, the “full” state. If the location is already in that state when the instruction is executed, an exception is signalled.
- the embodiment also provides an “Empty” instruction, which is used by a data consumer to obtain data from a selected location. If the location is associated with the full state, the data is read from it (e.g., to a designated register) and the instruction causes the location to be associated with an “empty” state. Conversely, if the location is not associated with the full state at the time the Empty instruction is executed, the instruction causes the thread that executed it to temporarily transition to the idle (or, in an alternative embodiment, an active, non-executing) state, re-transitioning it back to the active, executing state—and executing the Empty instruction to completion—once it is becomes so associated.
- Using the Empty instruction enables a thread to execute when its data is available with low overhead and software transparency.
- the pending (memory) event table (PET) 50 that stores status information regarding memory locations that are the subject of Fill and Empty operations. This includes the addresses of those locations, their respective full or empty states, and the identities of the “consumers” of data for those locations, i.e., the threads that have executed Empty instructions and are waiting for the locations to fill. It can also include the identities of the producers of the data, which can be useful, for example, in signalling and tracking cause of exceptions (e.g., as where to successive Fill instructions are executed for the same address, with no intervening Empty instructions).
- PTT pending (memory) event table
- the data for the respective locations is not stored in the PET 50 but, rather, remains in the caches and/or memory system itself, just like data that is not the subject of Fill and/or Empty instructions.
- the status information is stored in the memory system, e.g., alongside the locations to which it pertains and/or in separate tables, linked lists, and so forth.
- the PET is checked to determine whether it has an entry indicating that same location is currently in the full state. If so, that entry is changed to empty and a read is effected, moving data from the memory location to the register designated by the Empty instruction.
- the PET is checked is checked to determine whether it has an entry indicating that same location is currently in the empty state. Upon finding such an entry, its state is changed to full, and the event delivery mechanism 44 ( FIG. 4 ) is used to route a notification to the consumer-thread identified in that entry. If that thread is in an active, waiting state in a TPU, the notification goes to that TPU, which enters active, executing state and re-executes the Empty instruction—this time, to completion (since the memory location is now in the full state). If that thread is in the idle state, the notification goes to the system thread (in whatever TPU it is currently executing), which causes the thread to be loaded into a TPU in the executing, active state so that the Empty instruction can be re-executed.
- this use of the PET for consumer/producer-like memory operations is only effected with respect to selected memory instructions, e.g., Fill and Empty, but not with the more conventional Load and Store memory instructions.
- selected memory instructions e.g., Fill and Empty
- Load and Store memory instructions e.g., a Load instruction is executed with respect to a memory location that is currently the subject of an Empty instruction, no notification is made to the thread that executed that Empty instruction so that the instruction can be re-executed.
- Other embodiments may vary in this regard.
- FIGS. 24A depicts three interdependent threads, 230 , 232 and 234 , the synchronization of and data transfer between which can be facilitated by Fill and Empty instructions according to the invention.
- thread 230 is an MPEG2 demultiplexing thread 230 , responsible for demultiplexing an MPEG2 signal obtained, for example, from an MPEG2 source 236 , e.g., a tuner, a streaming source or otherwise. It is assumed to be in an active, executing state on TPU 10 , to continue the example.
- Thread 232 is a video decoding Step 1 thread, responsible for a first stage of decoding a video signal from a demultiplexed MPEG2 signal.
- Thread 234 is a video decoding Step 2 thread, responsible for a second stage of decoding a video signal from a demultiplexed MPEG2 signal for output via an LCD interface 238 or other device. It is assumed to be in an active, executing state on TPU 14 .
- each of the threads 230 - 234 continually process data provided by its upstream source and does so in parallel with the other threads.
- FIG. 24B illustrates use of the Fill and Empty instructions to facilitate this in a manner which insures synchronization and facilitates data transfer between the threads.
- arrows 240 a - 240 g indicate fill dependencies between the threads and, particularly, between data locations written to (filled) by one thread and read from (emptied) by another thread.
- thread 230 processes data destined for address A 0
- thread 232 executes an Empty instruction targeted to that location
- thread 234 executes an Empty instruction targeted to address B 0 (which thread 232 will ultimately Fill).
- thread 232 enters a wait state (e.g., active, non-executing or idle) while awaiting completion of the Fill of location A 0
- thread 234 enters a wait state while awaiting completion of the Fill of location B 0 .
- thread 232 On completion of thread 230 's Fill of A 0 , thread 232 's Empty completes, allowing that thread to process the data from A 0 , with the result destined for B 0 via a Fill instruction. Thread 234 remains in a wait state, still awaiting completion of that Fill. In the meanwhile, thread 230 begins processing data destined for address A 1 and thread 232 executes the Empty instruction, placing it in a wait state while awaiting completion of the Fill of A 1 .
- thread 232 executes the Fill demand for B 0
- thread 234 's Empty completes allowing that thread to process the data from B 0 , with the result destined for C 0 , whence it is read by the LCD interface (not shown) for display to the TV viewer.
- the three threads 230 , 232 , 234 continue process and executing Fill and Empty instruction in this manner—as illustrated in the drawing—until processing of the entire MPEG2 stream is completed.
- Empty instructs the memory system to check the state of the effective address. If the state is full, empty instruction changes the state to empty and loads the value into dreg. If the state is already empty, the instruction waits until the instruction is full, with the waiting behavior specified by the thread field. ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction group is delineated by this instruction.
- index register (ireg) for address calculation 1 Specifies disp for address calculation ireg Specifies the index register of the instruction. breg Specifies the base register of the instruction. disp Specifies the two s complement displacement constant (8-bits) for memory reference instructions dreg Specifies the destination register of the instruction.
- Register s1reg is written to the work in memory at the effective address.
- the effective address is calculated by adding breg (base register) and either ireg (index register) or disp (displacement) based on the im (immediate memory) field.
- the state of the effective address is changed to full. If the state is already full an exception is signaled.
- Operands and Fields ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects).
- stop 0 Specifies that an instruction group is not delineated by this instruction.
- 1 Specifies that an instruction group is delineated by this instruction.
- index register (ireg) for address calculation 1
- Specifies disp for address calculation ireg Specifies the index register of the instruction.
- breg Specifies the base register of the instruction.
- disp Specifies the two-s complement displacement constant (8-bits) for memory reference instructions
- streg Specifies the register that contains the first operand of the instruction.
- the EVENT instruction polls the event queue for the executing thread. If an event is present the instruction completes with the event status loaded into the exception status register. If no event is present in the event queue, the thread transitions to idle state.
- Operands and Fields ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction froup is delineated by this instruc- tion. s1reg Specifies the register that contains the first source operand of the instruction. SW Event
- the SWEvent instruction en-queues an event. onto the Event Queue to be handled by a thread. See xxx for the event format.
- Operands and Fields ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction group is delineated by this instruction. s1reg Specifies the register that contains the first source operand of the instruction. C TL F LD
- Control Field instruction modifies the control field specified by cfield. Other fields within the control register are unchanged.
- Operands and Fields ps The predicate source register that specifies whether the instruc- tion is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects).
- step 0 Specifies that an instruction group is not delineated by this instruction.
- 1 Specifies that an instruction group is delineated by this instruction.
- u 0 Specifies access to this threads control registers.
- 1 Specifies access to control register of thread specified by IDr_indirect field. (thread indirection) (privileged) efield efield[4:0] control field privilege
- FIG. 25 is a block diagram of a digital LCD-TV subsystem 242 according to the invention embodied in a SoC format.
- the subsystem 242 includes a processor module 5 constructed as described above and operated to execute simultaneously execute threads providing MPEG2 signal demultiplexing, MPEG2 video decoding, MPEG audio decoding, digital-TV user interface operation, and operating system execution (e.g., Linux) e.g., as described above.
- the module 5 is coupled to DDR DRAM flash memory comprising the off-chip portion of the L 2 cache 26 , also as discussed above.
- the module includes an interface (not shown) to an AMBA, AHB bus 244 , via which it communicates with “intellectual property” or “IP” 246 providing interfaces to other components of the digital LCD-TV, namely, a video input interface, a video output interface, an audio output interface and LCD interface.
- IP intellectual property
- FIG. 5 illustrates an interface (not shown) to an AMBA, AHB bus 244 , via which it communicates with “intellectual property” or “IP” 246 providing interfaces to other components of the digital LCD-TV, namely, a video input interface, a video output interface, an audio output interface and LCD interface.
- IP intellectual property
- FIG. 1 the module 5 communicates with optional IP via which the digital LCD-TV obtains source signals and/or is controlled, such as DMA engine 248 , high speed I/O device controller 250 and low speed device controllers 252 (via APB bridge 254 ) or otherwise.
- FIG. 26 is a block diagram of a digital LCD-TV or other application subsystem 256 according to the invention, again, embodied in a SoC format.
- the illustrated subsystem is configured as above, except insofar as it is depicted with APB and AHB/APB bridges and APB macros 258 in lieu of the specific IP shown 246 shown in FIG. 24 .
- elements 258 may comprise a video input interface, a video output interface, an audio output interface and an LCD interface, as in the implementation above, or otherwise.
- the illustrated subsystem further includes a plurality of modules 5 , e.g., from one to twenty such modules (or more) that are coupled via an interconnect that interfaces with and, preferably, forms part of the off-chip L 2 cache 26 b utilized by the modules 5 .
- That interconnect may be in the form of a ring interconnect (RI) comprising a shift register bus shared by the modules 5 and, more particularly, by the L 2 caches 26 .
- RI ring interconnect
- it may be an interconnect of another form, proprietary or otherwise, that facilitates the rapid movement of data within the combined memory system of the modules 5 .
- the L 2 caches are preferably coupled so that the L 2 cache for any one module 5 is not only the memory system for that individual processor but also contributes to a distributed all cache memory system for all of the processor modules 5 .
- the modules 5 do not have to physically sharing the same memory system, chips or buses and could, instead, be connected over a network or otherwise.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Debugging And Monitoring (AREA)
- Advance Control (AREA)
Abstract
Description
- This application is a continuation of, and claims the benefit of priority of, co-pending, U.S. patent application Ser. No. 10/449,732, filed May 30, 2003, and entitled “Virtual Processor Methods and Apparatus With Unified Event Notification and Consumer-Produced Memory Operations,” the teaching of which are incorporated herein by reference. The invention pertains to digital data processing and, more particularly, to virtual processor methods and apparatus with unified event notification and consumer-producer memory operations.
- There have been three broad phases to computing and applications evolution. First, there was the mainframe and minicomputer phase. This was followed by the personal computer phase. We are now in the embedded processor or “computers in disguise” phase.
- Increasingly, embedded processors are being used i n digital televisions, digital video recorders, PDAs, mobile phones, and other appliances to support multi-media applications (including MPEG decoding and /or encoding), voice and/or graphical user interfaces, intelligent agents and other background tasks, and transparent internet, network, peer-to-peer (P2P) or other information access. Many of these applications require complex video, audio or other signal processing and must run in real-time, concurrently with one another.
- Prior art embedded application systems typically combine: (1) one or more general purpose processors, e.g., of the ARM, MIPs or x86 variety, for handling user interface processing, high level application processing, and operating system, with (2) one or more digital signal processors (DSPs) (including media processors) dedicated to handling specific types of arithmetic computations, at specific interfaces or within specific applications, on real time/low latency bases. Instead or in addition to the DSPs, special-purpose hardware is often provided to handle dedicated needs that a DSP is unable to handle on a programmable basis, e.g., because the DSP cannot handle multiple activities at once or because the DSP cannot meet needs for a very specialized computational element.
- A problem with the prior art systems is hardware design complexity, combined with software complexity in programming and interfacing heterogeneous types of computing elements. The result often manifests itself in embedded processing subsystems that are underpowered computationally, but that are excessive in size, cost and/or electrical power requirements. Another problem is that both hardware and software must be re-engineered for every application. Moreover, prior art systems do not load balance; capacity cannot be transferred from one hardware element to another.
- An object of this invention is to provide improved apparatus and methods for digital data processing.
- A more particular object is to provide improved apparatus and methods that support applications that have high computational requirements, real-time application requirements, multi-media requirements, voice and graphical user interfaces, intelligence, background task support, interactivity, and/or transparent Internet, networking and/or P2P access support. A related object is to provide such improved apparatus and methods as support multiple applications meeting having one or more of these requirements while executing concurrently with one another.
- A further object of the invention is to provide improved apparatus and methods for processing (embedded or otherwise) that meet the computational, size, power and cost requirements of today's and future appliances, including by way of non-limiting example, digital televisions, digital video recorders, video and/or audio players, PDAs, personal knowledge navigators, and mobile phones, to name but a few.
- Yet another object is to provide improved apparatus and methods that support a range of applications, including those that are inherently parallel.
- A further object is to provide improved apparatus and methods that support multi-media and user interface driven applications.
- Yet a still further object provide improved apparatus and methods for multi-tasking and multi-processing at one or more levels, including, for example, peer-to-peer multi-processing.
- Still yet another object is to provide such apparatus and methods which are low-cost, low-power and/or support robust rapid-to-market implementations.
- These and other objects are attained by the invention which provides, in one aspect, a virtual processor that includes one or more virtual processing units. These execute on one or more processors, and each executes one or more processes or threads (collectively, “threads”). While the threads may be constrained to executing throughout their respective lifetimes on the same virtual processing units, they need not be. An event delivery mechanism associates events with respective threads and notifies those threads when the events occur, regardless of which virtual processing unit and/or processor the threads happen to be executing on at the time.
- By way of example, an embedded virtual processor according to the invention for use in a digital LCD television comprises a processor module executing multiple virtual processing units, each processing a thread that handles a respective aspect of digital LCD_TV operation (e.g., MPEG demultiplexing, video decoding, user interface, operating system, and so forth). An event delivery mechanism associates hardware interrupts, software events (e.g.. software-initiated events in the nature of interrupts) and memory events with those respective threads. When an event occurs, the event delivery mechanism delivers it to the appropriate thread, regardless of which virtual processing unit it is executing on at the time.
- Related aspect of the invention provide a virtual processor as described above in which selected threads respond to notifications from the event delivery mechanism by transitioning from a suspended state to an executing state. Continuing the above example, a user interface thread executing on a virtual processing unit in the digital LCD TV-embedded virtual processor may transition from waiting or idle to executing in response to a user keypad interrupt delivered by the event delivery mechanism.
- Still further related aspects of invention provide a virtual processor as described above in which the event delivery mechanism notifies a system thread executing on one of the virtual processing units of an occurrence of an event associated with a thread that is not resident on a processing unit. They system thread can respond to such notification by transitioning a thread from a suspended state to an executing state.
- Still other related aspects of invention provide a virtual processor as described above wherein at least selected active threads to respective such notifications concurrently with one another and/or without intervention of an operating system kernel.
- Yet further aspects of invention provide a virtual processor as described above in which the event delivery mechanism includes a pending memory operation table that establishes associations between pending memory operations and respective threads that have suspended while awaiting completion of such operations. The event delivery mechanism signals a memory event to a thread for which all pending memory operations have completed. Related aspects of the invention provide such a virtual processor that includes an event-to-thread lookup table mapping at least hardware interrupts to threads.
- In still other aspects, invention provide a virtual processor as described above wherein one or more threads execute an instruction for enqueuing a software event to the event queue. According to related aspects one or more threads that instruction specify which thread is to be notified of the event.
- Other aspects of invention provide a virtual processor as described above wherein at least one of the threads responds to a hardware interrupt by suspending execution of a current instruction sequence and executing and error handler. In related aspect, the thread further responds to the hardware interrupt by at least temporarily disabling event notification during execution of the error handler. In a further related aspect, that thread responds to the hardware interrupt by suspending the current instruction sequence following execution of the error handler.
- In still other aspects, the invention provides digital data processors with improved data-flow-bassed synchronization. Such a digital data processor includes a plurality of processes and/or threads (again, collectively, “threads”), as well as a memory accessible by those threads. At least selected memory locations have an associated state and are capable of storing a datum for access by one or more of the threads. The states include at least a full state and an empty state. A selected thread executes a first memory instruction that references a selected memory location. If the selected location is associated with the empty state, the selected thread suspends until the selected location becomes associated with the full state.
- A related aspect of invention provides an improved such digital data processor wherein, if the selected location is associated with the full state, execution of the first instruction causes a datum stored in the selected location to be read to the selected thread and causes the selected location to become associated with the empty state. According to a further related aspect of invention, the plurality of executing threads are resident on one or more processing units and the suspended thread is made at least temporarily nonresident on those units.
- The invention provides, is further aspects, a digital data processor as described above wherein the selected or another thread executes a second of memory instruction that references a selected memory location. If the selected location is associated with the empty state, execution of the second memory operation causes a selected data to be stored to the selected location and causes the selected location to become associated with the full state.
- Still other aspects, the invention provide a virtual processor comprising a memory and one or more virtual processing units that execute threads which access that memory. A selected thread executes a first memory instruction directed to a location in the memory. If that location is associated with an empty state, execution of instruction causes the thread to suspend until that location becomes associate with a full state.
- Related aspects invention provide a virtual processor as described above that additionally includes an event delivery mechanism as previously described.
- Further aspects of the invention provide digital LCD televisions, digital video recorders (DVR) and servers, MP3 servers, mobile phones, and/or other devices incorporating one or more virtual processors as described above. Related aspects of the invention provide such devices which incorporate processors with improved dataflow synchronization as described above.
- Yet further aspects of the invention provide methods paralleling the operation of the virtual processors, digital data processors and devices described above.
- These and other aspects invention are evident in the drawings and the description follows.
- A more complete understanding of the invention may be attained by reference to the drawings, in which:
-
FIG. 1 depicts a processor module constructed and operated in accord with one practice of the invention; -
FIG. 2 contrasts thread processing by a conventional superscalar processor with that by a processor module constructed and operated in accord with one practice of the invention; -
FIG. 3 depicts potential states of a thread executing in a virtual processing unit (or thread processing unit (TPU) in a processor constructed and operated in accord with one practice of the invention; -
FIG. 4 depicts an event delivery mechanism in a processor module constructed and operated in accord with one practice of invention; -
FIG. 5 illustrates a mechanism for virtual address to system address translation in a system constructed and operated in accord with one practice of the invention; -
FIG. 6 depicts the organization ofLevel 1 andLevel 2 caches in a system constructed and operated in accord with one practice the invention; -
FIG. 7 depicts the L2 cache and the logic used to perform a tag lookup in a system constructed and operated in accord with one practice of invention; -
FIG. 8 depicts logic used to perform a tag lookup in the L2 extended cache in a system constructed and operated in accord with one practice invention; -
FIG. 9 depicts general-purpose registers, predicate registers and thread state or control registers maintained for each thread processing unit (TPU) in a system constructed and operated in accord with one practice of the invention; -
FIG. 10 depicts a mechanism for fetching and dispatching instructions executed by the threads in a system constructed and operated in accord with one practice of the invention; -
FIGS. 11-12 illustrate a queue management mechanism used in system constructed and operated in accord with one practice of the invention; -
FIG. 13 depicts a system-on-a-chip (SoC) implementation of the processor module ofFIG. 1 including logic for implementing thread processing units in accord with one practice of the invention; -
FIG. 14 is a block diagram of a pipeline control unit in a system constructed and operated in accord with one practice of the invention; -
FIG. 15 is a block diagram of an individual unit queue in a system constructed and operated in accord with one practice of the invention; -
FIG. 16 is a block diagram of the branch unit in a system constructed and operated in accord with one practice of the invention; -
FIG. 17 is a block diagram of a memory unit in a system constructed and operated in accord with one practice of the invention; -
FIG. 18 is a block diagram of a cache unit implementing any of the L1 instruction cache or L1 data cache in a system constructed and operated in accord with one practice of the invention; -
FIG. 19 depicts an implementation of the L2 cache and logic ofFIG. 7 in a system constructed and operated in accord with one practice of the invention; -
FIG. 20 depicts the implementation of the register file in a system constructed and operated in accord with one practice of the invention; -
FIGS. 21 and 22 are block diagrams of an integer unit and a compare unit in a system constructed and operated in accord with one practice of the invention; -
FIGS. 23A and 23B are block diagrams of a floating point unit in a system constructed and operated in accord with one practice of the invention; -
FIGS. 24A and 24B illustrate use of consumer and producer memory instructions in a system constructed and operated in accord with one practice of the invention; -
FIG. 25 is a block diagram of a digital LCD-TV subsystem in a system constructed and operated in accord with one practice of the invention; and -
FIG. 26 is a block diagram of a digital LCD-TV or other application subsystem in a system constructed and operated in accord with one practice of the invention. -
FIG. 1 depicts aprocessor module 5 constructed and operated in accord with one practice of the invention and referred to occasionally throughout this document and the attached drawings as “SEP”. The module can provide the foundation for a general purpose processor, such as a PC, workstation or mainframe computer—though, the illustrated embodiment is utilized as an embedded processor. - The
module 5, which amy be used singly or in combination with one or more other such modules, is suited inter alia for devices or systems whose computational requirements are parallel in nature and that benefit from multiple concurrently executing applications and/or instruction level parallelism. This can include devices or system with real-time requirements, those that execute multi-media applications, and/or those with high computational requirements, such as image, signal, graphics and/or network processing. The module is also suited for integration of multiple applications on a single platform, e.g., where there is concurrent application use. It provides for seamless application execution across the devices and/or systems in which it is embedded or otherwise incorporated, as well as across the networks (wired, wireless, or otherwise) or other medium via which those devices and/or systems are coupled. Moreover, the module is suited for peer-to-peer (P2P) applications, as well as those with user interactivity. The foregoing is not intended to be an extensive listing of the applications and environments to which themodule 5 is suited, but merely one of examples. - Examples of devices and systems in which the
module 5 can be embedded include inter alia digital LCD-TVs, e.g., type shown inFIG. 24 , wherein themodule 5 is embodied in a system-on-a-chip (SOC) configuration. (Of course, it will be appreciated that the module need not be embodied on a single chip and, rather, can be may be embodied in any of a multitude of form factors, including multiple chips, one or more circuit boards, one or more separately-housed devices, and/or a combination of the foregoing). Further examples include digital video recorders (DVR) and servers, MP3 servers, mobile phones, applications which integrate still and video cameras, game platforms, universal networked displays (e.g., combinations of digital LCD-TV, networked information/Internet appliance, and general-purpose application platform), G3 mobile phones, personal digital assistants, and so forth. - The
module 5 includes thread processing units (TPUs) 10-20, level one (L1) instruction anddata caches cache 26, pipeline control 28 and execution (or functional units) 30-38, namely, an integer processing unit, a floating-point processing unit, a compare unit, a memory unit, and a branch unit. The units 10-38 are coupled as shown in the drawing and more particularly detailed below. - By was of overview, TPUs 10-20 are virtual processing units, physically implemented within
processor module 5, that are each bound to and process one (or more) process(es) and/or thread(s) (collectively, thread(s)) at any given instant. The TPUs have respective per-thread state represented in general purpose registers, predicate registers, control registers. The TPUs share hardware, such as launch and pipeline control, which launches up to five instructions from any combination of threads each cycle. As shown in the drawing, the TPUs additionally share execution units 30-38, which independently execute launched instructions without the need to know what thread they are from. - By way of further overview, illustrated
L2 ache 26 is shared by all of the thread processing units 10-20 and stores instructions and data on storage both internal (local) and external to the chip on which themodule 5 is embodied. Illustrated L1 instruction anddata caches level 1 andlevel 2 caches may be configured differently—e.g., entirely local to themodule 5, entirely external, or otherwise). - The design of
module 5 is scalable. Two ormore modules 5 may be “ganged” in an SoC or other configuration, thereby, increasing the number of active threads and overall processing power. Because of the threading model used by themodule 5 and described herein, the resultant increase in TPUs is software transparent. Though the illustratedmodule 5 has six TPUs 10-20, other embodiments may have a greater number of TPUs (as well, of course, as a lesser number). Additional functional units, moreover, may be provided, for example, boosting the number of instructions launched per cycle from five to 10-15, or higher. As evident in the discussion below of L1 and L2 cache construction, these too may be scaled. -
Illustrated module 5 utilizes Linux as an application software environment. In conjunction with multi-threading, this enables real-time and non-real-time applications to run on one platform. It also permits leveraging of open source software and applications to increase product functionality. Moreover, it enables execution of applications from a variety of providers. - Multi-Threading
- As noted above, TPUs 10-20 are virtual processing units, physically implemented within a
single processor module 5, that are each bound to and process one (or more) thread(s) at any given instant. The threads can embody a wide range applications. Examples useful in digital LCD-TVs, for example, include MPEG2 signal demultiplexing, MPEG2 video decoding, MPEG audio decoding, digital-TV user interface operation, operating system execution (e.g., Linux). Of course, these and/or other applications may be useful in digital LCD TVs and the range of other devices and systems in which themodule 5 may be embodied. - The threads executed by the TPUs are independent but can communicate through memory and events. During each cycle of
processor module 5, instructions are launched from as many active-executing threads as necessary to utilize the execution or functional uits 30-38. In the illustrated embodiment, a foun robin protocol is imposed in this regard to assure “fairness” to the respective threads (though, in other embodiments, priority or other protocols can be used instead or in addition). Although one or more system threads may be executing on the TPUs (e.g., to launch application, facilitate thread activation, and so forth), no operating system intervention is required to execute active threads. - The underlying rationales for supporting multiple active threads (virtual processors) per processor are:
- Functional Capability
- Multiple active threads per processor enables a single multi-threaded processor to replace multiple application, media, signal processing and network processors. It also enables multiple threads corresponding to application, image, signal processing and networking to operate and interoperate concurrently with low latency and high performance. Context switching and interfacing overhead is minimized. Even within a single image processing application, like MP4 decode, threads can easily operate simultaneously in a pipelined manner to for example prepare data for frame n+1 while frame n is being composed.
- Performance
- Multiple active threads per processor increases the performance of the individual processor by better utilizing functional units and tolerating memory and other event latency. It is not unusual to gain a 2× performance increase for supporting up to four simultaneously executing threads. Power consumption and die size increases are negligible so that performance per unit power and price performance are imptoved. Multiple active threads per processor also lowers the performance degradation due to branches and cache misses by having another thread execute during these events. Additionally, it eliminates most context switch overhead and lower latency for reat-time activites. Moreover, it supports a general, high performance event model.
- Implementation
- Multiple active threads per processor leads to simplification of pipeline and overall design. There is no need for a complex branch predication, since another thread can run. It leads to lower cost of single processor chips vs. multiple processor chips, and to lower cost when other complexities are eliminated. Further, it improves performance per unit power.
-
FIG. 2 contrasts thread processing by a conventional superscalar processor with that of the illustratedprocessor module 5. Referring toFIG. 2A , in a superscalar processor, instructions from a single executing thread (indicated by diagonal stippling) are dynamically scheduled to execute on available execution units based on the actual parallelism and dependencies within the code being executed. This means that on the average most execution units are not able to be utilized during each cycle. As the number of execution units increases the percentage utilization typically goes down. Also execution units are idle during memory system and branch prediction misses/waits. - In contrast, referring to
FIG. 2B , in themodule 5, instructions from multiple threads (indicated by different respective stippling patterns) execute simultaneously. Each cycle, themodule 5 schedules instructions from multiple threads to optimally utilize available execution unit resources. Thus the execution unit utilization and total performance is higher, while at the same time transparent to software. - Events and Threads
- In the illustrated embodiment, events include hardware (or device) events, such as interrupts; software events, which are equivalent to device events but are initiated by software instructions and memory events, such as completion of cache misses or resolution of memory producer-consumer (full-empty) transitions. Hardware interrupts are translated into device events which are typically handled by an idle thread (e.g., a targeted thread or a thread in a targeted group). Software events can be used, for example, to allow one thread to directly wake another thread.
- Each event binds to an active thread. If a specific thread binding doesn't exist, it binds to the default system thread which, in the illustrated embodiment, is always active. That thread then processes the event as appropriate including scheduling a new thread on a virtual processor. If the specific thread binding does exist, upon delivery of a hardware or software event (as discussed below in connection with the event delivery mechanism), the targeted thread is transitioned from idle to executing. If the targeted thread is already active and executing, the event is directed to default system thread for handling.
- In the illustrated embodiment, threads can become non-executing (block) due to: Memory system stall (short term blockage), including cache miss and waiting on synchronization; Branch miss-prediction (very short term blockage); Explicitly waiting for an event (either software or hardware generated); and System thread explicitly blocking application thread.
- In preferred embodiments of the invention, events operate across
physical processors modules 5 and networks providing the basis for efficient dynamic distributed execution environment. Thus, for example, amodule 5 executing in an digital LCD-TV or other device or system can execute threads and utilize memory dynamically migrated over a network (wireless, wired or otherwise) or other medium from a server or other (remote) device. The thread and memory-based events, for example, assure that a thread can execute transparently on anymodule 5 operating in accord with the principles hereof. This enables, for example, mobile devices to leverage the power of other networked devices. It also permits transparent execution of peer-to-peer and multi-threaded applications on remote networked devices. Benefits include increased performance, increased functionality and lower power consumption. - Threads run at two privilege levels, System and Application. System threads can access all state of its thread and all other threads within the processor. An application thread can only access non-privileged state corresponding to itself. By
default thread 0 runs at system privilege. Other threads can be configured for system privilege when they are created by a system privilege thread. - Referring to
FIG. 3 , in the illustrated embodiment, thread states are: - Idle (or Non-Active)
- Thread context is loaded into a TPU and thread is not executing instructions. An Idle thread transitions to Executing, e.g., when a hardware of software event occurs.
- Waiting (or Active, Waiting)
- Thread context is loaded into a TPU, but is currently not executing instructions. A Waiting thread transitions to Executing when an event it is waiting for occurs, e.g., a cache operation is completed that would allow the memory instruction to proceed.
- Executing (or Active, Executing)
- Thread context is loaded into a TPU and is currently executing instructions. A thread transitions to Waiting, e.g., when a memory instruction must wait for cache to complete an operation, e.g. a cache miss or an Empty/Fill (producer-consumer memory) instruction cannot be completed. A thread transitions to idle when a event instruction is executed
- A thread enable bit (or flag or other indicator) associated with each TPU disables thread execution without disturbing any thread state for software loading and unloading of a thread onto a TPU.
- The
processor module 5 load balances across active threads based on the availability of instructions to execute. The module also attempts to keep the instruction queues for each thread uniformly full. Thus, the threads that stay active the most will execute the most instructions. - Events
-
FIG. 4 shows an event delivery mechanism in a system according to the one practice of the invention. When an event is signaled to a thread, the thread suspends execution (if currently in the Executing state) and recognizes the event by executing the default event handler, e.g., atvirtual address 0×0. - In the illustrated embodiment, there are five different event types that can be signaled to a specific thread:
Event Description Thread Delivery Thread wait The timeout value from a wait threadn timeout instruction executed by threadn has expired Thread Executing instruction has threadn exception signaled exception. HW Event Event (like interrupt) threadn as determined by generated by hardware device. event to thread lookup SW Event Event (like sw interrupt) instruction specifies thread. signaled by sw event If that thread is not Active, instruction Waiting or Idle delivered to default system thread Memory All pending memory opera- threadn Event tions for a threadn have completed. -
-
Bit Field Description 0 priv Privilege that the event will be signaled at: 0. System privilege 1. Application privilege 1 how Specifies how the event is signaled of the thread is not in idle state. If the thread is in idle state, this field is ignored and the event is directly signalled 0. Wait for thread in idle state. All events after this event in the queue wait also. 1. Trap thread immediately 15:4 eventnum Specifies the logical number for this event. The vaslue of this field is captured in detail field of the system exception status or application exception status register. 31:16 threadnum Specifies the logical thread number that this event is signaled to. - Of course, it will be appreciated that the events presented by the hardware devices and software instructions may be presented in other forms and/or containing other information.
- The event tuples are, in turn, passed in the order received to the event-to-thread lookup table (also referred to as the event table or thread lookup table) 42, which determines which TPU is currently handling each indicated thread. The events are then presented, in the form of “TPU events” comprised of event numbers, to the TPUs (and, thereby, their respective threads) via the event-to-thread delivery mechanism 44. If no thread is yet instantiated to handle a particular event, the corresponding event is passed to a default system thread active on one of the TPUs.
- The
event queue 40 can be implemented in hardware, software and/or a combination thereof. In the embedded, system-on-a-chip (SoC) implementation represented bymodule 5, the queue is implemented as a series of gates and dedicated buffers providing the requisite queuing function. In alternate embodiments, it is implemented in software (or hardware) linked lists, arrays, or so forth. - The table 42 establishes a mapping between an event number (e.g., hardware interrupt) presented by a hardware device or event instruction and the preferred thread to signal the event to. The possible cases are:
- No entry for event number: signal to default system thread.
- Present to thread: signal to specific thread number if thread is in Executing, Active or Idle, otherwise signal to specified system thread
- The table 42 may be a single storage are, dedicated or otherwise, that maintains an updated mapping of events to threads. The table may also constitute multiple storage areas, distributed or otherwise. Regardless, the table 42 may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by
module 5, the table is implemented by gates that perform “hardware” lookups on dedicated storage area(s) that maintains an updated mapping of events to threads. That table is software-accessible, as well—for example, by system-level privilege threads which update the mappings as threads are newly loaded into the TPUs 10-20 and/or deactivated and unloaded from them. In turn embodiments, the table 42 is implemented by a software-based lookup of the storage area that maintains the mapping. - The event-to-thread delivery mechanism 44, too, may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by
module 5, the mechanism 44 is implemented by gates (and latches) that route the signaled events to TPU queues which, themselves, are implemented as a series of gates and dedicated buffers 46-48 for queuing be delivered events. As above, in alternate embodiments, the mechanism 44 is implemented in software (or other hardware structures) providing the requisite functionality and, likewise, the queues 46-48 are implemented in software (or hardware) linked lists, arrays, or so forth. - An outline of a procedure for processing hardware and software events (i.e., software-initiated signaling events or “software interrupts”) in the illustrated embodiment is as follows:
- 1. Event is signalled to the TPU which is currently executing active thread.
- 2. That TPU suspends execution of active thread. The Exception Status, Exception IP and Exception MemAddress control registers are set to indicate information corresponding to the event based on the type of event. All—Thread State is valid.
- 3. The TPU initiates excecution at system privilege of the default event handler at
virtual address 0×) with event signaling disabled for the corresponding thread unit. GP registers 0-3 contain and predicate registers 0-1 are utilized as scratch registers by the event handlers and are system privilege. By convention GP[0] is the event processing stack pointer. - 4. The event handler saves enough state so that it can make itself re-entrant and re-enable event signaling for the corresponding thread execution unit.
- 5. Event handler then processes the event, which could just be posting the event to a SW based queue or taking some other action.
- 6. The event handler then restores state and returns to execution of the original thread.
- Memory-related events are handled only somewhat differently. The Pending (Memory) Event Table (PET) 50 holds entries for memory operations (from memory reference instructions) which transition a tread from executing to waiting. The table 50, which may be implemented like the event-to-thread lookup table 42, holds the address of the pending memory operations, state information and thread ID which initiated the reference. When a memory operation is completed corresponding to an entry in the PET and no other pending operations are in the PET for that thread, and PET event is signaled to the corresponding thread.
- An outline of memory event processing according to the illustrated embodiment is as follows:
- 1. Event is signal to unit which is currently executing active thread
- 2. If the thread is in active-wait state and the event is a Memory Event the thread transitions to active-executing and continues execution at the current IP. Otherwise the event is ignored.
- As further shown in the drawing, in the illustrated embodiment, thread wait timeouts and thread exceptions are signaled directly to the threads and are not passed through the event-to-thread delivery mechanism 44.
- Traps
- The goal of multi-threading and events is such that normal program execution of a thread is not disturbed. The events and interrupts which occur get handled by the appropriate thread that was waiting for the event. There are cases where this is not possible and normal processing must be interrupted. SEP supports trap mechanism for this purpose. A list of actions based on event types follows, with a full list of the traps, enumerated in the System Exception Status Register.
Event Type Thread State Privilege Level Action Application event Idle Application or Recognize event by System transitioning to execute state, appli- cation priv System event Idle Application or System trap to System recognize event, transition to execute state Application event Waiting or executing Application or Event stays queued (wait for idle) System until idle Application event Waiting or executing Application or Application trap to (trap if not idle) System recognize event System event Waiting or executing Application System trap to recognize event System event (wait Waiting or executing System Event stays queued for idle) until idle System event (trap if Waiting or executing System System trap to not idle) recognize event Application Trap Any Application Application trap Application Trap Any System System trap System Trap Any Application System trap, system privilege level System Trap Any System System trap -
Illustrated processor module 5 takes the following actions when a trap occurs: - 1. The IP (Instruction Pointer) specifying the next instruction to be executed is loaded in the Exception IP register.
- 2. The Privilege Level is stored into
bit 0 of Exception IP register. - 3. The Exception type is loaded into Exception State register
- 4. If the exception is related to a memory unit instruction, the memory address corresponding to exception is loaded into Exception Memory Address register.
- 5. Current privilege level is set to system.
- 6. IP (Instruction Pointer) is cleared (zero).
- 7. Execution begins at
IP 0. - Virtual Memory and Memory System
- The illustrated
processor module 5 utilizes a virtual memory and memory system architecture having a 64-bit Virtual Address (VA) space, a 64-bit System Address (SA) (having different characteristics than a standard physical address), and a segment model of virtual address to system address translation with a sparsely filled VA or SA. - All memory accessed by the TPUs 10-20 is effectively managed as cache, even though off-chip memory may utilize DDR DRAM or other forms of dynamic memory. Referring back to
FIG. 1 , in the illustrated embodiment, the memory system consists of two logical levels. Thelevel 1 cache, which is divided into separate data and instruction caches, 24, 22, respectively, for optimal latency and bandwidth. Illustratedlevel 2cache 26 consists of an on-chip portion and off-chip portion referred to aslevel 2 extended. As a whole, thelevel 2 cache is the memory system for the individual SEP processor(s) 5 and contributes to a distributed “all cache” memory system in implementations wheremultiple SEP processors 5 are used. Of course, it will be appreciated that those multiple processors would not have to be physically sharing the same memory system, chips or buses and could, for example, be connected over a network or otherwise. -
FIG. 5 illustrates VA to SA translation used in the illustrated system, which translation is handled on a segment basis, where (in the illustrated embodiment) those segments can be of variable size, e.g., 221-248 bytes. The SAs are cached in the memory system. So an SA that is present in the memory system has an entry in one of the levels ofcache 22/24, 26. An SA that is not present in any cache (and the memory system) is effectively not present in the memory system. Thus, the memory system is filled sparsely at the page (and subpage) granularity in a way that is natural to software and OS, without the overhead of page tables on the processor. - In addition to the foregoing the virtual memory and memory system architecture of the illustrated embodiment has the following additional features: Direct support for distributed shared: Memory (DSM), Files (DSF), Objects (DSO), Peer to Peer (DSP2P); Scalable cache and memory system architecture; Segments that can be shared between threads;
Fast level 1 cache, since lookup is in parallel with tag access, with no complete virtual-to-physical address translation or complexity of virtual cache. - Virtual Memory Overview
- A virtual address in the illustrated system is the 64-bit address constructed by memory reference and branch instructions. The virtual address is translated on a per segment basis to a system address which is used to access all system memory and IO devices. Each segment can vary in size from 224 to 248 bytes. More specifically, referring to
FIG. 5 , thevirtual address 50 is used to match an entry in a segment table 52 in the manner shown in the drawing. The matchedentry 54 specifies the corresponding system address, when taken in combination with the components of the virtual address identified in drawing. In addition, the matchedentry 54 specifies the corresponding segment size and privilege. That system address, in turn, maps in to the system memory which in the illustrated embodiment comprises 264 bytes sparsely filled. The illustrated embodiment permits address translation to be disabled by threads with system privilege, in which case the segment table is bypassed and all addresses are truncated to the low 32 bits. - Illustrated segment table 52 comprises 16-32 entries per thread (TPU). The table may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by
module 5, the table is implemented in hardware, with separated entries in memory being provided for each thread (e.g., a separate table per thread). A segment can be shared among two or more threads by setting up a separate entry for each thread that points to the same system address. Other hardware or software structures may be used instead, or in addition, for this purpose. - Cache Memory System Overview
- As noted above, the
Level 1 cache is organized asseparate Level 1instruction cache 22 andlevel 1data cache 24 to maximize instruction and data bandwidth. - Referring to
FIG. 6 , the on-chip L2 cache 26 a consists of the tag and data portions. In the illustrated embodiment, it is 0.5-1 Mbytes in size, with 128 blocks, 16-way associative. Eachblock stores 128 bytes data or 16 extended L2 tags, with 64 kbytes are provided to store the extended L2 tags. A tag-mode bit within the tag indicates that the data portion consists of 16 tags for Extended L2 Cache. - The
extended L2 cache 26 b is, as noted above, DDR DRAM-based, though other memory types can be employed. In the illustrated embodiment, it is up to 1 gbyte in size, 256-way associative, with 16k byte pages and 128 byte subpages. For a configuration of 0.5mbyte L2 cache cache 26 b, only 12% of on-chip L2 cache is required to fully describe L2 extended. For larger on-chip L2 or smaller L2 extended sizes the percentage is lower. The aggregation of L2 caches (on-chip and extended) make up the distributed SEP memory system. - In the illustrated embodiment, both the
L1 instruction cache 22 andL1 data cache 24 are 8-way associative with 32 bytes and 128 byte blocks. As shown in the drawing, bothlevel 1 caches are proper subsets oflevel 2 cache. Thelevel 2 cache consists on an on-chip and off chip extended L2 Cache. -
FIG. 7 depicts theL2 cache 26 a and the logic used in the illustrated embodiment to perform a tag lookup inL2 cache 26 a to identify adata block 70 matching anL2 cache address 78. In the illustrated embodiment, that logic includes sixteen Cache Tag Array Groups 72 a-72 p, corresponding Tag Compare elements 74 a-74 p and corresponding Data Array Groups 76 a-76 p. These are coupled as indicated to match anL2 cache address 78 against the Group Tag Arrays 72 a-72 p, as shown, and to select the data block 70 identified by the indicated Data Array Group 76 a-76 p, again, as shown. - The Cache Tag Array Groups 72 a-72 p, Tag Compare elements 74 a-74 p, corresponding Data Array Groups 76 a-76 p may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by
module 5, these are implemented in as shown inFIG. 19 , which shown the Cache Tag Array Groups 72 a-72 p embodied in 32×256 single port memory cells and the Data Array Groups 76 a-76 p embodied in 128×256 single port memory cells, all coupled with current state control logic 190 as shown. That element is, in turn, coupled tostate machine 192 which facilitates operation of theL2 cache unit 26 a in a manner consistent herewith, as well as with arequest queue 192 which buffers requests from the L1 instruction anddata caches - The logic element 190 is further coupled with DDR
DRAM control interface 26 c which provides an interface to the off-chip portion 26 b of the L2 cache. It is likewise coupled toAMBA interface 26 d providing an interface to AMBA-compatible components, such as liquid crystal displays (LCDs), audio out interfaces, video in interfaces, video out interfaces, network interfaces (wireless, wired or otherwise), storage device interfaces, peripheral interfaces (e.g., USB, USB2), bus interfaces (PCI,ATA), to name but a few. TheDDR DRAM interface 26 c andAMBA interface 26 d are likewise coupled to aninterface 196 to the L1 instruction and data caches by way of L2data cache bus 198, as shown. -
FIG. 8 likewise depicts the logic used in the illustrated embodiment to perform a tag lookup in L2 extendedcache 26 b and to identify adata block 80 matching the designatedaddress 78. In the illustrated embodiment, that logic includes Data Array Groups 82 a-82 p, corresponding Tag Compare elements 84 a-84 p, andTag Latch 86. These are coupled as indicated to match andL2 cache address 78 against the Data Array Groups 72 a-72 p, as shown, and to select a tag from one of those groups that matches the corresponding portion of theaddress 78, again, as shown. The physical page number from the matching tag is combined with the index portion of theaddress 78, as shown, to identify data block 80 in theoff chip memory 26 b. - The Data Array Groups 82 a-82 p and Tag Compare elements 84 a-84 p may be implemented in hardware, software and/or a combination thereof. In the embedded, SoC implementation represented by
module 5, these are implemented in gates and dedicated memory providing the requisite lookup and tag comparison functions. Other hardware or software structures may be used instead, or in addition, for this purpose. - The following is a pseudo-code illustrates L2 and L2F, cache operation in the illustrated embodiment:
L2 tag lookup, if hit respond back with data to L1 cache else L2E tag lookup, if hit allocate tag in L2; access L2E data, store in corresponding L2 entry; respond back with data to L1 cache; else extended L2E tag lookup allocate L2E tag; allocate tag in L2; access L2E data, store in corresponding L2 entry; respond back with data to L1 cache;
Thread Processing Unit State - Referring to
FIG. 9 , the illustrated embodiment has six TPUs supporting up to six active threads. Each TPU 10-20 includes general-purpose registers, predicate registers, and control registers, as shown inFIG. 9 . Threads at both system and application privilege levels contain identical state, although some thread state information is only visible when at system privilege level—as indicated by the key and respective stippling patterns. In addition to registers, each TPU additionally includes a pending memory event table, an event queue and an event-to-thread lookup table, none of which are shown inFIG. 9 . - Depending on the embodiment, there can be from 48 (or fewer) to 128 (or greater) general-purpose registers, with the illustrated embodiment having 128; 24 (or fewer) to 64 (or greater) predicate registers, with the illustrated embodiment having 32; six (or fewer) to 256 (or greater) active threads, with the illustrated embodiment having 8; a pending memory event table of 16 (or fewer) to 512 (or greater) entries, with the illustrated embodiment having 16; a number of pending memory events per thread, preferably of at least two (though potentially less); an event queue of 256 (or greater, or fewer); and an event-to-thread lookup table of 16 (or fewer) to 256 (or greater) entries, with the illustrated embodiment having 32.
- General Purpose Registers
- In the illustrated embodiment, each thread has up to 128 general purpose registers depending on the implementation. General Purpose registers 3-0 (GP[3:0]) are visible at system privilege level and can be utilized for event stack pointer and working registers during early stages of event processing.
- Predication Registers
- The predicate registers are part of the general purpose SEP predication mechanism. The execution of each instruction is conditional based on the value of the reference predicate register.
- The SEP provides up to 64 one-bit predicate registers as part of thread state. Each predicate register holds what is called a predicate, which is set to 1 (true) or reset to 0 (false) based on the result of executing a compare instruction. Predicate registers 3-1 (PR[3:1]) are visible at system privilege level and can be utilized for working predicates during early stages of event processing.
Predicate register 0 is read only and always reads as 1, true. It is by instructions to make their execution unconditional. - Control Registers
-
-
Design Bit Field Description Privilege Per Usage 0 strapen System trap system_rw Thread Branch enable. On reset cleared. Signal- ling of system trap resets this bit and atrapen until it is set again by software when it is once again re-entrant. 0- System traps disabled 1- Events enabled 1 atrapen Application trap app_rw Thread enable. On reset cleared. Signal- ling of application trap resets this bit until it is set again by software when it is once again re-entrant. Application trap is cause by an event that is marked as applic- ation level when the privilege level is also application level 0- Events disabled (events are disabled on event delivery to thread) 1- Events enabled 2 tenable Thread Enable. System_rw Thread Branch On reset set for thread 0, clearedfor all other threads 0- Thread operation is disabled. System thread can load or store thread state. 1- Thread operation is enabled. 3 priv Privilege level. System_rw Thread Branch On reset cleared. 0- System priv- app_r ilege 1- Application privilege 5:4 state Thread State. System_rw Thread Branch On reset set to “executing” for thread0, set to “idle” for all other threads. 0- Idle 1- reserved 2- Waiting 3- Executing 15:8 mod[7:0] GP Registers App_rw Thread Pipe Modified. Cleared on reset. bit 8registers 0-15 bit 9 registers 16-31 bit 10registers 32-47 bit 11 registers 48-63 bit 12registers 63-79 bit 13 registers 80-95 bit 14registers 96-111 bit 15registers 112-127 16 endian Endian Mode- System_rw Proc Mem On reset cleared. 0- little endian App_r 1- big endian 17 align Alignment check- System_rw Proc Mem When clear, un- aligned memory references are allowed. When set, all un- aligned memory references result in unaligned data reference fault. On reset cleared. 18 iaddr Instruction add- System_rw Proc Branch ress translation App_r enable. On reset cleared. 0- disabled 1- enabled 19 daddr Data address System_rw Proc Mem translation en- App_r able. On reset cleared. 0- disbaled 1- enabled -
-
Bit Field Description Privilege Per 7:0 type Processor type and revision read only Proc [7:0] 15:8 id Processor ID[7:0]- Virtual read only Thread processor number 31:16 thread_id Thread ID[15:0] System_rw Thread App_ro -
- Specifies the 64-bit virtual address of the next instruction to be executed.
Bit Field Description Privilege Per 63:4 Doubleword Address of instruction double- app thread word 3:2 mask Indicates which instructions app thread within instruction doubleword remain to be executed. • Bit1- first instruction doubleword bit[40:00] • Bit2- second instruction doubleword bit[81: 41] • Bit3- thrid instruction doubleword, bit[122: 82] 0 Always read as zero app thread -
-
Bit Field Description Privilege Per 3:0 etype Excession Type read only Thread 0. none 1. event 2. timer event 3. SystemCall 4. Single Step 5. Protection Fault 6. Protection Fault, system call 7. Memory reference Fault 8. SW event 9. HW fault 10. others 15:4 detail Fault details- Valid for the following exception types: • Memory reference fault details (type 5) 1. None 2. waiting for fill 3. waiting for empty 4. waiting for compeletion of cache miss 5. memory reference error • event (type 1)- Specifies the 12 bit event number -
-
Bit Field Description Privilege Per 3:0 etype Exception Type read only Thread 0. none 1. event 2. timer event 3. SystemCall 4. Single Step 5. Protection Fault 6. Protection Fault, system call 7. Memory reference Fault 8. SW event 9. HW event 10. Others 15:4 detail Protection Fault details- Valid for the following exception types: • event (type 1)- Specifies the 12 bit event number -
- Address of instruction corresponding to signaled exception to system privilege. Bit[0] is the privilege level at the time of the exception.
Bit Field Description Privilege Per 63:4 Doubleword Address of instruction double- system thread word which signaled excep- tion 3:1 mask Indicates which instructions system thread within instruction doubleword remain to be executed. • Bit1- first instruction doubleword bit[40:00] • Bit2- second instruction doubleword bit[81: 41] • Bit3- third instruction doubleword, bit[122: 82] 0 priv Privilege level of thread system thread at time of exception - Address of instruction corresponding to signaled exception. Bit[0] is the privilege level at the time of the exception.
-
- Address of instruction corresponding to signaled exception to application privilege.
Bit Field Description Privilege Per 63:4 Doubleword Address of instruc- system thread tion doubleword which sig- naled exception 3:1 mask Indicates which instructions system thread within instruction doubleword remain to be executed. • Bit1- first instruction doubleword bit[40:00] • Bit2- second instruction doubleword bit[81: 41] • Bit3- third instruction doubleword, bit[122: 82] - Address of instruction corresponding to signaled exception. Bit[0] is the privilege level at the time of the exception.
-
- Address of memory reference that signaled exception. Valid only for memory faults. Holds the address of the pending memory operation when the Exception Status register indicates memory reference fault, waiting for fill or waiting for empty.
-
- Utilized by ISTE and ISTE registers to specify the ste and field that is read or written.
Bit Field Description Privilege Per 0 field Specifies the low (0) or high (1) system thread portion of Segment Table Entry 5:1 ste number Specifies the STE number that is system thread read into STE Data Register. -
- When read the STE specified by ISTE register is placed in the destination general register. When written, the STE specified by ISTE or DSTE is written from the general purpose source register. The format of segment table entry is specified in
Chapter 6+L- section titled Translation Table organization and entry description. -
- Specifies the Instruction Cache Tag entry that is read or written by the ICTE or DCTE.
Bit Field Description Privilege Per 6:2 bank Specifies the bank that is read from system thread Level1 Cache Tag Entry. The first implementation has valid banks 0x0-f. 13:7 index Specifies the index address within a System thread bank that is read from Level1 Cache Tag Entry -
- When read the Cache Tag specified by ICTP or DCTP register is placed in the destination general register. When written, the Cache Tag specified by ICTP or DCTP is written from the general purpose source register. The format of cache tag entry is specified in Chapter 6- section titled Translation Table organization and entry description.
-
- The Event Queue Control Register (EQCR) enables normal and diagnostic access to the event queue. The sequence for using the register is a register write followed by a register read. The contents of the reg_op field specifies the operation for the write and the next read. The actual register modification or read is triggered by the write.
Bit Field Description Privilege Per 1:0 reg_op Specifies the register operation for system proc that write and the next read. Valid for register read. 0- read 1- write 2- push onto queue 3- pop from queue 17:2 event For writes and push specifies the system proc event number written or pushed onto the queue. For read and pop opera- tions contains the event number read or popped from the queue 18 empty Indicates whether the queue was system proc empty prior to the current operation. 31:19 address Specifies the address for read and system proc write queue operations. Address field is don't care for push and pop operations. -
- The Event to Thread lookup table establishes a mapping between an event number presented by a hardware device or event instruction and the preferred thread to signal the event to. Each entry in the table specifies an event number with a bit mask and a corresponding thread that the event is mapped to.
Bit Field Description Privilege Per 0 reg_op Specifies the register operation system proc for that write and the next read. Valid for register read. 0- read 1- write 16:1 event[15:0] For writes specifies the event system proc number written at the specified table address. For read opera- tions contains the event number at the specified table address 32:17 mask[15:0] Specifies whether the system proc corresponding event bit is significant. 0- significant 1- don't care 40:33 thread 48:41 address Specifies the table address for system proc read and write operations. Address field is don't care for push and pop operations. - Timers and Performance Monitor
- In the illustrated embodiment, all timer and performance monitor registers are accessible at application privilege.
-
-
Bit Field Description Privilege Per 63:0 clock Number of clock cycles since processor app proc reset -
-
Bit Field Description Privilege Per 31:0 count Saturating count of the number of app thread instruction executed. Cleared on read. Value of all 1's indicates that the count has overflowed. -
-
Bit Field Description Privilege Per 31:0 active Saturating count of the number of app thread cycles the thread is in active- executing state. Cleared on read. Value of all 1's indicates that the count has overflowed. -
-
Bit Field Description Privilege Per 31:0 timeout Count of the number of cycles app thread reamining until a timeout event is signaled to thread. Decrements by one, each clock sysle. - Virtual Processor and Thread ID
- In the illustrated embodiment, each active thread corresponds to a virtual processor and is specified by a 8-bit active thread number (activethread[7:0]). The
module 5 supports a 16-bit thread ID (threaded[15:0]) to enable rapid loading (activation) and unloading (de-activation) of threads. Other embodiments may support thread IDs of different sizes. - Thread-Instruction Fetch Abstraction
- As noted above, the TPUs 10-20 of
module 5 shareL1 instruction cache 22, as well as pipeline control hardware that launches up to five instructions each cycle from any combination of the threads active in those TPUs.FIG. 10 is an abstraction of the mechanism employed bymodule 5 to fetch and dispatch those instructions for execution on functional units 30-38. - As shown in that drawing, during each cycle, instructions are fetched from the
L1 cache 22 and placed ininstruction queues 10 a-20 a associated with each respective TPU 10-20. This is referred to as the fetch stage of the cycle. In the illustrated embodiment, three to six instructions are fetch for each single thread, with an overall goal of keepingthread queues 10 a-20 a at equal levels. In other embodiments, different numbers of instructions may be fetched and/or different goals set for relative filling of the queues. Also during the fetch stage, the module 5 (and, specifically, for example, the event handling mechanisms discussed above) recognize events and transition corresponding threads from waiting to executing. - During the dispatch stage—which executes in parallel with the fetch and execute/retire stages—instructions from each of one or more executing threads are dispatched to the functional units 30-38 based on a round-robin protocol that takes into account best utilization of those resources for that cycle. These instructions can be from any combination of threads. The compiler specifies, e.g., utilizing “stop” flags provided in the instruction set, boundaries between groups of instructions within a thread than can be launched in a single cycle. In other embodiments, other protocols may be employed, e.g., ones that prioritize certain threads, ones that ignore resource utilization, and so forth.
- During the execute & retire phase—which executes in parallel with the fetch and dispatch stages—multiple instructions are executed from one or more threads simultaneously. As noted above, in the illustrated embodiment, up to five instructions are launched and executed each cycle, e.g., by the integer, floating, branch, compare and memory functional units 30-38. In other embodiments, greater or fewer instructions can be launched, for example, depending on the number and type of functional units and depending on the number of TPUs.
- An instruction is retired after execution if it completes: its result is written and the instruction is cleared from the instruction queue.
- On the other hand, if an instruction blocks, the corresponding thread is transitioned from executing to waiting. The blocked instruction and all instructions following it for the corresponding thread are subsequently restarted when the condition that caused the block is resolved.
FIG. 11 illustrates a three-pointer queue management mechanism used in the illustrated embodiment to facilitate this. - Referring to that drawing, an instruction queue and a set of three pointers is maintained for each TPU 10-20. Here, only a single
such queue 110 and set of pointers 112-116 is shown. Thequeue 110 holds instructions fetched, executing and retired (or invalid) for the associated TPU—and, more particularly, for the thread currently active in that TPU. As instructions are fetched, they are inserted at the queue's top, which is designated by the Insert (or Fetch)pointer 112. The next instruction for execution is identified by the Extract (or Issue)pointer 114. The Commitpointer 116 identifies the last instruction whose execution has been committed. When an instruction is blocked or otherwise aborted, the Commitpointer 116 is rolled back to quash instructions between Commit and Extract pointers in the execution pipeline. Conversely, when a branch is taken, the entire queue is flushed and the pointers reset. - Though the
queue 110 is shown as circular, it will be appreciated that other configurations may be utilized as well. The queuing mechanism depicted inFIG. 11 can be implemented, for example, as shown inFIG. 12 . Instructions are stored in dual portedmemory 120 or, alternatively, in a series of registers (not shown). The write address at which each newly fetched instruction is stored is supplied by Fetchpointer logic 122 that responds to a Fetch command (e.g., issued by the pipeline control) to generate successive addresses for thememory 120. Issued instructions are taken from the other port, here, shown at bottom. The read address from which each instruction is taken is supplied by Issue/Commitpointer logic 124. That logic responds to Commit and Issue commands (e.g., issued by the pipeline control) to generate successive addresses and/or to reset, as appropriate. - Processor Module Implementation
-
FIG. 13 depicts and SoC implementation of theprocessor module 5 ofFIG. 1 including, particularly, logic for implementing the TPUs 10-20. As inFIG. 1 , the implementation ofFIG. 13 includes L1 and L2 caches 22-26, which are constructed and operated as discussed above. Likewise, the implementation includes functional units 30-34 comprising an integer unit, a floating-point unit, and a compare unit, respectively. Additional functional units can be provided instead or in addition. Logic for implementing the TPUs 10-20 includespipeline control 130,branch unit 38,memory unit 36,register file 136 and load-store buffer 138. The components shown inFIG. 13 are interconnected for control and information transfer as shown, with dashed lines indicating major control, thin solid lines indicating predicate value control, thicker solid lines identifying a 64-bit data bus and still thicker lines identifying a 128-bit data bus. It will be appreciated thatFIG. 13 represents one implementation of aprocessor module 5 according to invention and that other implementation may be realized as well. - Pipeline Control Unit
- In the illustrated embodiment,
pipeline control 130 contains the per-thread queues discussed above in connection withFIGS. 11-12 . There can be parameterized at 12, 15 or 18 instructions per thread. Thecontrol 130 picks ups instructions from those queues on a round robin bases (though, as also noted, this can be performed on other bases as well). It controls the sequence of accesses to the register file 136 (which is the resource which provides source and destination registers for the instructions), as well as to the functional units 30-38. Thepipeline control 130 decodes basic instruction classes from the per-thread queues and dispatches instructions to the functional units 30-38. As noted above, multiple instructions from one or more threads can be scheduled for execution by those functional units in the same cycle. Thecontrol 130 is additionally responsible for signaling thebranch unit 38 as it empties the per-thread construction queues, and for idling the functional units when possible, e.g., on a cycle by cycle basis, to decrease our consumption. -
FIG. 14 is a block diagram of thepipeline control unit 130. The unit includescontrol logic 140 for the thread class queues, the thread class (or per-thread)queues 142 themselves, andinstruction dispatch 144, alongwood decode unit 146, and functional units queues 148 a-148 c, connected to one another (and to the other components of module 5) as shown in the drawing. The thread class (per-thread) queues are constructed and operated as discussed above in connection withFIGS. 11-12 . The thread classqueue control logic 140 controls the input side of thosequeues 142 and, hence, provides the Insert pointer functionality shown inFIGS. 11-12 and discussed above. Thecontrol logic 140 is also responsible for controlling the input side of the unit queues 148 a-148 e, and for interfacing with thebranch unit 38 to control instruction fetching. In this latter regard,logic 140 is responsible for balancing instruction fetching in the manner discussed above (e.g., so as to compensate for those TPUs that are retiring the most instructions). - The
instruction dispatch 144 evaluates and determines, each cycle, the schedule of available instructions in each of the thread class queues. As noted above, in the illustrated embodiment the queues are handled on a round robin basis with account taken for queues that are retiring instructions more rapidly. Theinstruction dispatch 144 also controls the output side of thethread class queues 142. In this regard, it manages the Extract and Commit pointers discussed above in connection withFIGS. 11-12 , including updating the Commit pointer wind instructions have been retired and rolling that pointer back when an instruction is aborted (e.g., for thread switch or exception). - The
longwood decode unit 146 decodes incoming instruction longwords from theL1 instruction cache 22. In the illustrated embodiment, each such longword is decoded into the instructions. This can be parameterized for decoding one or two longwords, which decode into three and six instructions, respectively. Thedecode unit 146 is also responsible for decoding the instruction class of each instruction. - Unit queues 148 a-148 e queue actual instructions which are to be executed by the functional units 30-38. Each queue is organized on a per-thread basis and is kept consistent with the class queues. The unit queues are coupled to the thread
class queue control 140 and to theinstruction dispatch 144 for control purposes, as discussed above. Instructions from the queues 148 a-148 e are transferred to corresponding pipelines 150 a-150 b en route to the functional units themselves 30-38. The instructions are also passed to theregister file pipeline 152. -
FIG. 15 is a block diagram of an individual unit queue, e.g., 148 a. This includes one instruction queue 154 a-154 e for each TPU. These are coupled to the thread class queue control 140 (labeled tcqueue_ctl) and the instruction dispatch 144 (labelled idispatch) for control purposes. These are also coupled to the longword decode unit 146 (labeled Iwdecode) for instruction input and to a thread selection unit 156, as shown. That unit controls thread selection based on control signals provided byinstruction dispatch 144, as shown. Output from unit 156 is routed to the corresponding pipeline 150 a-150 e, as well as to theregister file pipeline 152. - Referring back to
FIG. 14 ,integer unit pipeline 150 a and floating-point unit pipeline 150 b decode appropriate instruction fields for their respective functional units. Each pipeline also times the commands to that respective functional units. Moreover, eachpipeline unit pipeline 150 c,branch unit pipeline 150 d, andmemory unit pipeline 150 e, provide like functionality for their respective functional units, compareunit 34,branch unit 38 andmemory unit 36. Register file pipeline 150 also provide like functionality with respect to registerfile 136. - Referring, now back to
FIG. 13 , illustratedbranch unit 38 is responsible for instruction address generation and address translation, as well as instruction fetching. In addition, it maintains state for the thread processing units 10-20.FIG. 16 is a block diagram of thebranch unit 38. It includescontrol logic 160, thread state stores 162 a-162 e, thread selector 164, address adder 166, segment translation content addressable memory (CAM) 168, connected to one another (and to the other components of module 5) as shown in the drawing. - The control logic drives 160
unit 38 based on a command signal from thepipeline control 130. It also takes as input theinstruction cache 22 state and theL2 cache 26 acknowledgement, as illustrated. Thelogic 160 outputs a thread switch to thepipeline control 130, as well as commands to theinstruction cache 22 and the L2 cache, as illustrated. The thread state stores 162 a-162 e store thread state for each of the respective TPUs 10-20. For each of those TPUs, it maintaines the general-purpose registers, predicate registers and control registers shown inFIG. 3 and discussed above. - Address information obtained from the thread state stores is routed to the thread selector, as shown, which selects the thread address from which and address computation is to be performed based on a control signal (as shown) from the
control 160. The address adder 166 increments the selected address or performs a branch address calculation, based on output of the thread selector 164 and addressing information supplied by the register file (labelled register source), as shown. In addition , the address adder 166 outputs a branch result. The newly computed address is routed to the segment translation memory 168, which operates as discussed above in connection withFIG. 5 , which generates a translated instruction cache address for use in connection with the next instruction fetch. - Functional Units
- Turning back to
FIG. 13 ,memory unit 36 is responsible for memory referents instruction execution, includingdata cache 24 address generation and address translation. In addition,unit 36 maintains the pending (memory) event table (PET) 50, discussed above.FIG. 17 is a block diagram of thememory unit 36. It includescontrol logic 170,address adder 172, and segment translation content addressable memory (CAM) 174, connected to one another (and to the other components of module 5) at shown in the drawing. - The control logic drives 170
unit 36 based on a command signal from thepipeline control 130. It also takes as input thedata cache 22 state and theL2 cache 26 acknowledgement, as illustrated. Thelogic 170 outputs a thread switch to thepipeline control 130 andbranch unit 38, as well as commands to thedata cache 24 and the L2 cache, as illustrated. Theaddress adder 172 increments addressing information provided from theregister file 136 or performs a requisite address calculation. The newly computed address is routed to thesegment translation memory 174, which operates as discussed above in connection withFIG. 5 , which generates a translated instruction cache address for use in connection with a data access. Though not shown in the drawing, theunit 36 also includes the PET, as previously mentioned. -
FIG. 18 is a block diagram of a cache unit implementing any of theL1 instruction cache 22 orL2 data cache 24. The unit includes sixteen 128×256 byte single port memory cells 180 a-180 p serving as data arrays, along with sixteen corresponding 32×56 byte dual port memory cells 182 a-182 p serving as tag arrays. These are coupled to L1 and L2 address and data buses as shown. Control logic 184 and 186 are coupled to the memory cells and to L1 cache control and L2 cache control, also as shown. - Returning, again, to
FIG. 13 , theregister file 136 serves as the resource for all source and destination registers accessed by the instructions being executed by the functional units 30-38. The register file is implemented as shown inFIG. 20 . As shown there, to reduce delay and wiring overhead, theunit 136 is decomposed into a separate register file instance per functional unit 30-38. In the illustrated embodiment, each instance provides forty-eight 64-bit registers for each of the TPUs. Other embodiments may vary, depending on the number of registers allotted the TPUs, the number of TPUs and the sizes of the registers. - Each
instance 200 a-200 c has five write ports, as illustrated by the arrows coming into the top of each instance, via which each of the functional units 30-38 can simultaneously write output data (thereby insuring that the instances retain consistent data). Each provides a varying number of read ports, as illustrated by the arrows eminating from the bottom of each instance, via which their respective functional units obtain data. Thus, the instances associated with theinteger unit 30, the floatingpoint unit 32 and the memory unit all have three read ports, the instance associated with the compareunit 34 has two read ports, and the instance associated with thebranch unit 38 has one port, as illustrated. - The register file instances 200-200 e can be optimized by having all ports read for a single thread each cycle. In addition, storage bits can be folded under wires to port access.
-
FIGS. 21 and 22 are block diagrams of theinteger unit 30 and the compareunit 34, respectively.FIGS. 23A and 23B are block diagrams, respectively, of the floatingpoint unit 32 and the fused multiply-add unit employed therein. The construction and operation of these units is evident from the components, interconnections and labelling supplied with the drawings. - Consumer-Producer Memory
- In prior art multiprocessor systems, the synchronization overhead and programming difficulty to implement data-based processing flow between threads or processors (for multiple steps of image processing for example) is very high. The
processor module 5 provides memory instructions that permit this to be done easily, enabling threads to wait on the availability of data and transparently wake up when another thread indicates the data is available. Such software transparent consumer-producer memory operations enable higher performance fine grained thread level parallelism with an efficient data oriented, consumer-producer programming style. - The illustrated embodiment provides a “Fill” memory instruction, which is used by a thread that is a data producer to load data into a selected memory location and to associate a state with that location, namely, the “full” state. If the location is already in that state when the instruction is executed, an exception is signalled.
- The embodiment also provides an “Empty” instruction, which is used by a data consumer to obtain data from a selected location. If the location is associated with the full state, the data is read from it (e.g., to a designated register) and the instruction causes the location to be associated with an “empty” state. Conversely, if the location is not associated with the full state at the time the Empty instruction is executed, the instruction causes the thread that executed it to temporarily transition to the idle (or, in an alternative embodiment, an active, non-executing) state, re-transitioning it back to the active, executing state—and executing the Empty instruction to completion—once it is becomes so associated. Using the Empty instruction enables a thread to execute when its data is available with low overhead and software transparency.
- In the illustrated embodiment, it is the pending (memory) event table (PET) 50 that stores status information regarding memory locations that are the subject of Fill and Empty operations. This includes the addresses of those locations, their respective full or empty states, and the identities of the “consumers” of data for those locations, i.e., the threads that have executed Empty instructions and are waiting for the locations to fill. It can also include the identities of the producers of the data, which can be useful, for example, in signalling and tracking cause of exceptions (e.g., as where to successive Fill instructions are executed for the same address, with no intervening Empty instructions).
- The data for the respective locations is not stored in the
PET 50 but, rather, remains in the caches and/or memory system itself, just like data that is not the subject of Fill and/or Empty instructions. In other embodiments, the status information is stored in the memory system, e.g., alongside the locations to which it pertains and/or in separate tables, linked lists, and so forth. - Thus, for example, when an Empty instruction is executed on a given memory location, the PET is checked to determine whether it has an entry indicating that same location is currently in the full state. If so, that entry is changed to empty and a read is effected, moving data from the memory location to the register designated by the Empty instruction.
- If, on the other hand, when the Empty instruction is executed, there no entry in the PET for the given memory location (or if any such entry indicates that the location is currently empty) then an entry is created (or updated) in the PET to indicate that the given location is empty and to indicate that the thread which executed the Empty instruction is a consumer for any data subsequently stored to that location by a Fill instruction.
- When a Fill instruction is subsequently executed (presumably, by another thread), the PET is checked is checked to determine whether it has an entry indicating that same location is currently in the empty state. Upon finding such an entry, its state is changed to full, and the event delivery mechanism 44 (
FIG. 4 ) is used to route a notification to the consumer-thread identified in that entry. If that thread is in an active, waiting state in a TPU, the notification goes to that TPU, which enters active, executing state and re-executes the Empty instruction—this time, to completion (since the memory location is now in the full state). If that thread is in the idle state, the notification goes to the system thread (in whatever TPU it is currently executing), which causes the thread to be loaded into a TPU in the executing, active state so that the Empty instruction can be re-executed. - In the illustrated embodiment, this use of the PET for consumer/producer-like memory operations is only effected with respect to selected memory instructions, e.g., Fill and Empty, but not with the more conventional Load and Store memory instructions. Thus, for example, even if a Load instruction is executed with respect to a memory location that is currently the subject of an Empty instruction, no notification is made to the thread that executed that Empty instruction so that the instruction can be re-executed. Other embodiments may vary in this regard.
-
FIGS. 24A depicts three interdependent threads, 230, 232 and 234, the synchronization of and data transfer between which can be facilitated by Fill and Empty instructions according to the invention. By way of example,thread 230 is anMPEG2 demultiplexing thread 230, responsible for demultiplexing an MPEG2 signal obtained, for example, from an MPEG2 source 236, e.g., a tuner, a streaming source or otherwise. It is assumed to be in an active, executing state onTPU 10, to continue the example.Thread 232 is avideo decoding Step 1 thread, responsible for a first stage of decoding a video signal from a demultiplexed MPEG2 signal. It is assumed to be in an active, executing state onTPU 12.Thread 234 is avideo decoding Step 2 thread, responsible for a second stage of decoding a video signal from a demultiplexed MPEG2 signal for output via anLCD interface 238 or other device. It is assumed to be in an active, executing state onTPU 14. - To accommodate data streaming from the source 236 in real-time, each of the threads 230-234 continually process data provided by its upstream source and does so in parallel with the other threads.
FIG. 24B illustrates use of the Fill and Empty instructions to facilitate this in a manner which insures synchronization and facilitates data transfer between the threads. - Referring to the drawing,
arrows 240 a-240 g indicate fill dependencies between the threads and, particularly, between data locations written to (filled) by one thread and read from (emptied) by another thread. Thus,thread 230 processes data destined for address A0, whilethread 232 executes an Empty instruction targeted to that location andthread 234 executes an Empty instruction targeted to address B0 (whichthread 232 will ultimately Fill). As a result of the Empty instructions,thread 232 enters a wait state (e.g., active, non-executing or idle) while awaiting completion of the Fill of location A0 andthread 234 enters a wait state while awaiting completion of the Fill of location B0. - On completion of
thread 230's Fill of A0,thread 232's Empty completes, allowing that thread to process the data from A0, with the result destined for B0 via a Fill instruction.Thread 234 remains in a wait state, still awaiting completion of that Fill. In the meanwhile,thread 230 begins processing data destined for address A1 andthread 232 executes the Empty instruction, placing it in a wait state while awaiting completion of the Fill of A1. - When
thread 232 executes the Fill demand for B0,thread 234's Empty completes allowing that thread to process the data from B0, with the result destined for C0, whence it is read by the LCD interface (not shown) for display to the TV viewer. The threethreads - A further appreciation of the Fill and Empty instructions may be attained by review of their instruction formats.
- Empty
- Format: ps EMPTY.cache.threads dreg, breg, ireg {,stop}
- Description: Empty instructs the memory system to check the state of the effective address. If the state is full, empty instruction changes the state to empty and loads the value into dreg. If the state is already empty, the instruction waits until the instruction is full, with the waiting behavior specified by the thread field.
ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction group is delineated by this instruction. thread 0 unconditional, no thread switch 1 unconditional thread switch 2 conditional thread switch on stall (block execution of thread) 3 reserved scache 0 tbd eith reuse cache hint 1 read/write with reuse cache hint 2 tbd with no-reuse cache hint 3 read/write with no-reuse cache hint im 0 Specifies index register (ireg) for address calculation 1 Specifies disp for address calculation ireg Specifies the index register of the instruction. breg Specifies the base register of the instruction. disp Specifies the two s complement displacement constant (8-bits) for memory reference instructions dreg Specifies the destination register of the instruction.
Fill - Format: ps FILL.cache.threads s1reg. breg, ireg {,stop}
- Description: Register s1reg is written to the work in memory at the effective address. The effective address is calculated by adding breg (base register) and either ireg (index register) or disp (displacement) based on the im (immediate memory) field. The state of the effective address is changed to full. If the state is already full an exception is signaled.
- Operands and Fields:
ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction group is delineated by this instruction. thread 0 unconditional, no thread switch 1 unconditional thread switch 2 conditional thread switch on stall (block execution of thread) 3 reserved scache 0 tbd with reuse cache hint 1 read/write with reuse cache hint 2 tbd with no-reuse cache hint 3 read/write with no-reuse cache hint im 0 Specifies index register (ireg) for address calculation 1 Specifies disp for address calculation ireg Specifies the index register of the instruction. breg Specifies the base register of the instruction. disp Specifies the two-s complement displacement constant (8-bits) for memory reference instructions streg Specifies the register that contains the first operand of the instruction.
Software Events - A more complete understanding of the processing of hardware and software events may be attained by review of their instruction formats:
- Event
- Format: ps EVENT s1reg{,stop}
- Description: The EVENT instruction polls the event queue for the executing thread. If an event is present the instruction completes with the event status loaded into the exception status register. If no event is present in the event queue, the thread transitions to idle state.
- Operands and Fields:
ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction froup is delineated by this instruc- tion. s1reg Specifies the register that contains the first source operand of the instruction.
SW Event - Format: ps SWEVENT s1reg{,stop}
- Description: The SWEvent instruction en-queues an event. onto the Event Queue to be handled by a thread. See xxx for the event format.
- Operands and Fields:
ps The predicate source register that specifies whether the instruction is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). stop 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction group is delineated by this instruction. s1reg Specifies the register that contains the first source operand of the instruction.
CTL FLD - Format: ps.CtlFld.ti cfield, {,stop}
- Description: The Control Field instruction modifies the control field specified by cfield. Other fields within the control register are unchanged.
- Operands and Fields:
ps The predicate source register that specifies whether the instruc- tion is executed. If true the instruction is executed, else if false the instruction is not executed (no side effects). step 0 Specifies that an instruction group is not delineated by this instruction. 1 Specifies that an instruction group is delineated by this instruction. u 0 Specifies access to this threads control registers. 1 Specifies access to control register of thread specified by IDr_indirect field. (thread indirection) (privileged) efield efield[4:0] control field privilege -
000nn Thread state application nn value 00 idle 01 reserved 10 waiting 11 executing 0010S System trap enable system 0011S Application trap enable application 0100S Thread Enable system 0101S Privilege Level System 0110S Registers Modified application 0111S Instruction address translation enable system 1000S Data address translation enable system 1001S Alignment Check System 1010S Endian Mode system 1011S 11**S reserved
S = 0 clear, S = 1 set
Devices IncorporatingProcessor Module 5 -
FIG. 25 is a block diagram of a digital LCD-TV subsystem 242 according to the invention embodied in a SoC format. Thesubsystem 242 includes aprocessor module 5 constructed as described above and operated to execute simultaneously execute threads providing MPEG2 signal demultiplexing, MPEG2 video decoding, MPEG audio decoding, digital-TV user interface operation, and operating system execution (e.g., Linux) e.g., as described above. Themodule 5 is coupled to DDR DRAM flash memory comprising the off-chip portion of theL2 cache 26, also as discussed above. The module includes an interface (not shown) to an AMBA,AHB bus 244, via which it communicates with “intellectual property” or “IP” 246 providing interfaces to other components of the digital LCD-TV, namely, a video input interface, a video output interface, an audio output interface and LCD interface. Of course other IP may be provided in addition or instead, coupled to themodule 5 via theAHB bus 5 or otherwise. For example, in the drawing, illustratedmodule 5 communicates with optional IP via which the digital LCD-TV obtains source signals and/or is controlled, such as DMA engine 248, high speed I/O device controller 250 and low speed device controllers 252 (via APB bridge 254) or otherwise. -
FIG. 26 is a block diagram of a digital LCD-TV or other application subsystem 256 according to the invention, again, embodied in a SoC format. The illustrated subsystem is configured as above, except insofar as it is depicted with APB and AHB/APB bridges and APB macros 258 in lieu of the specific IP shown 246 shown inFIG. 24 . Depending on application needs, elements 258 may comprise a video input interface, a video output interface, an audio output interface and an LCD interface, as in the implementation above, or otherwise. - The illustrated subsystem further includes a plurality of
modules 5, e.g., from one to twenty such modules (or more) that are coupled via an interconnect that interfaces with and, preferably, forms part of the off-chip L2 cache 26 b utilized by themodules 5. That interconnect may be in the form of a ring interconnect (RI) comprising a shift register bus shared by themodules 5 and, more particularly, by theL2 caches 26. Alternatively, it may be an interconnect of another form, proprietary or otherwise, that facilitates the rapid movement of data within the combined memory system of themodules 5. Regardless, the L2 caches are preferably coupled so that the L2 cache for any onemodule 5 is not only the memory system for that individual processor but also contributes to a distributed all cache memory system for all of theprocessor modules 5. Of course, as noted above, themodules 5 do not have to physically sharing the same memory system, chips or buses and could, instead, be connected over a network or otherwise. - Described above is are apparatus, systems and methods meeting the desired objects. It will be appreciated that the embodiments described herein are examples of the invention and that other embodiments, incorporating changes therein, fall within the scope of the invention, of which we claim:
Claims (48)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/605,839 US8087034B2 (en) | 2003-05-30 | 2009-10-26 | Virtual processor methods and apparatus with unified event notification and consumer-produced memory operations |
US13/295,777 US8621487B2 (en) | 2003-05-30 | 2011-11-14 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US10/449,732 US7653912B2 (en) | 2003-05-30 | 2003-05-30 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US12/605,839 US8087034B2 (en) | 2003-05-30 | 2009-10-26 | Virtual processor methods and apparatus with unified event notification and consumer-produced memory operations |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/449,732 Continuation US7653912B2 (en) | 2003-05-30 | 2003-05-30 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/295,777 Continuation US8621487B2 (en) | 2003-05-30 | 2011-11-14 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
Publications (3)
Publication Number | Publication Date |
---|---|
US20100162028A1 US20100162028A1 (en) | 2010-06-24 |
US20110145626A2 true US20110145626A2 (en) | 2011-06-16 |
US8087034B2 US8087034B2 (en) | 2011-12-27 |
Family
ID=33451855
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/449,732 Active 2026-03-03 US7653912B2 (en) | 2003-05-30 | 2003-05-30 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US10/735,610 Active 2025-02-21 US7685607B2 (en) | 2003-05-30 | 2003-12-12 | General purpose embedded processor |
US12/605,839 Expired - Fee Related US8087034B2 (en) | 2003-05-30 | 2009-10-26 | Virtual processor methods and apparatus with unified event notification and consumer-produced memory operations |
US12/700,211 Expired - Fee Related US8271997B2 (en) | 2003-05-30 | 2010-02-04 | General purpose embedded processor |
US13/295,777 Expired - Lifetime US8621487B2 (en) | 2003-05-30 | 2011-11-14 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US13/614,011 Abandoned US20130185543A1 (en) | 2003-05-30 | 2012-09-13 | General purpose embedded processor |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US10/449,732 Active 2026-03-03 US7653912B2 (en) | 2003-05-30 | 2003-05-30 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US10/735,610 Active 2025-02-21 US7685607B2 (en) | 2003-05-30 | 2003-12-12 | General purpose embedded processor |
Family Applications After (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/700,211 Expired - Fee Related US8271997B2 (en) | 2003-05-30 | 2010-02-04 | General purpose embedded processor |
US13/295,777 Expired - Lifetime US8621487B2 (en) | 2003-05-30 | 2011-11-14 | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US13/614,011 Abandoned US20130185543A1 (en) | 2003-05-30 | 2012-09-13 | General purpose embedded processor |
Country Status (2)
Country | Link |
---|---|
US (6) | US7653912B2 (en) |
JP (2) | JP4870914B2 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100228954A1 (en) * | 2003-05-30 | 2010-09-09 | Steven Frank | General purpose embedded processor |
US20120272247A1 (en) * | 2011-04-22 | 2012-10-25 | Scott Steven L | Software emulation of massive hardware threading for tolerating remote memory references |
WO2013113206A1 (en) * | 2012-02-01 | 2013-08-08 | 中兴通讯股份有限公司 | Smart cache and smart terminal |
Families Citing this family (104)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7779165B2 (en) * | 2002-01-11 | 2010-08-17 | Oracle America, Inc. | Scalable method for producer and consumer elimination |
US7849465B2 (en) * | 2003-02-19 | 2010-12-07 | Intel Corporation | Programmable event driven yield mechanism which may activate service threads |
US7487502B2 (en) | 2003-02-19 | 2009-02-03 | Intel Corporation | Programmable event driven yield mechanism which may activate other threads |
US7669203B2 (en) * | 2003-12-19 | 2010-02-23 | Intel Corporation | Virtual multithreading translation mechanism including retrofit capability |
US7559065B1 (en) * | 2003-12-31 | 2009-07-07 | Emc Corporation | Methods and apparatus providing an event service infrastructure |
US8244891B2 (en) * | 2004-03-08 | 2012-08-14 | Ixia | Simulating a large number of users |
US8515741B2 (en) * | 2004-06-18 | 2013-08-20 | Broadcom Corporation | System (s), method (s) and apparatus for reducing on-chip memory requirements for audio decoding |
US7849441B2 (en) * | 2005-06-27 | 2010-12-07 | Ikoa Corporation | Method for specifying stateful, transaction-oriented systems for flexible mapping to structurally configurable, in-memory processing semiconductor device |
US8082423B2 (en) * | 2005-08-11 | 2011-12-20 | International Business Machines Corporation | Generating a flush vector from a first execution unit directly to every other execution unit of a plurality of execution units in order to block all register updates |
WO2007036072A1 (en) * | 2005-09-29 | 2007-04-05 | Intel Corporation | Apparatus and method for expedited virtual machine (vm) launch in vm cluster environment |
GB0605383D0 (en) * | 2006-03-17 | 2006-04-26 | Williams Paul N | Processing system |
US7802073B1 (en) | 2006-03-29 | 2010-09-21 | Oracle America, Inc. | Virtual core management |
US7954099B2 (en) | 2006-05-17 | 2011-05-31 | International Business Machines Corporation | Demultiplexing grouped events into virtual event queues while in two levels of virtualization |
EP2095226A1 (en) * | 2006-12-11 | 2009-09-02 | Nxp B.V. | Virtual functional units for vliw processors |
US8171270B2 (en) * | 2006-12-29 | 2012-05-01 | Intel Corporation | Asynchronous control transfer |
US8185722B2 (en) * | 2007-03-14 | 2012-05-22 | XMOS Ltd. | Processor instruction set for controlling threads to respond to events |
US9367321B2 (en) * | 2007-03-14 | 2016-06-14 | Xmos Limited | Processor instruction set for controlling an event source to generate events used to schedule threads |
US8219788B1 (en) * | 2007-07-23 | 2012-07-10 | Oracle America, Inc. | Virtual core management |
WO2009029549A2 (en) * | 2007-08-24 | 2009-03-05 | Virtualmetrix, Inc. | Method and apparatus for fine grain performance management of computer systems |
US8185894B1 (en) | 2008-01-10 | 2012-05-22 | Hewlett-Packard Development Company, L.P. | Training a virtual machine placement controller |
US8332847B1 (en) | 2008-01-10 | 2012-12-11 | Hewlett-Packard Development Company, L. P. | Validating manual virtual machine migration |
US8397216B2 (en) * | 2008-02-29 | 2013-03-12 | International Business Machines Corporation | Compiler for a declarative event-driven programming model |
US8627299B2 (en) * | 2008-02-29 | 2014-01-07 | International Business Machines Corporation | Virtual machine and programming language for event processing |
US8365149B2 (en) * | 2008-02-29 | 2013-01-29 | International Business Machines Corporation | Debugger for a declarative event-driven programming model |
JP5173714B2 (en) * | 2008-09-30 | 2013-04-03 | ルネサスエレクトロニクス株式会社 | Multi-thread processor and interrupt processing method thereof |
WO2010095182A1 (en) * | 2009-02-17 | 2010-08-26 | パナソニック株式会社 | Multithreaded processor and digital television system |
US8989705B1 (en) | 2009-06-18 | 2015-03-24 | Sprint Communications Company L.P. | Secure placement of centralized media controller application in mobile access terminal |
US10169072B2 (en) * | 2009-09-23 | 2019-01-01 | Nvidia Corporation | Hardware for parallel command list generation |
US8522000B2 (en) * | 2009-09-29 | 2013-08-27 | Nvidia Corporation | Trap handler architecture for a parallel processing unit |
US9165394B2 (en) * | 2009-10-13 | 2015-10-20 | Nvidia Corporation | Method and system for supporting GPU audio output on graphics processing unit |
US8312195B2 (en) * | 2010-02-18 | 2012-11-13 | Red Hat, Inc. | Managing interrupts using a preferred binding between a device generating interrupts and a CPU |
US8782653B2 (en) * | 2010-03-26 | 2014-07-15 | Virtualmetrix, Inc. | Fine grain performance resource management of computer systems |
US8677071B2 (en) * | 2010-03-26 | 2014-03-18 | Virtualmetrix, Inc. | Control of processor cache memory occupancy |
US20110276966A1 (en) * | 2010-05-06 | 2011-11-10 | Arm Limited | Managing task dependency within a data processing system |
US8423750B2 (en) * | 2010-05-12 | 2013-04-16 | International Business Machines Corporation | Hardware assist thread for increasing code parallelism |
US8694994B1 (en) * | 2011-09-07 | 2014-04-08 | Amazon Technologies, Inc. | Optimization of packet processing by delaying a processor from entering an idle state |
US8874786B2 (en) * | 2011-10-25 | 2014-10-28 | Dell Products L.P. | Network traffic control by association of network packets and processes |
GB2495959A (en) | 2011-10-26 | 2013-05-01 | Imagination Tech Ltd | Multi-threaded memory access processor |
US20130239124A1 (en) | 2012-01-20 | 2013-09-12 | Mentor Graphics Corporation | Event Queue Management For Embedded Systems |
EP2828741A4 (en) * | 2012-03-21 | 2016-06-29 | Nokia Technologies Oy | Method in a processor, an apparatus and a computer program product |
US8712407B1 (en) | 2012-04-05 | 2014-04-29 | Sprint Communications Company L.P. | Multiple secure elements in mobile electronic device with near field communication capability |
US8494576B1 (en) | 2012-05-03 | 2013-07-23 | Sprint Communications Company L.P. | Near field communication authentication and validation to access corporate data |
US8504097B1 (en) * | 2012-05-03 | 2013-08-06 | Sprint Communications Company L.P. | Alternative hardware and software configuration for near field communication |
US9027102B2 (en) | 2012-05-11 | 2015-05-05 | Sprint Communications Company L.P. | Web server bypass of backend process on near field communications and secure element chips |
US8862181B1 (en) | 2012-05-29 | 2014-10-14 | Sprint Communications Company L.P. | Electronic purchase transaction trust infrastructure |
US9282898B2 (en) | 2012-06-25 | 2016-03-15 | Sprint Communications Company L.P. | End-to-end trusted communications infrastructure |
US9066230B1 (en) | 2012-06-27 | 2015-06-23 | Sprint Communications Company L.P. | Trusted policy and charging enforcement function |
US8649770B1 (en) | 2012-07-02 | 2014-02-11 | Sprint Communications Company, L.P. | Extended trusted security zone radio modem |
US8667607B2 (en) | 2012-07-24 | 2014-03-04 | Sprint Communications Company L.P. | Trusted security zone access to peripheral devices |
US8863252B1 (en) | 2012-07-25 | 2014-10-14 | Sprint Communications Company L.P. | Trusted access to third party applications systems and methods |
US9183412B2 (en) | 2012-08-10 | 2015-11-10 | Sprint Communications Company L.P. | Systems and methods for provisioning and using multiple trusted security zones on an electronic device |
US8954588B1 (en) | 2012-08-25 | 2015-02-10 | Sprint Communications Company L.P. | Reservations in real-time brokering of digital content delivery |
US9215180B1 (en) | 2012-08-25 | 2015-12-15 | Sprint Communications Company L.P. | File retrieval in real-time brokering of digital content |
US9015068B1 (en) | 2012-08-25 | 2015-04-21 | Sprint Communications Company L.P. | Framework for real-time brokering of digital content delivery |
US8752140B1 (en) | 2012-09-11 | 2014-06-10 | Sprint Communications Company L.P. | System and methods for trusted internet domain networking |
US9122786B2 (en) * | 2012-09-14 | 2015-09-01 | Software Ag | Systems and/or methods for statistical online analysis of large and potentially heterogeneous data sets |
US9710874B2 (en) * | 2012-12-27 | 2017-07-18 | Nvidia Corporation | Mid-primitive graphics execution preemption |
US9578664B1 (en) | 2013-02-07 | 2017-02-21 | Sprint Communications Company L.P. | Trusted signaling in 3GPP interfaces in a network function virtualization wireless communication system |
US9161227B1 (en) | 2013-02-07 | 2015-10-13 | Sprint Communications Company L.P. | Trusted signaling in long term evolution (LTE) 4G wireless communication |
US9317291B2 (en) | 2013-02-14 | 2016-04-19 | International Business Machines Corporation | Local instruction loop buffer utilizing execution unit register file |
US9104840B1 (en) | 2013-03-05 | 2015-08-11 | Sprint Communications Company L.P. | Trusted security zone watermark |
US9613208B1 (en) | 2013-03-13 | 2017-04-04 | Sprint Communications Company L.P. | Trusted security zone enhanced with trusted hardware drivers |
US8881977B1 (en) | 2013-03-13 | 2014-11-11 | Sprint Communications Company L.P. | Point-of-sale and automated teller machine transactions using trusted mobile access device |
US9049186B1 (en) | 2013-03-14 | 2015-06-02 | Sprint Communications Company L.P. | Trusted security zone re-provisioning and re-use capability for refurbished mobile devices |
US9049013B2 (en) | 2013-03-14 | 2015-06-02 | Sprint Communications Company L.P. | Trusted security zone containers for the protection and confidentiality of trusted service manager data |
US9374363B1 (en) | 2013-03-15 | 2016-06-21 | Sprint Communications Company L.P. | Restricting access of a portable communication device to confidential data or applications via a remote network based on event triggers generated by the portable communication device |
US8984592B1 (en) | 2013-03-15 | 2015-03-17 | Sprint Communications Company L.P. | Enablement of a trusted security zone authentication for remote mobile device management systems and methods |
US9021585B1 (en) | 2013-03-15 | 2015-04-28 | Sprint Communications Company L.P. | JTAG fuse vulnerability determination and protection using a trusted execution environment |
US9191388B1 (en) | 2013-03-15 | 2015-11-17 | Sprint Communications Company L.P. | Trusted security zone communication addressing on an electronic device |
US9171243B1 (en) | 2013-04-04 | 2015-10-27 | Sprint Communications Company L.P. | System for managing a digest of biographical information stored in a radio frequency identity chip coupled to a mobile communication device |
US9454723B1 (en) | 2013-04-04 | 2016-09-27 | Sprint Communications Company L.P. | Radio frequency identity (RFID) chip electrically and communicatively coupled to motherboard of mobile communication device |
US9324016B1 (en) | 2013-04-04 | 2016-04-26 | Sprint Communications Company L.P. | Digest of biographical information for an electronic device with static and dynamic portions |
US9838869B1 (en) | 2013-04-10 | 2017-12-05 | Sprint Communications Company L.P. | Delivering digital content to a mobile device via a digital rights clearing house |
US9443088B1 (en) | 2013-04-15 | 2016-09-13 | Sprint Communications Company L.P. | Protection for multimedia files pre-downloaded to a mobile device |
US9069952B1 (en) | 2013-05-20 | 2015-06-30 | Sprint Communications Company L.P. | Method for enabling hardware assisted operating system region for safe execution of untrusted code using trusted transitional memory |
US9560519B1 (en) | 2013-06-06 | 2017-01-31 | Sprint Communications Company L.P. | Mobile communication device profound identity brokering framework |
US9454482B2 (en) | 2013-06-27 | 2016-09-27 | Apple Inc. | Duplicate tag structure employing single-port tag RAM and dual-port state RAM |
US9183606B1 (en) | 2013-07-10 | 2015-11-10 | Sprint Communications Company L.P. | Trusted processing location within a graphics processing unit |
US9274591B2 (en) | 2013-07-22 | 2016-03-01 | Globalfoundries Inc. | General purpose processing unit with low power digital signal processing (DSP) mode |
US9208339B1 (en) | 2013-08-12 | 2015-12-08 | Sprint Communications Company L.P. | Verifying Applications in Virtual Environments Using a Trusted Security Zone |
US9185626B1 (en) | 2013-10-29 | 2015-11-10 | Sprint Communications Company L.P. | Secure peer-to-peer call forking facilitated by trusted 3rd party voice server provisioning |
US9191522B1 (en) | 2013-11-08 | 2015-11-17 | Sprint Communications Company L.P. | Billing varied service based on tier |
US9161325B1 (en) | 2013-11-20 | 2015-10-13 | Sprint Communications Company L.P. | Subscriber identity module virtualization |
WO2015097494A1 (en) * | 2013-12-23 | 2015-07-02 | Intel Corporation | Instruction and logic for identifying instructions for retirement in a multi-strand out-of-order processor |
US9118655B1 (en) | 2014-01-24 | 2015-08-25 | Sprint Communications Company L.P. | Trusted display and transmission of digital ticket documentation |
JP6176166B2 (en) * | 2014-03-25 | 2017-08-09 | 株式会社デンソー | Data processing device |
US9226145B1 (en) | 2014-03-28 | 2015-12-29 | Sprint Communications Company L.P. | Verification of mobile device integrity during activation |
JP2014211890A (en) * | 2014-06-25 | 2014-11-13 | ルネサスエレクトロニクス株式会社 | Multi-thread processor and interrupt processing method of the same |
US9230085B1 (en) | 2014-07-29 | 2016-01-05 | Sprint Communications Company L.P. | Network based temporary trust extension to a remote or mobile device enabled via specialized cloud services |
US9692813B2 (en) * | 2014-08-08 | 2017-06-27 | Sas Institute Inc. | Dynamic assignment of transfers of blocks of data |
US9779232B1 (en) | 2015-01-14 | 2017-10-03 | Sprint Communications Company L.P. | Trusted code generation and verification to prevent fraud from maleficent external devices that capture data |
US9838868B1 (en) | 2015-01-26 | 2017-12-05 | Sprint Communications Company L.P. | Mated universal serial bus (USB) wireless dongles configured with destination addresses |
US9582312B1 (en) | 2015-02-04 | 2017-02-28 | Amazon Technologies, Inc. | Execution context trace for asynchronous tasks |
US9473945B1 (en) | 2015-04-07 | 2016-10-18 | Sprint Communications Company L.P. | Infrastructure for secure short message transmission |
US9819679B1 (en) | 2015-09-14 | 2017-11-14 | Sprint Communications Company L.P. | Hardware assisted provenance proof of named data networking associated to device data, addresses, services, and servers |
US20170116154A1 (en) * | 2015-10-23 | 2017-04-27 | The Intellisis Corporation | Register communication in a network-on-a-chip architecture |
US10282719B1 (en) | 2015-11-12 | 2019-05-07 | Sprint Communications Company L.P. | Secure and trusted device-based billing and charging process using privilege for network proxy authentication and audit |
US9678901B2 (en) | 2015-11-16 | 2017-06-13 | International Business Machines Corporation | Techniques for indicating a preferred virtual processor thread to service an interrupt in a data processing system |
US9817992B1 (en) | 2015-11-20 | 2017-11-14 | Sprint Communications Company Lp. | System and method for secure USIM wireless network access |
US10248593B2 (en) | 2017-06-04 | 2019-04-02 | International Business Machines Corporation | Techniques for handling interrupts in a processing unit using interrupt request queues |
US10210112B2 (en) | 2017-06-06 | 2019-02-19 | International Business Machines Corporation | Techniques for issuing interrupts in a data processing system with multiple scopes |
US10499249B1 (en) | 2017-07-11 | 2019-12-03 | Sprint Communications Company L.P. | Data link layer trust signaling in communication network |
US11366690B2 (en) * | 2019-12-02 | 2022-06-21 | Alibaba Group Holding Limited | Scheduling commands in a virtual computing environment |
CN115145864B (en) * | 2022-09-05 | 2022-11-04 | 深圳比特微电子科技有限公司 | Data processing method, system, electronic device and storage medium |
Citations (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4689739A (en) * | 1983-03-28 | 1987-08-25 | Xerox Corporation | Method for providing priority interrupts in an electrophotographic machine |
US5692193A (en) * | 1994-03-31 | 1997-11-25 | Nec Research Institute, Inc. | Software architecture for control of highly parallel computer systems |
US5721855A (en) * | 1994-03-01 | 1998-02-24 | Intel Corporation | Method for pipeline processing of instructions by controlling access to a reorder buffer using a register file outside the reorder buffer |
US6219780B1 (en) * | 1998-10-27 | 2001-04-17 | International Business Machines Corporation | Circuit arrangement and method of dispatching instructions to multiple execution units |
US6240508B1 (en) * | 1992-07-06 | 2001-05-29 | Compaq Computer Corporation | Decode and execution synchronized pipeline processing using decode generated memory read queue with stop entry to allow execution generated memory read |
US6272520B1 (en) * | 1997-12-31 | 2001-08-07 | Intel Corporation | Method for detecting thread switch events |
US20010016879A1 (en) * | 1997-09-12 | 2001-08-23 | Hitachi, Ltd. | Multi OS configuration method and computer system |
US6408381B1 (en) * | 1999-10-01 | 2002-06-18 | Hitachi, Ltd. | Mechanism for fast access to control space in a pipeline processor |
US6427195B1 (en) * | 2000-06-13 | 2002-07-30 | Hewlett-Packard Company | Thread local cache memory allocator in a multitasking operating system |
US6470443B1 (en) * | 1996-12-31 | 2002-10-22 | Compaq Computer Corporation | Pipelined multi-thread processor selecting thread instruction in inter-stage buffer based on count information |
US6493741B1 (en) * | 1999-10-01 | 2002-12-10 | Compaq Information Technologies Group, L.P. | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit |
US20030120896A1 (en) * | 2001-06-29 | 2003-06-26 | Jason Gosior | System on chip architecture |
US6658490B1 (en) * | 1995-01-31 | 2003-12-02 | Microsoft Corporation | Method and system for multi-threaded processing |
US20040049672A1 (en) * | 2002-05-31 | 2004-03-11 | Vincent Nollet | System and method for hardware-software multitasking on a reconfigurable computing platform |
US6799317B1 (en) * | 2000-06-27 | 2004-09-28 | International Business Machines Corporation | Interrupt mechanism for shared memory message passing |
US6829769B2 (en) * | 2000-10-04 | 2004-12-07 | Microsoft Corporation | High performance interprocess communication |
US6912647B1 (en) * | 2000-09-28 | 2005-06-28 | International Business Machines Corportion | Apparatus and method for creating instruction bundles in an explicitly parallel architecture |
US6988186B2 (en) * | 2001-06-28 | 2006-01-17 | International Business Machines Corporation | Shared resource queue for simultaneous multithreading processing wherein entries allocated to different threads are capable of being interspersed among each other and a head pointer for one thread is capable of wrapping around its own tail in order to access a free entry |
US7051337B2 (en) * | 2000-04-08 | 2006-05-23 | Sun Microsystems, Inc. | Method and apparatus for polling multiple sockets with a single thread and handling events received at the sockets with a pool of threads |
US7082519B2 (en) * | 1999-12-22 | 2006-07-25 | Ubicom, Inc. | System and method for instruction level multithreading scheduling in a embedded processor |
US7363474B2 (en) * | 2001-12-31 | 2008-04-22 | Intel Corporation | Method and apparatus for suspending execution of a thread until a specified memory access occurs |
US7653912B2 (en) * | 2003-05-30 | 2010-01-26 | Steven Frank | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
Family Cites Families (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5282201A (en) * | 1987-12-22 | 1994-01-25 | Kendall Square Research Corporation | Dynamic packet routing network |
US5226039A (en) * | 1987-12-22 | 1993-07-06 | Kendall Square Research Corporation | Packet routing switch |
US5341483A (en) * | 1987-12-22 | 1994-08-23 | Kendall Square Research Corporation | Dynamic hierarchial associative memory |
US5251308A (en) * | 1987-12-22 | 1993-10-05 | Kendall Square Research Corporation | Shared memory multiprocessor with data hiding and post-store |
US5335325A (en) * | 1987-12-22 | 1994-08-02 | Kendall Square Research Corporation | High-speed packet switching apparatus and method |
US5055999A (en) * | 1987-12-22 | 1991-10-08 | Kendall Square Research Corporation | Multiprocessor digital data processing system |
US5119481A (en) * | 1987-12-22 | 1992-06-02 | Kendall Square Research Corporation | Register bus multiprocessor system with shift |
JPH0340035A (en) * | 1989-07-06 | 1991-02-20 | Toshiba Corp | Multi-task processing system |
JPH03208131A (en) * | 1990-01-11 | 1991-09-11 | Oki Electric Ind Co Ltd | Task control system for operating system |
CA2078315A1 (en) * | 1991-09-20 | 1993-03-21 | Christopher L. Reeve | Parallel processing apparatus and method for utilizing tiling |
US5313647A (en) * | 1991-09-20 | 1994-05-17 | Kendall Square Research Corporation | Digital data processor with improved checkpointing and forking |
JP3547482B2 (en) | 1994-04-15 | 2004-07-28 | 株式会社日立製作所 | Information processing equipment |
JPH09282188A (en) | 1996-04-16 | 1997-10-31 | Mitsubishi Electric Corp | Interruption processing method and system using the method |
JP3050289B2 (en) | 1997-02-26 | 2000-06-12 | 日本電気株式会社 | Output impedance adjustment circuit of output buffer circuit |
US6240502B1 (en) * | 1997-06-25 | 2001-05-29 | Sun Microsystems, Inc. | Apparatus for dynamically reconfiguring a processor |
JP2000076087A (en) * | 1998-08-28 | 2000-03-14 | Hitachi Ltd | Multioperating system control method |
US6279046B1 (en) | 1999-05-19 | 2001-08-21 | International Business Machines Corporation | Event-driven communications interface for logically-partitioned computer |
JP3807588B2 (en) * | 1999-08-12 | 2006-08-09 | 富士通株式会社 | Multi-thread processing apparatus, processing method, and computer-readable recording medium storing multi-thread program |
US6889319B1 (en) | 1999-12-09 | 2005-05-03 | Intel Corporation | Method and apparatus for entering and exiting multiple threads within a multithreaded processor |
US20030135716A1 (en) * | 2002-01-14 | 2003-07-17 | Gil Vinitzky | Method of creating a high performance virtual multiprocessor by adding a new dimension to a processor's pipeline |
US7076640B2 (en) * | 2002-02-05 | 2006-07-11 | Sun Microsystems, Inc. | Processor that eliminates mis-steering instruction fetch resulting from incorrect resolution of mis-speculated branch instructions |
US7062606B2 (en) * | 2002-11-01 | 2006-06-13 | Infineon Technologies Ag | Multi-threaded embedded processor using deterministic instruction memory to guarantee execution of pre-selected threads during blocking events |
US7281075B2 (en) * | 2003-04-24 | 2007-10-09 | International Business Machines Corporation | Virtualization of a global interrupt queue |
US20050108711A1 (en) * | 2003-11-13 | 2005-05-19 | Infineon Technologies North America Corporation | Machine instruction for enhanced control of multiple virtual processor systems |
-
2003
- 2003-05-30 US US10/449,732 patent/US7653912B2/en active Active
- 2003-12-12 US US10/735,610 patent/US7685607B2/en active Active
-
2004
- 2004-05-27 JP JP2004158420A patent/JP4870914B2/en not_active Expired - Fee Related
-
2009
- 2009-10-26 US US12/605,839 patent/US8087034B2/en not_active Expired - Fee Related
-
2010
- 2010-02-04 US US12/700,211 patent/US8271997B2/en not_active Expired - Fee Related
-
2011
- 2011-07-04 JP JP2011148703A patent/JP2011238266A/en active Pending
- 2011-11-14 US US13/295,777 patent/US8621487B2/en not_active Expired - Lifetime
-
2012
- 2012-09-13 US US13/614,011 patent/US20130185543A1/en not_active Abandoned
Patent Citations (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4689739A (en) * | 1983-03-28 | 1987-08-25 | Xerox Corporation | Method for providing priority interrupts in an electrophotographic machine |
US6240508B1 (en) * | 1992-07-06 | 2001-05-29 | Compaq Computer Corporation | Decode and execution synchronized pipeline processing using decode generated memory read queue with stop entry to allow execution generated memory read |
US5721855A (en) * | 1994-03-01 | 1998-02-24 | Intel Corporation | Method for pipeline processing of instructions by controlling access to a reorder buffer using a register file outside the reorder buffer |
US5692193A (en) * | 1994-03-31 | 1997-11-25 | Nec Research Institute, Inc. | Software architecture for control of highly parallel computer systems |
US6658490B1 (en) * | 1995-01-31 | 2003-12-02 | Microsoft Corporation | Method and system for multi-threaded processing |
US6470443B1 (en) * | 1996-12-31 | 2002-10-22 | Compaq Computer Corporation | Pipelined multi-thread processor selecting thread instruction in inter-stage buffer based on count information |
US20010016879A1 (en) * | 1997-09-12 | 2001-08-23 | Hitachi, Ltd. | Multi OS configuration method and computer system |
US6272520B1 (en) * | 1997-12-31 | 2001-08-07 | Intel Corporation | Method for detecting thread switch events |
US6219780B1 (en) * | 1998-10-27 | 2001-04-17 | International Business Machines Corporation | Circuit arrangement and method of dispatching instructions to multiple execution units |
US6493741B1 (en) * | 1999-10-01 | 2002-12-10 | Compaq Information Technologies Group, L.P. | Method and apparatus to quiesce a portion of a simultaneous multithreaded central processing unit |
US6408381B1 (en) * | 1999-10-01 | 2002-06-18 | Hitachi, Ltd. | Mechanism for fast access to control space in a pipeline processor |
US7082519B2 (en) * | 1999-12-22 | 2006-07-25 | Ubicom, Inc. | System and method for instruction level multithreading scheduling in a embedded processor |
US7051337B2 (en) * | 2000-04-08 | 2006-05-23 | Sun Microsystems, Inc. | Method and apparatus for polling multiple sockets with a single thread and handling events received at the sockets with a pool of threads |
US6427195B1 (en) * | 2000-06-13 | 2002-07-30 | Hewlett-Packard Company | Thread local cache memory allocator in a multitasking operating system |
US6799317B1 (en) * | 2000-06-27 | 2004-09-28 | International Business Machines Corporation | Interrupt mechanism for shared memory message passing |
US6912647B1 (en) * | 2000-09-28 | 2005-06-28 | International Business Machines Corportion | Apparatus and method for creating instruction bundles in an explicitly parallel architecture |
US6829769B2 (en) * | 2000-10-04 | 2004-12-07 | Microsoft Corporation | High performance interprocess communication |
US6988186B2 (en) * | 2001-06-28 | 2006-01-17 | International Business Machines Corporation | Shared resource queue for simultaneous multithreading processing wherein entries allocated to different threads are capable of being interspersed among each other and a head pointer for one thread is capable of wrapping around its own tail in order to access a free entry |
US20030120896A1 (en) * | 2001-06-29 | 2003-06-26 | Jason Gosior | System on chip architecture |
US7363474B2 (en) * | 2001-12-31 | 2008-04-22 | Intel Corporation | Method and apparatus for suspending execution of a thread until a specified memory access occurs |
US20040049672A1 (en) * | 2002-05-31 | 2004-03-11 | Vincent Nollet | System and method for hardware-software multitasking on a reconfigurable computing platform |
US7653912B2 (en) * | 2003-05-30 | 2010-01-26 | Steven Frank | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US7685607B2 (en) * | 2003-05-30 | 2010-03-23 | Steven Frank | General purpose embedded processor |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100228954A1 (en) * | 2003-05-30 | 2010-09-09 | Steven Frank | General purpose embedded processor |
US8621487B2 (en) | 2003-05-30 | 2013-12-31 | Steven J. Frank | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations |
US20120272247A1 (en) * | 2011-04-22 | 2012-10-25 | Scott Steven L | Software emulation of massive hardware threading for tolerating remote memory references |
US9201689B2 (en) * | 2011-04-22 | 2015-12-01 | Cray Inc. | Software emulation of massive hardware threading for tolerating remote memory references |
WO2013113206A1 (en) * | 2012-02-01 | 2013-08-08 | 中兴通讯股份有限公司 | Smart cache and smart terminal |
US9632940B2 (en) | 2012-02-01 | 2017-04-25 | Zte Corporation | Intelligence cache and intelligence terminal |
Also Published As
Publication number | Publication date |
---|---|
US8087034B2 (en) | 2011-12-27 |
JP2004362564A (en) | 2004-12-24 |
US20040250254A1 (en) | 2004-12-09 |
US7685607B2 (en) | 2010-03-23 |
US20100228954A1 (en) | 2010-09-09 |
US7653912B2 (en) | 2010-01-26 |
US20100162028A1 (en) | 2010-06-24 |
US20040244000A1 (en) | 2004-12-02 |
US8621487B2 (en) | 2013-12-31 |
US20130185543A1 (en) | 2013-07-18 |
JP4870914B2 (en) | 2012-02-08 |
US8271997B2 (en) | 2012-09-18 |
US20120151487A1 (en) | 2012-06-14 |
JP2011238266A (en) | 2011-11-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7653912B2 (en) | Virtual processor methods and apparatus with unified event notification and consumer-producer memory operations | |
US10929323B2 (en) | Multi-core communication acceleration using hardware queue device | |
US10379887B2 (en) | Performance-imbalance-monitoring processor features | |
US20160026574A1 (en) | General purpose digital data processor, systems and methods | |
EP2171574B1 (en) | Multiple-core processor and system with hierarchical microcode store and method therefor | |
KR100299691B1 (en) | Scalable RSC microprocessor architecture | |
US8972699B2 (en) | Multicore interface with dynamic task management capability and task loading and offloading method thereof | |
US7676664B2 (en) | Symmetric multiprocessor operating system for execution on non-independent lightweight thread contexts | |
US6044430A (en) | Real time interrupt handling for superscalar processors | |
US20050015768A1 (en) | System and method for providing hardware-assisted task scheduling | |
US8046538B1 (en) | Method and mechanism for cache compaction and bandwidth reduction | |
CN109388429B (en) | Task distribution method for MHP heterogeneous multi-pipeline processor | |
TW201423402A (en) | General purpose digital data processor, systems and methods | |
CN109408118B (en) | MHP heterogeneous multi-pipeline processor | |
US11775336B2 (en) | Apparatus and method for performance state matching between source and target processors based on interprocessor interrupts | |
CN112395000B (en) | Data preloading method and instruction processing device | |
JP2005182791A (en) | General purpose embedded processor | |
JP4631442B2 (en) | Processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ZAAA | Notice of allowance and fees due |
Free format text: ORIGINAL CODE: NOA |
|
ZAAB | Notice of allowance mailed |
Free format text: ORIGINAL CODE: MN/=. |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20231227 |