GB2489243A

GB2489243A - Trace cache with pointers to common blocks and high resource thread filtering

Info

Publication number: GB2489243A
Application number: GB1104760.2A
Authority: GB
Inventors: Azam Beg; Ajmal Beg
Original assignee: United Arab Emirates University
Current assignee: United Arab Emirates University
Priority date: 2011-03-21
Filing date: 2011-03-21
Publication date: 2012-09-26
Anticipated expiration: 2031-03-21
Also published as: GB201104760D0; GB2489243B

Abstract

A processor 100 has an instruction cache 101 and a trace cache 102. The trace cache includes pointers to commonly used blocks of instructions. The commonly used blocks may be identified using a signal 112. The processor uses a multiplexer 103, 104 to select instructions from the trace cache or the instruction cache to be executed in the execution engine 106. The trace cache may be divided into separate pointer and block sections. The processor may use a thread filter 107 to trace only threads specified by a thread selection signal 110. The signal may be sent by a user or by the operating system based on the resource usage of the threads. The trace cache may be disabled to save energy in response to an energy mode signal 111.

Description

ARCHITECTURE OF A PROCESSOR WITH LOW-ENERGY

INSTRUCTION CACHE

TECHNICAL FIELD

This invention relates to an architecture of a processor with low-energy instruction cache.

BACKGROUND

A processor executes instructions using one or more execution units. An execution unit in a modern processor receives the instructions from an instruction cache or a unified instruction-data cache. The execution unit in the processor remains idle if there is a "cache miss," meaning the desired instruction(s) is/are not available in the cache. This also means that the instructions have to be fetched from a higher level of memory such as RAM. The RAM access time is larger than that of the cache.

Practical programs exhibit "locality of reference", which is the tendency of sets of instructions (also called blocks) related to a single thread, to repeatedly execute.

One of the ways of increasing the chances of finding the instructions in a cache is to store the blocks of instructions in a special instruction cache such as Code Pattern Cache (CPC).

CPC generally exhibits shorter access time than the common instruction cache.

The miss rate of any cache (including CPC) can be reduced by increasing its size and/or by dedicating cache areas for different threads running on a processor.

However, increasing the size of a cache (including CPC) results in higher energy consumption and increased die area. Thus, there is need for a cache architecture that results in lower energy consumption and smaller die area while maintaining lower miss rates.

BRIEF SUMMARY OF THE INVENTION

A low energy processor, comprising: execution unit to execute instructions, instruction cache to store instructions, cache storingcommonly used instructions, cache storing blocks of instructions from threads, references to blocks in the cache storing commonly used instructions, multiplexer to select instruction from either from the instruction cache or from the cache storing blocks of instructions belonging to threads.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings together with the description, serve to explain the principles of the invention.

FIG. 1 illustrates an exemplary processor with low energy CPC (LE-CPC) according to the present invention.

FIG. 2 illustrates an exemplary executable for selecting threads which are more likely to benefit from the CPC and sending the selected thread information to the processor according to the present invention.

FIG. 3 illustrates an exemplary thread filter in the processor, which selects traces based on the information received by the exemplary executable illustrated in FIG. 2 and is according to the present invention.

FIG. 4 illustrates an exemplary trace build engine used to build traces according to the present invention.

FIG. 5 illustrates an exemplary executable which provides the processor the most commonly used blocks' information according to the present invention.

FIG. 6 illustrates an exemplary storage module which stores the traces of instructions relevant to selected threads according to the present invention.

FIG. 7 illustrates an exemplary cache which stores commonly used sets of instructions according to the present invention.

FIG. 8 illustrates an exemplary cache which stores pointers for traces of threads according to the present invention.

FIG 9 illustrates exemplary cache structures for storing instructions according to the present invention.

FIG 10 illustrates exemplary lines in the cache structures for storing compressed instructions according to the present invention.

FIG 11 illustrates exemplary lines in the cache structures for storing uncompressed instructions according to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

FIG. 1 illustrates an exemplary processor 100 with LE-CPC according to the present invention.

The execution engine 106 executes instructions. The instructions to the execution engine 106 are provided by instruction cache 101 and a further cache, namely a LE-CPC storage module 102.

The instruction cache 101 receives instructions 113 from other memory area such as RAM or another level of cache.

The LE-CPC storage module 102 stores the traces of instructions from threads executed by the execution engine 106.

Two multiplexers 103, 104 select between instruction cache 101 and LE-CPC storage module 102 via lines/or as a source of instructions for the execution engine 106.

A thread filter 107 selects traces from only those threads which are using the processor extensively. The thread filter 107 selects threads based on a thread selection signal 110 which is received from a utility running at the operating system level. Using a utility to select threads which consume the processor extensively eliminates the need to build a selection unit in the processor at the hardware level, thus further reducing the die area of the processor and decreasing the energy consumption of the CPC.

LE-CPC trace build engine 108 buffers the output of the thread filter 107 to LE-CPC storage module 102 to build a trace 109.

The LE-CPC storage module 102 stores the traces in a commonly used block cache 603 and basic block cache 601 (as described below with reference to Figure 6). The commonly used block cache 603 in the LE-CPC storage module 102 stores the commonly used sets of instructions in the form of basic blocks.

The approach of taking traces of only those threads which are using the processor extensively (i.e. commonly) helps reduce the size of the CPC, thus reducing the energy consumption and the die area of the CPC.

The commonly used block cache 603 in the LE-CPC storage module is filled by the commonly used block information received by a utility running at the OS level. As such this utility is not implemented at the hardware level, and thereby it further reduces the die area, resulting in low energy consumption by the processor.

The blocks in a built trace which are the same as the blocks stored in the commonly used block cache 603 are replaced by the pointers toward the commonly used block cache 603.

That is, parts of the trace are replaced by a reference to blocks stored in the LE-CPC storage module 102 (e.g. in the commonly used block cache 603). This approach further reduces the size of the CPC, thus further resulting in reduced die area and low energy consumption.

The thread filter 107, LE-CPC trace engine 108 and LE-CPC storage module 102 have energy modes which are controlled by energy mode signals 111. In an embodiment, the thread filter 107, the LE-CPC trace engine 108 and the LE-CPC storage module 102 can be switched off during the duration of low activity of the processor 100, thus further allowing low energy consumption.

FIG. 2 illustrates an exemplary executable called "high resource thread selection utility" 203 for selecting threads which are more likely to benefit from the CPC according to the present invention and for sending the selected threadst information to the processor 100, 201.

The high resource thread selection utility 203 runs on operating system level 202 and collects threads information 208, 210. Thread information 208 from the OS level may have different details and formats compared to thread information 210 from the processor 101, 201.

The high resource thread selection utility 203 consists of two routines, "automatic high resource thread detection routine" 204 and "manual high resource thread selection routine" 205.

The high resource thread selection utility 203 also provides a graphical user interface 207.

The automatic high resource thread detection routine 204 collects information about high resource consuming threads without need of an input from a human user through a graphical user interface 207.

The manual high resource thread selection routine 205 collects information about high resource consuming threads based on an input frcsm a human user through a graphical user interface 207.

The high resource thread selection utility 203 may store high resource thread related information in OS level files 206 for later reference to make decisions related to thread selection based on the historical data.

Thread selection information 209, 211 is sent to the processor 100, 201. The thread selection information at the operating system level 209 and at the processor level 211 may have different details and formats compared to one another.

FIG. 3 illustrates an exemplary thread filter 300 in the processor (equivalent to thread filter 107 in Figure 1), which selects traces based on the thread selection signal 211, 303 (equivalent to thread selection signal 110 in Figure 1) according to the present invention. The filter 301 in the thread filter 300 receives the traces 302 of instructions related to tn number of threads which are executed by the execution engine 106. The filter 301 selects only traces of instructions that are related to n number of threads. The selection of n number of threads is based on the thread selection signal 211, 303 from the operating system 202.

As the thread filter 300 receives this information about thread selection 211, 303 from high resource thread selection utility 203, the thread filter 300 does not need to contain complex circuitry to implement decision making logic at the hardware level. Such approach reduces the die area resulting in low energy consumption. The thread filter 300 can be switched to low energy mode based on energy mode signal 111, 304 during the duration of low activity. Thus, allowing further reduction in energy consumption.

FIG. 4 illustrates an exemplary trace build engine 400 (equivalent to LE-CPC trace build engine 108 in Figure 1) used to build traces according to the present invention. The exemplary trace build engine 400 receives and buffers traces of threads from thread filter 107, 303. The exemplary trace build engine 400 is a first-in first-out buffer which consists of multiple trace buffer areas, called LE-CPC trace buffer area 403. Each LE-CPC trace buffer area 403 consists of six fields to memory areas; thread ID 404, Sequence ID 405, Head address 406, Tail address 407, Branch status 408 and Instruction area 409. In an embodiment, all six memory areas are of fixed length.

Thread ID 404 stores the thread ID the trace belongs to. Head address 406 is the address of the first instruction in a block belonging to Thread ID 404. Tail address 407 is the address of the last instruction in a block belonging to thread ID 404. Branch status 408 is the branch status of block belonging to a trace in thread ID 404. Instruction area 409 stores the block belonging to thread ID 404 that starts from the head address 406 and ends at tail address 407. Long traces which cannot be stored fully in one single instruction area 409 are divided and stored in multiple LE-CPC trace buffer areas 403.

FIG. 5 illustrates an exemplary exeOutable called "commonly used block detection utility" 503 which provides the processor 100, 501 the commonly used blocks information 112 (see Figure 1) according to the present invention.

The commonly used block detection utility 503 runs on operating system level 502 and collects commonly used blocks information 504. The commonly used blocks information at the OS level 504 may have different details and formats compared to the commonly used blocks information 505 sent to the processor 100, 501.

The commonly used block detection utility 503 keep tracks of the commonly used blocks, The commonly used block detection utility 503 also provides a graphical user interface 507 to allow users select most commonly used programs. The commonly used block detection utility 503 may store commonly used block information in OS level files 506 for later reference to make decisions related to commonly used block information based on the historical data.

Commonly used block information 504, 505, 112 is passed to the LE-CPC storage module 102.

The process does not need to implement logic at hardware level to make decisions about commonly used block detection utility. It reduces the die area and helps reduce the energy consumption.

FIG. 6 illustrates an exemplary LE-CPC storage module 600 (equivalent to LE-CPC storage module 102 of Figure 1) which stores the traces of instructions relevant to selected threads according to the present invention. The exemplary LE-CPC storage module 600 receives trace build from LE-CPC trace build engine signals 611 (equivalent to build trace signal 109 of Figure l) A commonly used block cache filler 604 receives the commonly used block information 112 613 and stores this information 112, 613 in the commonly used block cache 603.

A basic block filler 605 uses the information in the commonly used block cache 603 and stores traces in a block pointer cache 602 and basic block cache 601.

An instruction extractor 606 receives address 614 and sets a LE-CPC hit flag 607 to indicate the availability of a trace containing the address 614 in the LE-CPC storage module 600. A block count 608 produced by the instruction extractor 606 indicates the number of instructions contained in the trace. A block of instructions 610 produced by the instruction extractor 606 contains the set of instructions which form the trace. A block counter 609 produced by the instruction extractor 606 indicates the existing count number in the block count 608. With each cycle, the block counter 609 increments until it reaches the block count 608.

The LE-CPC storage module 600 can be switched to low energy mode based on energy mode signal 111 612 during the duration of low activity. Thus, allowing further reduction in energy consumption.

FIG. 7 illustrates an exemplary commonly used block cache 700 (equivalent to the commonly used block cache 603 of Figure 6) which stores commonly used sets of instructions according to the present invention. The commonly used block cache 700 consists of multiple fixed length memory areas 701 storing commonly used blocks. This fixed length memory area is called "commonly used block area" 701 here. The commonly used block area 701 consists of three fixed length memory areas: Block ID 702, SQ No. 703 and commonly used block 704.

The block ID 702 identifies uniquely the common used block. The commonly used block is stored in commonly used block 705.

The commonly used block cache may be updated periodically by the commonly used block detection utility 503.

The commonly used block detection utility 503 may get the information about the current thread selection signal 209 from the high resource thread selection utility 203 and fill the commonly used block cache 700 with the commonly used blocks that are relevant to the traces related to currently selected threads.

SQ No 703 allows storing commonly used code that spans over multiple commonly used block areas 701.

FIG. 8 illustrates an exemplary block pointer cache 800 (equivalent to block pointer cache 602 of Figure 6) which stores pointers for blocks stored for the traces in the commonly used block cache 603 and basic block cache 601 of threads stored according to the present invention.

Each row consists of multiple fields: A thread ID 801 contains the unique identifier of a thread. A trace valid bit 803 indicates the validity of the trace 803. A trace LRU 804 indicates the least recently used status of the trace. A branch status 805 indicates the branch status.

Block n 806 807 808 809 holds the pointers to the blocks in commonly used block cache 603 and basic block cache 601. SQ No 802 is the sequence number and is used in case when the trace is too long and cannot be accommodated in one row. One row in block n has four fields 810 811 812 813. A first field 810 indicates the type of cache (the commonly used block cache 603 or the basic block cache 601) which holds the block. A head address 811 holds the head address of the block. A tail address 812 holds the tail address of the block. A way ID 813 indicates the way number for the block in the trace.

FIG. 9 illustrates the basic block cache 900 (equivalent to basic block cache 601 of Figure 6). An opcode detector and encoder (ODE) 901 distinguishes between two types of incoming instructions (opcodes) 906: frequently occurring opcodes and non-frequent opcodes.

The frequently occurring opcodes are compressed before storage, while the other opcodes are not compressed before storage. We use four most frequently opcodes for compression, following the example of MIPS architecture,.in which just four opcodes make up more than 50% of the executed instructions in CPU2006 benchmarks. The compressible opcodes are encoded by ODE 901 and routed 908 to a BBC compressed data Array (BBC-CDA) 902 and the uncompressed codes 910 are dispatched to a BBC uncompressed data array (BBC-UDA) 903. ODE 901 encoding converts 6-bit opcodes into 2-bit codes and sends them to 28xn bit wide BBC-CDA 902. Storage and retrieval of 2-bit opcodes uses less energy than normal 6-bit opcodes. The remaining 6-bit opcodes (instructions) are stored in the regular 32xn bit wide BBC-UDA 903.

ODE 901 asserts a write enable signal 907 if the compressed opcodes are being sent BBC-CDA 902, or asserts the other write enable signal 909 if the uncompressed opcodes 910 are being directed to BBC-UDA 903.

Upon a hit to the BBC-CDA 902, it routes the instructions 911 first to an opcode decoder 904 and then 912 to a merging buffer 905. While a hit to BBC-UDA 903 does not require opcode decoding and instructions 913 are transferred directly to the merging buffer 905.

A multiplexer MUX 914 selects which instructions 912, 913 to send to the merging buffer 905.

The merging buffer 905 outputs instructions 915 to the execution engine (the instruction extract 606 of Figure 6).

FIG. 10 illustrates the composition of lines 1000 in BBC-CDA 902. Each line of BBC-CDA 902 is made up of 28xn bits.

FIG. 11 illustrates the composition of lines 1100 in BBC-UDA 903. Each line of BBC-UDA 903 is made up of 32xn bits.

It is to be understood that while the detailed description describes the present invention, the foregoing description is for illustrative purpose and does not limit the scope of the present invention which is defined by the scope of the appended claims. Other embodiments, arrangements and equivalents will be evident to those skilled in the art. Other embodiments, arrangements, usages and equivalents are within the scope of the present invention as defined by the appended claims.

Claims

CLAIMSI. A processor, comprising: an execution unit (106) to execute instructions; an instruction cache (101) storing instructions; a further cache (102, 600) storing traces of instructions from threads, where parts of the trace are replaced by reference to blocks stored in the further cache; a multiplexer (103, 104) to select instruction for the execution unit (106) from either the further cache storing traces of instruction from threads or from the instruction cache.
2. A processor according to claim I, characterized by further comprising a thread filter (107, 300) for selecting threads of which a trace is to stored in the further cache.
3. A processor according to claim 2, characterized in that the thread filter (107, 300) is adapted to select the threads based on a signal (110, 303) from an operating system (202) of the processor.
4. A processor according to any one of claims 1 to 3, characterized in that the further cache (102, 600) has a block memory part (603) allocated for storing commonly used instructions.
5. A processor according to claim 4, wherein the commonly used instructions are filled in the block memory part (603) based on information from an operating system (202) of the processor.I
6. A processor according to any one of claims 1 to 5, characterized in that the further cache (102, 600) has a pointer memory part (602) allocated for pointers to blocks in the block memory part thereby implementing parts of the trace to be replaced by reference to blocks.
7. A program, which when executed on a computer, causes the computer to carry out the following: a step of automatic high resource thread detection in which information about high resource consuming threads is collected and/or a step of accepting user input as to high resource threads; generating a thread selection signal based on the step of automatic high resource thread detection and/or the step of accepting new input; and selecting threads for storing in a further cache separate from an instruction cache based on the thread selection signal.
8. The program of claim 7, which further causes the computer to carry out: storage of traces of instructions from selected threads, in which parts of the traces are replaced by reference to blocks stored in the further cache.
9. The program of claim 7 or 8, which further causes the computer to carry out: storing blocks representing parts of traces of selected threads in a block memory part (603) of the further cache.
10. The program of claim 8 or 9, which further causes the computer to carry out: storing pointers to blocks stored in the further cache in a pointer memory part of the further cache.
11. The program of any of claims 7 to 10, wherein the step of accepting user input is carried out through a graphical user interface.
12. A program, which when executed on a computer, causes the computer to cany out the following: a step of accepting user input to determine commonly used blocks and/or a step of code simulating to determine commonly used blocks; passing commonly used block information to a further cache different to an instruction cache; and storing block information passed to the further cache in a block memory part of the further cache.AMENDMENTS TO CLAIMS HAVE BEEN FILED AS FOLLOWSCLAIMS1. A program, which when executed on a computer, causes the computer to carry out the following: a step of automatic high resource thread detection in which information about high resource consuming threads is collected and/or a step of accepting user input as to high resource threads; generating a thread selection signal based on the step of automatic high resource thread detection and/or the step of accepting user input; and selecting threads for storing in a further cache separate from an instruction cache based on the thread selection signal.2. The program of claim 1, which further causes the computer to carry out: storage of traces of instructions from selected threads, in which parts of the traces are replaced by reference to blocks stored in the further cache. rcc 3. The program of claim 1 or 2, which further causes the computer to carry out: 0 storing blocks representing parts of traces of selected threads in a block memory part (603) of the further cache. C.J204. The program of claim 2 or 3, which further causes the computer to carry out: storing pointers to blocks stored in the further cache in a pointer memory part of the further cache.5. The program of any of claims ito 4, wherein the step of accepting user input is carried out through a graphical user interface.