GB2390178A

GB2390178A - Extended instruction buffer

Info

Publication number: GB2390178A
Application number: GB0130931A
Authority: GB
Inventors: Rob Macaulay
Original assignee: VULCAN MACHINES Ltd
Current assignee: VULCAN MACHINES Ltd
Priority date: 2001-12-24
Filing date: 2001-12-24
Publication date: 2003-12-31
Also published as: GB0130931D0

Abstract

There is disclosed a buffer for receiving instructions to be processed, wherein the instructions are comprised of at least one sub-set of data, the largest instruction comprising m sub sets of data, wherein n sub sets of data are loaded into the buffer on a processing cycle, further wherein the size of the buffer is the greater of [(m-1)/n + 1] or 2 multiplied by the size of the sub-set.

Description

1 2390178

EXTENDED INSTRUCTION BUFFER

Field of the Invention

The present invention relates to a buffer for storing data to be processed, and particularly but not exclusively to a buffer for storing instructions in a processor. Background to the Invention

In applications such as microprocessor applications, instructions are provided to the microprocessor following accesses to memory.

In some applications, instructions are not of a fixed length. For example instructions may vary in length from one byte to five bytes. The boundary between instructions is arbitrary, and it is possible that a full instruction is not always retrieved in a single memory access. in particular, where instructions vary in length, and are of a length which is longer than a memory access width, then it is especially possible that a complete instruction cannot be guaranteed to be accessed on each memory access.

As such, it is possible that a processor may have to wait two or more memory cycle before a full instruction can be decoded.

It is an aim of the present invention to overcome one or more of the above-

stated problems.

Summarv of the Invention According to the present invention there is provided a buffer for receiving instructions to be processed, wherein the instructions are comprised of at least one sub-set of data, the largest instruction comprising m sub sets of data, wherein n sub sets of data are loaded into the buffer on a processing cycle, further wherein the size of the buffer is the greater of [(m-1)/n + 1] or 2 multiplied by the size of the sub-set.

The size of the sub-set may be one byte, n=4 and m=5.

The number of subsets of data loaded on a processing cycle may correspond to the size of a data bus.

The instructions may be microprocessor instruction.

The instructions may be accessed by a microprocessor from an external memory. The buffer may further comprise an instruction extraction unit for extracting an instruction to be processed therefrom.

Brief Description of the Finures

Figure 1 illustrates in block diagram form an exemplary processor architecture for implementing the present invention; Figure 2 illustrates an exemplary implementation of a buffer used in the processor architecture of Figure 1; Figure 3 illustrates an exemplary implementation of a stack architecture used in the processor architecture of Figure 1; Figure 4 illustrates a decode operation in the processor of Figure 1; Figure 5 illustrates examples of movement in the stack architecture of Figure 3; Figure 6 illustrates an exemplary implementation of a control element for the stack architecture of Figure 3; Figure 7 illustrates an example implementation of the stack architecture of Figure 3; Figure 8 illustrates an exemplary system implementation of the processor of Figure 1; Figure 9 illustrates a prior art write data technique; and

Figure 10 illustrates a write data technique in accordance with one aspect of the present invention.

Description of Preferred Embodiments

The processor described herein with reference to a preferred embodiment of the present invention is a 32 bit r,ricrocomputer which is optimized to run a stack based language, such as Java. Tee processor incorporates a hardware stack memory, as the Java model is based on a stack. The preferred

embodiment of the present invention directly supports the Java Virtual Machine (JO.

The general structure of an exemplary architecture of a processor for illustrating the present invention is shown in Figure 1. In this preferred embodiment the processor architecture can be generally considered to consist of an instruction execution unit and a data path unit. The data path unit handles ALU operations and stack and memory access, as described further hereinafter.

The present invention is furthermore described in relation to a particular embodiment in which the processor interfaces to a (PVCI) system bus.

Referring to Figure 1, the main elements of the processor architecture comprise a VCI bus interface and resource arbiter 104, a prefetch buffer 106, an instruction fetch control block 108, an instruction extractor block 110, a microcode controller 116, a microcode address block 114, a program counter (PC) 112, a multiplexer 118, an instruction decode ROM 120, a stack control block 122, an arithmetic logic unit (ALU) 128, an internal stack memory 134, a pair of registers 130 and 132, a stack pointer (SP) register 124 and a stack base (SB) register 126.

The instruction unit comprises the elements generally included within the dashed line 100, and the data path unit comprises the elements generally included within the dashed line 102.

The instruction unit 100 fetches bytecodes, using the instruction fetch control block 108, from a memory location pointed to by the program counter 112.

The instruction fetch control unit 108 receives the memory location on a bus 136 from the program counter 112. The instruction fetch control block 108 sends a request on bus 138 to the interface and arbiter 104. The retrieved bytecodes are received via the interface and arbiter 104. The received bytecodes are loaded into the prefetch buffer 106 via bus 142 under the control of the instruction fetch control block 108 on a bus 140. The prefetch buffer 106 is a small instruction buffer comprising a set of prefetch registers.

The fetched instructions are extracted from the buffer 106 on bus 144 by the instruction extractor 110 for processing in the instruction unit 100.

Typically, more than one complete bytecode is fetched at any one time from memory, since the width of the data bus is larger than most byte- code sequences. However, some bytecode sequences may be larger than the data bus, and thus in some fetches a full bytecode sequence may not be obtained.

Referring to Figure 2, there is shown a block diagram illustrating a prefetch buffer 106 in accordance with a preferred embodiment of the present invention. Like reference numerals are used to refer to elements in Figure 2 corresponding to those shown in Figure 1. As shown in Figure 2, the prefetch buffer comprises a first buffer 202 and a second buffer 204. In this preferred embodiment data is accessed from main memory a word at a time on bus 142, and each of the buffers 202 and 204 is for storing a word.

A control block 206 of Figure 2 represents part of the instruction fetch control block 108 of Figure 1, and a selector block 208 of Figure 2 represents part of the instruction extractor block 110 of Figure 1.

As mentioned above, in this embodiment data is accessed from main memory, by control block 206 using control signals 220, a word at a time. The accessed data is presented on bus 142 to the buffers 202. As two words are stored in the embodiment of Figure 2, a first word (N-1) is stored in buffer 204 and a second word (N) is stored in buffer 202.

For the purposes of this example, it is assumed that instructions are comprised of byte-codes, and that instructions vary in length from a minimum of one byte up to a maximum of five byte codes. The byte-codes, once stored in the buffers 202 and 204, are consumed such that complete instructions are presented on line 216 at the output of the selector 208. The selector 208 is controlled by a control line 220 from the control block 206 to select the byte codes for the current instruction from the buffers 202 and 204 on the buffer output lines 212 and 214.

Thus, in this example, the provision of a buffer capable of storing two words ensures that the full bytecode sequence of the longest codeword instruction (five bytes) is always available in the prefetch buffer 106.

The advantage of ensuring that a full bytecode sequence is held in the prefetch buffer is that all information for a single instruction is guaranteed to

be available in the prefetch buffer 106 on a single cycle. If the prefetch buffer was smaller than two words, then it could not be guaranteed that all necessary information was present. If the pre-fetch buffer was not sized in this manner, then for certain instructions it may be necessary to wait for a further memory access cycle before the next instruction is available The provision of the appropriately sized buffer as discussed above reduces the amount of fetching of data from main memory, and consequently reduces access time on the memory bus. This can be very important where access to the memory bus is limited.

It should be noted that the implementation of Figure 2 is only an example implementation, and the specific implementation of the pre-fetch buffer is not important to this embodiment of the invention. The important feature is that the buffer is sized appropriately to allow the full sequence of bytecodes for a given instruction to be stored.

In a more general case, where data is retrieved from memory in sub-sets of data n sub-sets at a time, and the largest instruction comprises m subsets of data, it can be considered that the pre-fetch buffer should be sized such that size of the buffer is the greater of [(m-1)/n + 1] or 2. By applying this technique on sizing the buffer then the size of the prefetch buffer can be guaranteed to be sufficient to have all necessary information available for any instruction.

One skilled in the art will appreciate that this concept is more generally applicable than the application shown in Figures 1 and 2, and may more generally apply to any storage scenario where a required number of sets of bits is required in order to perform an operation.

Turning again to Figure 1, the instruction fetch control block 108 continues to attempt to read memory words provided there is room in the prefetch buffer 106. The instruction extractor block 1 10 extracts the next instruction to be processed from the prefetch buffer 106 on a bus 144. The extracted instruction is provided on line 168 to the microcode controller 116 and a first input of the multiplexer 118. The multiplexer 118 receives a control input on

line 166 from the microcode controller 116. The second input of the multiplexer 118 is provided on line 170 by the microcode address block 114.

As is known in the art, the microcode address block 114 receives an input from the instruction decode ROM and provides an input to the multiplexer 118. As also known in the art the multiplexer 118 is controlled to apply on e of its inputs on lines 170 or 168 to the instruction decode ROM 120. The output of the multiplexer 118 on line 164 forms the input to the instruction decode ROM 120. The output of the instruction decode ROIVI 120 on line 62 provides an input to the instruction extractor 110, the microcode address block 114, the microcode controller 116, the stack control block 122 and the ALU 128.

The instruction decode ROM 120 operates, as is known in the art, to decode the current bytecode. This decode indicates, for example, the ALU operation requested and the size of the immediate arguments.

Any immediate arguments are extracted from the prefetch registers within the prefetch buffer 106, and passed on to the execution unit. As discussed hereinabove, the prefetch unit holds several memory words at once, and bytecode execution can usually be performed at a rate of one bytecode per clock tick. For example, the worst case for execution is a bytecode that requires four bytes of immediate data, where the bytecode occupies the most significant byte of a memory word. The prefetch buffer can provide all the data required. In some cases a bytecode may require several ALU operations. In such case, several microcode instructions are executed, before the next bytecode is fetched. As will be discussed further hereinbelow, in one aspect of the present invention the instruction decode ROM 120 provides two decoded signals which are advantageously used in controlling the stack operation of the datapath unit 102.

The datapath unit essentially consists of the ALU, the hardware stack memory 134, the two hardware registers 130 and 132, and a number of internal registers. ALU operations operate on the top two elements of the stack architecture, i.e. the registers 130 and 132. Data may be pushed onto the

stack memory by, for example, an immediate data instruction. Data may be popped - or pulled - from the stack, often as a result of a two operand operation.! Data is transferred to and from the registers 130 and 132 to the top of the stack. memory The stack pointer (SP) register indicates the current location in I the stack for operation, and is updated automatically after an operation. As described further hereinbelow, the provision of the stack architecture of the preferred embodiment of the invention allows stack operations to be'folded'or merged together where possible. This allows the stack memory to be a single port device.

A more detailed implementation of the stack architecture is described hereinbelow with reference to Figure 3. As is clearly illustrated in Figure 3, the stack 134, which is an internal stack, is associated with two registers 130 and 132, each having a width equivalent to the width of the stack. The provision of the two registers 130 and 132 provides for a particularly advantageous stack operation, as will be further described. As is shown in Figure 3, and is described further hereinbelow, the stack control block 122 includes first and! second multiplexors 300 and 302 for providing inputs to the first and second registers 130 and 132 respectively.

The registers 130 and 132 are provided to store the operands or values which would normally be stored, in a conventional stack memory, in the top two locations of the stack. The stack pointer 124 points to the current stack address, i.e. the address in the stack memory 134 storing the current value of the stack. The stack pointer is not used to access the registers 130 and 132.

As will be discussed in the following description, in a stack PUSH operation

according to this embodiment of the present invention, an operand is pushed into the top of the memory stack 134 from the register 132. In a stack PULL I operation according to this embodiment of the present invention, an operand is pulled from the top of the stack memory 134 to the register 132.

Holding the values normally held in the top two locations of the stack in the registers 130 and 132, which can be considered to be special purpose hardware, leads to three pieces of information being accessible at any one

time: the contents of the two registers and the contents of the stack currently pointed to. Thus these values are available with only one access to the stack memory 134. This allows all ALU operations to be performed with a single access to the stack memory at most, therefore enabling the stack memory to be implemented as a single port device. Data is transferred to and from the hardware stack memory to the registers 130 and 132.

The control block 330, which is part of the control block 122, receives control signals on line 162 from the instruction decode ROM 120, as represented by signals 330 in Figure 3. The control block 330 generates the control signals for controlling the stack architecture of Figure 3 in accordance with this embodiment of the invention. The control block 330 differs from a control block of a normal stack architecture to allow for the control of the stack memory 134 and the two registers 130 and 132.

The inputs to the multiplexors 300 and 302 are consistent with a conventional stack architecture. The input on line 304 to the multiplexer 300 comprises an external operand or value to be pushed into the stack architecture, in this example the first register 130. The corresponding input to the multiplexer 302 is provided by the output of the register 130 on line 132. The inputs on respective input lines 306 and 312 to the multiplexors 300 and 302 are provided by a stack value input, which is the current value of the stack memory provided on line 340. The inputs on respective input lines 308 and 314 to the multiplexors 300 and 302 are provided directly by the result output of the ALU 128 on line 150. The inputs on respective input lines 310 and 316 to the multiplexors 300 and 302 are provided by the respective outputs of the rnultiplexors on lines 318 and 320, i.e. these inputs are feedback inputs.

The control unit 332 provides control signals on line 332 to each of the nultiplexors 300 and 302, to select the one of the muliplexor inputs to be presented on the output thereof to be presented to the respective register 130 or 132.

Finally, the output of the register 132 on line 134 provides at its output a signal which may be presented to the stack memory 134, for loading into the current stack location, as pointed to by the stack pointer. The input to the

stack memory 134 is represented by line 338. The stack memory may also generate an output, as represented by line 340, for presenting a value of a current stack location.

The control unit 330 receives on line 336 from the instruction decode ROM 120, amongst other signals, a signal representing a decode phase and a I signal representing an ALU phase for a current instruction in a current memory cycle. In a preferred embodiment of the present invention, these signals are used by the control block 330 to control the multiplexors 300 and 302 and, in addition, to control the movement of the stack memory 134.

Referring to Figure 4, every instruction received by the processor is decoded in two stages. A first stage is the instruction decode phase, and a second stage is the ALU phase. However the control block 330 also adjusts the relative timing of the instruction decode phase signal and the instruction ALU phase signal from the instruction decode ROM, as illustrated in Figure 4. In the instruction decode phase, the instruction is identified and decoded by the instruction decode ROM. In the ALU phase, the ALU operation(s) required by the instruction - if any- is determined. ! In accordance with the preferred embodiment of the present invention, the control block 330 includes means for buffering the signals form the instruction decode ROM such that the instruction ALU phase for a given instruction is; delayed by one memory cycle. As such, within the control means 330, in a given memory cycle the control means has available the result of the instruction decode phase of an instruction n, and the result of the ALU phase! of an instruction n-1.The use of this information in each memory cycle to control the stack architecture is described in further details hereinbelow.

Figure 4 shows a series of memory cycles, represented by time periods to to tn. Each time period represents a memory cycle. As illustrated by Figure 4, within the control block 330 a given instruction is decoded in a decode phase in one memory cycle, and then in the next memory cycle the ALU phase is determined for that instruction. In any given memory cycle therefore, the control block 330 is provided with an instruction decode phase and an ALU phase, the respective phases corresponding to different successive

instructions. Time tn of Figure 4, for example, comprises an instruction decode phase for an instruction instrc n+1, and an ALU phase for the instruction instrc n.

The stack architecture of the preferred embodiment of the present invention, where a stack memory 134 is provided with two hardware registers, advantageously enables the access to the stack memory to be optimised, by utilising the two-stage instruction decode described hereinabove with reference to Figure 4.

Before illustrating the particularly advantageous optimization of the stack architecture of Figure 4, the basic PUSH and PULL stack operations for the configuration of Figure 4 will be described with reference to Figure 5.

Figure 5(a) shows the stack architecture of Figure 5 with the values a and b loaded into the registers 130 and 132 respectively. The stack pointer is pointing at an address n-m in the internal stack, which is loaded with a value c, and labeled 404. Following a PUSH operation, the value b is moved into the address location n-m+1, labeled 406, and the stack pointer moves back one address to point at the memory location n-m+ 1 as shown in Figure 5(b). The value a moves into the register 132, and the register 130 is available for a new value. Thereafter, in the example of Figure 5, it is assumed that a PULL operation occurs. In a PULL operation, as shown in Figure 5(c), the value a is pulled back into the register 130, and the value b retrieved from the current stack pointer location and loaded into register 132. The stack pointer then moves forward by one to point at the memory location n-m.

Data may be fetched from a local variable within the stack. The location is specified as an offset from the stack base (SB) register 126.

The implementation of a stack architecture as described above enables the stack to be implemented using a single port memory. This enables the memory, and hence the processor device, to be made smaller, and the power consumption of the memory to be reduced. Such a single port memory is also more readily available on the market, and is more readily implementable in general process technology.

The above example of a PUSH operation followed by a PULL illustrates a particular advantage of the stack architecture of Figure 3, namely that the net change in the stack pointer movement is zero when a PUSH operation is followed by a PULL operation. In a preferred embodiment of the present invention this feature of the invention is utilized to further improve system I efficiency. As discussed hereinabove, the instruction decode phase for a given instruction, and the ALU phase for a successive given instruction, occur in the same memory cycle. Each instruction decode phase is associated with an address movement on the stack, which movement may result in the stack pointer being adjusted by - 1, 0, or +1. Similarly each ALU phase is associated with an address movement on the stack, which may result in the stack pointer being adjusted by -1, O. or +1.

Referring to Figure 6, the control means is advantageously provided with means for determining the net movement on the stack in a given memory cycle, i.e. the net movement as a result of an instruction decode phase and an ALU phase in a single memory cycle.! Referring to Figure 6, in a decode phase the current instruction N is decoded in a step 502. Simultaneously thereto, in an ALU phase the ALU operation for the previous instruction N-1 is determined in a step 502. The stack movement associated with the respective instruction decode and ALU phases is determined in respective steps 506 and 504. For any given instruction decode phase or ALU phase, the stack movement will be one up (+1), no change (0),! or one down (-1). In a step S08 the stack movements for the two operations are summed. If the summed result is zero, then there is no overall change and the stack pointer will remain unchanged. If the summed result is non-zero, then stack movement occurs, and furthermore the result points to the direction in which data is to be transferred between the registers 130 and 132 and the stack memory 130. The value at the output of the summing step 508 can be used to steer data on the stack and in the registers 130 and 132 as illustrated by step 510 in Figure 6. The steering of data on the stack and in the registers 130 and 132 is controlled by the control block 330, which controls the multiplexors 300 and 302 in an appropriate manner.

As discussed hereinabove, if the control block 330 determines there is no net stack movement in a given memory cycle, then the stack pointer remains unchanged and there are no operands pulled or pushed from/to the stack memory. An example of the operation of the stack architecture of Figure 3 is now described with reference to Figure 7.

In a first time frame TO, the program counter has a value P. and the current instruction bytecode is iadd. In a second time frame T1, the program counter moves to value P+1, and the current instruction bytecode is bipush 2. The value in the registers 130 and 132 are 01 and 02 respectively, and the current top of the stack has a value S3.

The ladd operation would be expected to require a pull operation, to pull the next value from the stack after the two integers in the registers 130 and 132 are added together. However the bipush 2 operation requires a push operation, to push the value 2 into the register 132 and then to push the value in the register 132 into the stack.

In accordance with the preferred embodiment of the present invention, no stack movement is required when an iadd is followed by a bipush. The values 01 and 02 are added together, and the result steered to register 132. The value 2 is then loaded directly into register 130.

These results appear in the respective locations in time frame T2, in which the program counter moves to value P+3, and the current instruction is an iadd instruction, suggesting a pull operation to the stack is required.

In time period T3, the program counter moves to P+4, and the instruction retrieved is a nap. As such, the overall effect of the two instruction from T2 to T3 is a push to the stack. Therefore in time period T4 the result 01+02+2 is steered to register 130, and the value S3 is pulled from the stack to register 132. The stack pointer is then moved on one location to point to the next operand therein, S4.

Thus, in the above example, it can be seen that any access to the stack can be minimized by looking at the net effect of instructions, and steering the data accordingly.

Referring now to Figure 8, there is illustrated how the processor generally illustrated by Figure 1 may be utilized in a system with a different processor.

As shown in Figure 8, the processor of Figure 1, generally identified by reference numeral 600, is shown connected to a system bus 608 via an interface or converter 602. A bus 610 transfers data between the processor 600 and the interface 602, and a bus transfers data between the interface 602 and the main system bus 608. A further processor, 606, communicates with the system bus via a local bus 614. As shown in Figure 8, the system memory 604 is also connected to the system bus.

For the purposes of this example, it is assumed that the system bus 608 is designed in conjunction with the further processor 606, such that the processor 606 is directly compatible with the bus 608. For example, the processor 606 may be an ARM processor, and the system bus 608 may be an AHB bus. As mentioned in relation to Figure 1, the bus is a PVCI bus in one embodiment. AN ARM processor, in common with processors generally, performs read and write operations in two phases, as is illustrated by Figure 9. Specifically, in a first cycle an address 700 ADD1 to be read from or written to is output. If the operation is a read cycle, then in a second cycle a next address ADD2 702 is output on the address bus, and data RDDATA1 704 associated with address ADD1 is output. If the operation is a write cycle, then in a second cycle a next address ADD2 702 is output and the data WRDATA1 706 to be written to the address ADD1 is placed on the data bus.

In accordance with a preferred embodiment of the present invention, the write cycle implemented within the processor 600 as shown in Figure 1 is adapted as shown in Figure 10. In the processor according to Figure 1, in a write operation the write data is preferably placed on the data bus at the same time as the address to which the data is to be written. Thus, referring to Figure 10, the address ADD1 700 is placed on the address data bus at the same time as the write data WRDATA1 706 in a first cycle. In a second memory cycle the address ADD2 702 is placed on the address bus at the same time as the write data WRDATA2 708 is placed on the data bus. Thus a write operation is completed - and frees the data bus - in a single memory cycle. As a result,

write data is written to memory one cycle earlier than in the write technique of the processor 606.

It should be noted that although the principle of placing write data on the data bus at the same time as a write address is described herein withspecific reference to the exemplary processor environment of Figure 1, the principle is more generally applicable.

In the example of Figure 8, the write addressing technique of Figure 10 is clearly incompatible with that of the processor 606. The processor 600 therefore preferably communicates with the system bus 608 via the interface 602. In the direction of data from the processor 600, the interface 602 operates, in a write cycle, to delay the write bus by one memory cycle. In the case of data going to the processor 600, the interface 602 operates to buffer the address bus such that the address bus is delayed by one memory cycle relative to the write data bus. In this way the data on the output bus 612 of the interface 602 is in a format compatible with the system bus 608.

It should be noted that the interface block 602 of Figure 8 is preferably implemented as part of the block 104 of Figure 1.

The processor described herein is suitable for numerous different configurations. Example configurations include: - As a slave microprocessor in a multi-processor system (such as a mobile telephone handset).

- As a stand-alone microprocessor in a system-on-programmable application. - As a stand-alone microprocessor in a system-on-chip application.

A preferred implementation of the processor uses a hardware/software mix.

Most of the byte-code instructions are implemented directly in hardware, with the more complex operations either microcoded or external.

Although the invention has been described herein with reference to particular implementations and embodiments, the invention is not limited to such. The scope of the present invention is defined by the appended claims.

Claims

1. A buffer for receiving instructions to be processed, wherein the instructions are comprised of at least one sub-set of data, the largest instruction comprising m sub sets of data, wherein n sub sets of data are loaded into the buffer on a processing cycle, further wherein the size of the buffer is the greater of [(m-1)/n + 1] or 2 multiplied by the size of the sub-set.

2. A buffer according to claim 1 wherein size of the sub-set is one byte, n=4 and m=5.

3. A buffer according to claim 1 or claim 2 wherein the number of subsets of data loaded on a processing cycle corresponds to the size of a data bus.

4. A buffer according to any one of claims 1 to 3 wherein the instructions are microprocessor instruction.

5. A buffer according to any one of claims 1 to 4 wherein the instructions are accessed by a microprocessor from an external memory.

6. A buffer according to any one of claims 1 to 5, further comprising an instruction extraction unit for extracting an instruction to be processed therefrom.

7. A buffer substantially as described herein with reference to, or as shown in, any one of Figures 1 or 2.

8. A buffer substantially as described herein.