GB2332075A

GB2332075A - Optimized instruction storage and distribution for parallel processor architecture

Info

Publication number: GB2332075A
Application number: GB9725808A
Authority: GB
Inventors: Alexander Tulai
Original assignee: Mitel Corp
Current assignee: Microsemi Semiconductor ULC
Priority date: 1997-12-06
Filing date: 1997-12-06
Publication date: 1999-06-09
Anticipated expiration: 2017-12-06
Also published as: FR2772952A1; GB2332075B; GB9725808D0; FR2772952B1; CA2254200A1; SE9804202D0; SE9804202L; DE19854810A1

Abstract

A method of improving the utilization of program memory in a multi parallel processor architecture which utilizes an instruction register file (IRE). The IRE is partitioned into two pages and grouping bits are added to the program instructions to designate the fetch cycle to which the instruction belongs. Routing bits are also used to route the instructions properly to the designated processor. The relative position of the routing instruction within the set of instructions is also used to provide routing information.

Description

2332075 OPTIMIZED INSTRUCTION STORAGE AND DISTRIBUTION FOR PARALLEL

PROCESSORS ARCHITECTURES

Field of the Invention

This invention relates to multiple processors in a parallel configuration and more particularly to a method of improving the utilization of program memory in the process of fetching and distributing instructions when an instruction register file is used.

Background of the Invention

In a single processor architecture, the execution of a program is conventionally divided into three major phases. These phases are: instruction fetching which involves reading one instruction from the program memory into the instruction register (IR); instruction decoding which involves decoding the instructions from IR and preparing the control signals for its execution; and executing the instruction.

In a parallel processor architecture, multiple instructions have to be read from the program memory into multiple instruction registers that could be organized into an instruction register file (IRF) If there are n 1 processors in a multiparallel processor architecture, there should ba at least n instructions which are read from the program ---ito the IRF if unnecessary de-ays are to be avoided. In multiple processor architectures, however, it is not guaranteed that all the processors will have an instruct-on to execute every cycle. In this case, a no operation (NOP) instruction will have:o be routed to the processor for decoding and execution. Obviously, storing NOP instructions into the program memcry is wasteful of 10 program memory and ways of eliminating NOPs have been investicated.

Simply eliminating the NOP instructions creates routing problems in a multiple parallel processor architecture, as it is impossible to successfully route the instructions without -=dditional information.

To overcome this problem it is known to add control bits to --he instructions stored in the program memory for use in grouping instructions and routing information to the intended processor. The requirement to introduce grouping and routing bits to the instructions adds complexity to the system architecture and increases power requirements.

Summary of the Invention

The present invention seeks to provide better utiliza---on of the program memory in a multi-parallel 2 processor implementation by allowing instructions to stretch between to consecutive instruction packs.

The present invention provides a simplified ins--ruction distribution circuit in that the bit routing coding makes use of the instruction position within the group (for groups of two or more instructions).

In the present invention unused distribution control bits are created in certain cases and these bits mav be used for additional functionality.

Therefore, in accordance with a first aspect o."Lc the present invention there is provided in a multi-parallel processor architecture having a program manager for storing processor instructions, an instruction register file for decoding instructions fetched from the program memory for execution by selected ones of the parallel processors a method of improving utilization of the program memory. The method comprises a) partitioning the instruction register file; b) providing a grouping bit to the instructions to identify the fetch cycle to which the instruction belongs and c) providing routing bits to the instructions tc designate which of the processors the instruction is for, wherein the relative position of the routing bit within the instruction provides routing information.

In accordance with a second aspect of the invention there is provided a system for distributing instruc---Jons 3 from a program memory to multi-parallel processors comprising: a partitioned instruction register file (IRF) for receiving and decoding instructions from the program memory; multiple buses for carrying instructions from the memory to the IRF; routing circuitry for directing instructions to designated processors; and a bit route coding sequence to distribute instructions, the route coding utilizing the instruction position within the sequence to provide routing information.

Brief Description of the Drawings

The invention will now be described in greater detail with reference to the attached drawings wherein: Figure 1 illustrates a multi-parallel processor architecture according to the prior art; Figure 2 illustrates the problem caused by eliminating nonoperational instructions from the set of instructions; and Figure 3 illustrates a four-processor architecture whichimplements the present invention.

Detailed Description of the Invention

As previously discussed the execution of a program is divided in most of the processors in use today into three major phases. These are:

1) instruction fetching 4 2) instruction decoding; and 3) instruction execution.

In a para-lel processor architecture multiple instructions have to be read from the Program Memory into multiple IRs t"---at could be organized in a Instruction Register File!IRF) Assuming 7-hat a certain architecture uses "n" processors, at least "n" instructions should be read from the Program Memory into the IRF if unnecess;-:ry delays are to be avoided. However in multi processor arch--- -ectures it is not guaranteed that all the processors will nave an instruction to execute every cycle, in whic---case a NOP (NO Operation) inszruction will have to be routed to the processor for decoding and execution. Storing NOP instructions into the Program Memory is rat--.er wasteful and ways of elimina-ting them have been sought. Figure 1 illustrates a multi parallel processor arc't-.-J--ecture in which the program memory has stored NOP instructicns respecting processors 2 and 3. These NOP instructions are fetched to the Instructicn Register File for delivery:o the respective processors. Obviously this results in memory usage involving no exchange o"meaningful data.

If the un--;esired NOP instructions are eliminated, the routing of the instructions is impossible w-'-Lhout additional 5 information. Figure 2 illustrates the assignment problem for the case of n=4. In this example the NOP instructons relating to processors 2 and 3 have been elimina--ed and the program memory which would have been used for NO? instructions used for other processor instructions. As indicated in Figure 2 this results in instructions intended for processor 4 being wrongly directed to process3r 2.

To solve this problem, control bits are added to the instructions stored in the Program Memory. The number of instruction goes from constant (4 in Figure 1) tc variable (anywhere from 1 to 4 in the case of a 4 processcr architecture or 1 to n in the general case of n processors) These control bits carry two kind of information: 1) grouping information (grouping together all t---e instructions that have to be executed in the same cycle but on different processors); and 2) routing information (maps an instruction to a certain processing unit).

Because of these definitions they shall be referred to as grouping control bits and routing control bits.

In such systems the issues that have to be addressed are:

1) How are the instructions stored in the Progra= Memory and how many of them are written into the IRF in one fetch cycle? 2) What is the optimal size of the IRF (how manv 6 instructions can it accommodate)? 3) What configuration of control bits allows for an optimal distribution of the instruction from the IRF to the processing units? 4) Hcw are the flow control changes (jumps, call to subrcutines etc.) handled? and 5) Hzw does the size of IRF influence the number of reads from the PM and the impact on the power consumption of the device? The present invention demonstrates that the Program Memory waste could be further reduced and the instruction distributing circuitry could be simplified by:

1. allowing a set of instructions (that is to be executed in the same cycle) to spread over two consecutive Program Memcry fetch lines; 2. d'mensioning IRF to 2n where n is a power of 2 (but not necessarily); 3. =ding the distribution control bits as follows: 3. 1) use r= [ log2 (2n1) 1 bits per instruction for routing control; 3.2) in a set of p instructions belonging to the same cycle, with p 2: m (where m is the minimum integer such that mr n), z-ssign each distribution control bit of the first m ins::_ructions to one of the n processors and set them to 0 or 7 1 to indica--e which processor receives,,ihich instruction in the set of n (the matching is done POSi-Lionally from left to right) 4. If the grouping control bits indicate that more than n instructions belong to the same group, do not advance the decoding pcnter in IRF 5. Upon a flow control change, set all the grouping bits of the IRF, t---at will not be written to during the first fetch cycle, to Such a value that an instruction spreading over two consecu-tive Program Memory locations (at the addressed jumped to) --ould not be falsely grouped with instruction left over = the IRF before the flow control occurred.

As mentioned above in a system with n processors, at least n instructions should be read at a time from the Program Me=ory (PM) if delays are to be avoided.

Consequent-y a minimum IRF capacity of n instructions guarantees that no delays are introduced during the fetching phase.

If IR-7 can store more than n instructions, that would allow the elimination of the fetching phase upon jumps to locations::hat are already in the IRF. From this point of view a larzer IRF would behave like a cache memory. However, the size cf the routing circuit needed to send an instructic-ri from IRF to the proper processor becomes huge 8 when any IRF register could be routed to any processor, a situation that occurs if one wants to eliminate the memory wasteful NOPs by accepting a variable instruction regiszer. In these conditions the size of the routing circuitry is kept to a minimum and no additional delays are introduced during the fetching phase if the capacity of the IRF is set to exactly n instructions. However, a third factor in deciding the size of the IRF is the waste of program me=ory location that occurs when the size of the IRF is exactl,; n and the instructions to be decoded and executed every cycle is variable (anywhere from 1 to n).

When variable sized instructions are packed in groups of n and stored in the program memory, it could happen that the room left in the current pack is not enough to fit the next instruction in which case the rest of t--'.e pack will be filled with NOPs and a new pack started. T1he worst scenario possible is that (n-1) locations are available in the current pack while the next cycle instruction length is exactly n. In such a case the wasze could be as high as (n- 1) instructions. The best case -s obviously when instructions could be fitted exactly in an n instruction pack.

To address this waste, we could allow an instruct= to stretch between two packs of length n and thus elimina-Le any waste of PM locations. However, this feature requires -L'--ie 9 extension of the IRF capacity from n to 2n.

A pointer within IRF will indicate where the next instruction to be decoded starts. When this points to an instruction that starts in one pack and finishes in the next pack, the instruction cannot be decoded unless the rest of it is fetched from the PM and available in the IRF. That's why doubling the size of the IRF from n to 2n solves the problem as we could alternatively fetch in one half of IRF or the other and when an instruction that stretches between LWO consecutive packs is to be decoded both the beginning and the end of the instructions are found in IRF. Wrap arounding is used in such a case to maintain the continuity of an instruction.

Having two pages of n instructions significantly increases the size of the IRF circuitry and that of the routing circuitry, however, considerable program memory savings are made possible (some examples on a 4 processor architecture have shown savings of up to 20% for certain programs). In addition to this, 2n locations are enough for the code for some tight loops, the kind we encounter in filtering, to be fully stored in IRF and that would avoid program memory fetches during filtering and consequently would reduce the overall power consumption of the chip. Considering these advantages an IRF with a capacity of 2n C is optimal. Increasing the capacity of the IRF beyond 2n instructions could reduce the power consumpt-i::.n in certain cases and that for a very high cost in increa-c-ed IRF and routing circuitry, and it is not justifiable _n general.

The control bits used for grouping are u-zed to indicate which instructions from IFR should be routed z:) the n processors for decoding during the current cy=le. The minimum number of bits used for this operatic--- for each instruction is 1. The routing circuitry will analyse the grouping bit for n consecutive IRF instructicns and the decisions taken are summarized in Table 1.

Table 1: Grouping control bits decoder' Instr. 1 Instr. 2..... Instr. n Decision Gr p X X.. X X 1 instruction cycle R - 0 P p X X X 2 instruction cycle U p p X. X X 3 instruction cycle 1 I - And so on N G p F p PP X n-1 instruction cycle B p p F.. F p n instruction cycle I p p p 0 instruction cycle, NOps T will be pushed to all proc essers a. x - don't care, p 011. f - 110 r=[log2(2n-1)] bits are required to identify io what processor an individual instruction should be routed, where [] is the integer part function defined as:

[X] = x, x G: N rn, n<x<n+.., n E N However, if each instruction carries its own routing bits, a redundancy appears when groups of p instruc-Lions with p > m, where m is such that m r 22t n, do not exploit the position of the instruction in the group.

Consider the example where n=4.

The number of routing bits r required for routing one instruction is: r=log2(24-1)]=[2.8]=2 which indeed corresponds to the 4 possible combinations one can make with two bits.

For m=2 we have: mr=4=n so for any group of p > 2 instructions we have some redundancy within the routing bits if the position of the instruction in the group is not exploited.

For p=3,r=2,n=4 and the following three instructions:

gr 1 r 1 i 1.1 1 1 o 1 112- - 2 2 2 2.2 3 3.3.3.3 grorl'1'2'L gror,1112 'L where: g's are the grouping bits r's are the routing bits i's are the instruction bits L is the instruction length and let's assume as well that the 3 instructions should go to processors 1, 3 and 4 with g=1 and the natural assignment of the 2 bit combinations we have the following control bits for the example given:

12 1 1 1 OW'1'2... 'L 2.2 2 3.3.3 010'112 'L 111'1;21L However, the following group of identical instructions would be distributed just as well to the processors 1, 3 and 4 for the simple reason that each instruction carries its own routing bits and the order of the instructions in the group dcesn't count 33.3 Owilil... 1.2.2.2 0111112... &L 1 2 iL 1101112-1L This introduces a redundancy that translates into a somewhat faster circuit but significantly larger than in the case when the position of the instructions in the group would be exploited.

If we assume that the three instructions to be routed to processors 1, 3 and 4 are placed exactly in this order when packed and placed in the program memory, we would need 20 exactly n=4 bits to show how the mapping is done.

-s of the first two By concatenating the routing bit 1 1 2 2 instructions (rorl, and rorl) we get exactly the 4 bits needed to show how the assignment is done and the following table w-'11 cover all possible cases for groups of 2, 3 and 4 instructions.

13 Table 2: Proposed routing bits assigment p 1 1 2 2 Routing decision 2 0011 NOP->proc.1,NOP->proc.2, i 112 'L ->proc.3 9 2.2.2 ->proc.4 112 &L 2 0101 1 1.1 NOP-nroc.1, '1'2... 'L ->proc.2,NOP->proc.3 9 2.2.2 ->proc.4 1g21L 2 1001 1112A ->PrOC.1,NOP->proc.2,NOP->proc.3 2.2.2 1112-1L ->proc.4 2 0110 NOP. 1, i 1 1. 1.21.2 1'2-SL ->proc 2, 1112-1L ->proc.3 NOP->proc.4 2 1010.1.1 1.2.2.2 112... 'L ->proc. 1,NOP->proc.2, 1112 'L ->proc.3 NOP->proc.4 2 1100.1.1 1 2.2.2 1112-1L ->PrOC.l,lil'2'L ->proc.2,NOP->proc.3 NOP->proc.4 3 0111 NOP->proc.1, 1 1.1 -2.2.2 1'2'.'L ->proc.2, 1112 'L ->proc.3 1.3.3.3 ->proc.4 112-ZL 3 1011.1.1 1.2.2.2 1112 A ->proc. LNOP->proc.2, g 112. 'L ->proc.3 3.3 3 1112-1L ->proc.4 3 1101.1.1 1.2.2.2 Y2-'L ->proc. 1 111Z2-ZL ->proc.2,NOP->proc.3 -3.3 1112A ->proc.4 3 1110.1.1 1 -2.2.2 1112-'L ->PrOC"11112"L ->proc.2, 3.3.3 111Z2ZL ->proc.3, NOP->proc.4 4 1111.1.1 1 -2.2.2 1112'L ->proC'11X112-1L ->proc.2, -3.3.3 ->PT0C. 4.4.4 I'112-1L 3, '112-'L ->proc.4 Depending how the circuit is implemented the last coding could prove redundant just as well because for the case p=4 i: is clear which instruction goes to which 14 processor (because we have an equal number of instructions and processors).

Going back to the previous example, we see now that 3 3 bits ror, are not used any more. These bits are redundant.

In the case of 4 instructions in the group not only are two 4 4 more bits becoming redundant (rorl) but depending on the 1 1 2 2 implementation even the first four bits (ror, and rori) could be redundant.

Because the routing of the instructions from the IRF to the processors is done after the instructions have been fetched from the program memory into the IRF, at least one cycle of delay will be introduced during an instruction flow change to a PM location that is not already loaded into IRF.

During this time NOPs will be pushed to all processors.

However if the instruction at the address jumped to, is one that stretches over two consecutive packs of n instructions, an additional cycle is needed to load the second pack into the second IRF page, before the instruction could actually be routed to the appropriate processor.

However, because of the previous instructions left over in that IRF page, an early and wrong routing might take place, if during a jump, the second page of the IRF (first in this case being the one to which the first PM location is fetched into) will have all its grouping bits set to q such that the decoder will be forced to default to the last case is in Table 1, with NOPs being pushed to all processors and the pointer within IRF preserving the old value.

---nis is a very elegant way of handling the jumps because it does not require any additional circuitry.

Moreo7er the circuit will work just as well during RESET when all the grouping bits will be held to 17 and NOPs will be pushed automatically to all units while the pointer will be locked at 0.

7igure 3 shows an architecture with 4 processors, 3 contr^-! bits (one for grouping and two for routing), an IRF with -wo pages of 4 instructions each. Not shown in Figure 3 is the circuitry that enables writing to IRF, based on the value of the current IRF pointer, the grouping and control bits and the instructions to be executed. Figure 3 does show a 4 processor implementation wherein the IRF has two pages of fc-ir instructions each. From the program memory, four buses carry four instructions to IRF during a fetch cycle.

Each bus is Lc+1 bits wide with three control bits and L instruction bits, so that L,= L+2.

-A-1though one implementation of the invention has been descr-'bed and illustrated it will be apparent to one skilled in the art that several alterations can be made without depa----ing from the basic concept. It is to be understood that such alterations will fall within the scope of the inven--ion as defined by the appended claims.

16

Claims

Claims: 1. In a multi, parallel processor architecture having a program

memory for storing processor instructions and an instruction register file for decoding instructions fetched -ed ones of from said program memory for execution by select said parallel processors, a method of improving utilization of said program memory comprising: a) partitioning said instruction register file; b) providing a grouping bit to said instructions to identify the fetch cycle to winich said instruction belongs; and c) providing routing bits to said instructions to designate which of said processors said instruction is to be routed, wherein the position of said routing bit within said instructions provides routing information.
2. A method as defined in claim 1, wherein said instruction register file is partitioned into two sections.
3. A method as defined in claim 2 wherein the number of parallel processors is n and the capacity of the instruction register file is 2n.
4. A method as defined in claim 3 wherein the number of routing bits (r) is in accordance with the expression: r= 11092 (2n-1) 1.

17
5. A method as defined in claim 1 wherein said grouping bit is used to indicate which instructions from the instruction register file is to be decoded in the current cycle.
6. A system for distributing instructions from a program memory to multiparallel processors comprising: a partitioned instruction register file (IRF) for receiving and decoding instructions from the program memory; multiple buses for carrying instructions from the memory to the IRF; routing circuitry for directing instructions to designated processors; and a bit route coding sequence to distribute instructions, the route coding utilizing the instruction position within the sequence to provide routing information.
7. A system as defined in claim 6 wherein said instruction register file is partitioned into two pages.
8. A system substantially as herein described, with reference to figure 3 of the accompanying drawings.

18