WO2001090888A1 - A data processing system having an address generation unit with hardwired multidimensional memory indexing support - Google Patents

A data processing system having an address generation unit with hardwired multidimensional memory indexing support Download PDF

Info

Publication number
WO2001090888A1
WO2001090888A1 PCT/EP2000/004671 EP0004671W WO0190888A1 WO 2001090888 A1 WO2001090888 A1 WO 2001090888A1 EP 0004671 W EP0004671 W EP 0004671W WO 0190888 A1 WO0190888 A1 WO 0190888A1
Authority
WO
WIPO (PCT)
Prior art keywords
generation unit
address generation
processing device
instance
memory
Prior art date
Application number
PCT/EP2000/004671
Other languages
French (fr)
Inventor
Jean-Paul Theis
Original Assignee
Theis Jean Paul
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Theis Jean Paul filed Critical Theis Jean Paul
Priority to PCT/EP2000/004671 priority Critical patent/WO2001090888A1/en
Publication of WO2001090888A1 publication Critical patent/WO2001090888A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0207Addressing or allocation; Relocation with multidimensional access, e.g. row/column, matrix
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • a data processing system having an address generation unit with hardwired multidimensional memory indexing support
  • the present invention relates to the field of architecture design of data processing systems in general. More specifically, the invention is dealing with architecture design issues at register transfer level of a processing system containing a processing device and an address generation unit with hardwired multidimensional memory indexing support.
  • the term 'processing device' means one of the following : microprocessor, CPU, DSP or micro-controller, the meaning of these terms being the one commonly described in the literature.
  • the machine code of a program which is running or executed on said processing device, is containing exclusively instructions specific to said processing device and where said machine code is either obtained by compiling the source code of said program or is obtained by manual writing.
  • the source code of said program is usually written in a high level programming language like C, Pascal, Basic, Fortran or Java.
  • the term 'processing system' means a processing device (microprocessor, CPU, DSP or micro-controller) coupled (connected) to an address generation unit as shown in figure 1.
  • the address generation unit based on the present invention can be part of the processing device itself, the processing device and address generation unit forming together one integrated circuit (IC) with the same functionality as a microprocessor, CPU, DSP or micro-controller.
  • IC integrated circuit
  • AGU address generation unit'
  • An address generation unit may be part of a memory management unit but not vice versa.
  • an address generation unit may also load/store the program data from/to a memory or cache, in which case it may also be part of a cache controller.
  • the register-transfer level architecture of said processing system considers only (1) elementary building blocks, e.g. the address generation unit and the processing device, (2) input and output data of each building block, (3) the functionality of each building blocks, e.g.
  • loop bodies of 'loops' may contain any mixture of conditional, branch and jump statements, where said statements may also be nested.
  • a block of k nested loops with loop indexes m 1 ,m 2 ... m ⁇ An instance of the /7-dimensional variable b appearing in a loop body of a loop being part of the considered block of nested loops is of the form b[expr-, (m ⁇ m ⁇ ... m k )] [expr ⁇ (m ⁇ ,m 2 ...
  • the memory index of the considered instance is defined to be the address within a physical memory or cache to/from which the value of (or the data corresponding to) the considered instance has (have) to be loaded/stored.
  • the memory index corresponding to an instance of an n- dimensional variable is equal to n. Furthermore, to each instance of a multidimensional variable corresponds a different memory index with possibly a different form.
  • n the dimension of the considered variable
  • k the number of nested loops
  • the excerpt of a program source code written in C and listed in section 3 contains a single for-loop with a loop index m and there are 7 different instances of a 2-dimensional variable apflpz] present in the loop body, namely a[m][m], a[ +1][ml a[m+1][m+1], a[m-1][m], a[m-1][m-1], a[m+2][m+1], a[m-1][m-1].
  • the indexes of each instance involve different expressions in the loop index m.
  • This example shall also illustrate the concept of the nesting level of a loop which is part of a block of nested loops.
  • the outermost loop with loop index m has nesting level 1
  • the loop with loop index m 2 has nesting level 2
  • the innermost loop with loop index m-i has nesting level 3. This concept is extended in the same way to the general case of a block of k nested loops.
  • the linearized 2-dimensional memory indexes corresponding to the 7 instances of the 2-dimensional variable aprfp ⁇ ] , 1 ⁇ i ⁇ ,i 2 ⁇ k , k an integer constant, are k*m+m, k*(m+1)+m, k*(m+1)+m+1, k*(m-1)+m, k*(m-1)+m-1, k*(m+2)+m+1 and k*(m-1)+m-1 respectively.
  • the memory index of an instance of a multidimensional variable appearing in the loop body of a loop being part of a block of nested loops is changing value whenever one or more of the loop indexes on which it depends are changing value. Therefore, the data required to compute the memory index of an instance b[exp ⁇ (m ⁇ ,m 2 ... mjj] [expr 2 (m ⁇ m ⁇ ... m ⁇ ] ... [expr ⁇ (m ⁇ ,m ... m ⁇ ] of an n -dimensional variable b appearing in the body of a loop being part of a block of k nested loops are :
  • offset which may be added to a preliminary computed memory index in order to obtain the final memory index.
  • Address generation units found in today's microprocessors, CPUs, DSPs and micro-controllers are performing a restricted number of relatively simple address calculation modes.
  • the most important address calculation modes commonly supported are (1) indirect (2) indexed (3) displacement (4) postincrement/decrement (5) modulo (6) modulo wrap-around (circular addressing).
  • none of these modes allows to support multidimensional memory indexing efficiently. As a consequence, a lot of instructions in the machine code of a program are required just to calculate the memory indexes of instances of multidimensional variables appearing in the program source code.
  • Multidimensional memory indexing naturally occurs in applications involving matrix and vector operations, e.g. FEM calculations, as well as image and multidimensional signal processing.
  • FEM calculations e.g. FEM calculations
  • image and multidimensional signal processing e.g. image and multidimensional signal processing.
  • a lot of instructions/operations and hence a lot of program (machine) code size and computation power is required just to perform multidimensional memory indexing.
  • the program (machine) code (listed behind the prior one) can be reduced to about 33 instructions. This represents a saving of 42 % in program (machine) code size. Furthermore, such an address generation unit with 'hardwired' multidimensional memory indexing support can perform these calculations a lot faster. This allows the program (machine) code to be executed faster, resulting in higher effective processing (computation) power. Note that 7 special instructions, denoted by 'MMU', are required to initialize the address generation unit. Furthermore, executing fewer instructions means also consuming less power. About the improvements are achieved when the physical memory is organized and addressed as a 2-dimensional array.
  • the following program code (in assembler format) of the previous program source code corresponds to a processing device with a conventional address generation unit according the prior art :
  • the following program code (in assembler format) of the previous program source code corresponds to a processing system containing a processing device and an address generation unit based on the present invention :
  • Figure 1 shows a processing system containing a processing device (microprocessor, CPU, DSP, micro-controller) and an address generation unit with 'hardwired' multidimensional memory indexing support as based on the present invention.
  • a processing device microprocessor, CPU, DSP, micro-controller
  • an address generation unit with 'hardwired' multidimensional memory indexing support as based on the present invention.
  • Figure 1 shows the register transfer level architecture of a processing system containing a processing device (microprocessor, CPU, DSP, micro-controller) and an address generation unit with 'hardwired' multidimensional memory indexing support as based on the present invention.
  • figure 1 shows the data required for initialization and operation of the address generation unit and the data exchanged between the processing device to the address generation unit.
  • Shown is also a memory/ cache from/to which the address generation unit may optionally load/store data according to the addresses as given by the computed memory indexes.
  • the address generation unit is capable to compute a number of predefined forms of memory indexes, e.g. the form corresponding to a linearized memory index (see above for details), which are stored internally and which are selected by control data during initialization of the address generation unit as described below.
  • the address generation unit is initialized for each instance considered in 1. : a. with control data which select, out of several predefined forms, the form of the memory index of the considered instance to be computed, e.g. the form of a linearized memory index.
  • the form of a memory index depends on how the physical memory/cache is addressed.
  • an offset which may be multidimensional, and which is added to a preliminary computed memory index of the considered instance to obtain the final memory index
  • the data mentioned under 2.a-2.c are transmitted from the processing device to the address generation unit, where no further data than those mentioned under 2.a-2.c are required to initialize the address generation unit, where however this does not exclude the possibility that, for practical reasons, additional data may be exchanged between processing device and address generation unit in order to initialize the address generation unit.
  • the processing device When the processing device starts executing said machine code, it transmits control data to the address generation unit which tell the address generation unit to start operation ecution of said machine code, the operation of the address generation unit is as follows : d. Whenever a loop index of a loop being part of said block of nested loops changes value (due to execution of said machine code), the new value of said loop index is transmitted from the processing device to the address generation unit such that the latest transmitted values of the loops indexes of said loops being part of said block of nested loops, transmitted from the processing device to the address generation as stated hereunder, represent the so called actual values of the loops indexes of said loops.
  • this transmission scheme triggered by the change of one or more loop index values, does not exclude the possibility to synchronize through the use of a clock signal the transmission of the values of the loop indexes as well as the calculation of the memory index of the considered instance e. for each instance considered in 1.
  • the address generation unit calculates the corresponding memory index using the data as specified in 2. a - 2.c and based on the actual values of the loop indexes as specified in 4.d.
  • the address generation unit can well calculate said memory index with modified (e.g. incremented) values of the loop indexes. In other words, the address generation unit may well modify, e.g.
  • the memory /cache address given by a memory index which is calculated using f. ex. incremented loop index values contains a value (namely that of the considered instance) which might be required by the program a few iterations later (how many iterations later depends on which loop index values are incremented).
  • Such an memory index (address) calculation ahead of the actual program code execution, together with the subsequent loading of memory/cache data stored at the address given by the calculated memory index allows to hide the latency (access time) of the memory/cache and to avoid that program execution is slowed down.
  • the address generation unit either loads from the physical memory/cache address given by the new value of the memory index the value of the considered instance and transmits that value to the processing device or stores to said physical memory/cache address the value of the considered instance and in which case the value of the considered instance was transmitted, prior to be used by the address generation unit, from the processing device to the address generation unit g.
  • the address generation unit calculates the memory index of an instance considered in 1.
  • an offset which may be multidimensional, which is added to a preliminary computed memory index of said instance to obtain the final memory index, where said offset is computed by the processing device, and transmitted from the processing device to the address generation unit whenever said offset changes value and prior to the usage of the new value of said offset by the address generation unit h. during operation, the address generation unit requires no further data than those mentioned under 2.a - 2.c, 4.d and optionally 4.g to calculate the memory index of an instance considered in 1. . However, this does not exclude the possibility that, for practical reasons, additional data may be exchanged between processing device and address generation unit.
  • the execution (on said processing device) one or more loops being part of said block of nested loops is overlapping, in other words, one or more of the instructions/operations contained in the loop body and which have to be performed during some iteration of one of said loops are executed before all the instructions/operations to be executed during the previous iteration of said loop have been completely executed.
  • the processing device modifies the value of the loop index of said loop and transmits the new value to the address generation unit.
  • the present invention concerns a processing system containing a processing device (microprocessor, DSP, CPU, micro-controller) and an address generation unit with 'hardwired' multidimensional memory indexing support according to claim 1.
  • a processing device microprocessor, DSP, CPU, micro-controller
  • an address generation unit with 'hardwired' multidimensional memory indexing support according to claim 1.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present invention describes a data processing system (microprocessor, CPU, DSP, micro-controller) with an address generation unit with hardwired multidimensional memory indexing support. The address generation unit is able to compute one or more multidimensional memory indexes based on only a limited number of initialization and operation data without requiring explicit instructions in the program code. The advantages of such an 'intelligent' hardwired address generation unit are substantial savings in program code size, power consumption and effective processing speed/power.

Description

A data processing system having an address generation unit with hardwired multidimensional memory indexing support
1. Field of the invention
The present invention relates to the field of architecture design of data processing systems in general. More specifically, the invention is dealing with architecture design issues at register transfer level of a processing system containing a processing device and an address generation unit with hardwired multidimensional memory indexing support.
2. Conventions, definition of terms, terminology
In the context of the present invention, the term 'processing device' means one of the following : microprocessor, CPU, DSP or micro-controller, the meaning of these terms being the one commonly described in the literature. As usual, it is further assumed that the machine code of a program, which is running or executed on said processing device, is containing exclusively instructions specific to said processing device and where said machine code is either obtained by compiling the source code of said program or is obtained by manual writing. The source code of said program is usually written in a high level programming language like C, Pascal, Basic, Fortran or Java. In the context of the present invention, the term 'processing system' means a processing device (microprocessor, CPU, DSP or micro-controller) coupled (connected) to an address generation unit as shown in figure 1. However, in practice the address generation unit based on the present invention can be part of the processing device itself, the processing device and address generation unit forming together one integrated circuit (IC) with the same functionality as a microprocessor, CPU, DSP or micro-controller. The reason why to use two different terms, namely 'processing system' and 'processing device', is to be able to clearly delimit and define the functionality of the address generation unit by conceptually splitting it off (as shown in figure 1) from the 'rest of the processing system' and by identifying the term 'rest of the processing system' with the term 'processing device'. Furthermore, this allows to define the data exchange (communication) between the address generation unit and the processing device. Despite this conceptual splitting-off from the processing device, here the term 'address generation unit' (AGU) has the same meaning as in the literature, namely a hardware circuitry used to perform address calculations, the calculated addresses referring to data (including instruction data) used by a program which is running on the processing device. An address generation unit may be part of a memory management unit but not vice versa. Furthermore, an address generation unit may also load/store the program data from/to a memory or cache, in which case it may also be part of a cache controller. The register-transfer level architecture of said processing system considers only (1) elementary building blocks, e.g. the address generation unit and the processing device, (2) input and output data of each building block, (3) the functionality of each building blocks, e.g. how the output data are calculated by using the input data, (4) the connections and data exchanged between the building blocks. Therefore, implementation details like intermediate amplifiers, buffers, latches, registers, which might be inserted between or inside elementary building blocks, are not considered since (1) they do not change the register transfer level architecture (2) although they may change the timing (due to the insertion of buffers, latches and registers) they do not change the functionality.
Since the term 'multidimensional memory indexing' has no clearly defined meaning in the literature, it shall now be defined in detail for the purpose of the present invention. Consider a block of nested loops, including the case of a single loop, being part of the source code of a program running (being executed) on the considered processing device. Assuming that the program source code is specified in some high level programming language like C, Pascal, Basic, Fortran or Java, the term 'loop' refers as usual either to a 'for'-, 'while'-, or 'do'- loop, conditional statements refer to 'if-then-else' statements and branch/jump statements to 'goto'- or 'exit'- statements. Furthermore, loop bodies of 'loops' may contain any mixture of conditional, branch and jump statements, where said statements may also be nested. Consider a n- dimensional variable bprfprf ... [ij, ik , k=1, ...n being the indexes of the variable b, and a block of k nested loops with loop indexes m1,m2 ... m^ . An instance of the /7-dimensional variable b appearing in a loop body of a loop being part of the considered block of nested loops is of the form b[expr-, (m^m ... m k)] [expr (mι,m2 ... m^)] ... [exprn (m^m ... rri/J], where expη (mι,m2 ... D/ i=1,2...n is an arbitrary complex expression representing index / of the considered instance and depending on the loop indexes mi,m2 ... mk. The memory index of the considered instance is defined to be the address within a physical memory or cache to/from which the value of (or the data corresponding to) the considered instance has (have) to be loaded/stored. The memory index corresponding to an instance of an n- dimensional variable is equal to n. Furthermore, to each instance of a multidimensional variable corresponds a different memory index with possibly a different form. The form taken by the memory index depends on how the memory/cache is organized. Two important forms of memory indexes (and thus memory organizations) shall be considered below. Note that n (the dimension of the considered variable) may actually be bigger than k (the number of nested loops) as exemplified by the excerpt of a program source code listed below, where there is only one loop is present (k=1), where however there is a 2-dimensional variable apι]p2] appearing in the body of the for-loop (see below for details).
First, consider two examples illustrating the concept of instances of multidimensional variables. The excerpt of a program source code written in C and listed in section 3 contains a single for-loop with a loop index m and there are 7 different instances of a 2-dimensional variable apflpz] present in the loop body, namely a[m][m], a[ +1][ml a[m+1][m+1], a[m-1][m], a[m-1][m-1], a[m+2][m+1], a[m-1][m-1]. Note that the indexes of each instance involve different expressions in the loop index m. Another simple example is given by the following program which multiplies two 2-dimensional n x n matrices a and b together. The program contains 3 nested for-loops with loop indexes mι,m2,m3 : for(mι==1; m-i≤n; m-,++) for(m2==1; m2≤n; m2++) for (m3==1; m3≤n; m3++) c[mι][m2] + almtJfmaPbtmaHm] ; In this example, there appear 3 variable instances in the body of the innermost For-loop : 1 instance of variable c, namely c[mι ][ m2], 1 instance of variable a, namely a[m-ι ][ m3], 1 instance of variable b, namely b[m3][m2]. Note again that the indexes of each instance depend on different loop indexes. This example shall also illustrate the concept of the nesting level of a loop which is part of a block of nested loops. In the above example, the outermost loop with loop index m has nesting level 1, the loop with loop index m2 has nesting level 2 , while the innermost loop with loop index m-i has nesting level 3. This concept is extended in the same way to the general case of a block of k nested loops.
Two specific forms of memory indexes of high practical interest are now discussed.
(1) If the physical memory/cache is linearly addressed, in other words if it is organized as a linear (one- dimensional) array, the n-dimensional memory index corresponding to an instance b[expr-, (m1,m2 ... rrik)] [expr2 (mι,m2 ... miJ] ... [exprn (m1,m2 ... m jof a n-dimensional variable bpι]p2]...p„] appearing in the body of a loop being part of a block of k nested loops is very often of the form b-,* mι + b2 * m2 + ... + bk* mk + bk+ι , where bs j=1, ...k+1 are integer coefficients and ms , j=1, ...k are the loop indexes of the nested loops. Memory indexes of the form b-,* m-, + b2* m2 + ... + bk* m + bk+1 are also called linearized memory indexes. For example, in the above listed program, the linearized memory index corresponding to instance b[m3][ m2] would be n * m3 + m2. Note that the coefficient n is given by the upper boundary value of loop index m2. Referring to the before mentioned example of a program source code excerpt written in C and listed below, the linearized 2-dimensional memory indexes corresponding to the 7 instances of the 2-dimensional variable aprfp] , 1≤ iι,i2 ≤ k , k an integer constant, are k*m+m, k*(m+1)+m, k*(m+1)+m+1, k*(m-1)+m, k*(m-1)+m-1, k*(m+2)+m+1 and k*(m-1)+m-1 respectively.
(2) If the memory/cache is organized as a n-dimensional array, then the memory index corresponding to an instance bpexpr-, ( -,,m2 ... m^] [expr ( -,,m2 ... m^] ... [exprn (mι,m2 ... ^] of a n-dimensional variable bpι]p2]...pn] appearing in the body of a loop being part of a block of k nested loops, has the form of the n-tuplet [expr-, (m-,, m2,.. mk )][ expr2 (m1, m2,.. mk )] ... [exprn (m-,, m2 ,.. mk )], where expr,- (mi, m2,.. mk ) i=1,2...n ,is often of the form b-, * m-, + b2 * m2 + ... + bk * mk + bk+1 , with bμ j=1...k+1,l=1...n being integer coefficients. Concerning the before mentioned excerpt of a program source code listed below, the 7 instances of the 2-dimensional variable ap1]p2] appearing in the for-loop are : [m][m], [m+1][m], [m+1][m+1], [m-1][mj, [m-1][m-1], [m+2][m+1], [m-1][m-1] .
As can be seen from these two examples, the memory index of an instance of a multidimensional variable appearing in the loop body of a loop being part of a block of nested loops is changing value whenever one or more of the loop indexes on which it depends are changing value. Therefore, the data required to compute the memory index of an instance b[exp^ (mι,m2 ... mjj] [expr2 (m^m ... m^] ... [exprπ (mι,m ... m^] of an n -dimensional variable b appearing in the body of a loop being part of a block of k nested loops are :
(1) the actual values of the loop indexes m1tm2 ... mk corresponding to the iteration counts of the loops being part of said block of nested loops, the iteration counts being given by the execution of said block of loops on said processing device at a given moment in time
(2) the coefficients required to calculate the memory index, e.g. the integer coefficients bj j=1, ...k+1 in the case of a linearized memory index of the form b * m-, + b2 * m2 + ... + bk* mk + bk
(3) offset, which may be added to a preliminary computed memory index in order to obtain the final memory index. In case that the memory is organized and addressed as a n-dimensional array with the same dimension as the memory index of an instance of an n-dimensional variable, then the offset is n-dimensionai and is of the form [c-,][c2]... [cn], c;- i=1,2...n integer constants. For example, if the instance b[m1 +3][m2 +10][m37 of a 3-dimensional variable b has to be loaded from memory and has a corresponding offset [2][3][4], then its final memory index is given by preliminary memory index + offset = [m^Hm+IOHma] +[2][3][4] = \m-i+3+2][m2+10+3][m3+4]. It is clear that in case that the memory index is linearized, the offset is just an integer value.
3. Prior Art
Address generation units found in today's microprocessors, CPUs, DSPs and micro-controllers are performing a restricted number of relatively simple address calculation modes. The most important address calculation modes commonly supported are (1) indirect (2) indexed (3) displacement (4) postincrement/decrement (5) modulo (6) modulo wrap-around (circular addressing). However, none of these modes allows to support multidimensional memory indexing efficiently. As a consequence, a lot of instructions in the machine code of a program are required just to calculate the memory indexes of instances of multidimensional variables appearing in the program source code. However with an address generation unit as based on the present invention, all these instructions become obsolete and can be dropped since the memory indexes of instances of multidimensional variables are 'hardwired', in other words they are calculated automatically and autonomously without requiring any instructions in the program code. Furthermore, 'hardwired' does not exclude the possibility to select between several forms of memory indexes (e.g. linearized and others) which are predefined and stored internally in the address generation unit. In this case, control data can be used to tell the address generation unit which form of memory index to select for each instance of a multidimensional variable.
Multidimensional memory indexing naturally occurs in applications involving matrix and vector operations, e.g. FEM calculations, as well as image and multidimensional signal processing. As mentioned before, with a prior art address generation unit, a lot of instructions/operations and hence a lot of program (machine) code size and computation power is required just to perform multidimensional memory indexing. This shall be exemplified by the following excerpt of a program source code written in C (taken from 'Numerical Recipes in C, W.H. Press et. al.') and which determines the eigenvalues of a 2-dimensional upper Hessenberg matrix a, hence requiring 2-dimensional memory indexing. As already mentioned, there are in total 7 2-dimensional memory indexes appearing in the for-loop and
I corresponding to 7 different instances of the 2-dimensional matrix variable a and which have to be recalculated for each iteration of the for-loop body.
Without 2-dimensional memory indexing support, the corresponding program code (in assembler format) of a processing device (microprocessor, CPU, DSP, micro-controller) with a prior art address generation unit and 3-operand instruction set, is listed below and would typically require about 56 instructions, under the assumption that the physical memory is addressed linearly (as a one- dimensional array).
However, by using an address generation unit with 'hardwired' multidimensional memory indexing support as based on the present invention, the program (machine) code (listed behind the prior one) can be reduced to about 33 instructions. This represents a saving of 42 % in program (machine) code size. Furthermore, such an address generation unit with 'hardwired' multidimensional memory indexing support can perform these calculations a lot faster. This allows the program (machine) code to be executed faster, resulting in higher effective processing (computation) power. Note that 7 special instructions, denoted by 'MMU', are required to initialize the address generation unit. Furthermore, executing fewer instructions means also consuming less power. About the improvements are achieved when the physical memory is organized and addressed as a 2-dimensional array.
Therefore, the advantages of address generation units with hardwired multidimensional memory indexing support are threefold : (1) reduced program (machine) code size (2) accelerated program execution (3) reduced power consumption.
The following excerpt of a program source code written in C is taken from 'Numerical Recipes in C, W.H. Press et. al.' and determines the eigenvalues of a 2-dimensional upper Hessenberg matrix a.
for (m=nn-2;m≥1;m- -)
{z=a[m][m]; r= x-z; s= y-z; p= (r*s-w) / a[m+1][m] + a[m][m+1]; q= a[m+1][m+1] -z-r-s; r= a[m+2][m+1]; s= abs(p) + abs(q) + abs(r);
P= p/s; q= q/s; r= r/s; if (m==1) goto 4; u= abs(a[m][m- 1]) *(abs(q) +abs(r)); v=abs(p)*(abs(a[m-1][m-1])+abs(z)+abs(a[m+1][m+1])); if (u+v==v) goto 4;
} 4:
The following program code (in assembler format) of the previous program source code corresponds to a processing device with a conventional address generation unit according the prior art :
1: MACR2,#n,R2 LD R2,(R2) SUB R3,R4,R2 INC R7.R1 MAC R7,#n,R1 LD R7,(R7) INC R8.R1 MACR1,#n,R8 LDR1,(R1) MULT R9,R3,R5 SUB R9.R10 DIV R9.R7 ADD R9.R1 LDR1,#(nn-2) INCR1
MACR1,#n,R1 LDR11.R1 SUBR11.R2 SUB 11.R3 SUBR11,R5 INCR3 ADD R4,#2 MAC R4,#n,R3 LD r4,(r4) ADDR5,|R9|,1R11| ADD R5,R5,|R4| DIVR9.R5 DIVR11.R5 DIVR4.R5 LDR1,#(nn-2) CMPR1,#1 JMPE 56 SUBR12,R1,#1 MACR1,#n,R12 LDR1,(R1) ADDR3,|R11|,|R4| MULTR3,|R1|,R3 DECR1 MACR1,#n,R1 LDR1,(R1) ADDR11,|R1,|R2| LDR1,#(nn-2) INCR1
MACR1,#n,R1 LDR1,(R1) ADDR11,|R1| MULTR11.R9 ADDR3,R11 CMPR3.R11 JMPE 56 LDR1,#(nn-2) JMPER1-,#1 56:
The following program code (in assembler format) of the previous program source code corresponds to a processing system containing a processing device and an address generation unit based on the present invention :
MMU1,R1,#n,R1 MMU2,R1+,#n,R1 MMU3,R1,#n,R1 + MMU4,R1+,#n,R1 + MMU5,R1+#2,#n,R1 + MMU6,R1,#n,R1- MMU7,R1-,#n,R1- 1: SUBR3,R4,A1
SUB R5,R6,A1
MULT R9,R3,R5
SUB R9.R10 DIV R9.A2 ADD R9.A3 SUB R11.A4 SUB R11.R3 SUB R11.R5 ADD R5,|R9|,|R11| ADD R5,R5,|A5| DIV R9.R5 DIV R11.R5 DIV R4,R5 CMP R1 ,#1 JMPE 26
ADD R4,|R4|,|R11| MULT R4,R4,|R11 | ADD R11 ,|R11|,|R12| ADD R11,R12,A4 MULT R11,R11 ,|R9| ADD R4.R11 CMP 4.R11 JMPE 26 : JMPE R1-,#1 26 :
4. Brief description of the drawings
Figure 1 shows a processing system containing a processing device (microprocessor, CPU, DSP, micro-controller) and an address generation unit with 'hardwired' multidimensional memory indexing support as based on the present invention.
5. Detailed description of the drawings
The main aspects of the present invention are described by referring to figure 1 mentioned in this section.
Figure 1 shows the register transfer level architecture of a processing system containing a processing device (microprocessor, CPU, DSP, micro-controller) and an address generation unit with 'hardwired' multidimensional memory indexing support as based on the present invention. In addition, figure 1 shows the data required for initialization and operation of the address generation unit and the data exchanged between the processing device to the address generation unit. Shown is also a memory/ cache from/to which the address generation unit may optionally load/store data according to the addresses as given by the computed memory indexes.
The data required for the initialization and the operation of the address generation unit are now discussed in more detail. The address generation unit is capable to compute a number of predefined forms of memory indexes, e.g. the form corresponding to a linearized memory index (see above for details), which are stored internally and which are selected by control data during initialization of the address generation unit as described below.
1. Given a block of nested loops, with loop indexes m-,,m2 ... mk , being part of the source code of a program whose machine code is going to be executed on the processing device and where the loop bodies of said loops may contain any mixture of conditional, branch and jump statements and where said statements may also be nested. Given one or more instances of multidimensional variables appearing in some loop bodies of loops being part of said block of nested loops. The initialization and the operation of the address generation unit are as follows :
2. Before execution of said machine code, the address generation unit is initialized for each instance considered in 1. : a. with control data which select, out of several predefined forms, the form of the memory index of the considered instance to be computed, e.g. the form of a linearized memory index. Remember that the form of a memory index depends on how the physical memory/cache is addressed. b. with the integer coefficients required to calculate the memory index, said memory index having the form that was selected as specified in 2.a, e.g. the coefficients by )-1, ...k+1 in case of a linearized memory index of the form b * m1 + b2* m2 + ... + bk* mk + b +ι c. optionally with an offset, which may be multidimensional, and which is added to a preliminary computed memory index of the considered instance to obtain the final memory index where the data mentioned under 2.a-2.c are transmitted from the processing device to the address generation unit, where no further data than those mentioned under 2.a-2.c are required to initialize the address generation unit, where however this does not exclude the possibility that, for practical reasons, additional data may be exchanged between processing device and address generation unit in order to initialize the address generation unit.
3. When the processing device starts executing said machine code, it transmits control data to the address generation unit which tell the address generation unit to start operation ecution of said machine code, the operation of the address generation unit is as follows : d. Whenever a loop index of a loop being part of said block of nested loops changes value (due to execution of said machine code), the new value of said loop index is transmitted from the processing device to the address generation unit such that the latest transmitted values of the loops indexes of said loops being part of said block of nested loops, transmitted from the processing device to the address generation as stated hereunder, represent the so called actual values of the loops indexes of said loops. Note that this transmission scheme, triggered by the change of one or more loop index values, does not exclude the possibility to synchronize through the use of a clock signal the transmission of the values of the loop indexes as well as the calculation of the memory index of the considered instance e. for each instance considered in 1., the address generation unit calculates the corresponding memory index using the data as specified in 2. a - 2.c and based on the actual values of the loop indexes as specified in 4.d. Note that the address generation unit can well calculate said memory index with modified (e.g. incremented) values of the loop indexes. In other words, the address generation unit may well modify, e.g. increment, the actual values of the loop indexes, which are transmitted by the processing device to the address generation unit as specified in 4.d, and calculate a memory index using these modified loop index values. The consequence is that the memory /cache address given by a memory index which is calculated using f. ex. incremented loop index values, contains a value (namely that of the considered instance) which might be required by the program a few iterations later (how many iterations later depends on which loop index values are incremented). Such an memory index (address) calculation ahead of the actual program code execution, together with the subsequent loading of memory/cache data stored at the address given by the calculated memory index, allows to hide the latency (access time) of the memory/cache and to avoid that program execution is slowed down. f. Optionally, whenever the memory index of an instance considered in 1. changes value, the address generation unit either loads from the physical memory/cache address given by the new value of the memory index the value of the considered instance and transmits that value to the processing device or stores to said physical memory/cache address the value of the considered instance and in which case the value of the considered instance was transmitted, prior to be used by the address generation unit, from the processing device to the address generation unit g. Optionally, during operation, the address generation unit calculates the memory index of an instance considered in 1. by using, in addition to the data as specified in 2.a-2.c and 4.d, an offset, which may be multidimensional, which is added to a preliminary computed memory index of said instance to obtain the final memory index, where said offset is computed by the processing device, and transmitted from the processing device to the address generation unit whenever said offset changes value and prior to the usage of the new value of said offset by the address generation unit h. during operation, the address generation unit requires no further data than those mentioned under 2.a - 2.c, 4.d and optionally 4.g to calculate the memory index of an instance considered in 1. . However, this does not exclude the possibility that, for practical reasons, additional data may be exchanged between processing device and address generation unit.
Note that the operation of the address generation unit as well as the before mentioned data required by the address generation unit for the calculation of a memory index includes the following two cases :
(1) one or more instructions/operations to be performed within the loop body of a loop being part of said block of nested loops are executed in parallel (simultaneously), sequentially, or partially sequentially and partially in parallel
(2) the execution (on said processing device) one or more loops being part of said block of nested loops is overlapping, in other words, one or more of the instructions/operations contained in the loop body and which have to be performed during some iteration of one of said loops are executed before all the instructions/operations to be executed during the previous iteration of said loop have been completely executed. In this case, whenever a new iteration of said loop is started and executed, the processing device modifies the value of the loop index of said loop and transmits the new value to the address generation unit.
6. Summary of the invention
The present invention concerns a processing system containing a processing device (microprocessor, DSP, CPU, micro-controller) and an address generation unit with 'hardwired' multidimensional memory indexing support according to claim 1.

Claims

ClaimsWhat is claimed is :
1. A processing system containing a processing device and an address generation unit, where the machine code of a program, whose source code contains a block of nested loops, is going to be executed on said processing device, with one or more instances of multidimensional variables appearing in some loop bodies of loops being part of said block of nested loops, where said address generation unit contains one or more predefined forms of memory indexes which are selectable by control data as described under 1.k and where, before the execution of said machine code, the address generation unit is initialized, for each previously considered instance : i. with control data which select, out of one or more predefined forms, the form of the memory index of said instance to be computed j. with the integer coefficients required to calculate said memory index, said memory index being of the form that was selected as specified in 1.i where no further data than those mentioned under 1.i, 1.j are required to initialize the address generation unit, where the data mentioned under 1.i, 1.j are transmitted from the processing device to the address generation unit, where, upon starting executing of said machine code, the processing device transmits control data to the address generation unit which tell the address generation unit to start operation, where, during execution on the processing device of said machine code, the operation of the address generation unit is as follows : k. whenever a loop index of a loop being part of said block of nested loops changes value, due to execution of said machine code on the processing device, the new value of the considered loop index is transmitted from the processing device to the address generation unit such that the latest transmitted values of the loops indexes of said loops being part of said block of nested loops, transmitted from the processing device to the address generation as stated hereunder, represent the so called actual values of the loops indexes of said loops used to calculate the memory index of each considered instance as stated in n. 1. for each considered instance, the address generation unit calculates the corresponding memory index using the data as specified in 1.i, 1.j and k m. where, during operation, the address generation unit requires no further data than those specified in 1.i, 1.j and 1.k to calculate the memory index of the considered instance where all the memory indexes computed by the address generation unit have at least dimension 2 A processing system containing a processing device and an address generation unit as claimed in claim 1, where the address generation unit is initialized, for each considered instance, and in addition to the data mentioned in Li and 1.j, with an offset which may be multidimensional and which is added to a preliminary computed memory index of the considered instance to obtain the final memory index, and where no further data than those mentioned hereunder are required to initialize the address generation unit
A processing system containing a processing device and an address generation unit as claimed in claim 1 where, during operation, the address generation unit calculates the memory index of a considered instance by using, in addition to the data as specified in 1.i, 1.j and k, an offset, which may be multidimensional, which is added to a preliminary computed memory index of said instance to obtain the final memory index, where said offset is computed by the processing device, and transmitted from the processing device to the address generation unit whenever said offset changes value and prior to the usage of the new value of said offset by the address generation unit, and where during operation, the address generation unit requires no further data than those mentioned hereunder to calculate the memory index of said instance
A processing system containing a processing device and an address generation unit as claimed in claim 1 where, whenever the memory index of a considered instance changes value, the address generation unit either loads from the physical memory/cache address given by the new value of said memory index the value of said instance and transmits that value to the processing device or stores to said physical memory/cache address the value of said instance and in which case the value of said instance was transmitted, prior to be used by the address generation unit, from the processing device to the address generation unit
A processing system containing a processing device and an address generation unit as claimed in claim 1, where all the memory indexes computed by the address generation unit are linearized
A processing system containing a processing device and an address generation unit as claimed in claim 5, where the address generation unit is initialized, for each considered instance, and in addition to the data mentioned in 1.i and 1.j with an offset which may be multidimensional and which is added to a preliminary computed memory index of the considered instance to obtain the final memory index, and where no further data than those mentioned hereunder are required to initialize the address generation unit
A processing system containing a processing device and an address generation unit as claimed in claim 5 where, during operation, the address generation unit calculates the memory index of a considered instance by using, in addition to the data as specified in 1.i, 1.j and 1.k, an offset which may be multidimensional, which is added to a preliminary computed memory index of said instance to obtain the final memory index, where said offset is computed by the processing device, and transmitted from the processing device to the address generation unit whenever said offset changes value and prior to the usage of the new value of said offset by the address generation unit, and where during operation, the address generation unit requires no further data than those mentioned hereunder to calculate the memory index of said instance
A processing system containing a processing device and an address generation unit as claimed in claim 5 where, whenever the memory index of a considered instance changes value, the address generation unit either loads from the physical memory/cache address given by the new value of said memory index the value of said instance and transmits that value to the processing device or stores to said physical memory/cache address the value of said instance and in which case the value of said instance was transmitted, prior to be used by the address generation unit, from the processing device to the address generation unit
PCT/EP2000/004671 2000-05-23 2000-05-23 A data processing system having an address generation unit with hardwired multidimensional memory indexing support WO2001090888A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2000/004671 WO2001090888A1 (en) 2000-05-23 2000-05-23 A data processing system having an address generation unit with hardwired multidimensional memory indexing support

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2000/004671 WO2001090888A1 (en) 2000-05-23 2000-05-23 A data processing system having an address generation unit with hardwired multidimensional memory indexing support

Publications (1)

Publication Number Publication Date
WO2001090888A1 true WO2001090888A1 (en) 2001-11-29

Family

ID=8163960

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2000/004671 WO2001090888A1 (en) 2000-05-23 2000-05-23 A data processing system having an address generation unit with hardwired multidimensional memory indexing support

Country Status (1)

Country Link
WO (1) WO2001090888A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1962190A2 (en) * 2007-02-22 2008-08-27 Samsung Electronics Co., Ltd. Memory access method using three dimensional address mapping
CN107038018A (en) * 2016-02-03 2017-08-11 谷歌公司 Access the data in multidimensional tensor
US10504022B2 (en) 2017-08-11 2019-12-10 Google Llc Neural network accelerator with parameters resident on chip
JP7433356B2 (en) 2017-05-23 2024-02-19 グーグル エルエルシー Accessing data in multidimensional tensors using adders
JP7507271B2 (en) 2016-02-03 2024-06-27 グーグル エルエルシー Apparatus, system, and computer-implemented method for processing instructions for accessing N-dimensional tensors

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2163280A (en) * 1984-08-15 1986-02-19 Tektronix Inc Address computation system for digital processing apparatus
EP0227900A2 (en) * 1985-12-02 1987-07-08 International Business Machines Corporation Three address instruction data processing apparatus
EP0689142A1 (en) * 1994-06-21 1995-12-27 France Telecom Electronic memory addressing apparatus especially for a bank organized memory
US5696947A (en) * 1995-11-20 1997-12-09 International Business Machines Corporation Two dimensional frame buffer memory interface system and method of operation thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2163280A (en) * 1984-08-15 1986-02-19 Tektronix Inc Address computation system for digital processing apparatus
EP0227900A2 (en) * 1985-12-02 1987-07-08 International Business Machines Corporation Three address instruction data processing apparatus
EP0689142A1 (en) * 1994-06-21 1995-12-27 France Telecom Electronic memory addressing apparatus especially for a bank organized memory
US5696947A (en) * 1995-11-20 1997-12-09 International Business Machines Corporation Two dimensional frame buffer memory interface system and method of operation thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIUEH T -C: "SUNDER: A PROGRAMMABLE HARDWARE PREFETCH ARCHITECTURE FOR NUMERICAL LOOPS", PROCEEDINGS OF THE SUPERCOMPUTING CONFERENCE,US,LOS ALAMITOS, IEEE COMP. SOC. PRESS, vol. CONF. 7, 14 November 1994 (1994-11-14), pages 488 - 497, XP000533912, ISBN: 0-8186-6607-2 *
HULINA P T ET AL: "DESIGN AND VLSI IMPLEMENTATION OF AN ADDRESS GENERATION COPROCESSOR", IEE PROCEEDINGS: COMPUTERS AND DIGITAL TECHNIQUES,GB,IEE, vol. 142, no. 2, 1 March 1995 (1995-03-01), pages 145 - 151, XP000507030, ISSN: 1350-2387 *
PLESZKUN AND DAVIDSON: "Structured memory access architecture", PROCEEDINGS INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 23 August 1983 (1983-08-23) - 26 August 1983 (1983-08-26), COLUMBUS, OHIO, US, pages 461 - 471, XP000212140 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1962190A2 (en) * 2007-02-22 2008-08-27 Samsung Electronics Co., Ltd. Memory access method using three dimensional address mapping
EP1962190A3 (en) * 2007-02-22 2009-05-06 Samsung Electronics Co., Ltd. Memory access method using three dimensional address mapping
US7779225B2 (en) * 2007-02-22 2010-08-17 Samsung Electronics Co., Ltd. Memory access method using three dimensional address mapping
CN107038018B (en) * 2016-02-03 2019-07-19 谷歌有限责任公司 Access the data in multidimensional tensor
EP3226121A3 (en) * 2016-02-03 2018-10-31 Google LLC Accessing data in multi-dimensional tensors
US10228947B2 (en) 2016-02-03 2019-03-12 Google Llc Accessing data in multi-dimensional tensors
CN107038018A (en) * 2016-02-03 2017-08-11 谷歌公司 Access the data in multidimensional tensor
CN110457069A (en) * 2016-02-03 2019-11-15 谷歌有限责任公司 Access the data in multidimensional tensor
CN110457069B (en) * 2016-02-03 2020-08-18 谷歌有限责任公司 Accessing data in a multidimensional tensor
US10838724B2 (en) 2016-02-03 2020-11-17 Google Llc Accessing data in multi-dimensional tensors
JP7507271B2 (en) 2016-02-03 2024-06-27 グーグル エルエルシー Apparatus, system, and computer-implemented method for processing instructions for accessing N-dimensional tensors
JP7433356B2 (en) 2017-05-23 2024-02-19 グーグル エルエルシー Accessing data in multidimensional tensors using adders
US10504022B2 (en) 2017-08-11 2019-12-10 Google Llc Neural network accelerator with parameters resident on chip
US11501144B2 (en) 2017-08-11 2022-11-15 Google Llc Neural network accelerator with parameters resident on chip
US11727259B2 (en) 2017-08-11 2023-08-15 Google Llc Neural network accelerator with parameters resident on chip

Similar Documents

Publication Publication Date Title
US5163139A (en) Instruction preprocessor for conditionally combining short memory instructions into virtual long instructions
US6298434B1 (en) Data processing device for processing virtual machine instructions
US4524416A (en) Stack mechanism with the ability to dynamically alter the size of a stack in a data processing system
JP3718319B2 (en) Hardware mechanism for optimizing instruction and data prefetching
JP4657455B2 (en) Data processor
EP1359501A2 (en) A processing device for executing virtual machine instructions
US20190004797A1 (en) Exposing valid byte lanes as vector predicates to cpu
EP1512068B1 (en) Access to a wide memory
US20230385063A1 (en) Streaming engine with early exit from loop levels supporting early exit loops and irregular loops
CN102150139A (en) Data processing device and semiconductor integrated circuit device
WO2006094289A2 (en) Speculative load of look up table entries based upon coarse index calculation in parallel with index calculation
US11921636B2 (en) Streaming engine with flexible streaming engine template supporting differing number of nested loops with corresponding loop counts and loop offsets
JPS60142743A (en) Internal bus system
KR20010007031A (en) Data processing apparatus
US6260191B1 (en) User controlled relaxation of optimization constraints related to volatile memory references
US5872989A (en) Processor having a register configuration suited for parallel execution control of loop processing
CN113924550A (en) Histogram operation
US20240184586A1 (en) Mechanism to queue multiple streams to run on streaming engine
JPH03233630A (en) Information processor
WO2001090888A1 (en) A data processing system having an address generation unit with hardwired multidimensional memory indexing support
KR940006916B1 (en) Microprocessor with selective cache memory
US6161171A (en) Apparatus for pipelining sequential instructions in synchronism with an operation clock
EP3559803B1 (en) Vector generating instruction
US6012139A (en) Microprocessor including floating point unit with 16-bit fixed length instruction set
KR19980018071A (en) Single instruction multiple data processing in multimedia signal processor

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): JP US

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE

121 Ep: the epo has been informed by wipo that ep was designated in this application