WO2002041142A1

WO2002041142A1 - A data processing system performing just-in-time data loading

Info

Publication number: WO2002041142A1
Application number: PCT/EP2001/008496
Authority: WO
Inventors: Jean-Paul Theis
Original assignee: Theis Jean Paul
Priority date: 2000-11-17
Filing date: 2001-07-23
Publication date: 2002-05-23

Abstract

The present invention describes a data processing system containing a data processing device (in form of a microprocessor, CPU, DSP, ASIP, micro-controller, multi-processor system, array of more or less tightly coupled microprocessors, CPUs, DSPs, ASIPs or micro-controllers) and a memory system and where said data processing device performs just-in-time data loading. The key feature of just-in-time data loading is that the access times of the memories where data are stored determine when said the loading of said data may start. In particular, the loading of a datum from some memory is postponed as long as the access time for data read of the memory where said datum is stored still allows the datum to be loaded f. ex. into a load buffer or into a register file of said data processing device just-in-time, e.g. just before the instruction, which uses said datum as an operand, is known or is estimated to begin execution.

Description

A data processing system performing just-in-time data loading.

1. Field of the invention

The present invention relates to the field of architecture design of data processing systems in general. More specifically, the invention is dealing with a data processing system containing a data processing device performing just-in-time data loading.

2. Conventions, definition of terms, terminology

In the context of the present invention, the term 'data processing device' means one of the following terms : microprocessor, central processing unit (CPU), digital signal processor (DSP), micro-controller, any special-purpose processor (e.g. graphics processor) or any application specific instruction set processor (ASIP), whether embedded or stand-alone, any multi-processor system, any array of more or less tightly coupled microprocessors, CPUs, DSPs and/or ASIPs. The meaning of these terms is the same as the one commonly described in the literature. A good reference book on the subject of the present invention is 'Computer Architecture : A Quantitative Approach, J. Hennessy and D. Patterson, Morgan Kaufmann Publishers, 1996'. One of the main characteristics of a data processing device as defined before is the fact that it has an instruction set. In other words, the machine code of a program which is running or executed on said data processing device, contains instructions belonging to said instruction set. Said machine code is usually obtained by compiling the source code of a given program or is obtained by manual writing. The source code of said program is usually written in a high level programming language like C++, Basic, Fortran or Java. A said instruction set may be dynamically reconfigurable during execution (run-time) of said machine code or may be fixed. The scope of the present invention is independent thereof.

It is assumed in the following that said data processing device contains one or more functional units (FUs) representing any kind of arithmetic logic .units (ALUs) and/or floating point units (FPUs) and/or load/store units and/or address generation units and/or memory management units and/or any other functional units. Instructions of a machine code running on said data processing device are executed on (by) the FUs of said data processing device. E.g. arithmetic/logic instructions like addition, multiplication, bit-wise OR instructions are executed on (by) ALUs, address mode instructions are executed on (by) address generation units, load/store instructions are usually executed on (by) load/store units etc... The meaning of expressions like 'an instruction is executed' or 'an instruction begins execution' or 'an instruction has finished execution' is explained in more detail below.

The data used (read) by instructions are often called instruction operands. When an instruction is executed on a FU, it performs a number of data operations on its operands and generates data results also called instruction results. E.g. an 'ADD' instruction (arithmetic addition) uses (reads) two operands (two numbers) and generates a result equal to the sum of the two operands. E.g. a load-, store- or prefetch instruction often uses as operand an explicitly specified memory address or the memory address stored within a register of a register file and returns the content (value) stored at said memory address as the (data) result of saidjogd/store instruction.

As usual, instructions of a machine code being executed on said data processing device have an instruction format. While the definition and meaning of an instruction format is well known, it is recalled that an instruction format is made-up (contains) one ore more so-called bit-fields where specific information is stored. Common bit-fields appearing in instruction formats are f. ex. 'opcode'-, 'operand'- and 'destination' bit-fields which specify a particular instruction (e.g. an 'ADD' instruction), its operands and its destination respectively. However, the order, the number, the type of bit-fields as well as the information stored within each bit-field may vary from data processing device to data processing device.

However, in the context of the present invention and in the text that follows, the term 'instruction format' has a slightly broader meaning than the one normally found in the literature and includes instruction formats where no instruction (or data operation) is specified neither in an 'opcode' bit-field nor in any other bit-field of the instruction format. In other words, either one or more 'implicit' instructions or one or more 'implicit and potential' instructions are associated to the data (or operands) specified by the 'operand' bit-fields or by any other bit-fields contained in the instruction format. However, in this case we still speak of an instruction having such an instruction format although there is no instruction explicitly specified by an 'opcode' bit-field or by any other bit-field in said instruction format.

An 'implicit' instruction is defined to be an instruction which is known by the data processing device prior to execution of said instruction and where said instruction has not to be specified by an 'opcode' bit-field or any other bit-field in an instruction format of said instruction. However, as mentioned before, an 'implicit' instruction may well have one or more operands and one or more destinations specified in corresponding bit-fields of said instruction format. It is also possible that an 'implicit' instruction may have no operands and no destination specified in any bit-field of the instruction format. E.g., the 'implicit' instruction may be a special-purpose instruction which initializes some hardware circuitry of the data processing device or has some other well defined meaning or purpose.

Always in the context of a machine code running on said data processing device, an 'implicit and potential' instruction is an 'implicit' instruction where the data results or the outcome of instructions which have not yet finished execution decide whether : 1) said 'implicit and potential' instruction shall be executed or not

2) an already commenced execution of said 'implicit and potential' instruction is valid or not or shall be canceled or not

3) the data result of a said 'implicit and potential' instruction which has finished execution is valid or not

In other words, the execution of an 'implicit and potential' instruction is delayed and is decided upon until other instructions have finished execution, although said instruction may have already entered an instruction pipeline stage like f. ex. a 'fetch' or 'decode'-stage. It is important to see that predicated instructions are special cases of 'implicit and potential' instructions. The concepts of instruction pipeline and instruction pipeline stages are further explained below.

Two small examples shall clarify the meaning of an 'implicit' instruction' and an 'implicit and potential' instruction.

E.g. assume a data processing device having an instruction format (among other instruction formats) as based on the present invention and running a machine code containing instructions out of an instruction set of said data processing device. Furthermore, assume that said instruction format contains two 'operand' bit-fields and no other bit-fields. Furthermore, assume that said data processing device has to execute an instruction having said instruction format and that said two bit-fields specify two operands designated f. ex. by Op1' and Op2'. In this case, an example of an 'implicit instruction' associated to these two operands can be any kind of instruction (or data operation) like the addition or the multiplication of these two operands or the loading of these two operands from a memory or a register file etc. ..., and where said implicit instruction may be specified by convention for the whole time of execution of said machine code or may be specified by another instruction which was executed prior to said instruction. An example of an 'implicit and potential instruction' associated to these two operands is a load- or a move-instruction which is loading the two operands from some memory 1) only after certain instructions not yet executed have been executed and 2) only if the outcome of the data results of said instructions satisfy certain conditions.

In the context of the present invention, the organization and architecture of the memory system and memory hierarchy of said data processing system plays an important role. In the text that follows, the terms 'memory system' and 'memory hierarchy' are defined such as to comprise one or more of the following memories :

(1) one or more register files being part of the data processing device itself

(2) one or more data caches which may be part of the data processing device itself, e.g. L0-.L1- and L2-caches

(3) main memory Often, data used by instructions are stored in separate caches and memories than instructions themselves. However, data caches and instruction caches may be unified such that data and instructions are stored within a same memory. The scope of the present invention is independent thereof.

Furthermore, not mentioned are any kind of 'load buffers' and 'write buffers'. Load and write buffers are usually part of the data processing device itself, e.g. they may also be part of so-called reservation stations used within super-scalar microprocessors. Often, when data are pre-fetched or loaded from main memory or from a data cache, they may first be loaded into said load buffers where they are kept either until they are read by instructions to be executed on the functional units (FUs) of said data processing device or until they are stored in a register file of the data processing device. Similarly, instructions executed on the FUs of the data processing device often write their instruction results first into said write buffers before they are stored into a register file or into another memory of the memory system. Although the load and write buffers are not relevant for the scope of the present invention, they ease the conceptual description. Furthermore, said load and write buffers can be bypassed if required and are fully transparent for the programmer.

In the following, the term 'memory' denotes either a register file, a data cache or main memory. The register file(s), the data caches and main memory have each different access times (latencies) for data read and for data write. Since one or more of these memories may be part of the memory system, we adopt a definition of the access time for data read/write (whether register file, data cache or main memory) which considers the memory system as a unit having data address ports, memory control ports (e.g. any kind of clock signal input, column/row address strobe signal inputs, chip select signal inputs etc..) and data read/write ports. The memory system is accessed by said data processing device through these ports. Therefore, the definition of access times for data read/write are defined with respect to the timing of the signals and data applied to these ports. Note that these ports may be those of the memory system or those of a particular memory of the memory system.

We define the access time for data read of a memory of said memory system as the time that elapses between : the point in time when said data processing device applies some memory control signals to (e.g. any kind of clock signals, column/row address strobe signals, chip select signals etc..) and/or applies valid data address signals on some read address bus (or some read address ports) of said memory or of said memory system in order to signal a request for a data access to a particular memory of the memory system or in order to start a data read operation of said data from said memory and the point in time when said data are returned by the memory system and/or are valid on some data read bus (or data read ports) of said memory or of said memory system or when the transmission of said data to a memory location of the same or of another memory is finished Similarly, the access time for data write of a memory of the memory system (whether register file, data cache or main memory) is defined to be the time that elapses between : the point in time when said data processing device applies some memory control signals to (e.g. any kind of clock signals, column/row address strobe signals, chip select signals etc..) and/or applies valid data address signals on some write address bus (or some write address ports) of said memory of or said memory system in order to signal a request for a data write operation to a particular memory of the memory system or in order to start a data write operation of said data into said memory and the point in time until which the data to be written must be kept stable and/or valid on some data write bus (or data write ports) of the memory or memory system or when the writing operation of said data into said memory is finished

Note that this definition of the access time for data read/write of a memory is independent of any particular embodiment of the memory system and of the memories in question, be it a synchronous static/dynamic RAM (SSRAM, SDRAM) or a packet-switched or packet-oriented memory (see memories produced by Rambus Inc.) .

As usual, the term 'memory hierarchy' refers to the fact that said memory system has so-called memory hierarchy levels. In other words, each memory within said memory system has a specific memory hierarchy level. Said memory hierarchy level is often determined by the access times for data read/write of said memory. A memory having a certain memory hierarchy level denoted by '/ is also said to be of or to have memory hierarchy level j and said memory hierarchy level j is said to be the memory hierarchy level of said memory. Usually, the shorter the access times for data read/write of a memory, the lower the memory hierarchy level of that memory. E.g., a L0 (level-O)-cache has shorter access times than a L1 -cache, a L1 -cache has shorter access times than a L2-cache and so on. Note that a register file is sometimes called a LO-cache.

Therefore, the term 'memory hierarchy level' is based in essence on either an upwards or downwards sorting and labeling of the memory hierarchy levels of a memory system according to the access times for data read/write of the different memories. E.g. a memory A has a shorter access time for data write than a memory B, then said memory A has a lower/higher hierarchy level than said memory B. Often however, an upwards sorting of the memory hierarchy levels is used, in which case a lower memory hierarchy level implies a shorter (or faster) data access time, often concerning both data read and data write. A datum is said to be stored at memory hierarchy level denoted by ' if the memory holding the value of said datum has memory hierarchy level j .

There are many possible ways to define memory hierarchy levels in terms of data access times. A particularly interesting procedure is as follows :

(1) all the memories of the memory system are sorted and labeled upwards according to the sum of their access times for data read and data write (2) if the sum of the access times for data read and data write are the same for two memories, then they share the same label and hierarchy level

(3) the labels obtained according to (1) and (2) denote the memory hierarchy levels of the memories in question

It is clear that this procedure associates exactly one memory hierarch level to each memory.

An small example shall clarify the concepts. Assume the following access times for data read/write of the different memories of the memory system :

register file called 'R1' : access time for data write = 2.5 ns, access time for data read = 2 ns register file called 'R2' : access time for data write = 2 ns, access time for data read = 1 ns the LO-cache : access time for data write = 3.5, access time for data read = 4 ns the L1 -cache : access time for data write = 7 ns, access time for data read = 6 ns the L2-cache : access time for data write = 15 ns, access time for data read = 10 ns the main memory : access time for data write = 35 ns, access time for data read = 30 ns

From there, we deduce the following memory hierarchy levels for the different memories :

register file called 'R2' : access time for data write + access time for data read = 3 ns : => level = 0 register file called 'R1' : access time for data write + access time for data read = 4.5 ns : => level = 1 the LO-cache : access time for data write + access time for data read = 7.5 ns : ==> level = 2 the L1 -cache : access time for data write + access time for data read = 13 ns : == level = 3 the L2-cache : access time for data write + access time for data read = 25 ns : => level = 4 the main memory : access time for data write + access time for data read = 65 ns : = level = 5

In order to ease the description, in the text that follows the terms 'loading of a datum' and 'data loading' refers also to any reading, pre-fetching, moving, copying and transferring of a datum (of data) from a memory of the memory system into the same or into another memory of the memory system or into a load buffer. Similarly, in the following the verb 'load' stands also for verbs like 'pre-fetch', 'move', 'copy', 'read', 'transfer' and for any other synonym. It is also clear that the loading of a datum assumes that said datum is loaded from some memory location (read address) of said memory and that said loading always involves and assumes at the same time the writing, moving, copying, transferring or storing of said datum into a memory location (write address) of the same or of another memory.

Furthermore, for clarification in the following the term 'datum' denotes the singular of the term 'data'. The terms 'data' and 'datum' refer to their values taken during execution of a machine code running on said data processing device. Furthermore, if a datum is loaded from some memory of the memory system and transmitted, copied or moved to the same or to another memory of the memory system, this means of course that the value of said datum is loaded and transmitted to said data processing device. Several concepts used within the scope of the present invention require and assume that said data processing device has means (hardware circuitry) to measure time by using some method, otherwise machine code that is running on said data processing device may produce wrong data or wrong results. Said terms 'measure time' or 'time measurement' have a very broad meaning and implicitly assume the definition of a time axis and of a time unit such that all points in time, time intervals, time delays or any arbitrary time events refer to said time axis. Said time axis can be defined by starting to measure the time that elapses from a certain point in time onwards, this point in time usually being the point in time when said data processing device starts operation and begins to execute said machine code. Said time unit, which is used to express the length of time intervals and time delays as well as the position on said time axis of points in time or any other time events, may be a physical time unit (e.g. nanosecond) or a logical time unit (e.g. the cycle of a clock used by a synchronously clocked microprocessor).

E.g. synchronously clocked microprocessors use the cycles, the cycle times or the periods of one or more periodic clock signals to measure time. In the text that follows., a clock signal is referred to simply as a clock. However, the cycle of said clock may change over time or during execution of a machine code on said microprocessor, e.g. the SpeedStep Technology used by Intel Corporation in the design of the Pentium IV microprocessor. E.g. asynchronously clocked microprocessors use the travel times required by signals to go through some specific hardware circuitry as time units. In case of a synchronously clocked microprocessor, said time axis can be defined by starting to count and label the clock cycles of said clock from a certain point in time onwards, this point in time usually being the point in time when said microprocessor starts operation and begins to execute machine code.

Therefore, if a data processing device is able to measure time, then this implies that said data processing device is able find to out the chronological order of any two points in time or of any two time events on said time axis. Assuming a positive time axis, a point in time (whose value is) denoted by 't1' is said to lie chronologically ahead of or behind another point in time denoted by 't2' if t2 - 11 < 0. A point in time (whose value is) denoted by 't1' is said to lie chronologically before another point in time denoted by 't2' if t2 - 11 > 0.

In the case of a synchronously clocked microprocessor, time measurement is made possible by letting said microprocessor operate with a clock in order to measure time with multiples (maybe integer or fractional) of the cycle of said clock, where one cycle of said clock can be seen as a logical time unit. E.g., if a time delay (time interval) is equal to 34.4 ns and the cycle time of said clock is equal to 12.3 ns, then said time delay would be equal to 34.4 divided by 12.3 = 2.79 logical time units or 2.79 cycle units. Furthermore, the clock which is used to measure time is often the clock with the shortest cycle time such that said cycle is the smallest time unit (logical or physical) used by a synchronously clocked microprocessor in order to perform instruction scheduling and execution , e.g. to schedule all internal operations and actions necessary to execute a given machine code in a correct way. However the scope of the present invention is independent of whether said data processing device is synchronously clocked or whether it uses asynchronous clocking, asynchronous timing or any other operating method or timing method to run and execute machine code.

Whatever the clocking scheme or the operating method (synchronous or asynchronous) or the time measurement method used by a data processing device, it is usual that instructions are pipelined. This means that :

1) said data processing device has one or more instruction pipelines which contain each several (pipeline) stages and instructions may take each different amounts of time (in case of a synchronously clocked microprocessor : several cycles of said clock) to go through the different stages of said instruction pipeline before completing execution. The first pipeline stage is usually a 'prefetch' stage, followed by 'decode' and 'dispatch' stages, the last pipeline stage being often a 'write back' or an 'execution' stage. One often speaks of different phases through which an instruction has to go, e.g. 'fetch', 'decode', 'dispatch', 'execute', 'write-back' phases etc., each phase containing several pipeline stages. Therefore, the execution of an instruction may include the pipeline stages (and the amount of time) which are required to write or to store or to save operands or data results into some memory location, e.g. into a register, into a cache or into main memory. E.g. in the case of a synchronously clocked microprocessor, multiples (integer or fractional) of the cycle of said clock can be used as well to specify the depth and the number of the instruction pipeline stages of a microprocessor. The number of pipeline stages that a given instruction has to go through is often called the latency of said instruction. In case of a synchronously clocked microprocessor, said latency is often given in cycle units of a clock.

An instruction is said to begin execution or to be executed on a FU of said data processing device or to have commenced execution an a FU of said processing device if said instruction enters a certain pipeline stage, and where said pipeline stage is often the first stage of the execution phase. An instruction is said to have finished execution if it leaves a certain pipeline stage, said pipeline stage being often the last stage of the execution phase. The point in time (on said time axis) at which a given instruction enters a pipeline stage is called the 'entrance point' of said instruction into said pipeline stage. The point in time at which a given instruction leaves a pipeline stage is called the 'exit point' of said instruction out of said pipeline stage.

From the operating principles of instruction pipelines in general, it is recalled that if an instruction enters a certain pipeline stage then said instruction usually triggers certain operations (also called microoperations) or events internal to the microprocessor which are required to operate and to execute machine code correctly and which are determined by the functionality of said pipeline stage and which are usually part of a so-called microcode of said instruction. Therefore, microcode and microoperations usually differ from pipeline stage to pipeline stage. Note that microcode has not to be confused with machine code. 2) an instruction may enter a stage of an instruction pipeline before another instruction has left another stage of the same instruction pipeline. E.g. if an instruction pipeline has 4 stages denoted by P1,P2,P3,P4, then an instruction A1 may enter stage P2 at some point in time t1 while another instruction labeled by B1 enters stage P4 at the same point in time t It is also possible that an instruction pipeline of a data processing device is such that instruction A1 may enter a stage before another instruction B1 has left the same stage.

The term instruction pipeline is still valid and keeps the same meaning even if instructions are not pipelined. In this case, an instruction pipeline has one single stage. E.g. in case of a synchronously clocked microprocessor, an instruction usually takes one cycle of said clock to go through one stage of an instruction pipeline. Typical depths of instruction pipelines of prior-art microprocessors range between 5 to 15 stages. E.g. the Pentium IV processor of Intel Corporation has an instruction pipeline containing 20 stages such that instructions may require up to 20 clock cycles to go through the entire pipeline, whereas the Alpha 21264 processor from Compaq Corporation has only 7 stages.

In the following, the term 'instruction scheduling and execution' plays an important role for the scope of the present invention. In order to show the generality of the scope of the present invention, we give first a broader definition of this term : in the context of a data processing device executing some machine code, the term 'instruction scheduling and execution' refers to the determination of the points in time on a time axis (as defined above) at which some operations or some time events are occurring (or are taking place) within said data processing device in order to allow for a correct execution of machine code on said data processing device

A definition of the previous term which is closer to a physical use and implementation of an instruction format as based on the present invention and which is included in and is a special case of the previous definition, is as follows : the term 'instruction scheduling and execution' refers to the determination of the points in time on said time axis at which a given instruction of a machine code running on said data processing device enters or leaves one or more stages of an instruction pipeline of said data processing device in order to complete (finish) execution. In case of a synchronously clocked data processing device, said points in time can be integer or fractional multiples of a cycle, cycle time or period of a clock.

It is recalled that instruction scheduling and execution applies to all instructions of a machine code.

It is important to see that in practice almost all instructions of a machine code perform or initiate data loading, either explicitly or implicitly. E.g. instructions which perform data loading explicitly may be any kind of data load-, pre-fetch, move-, copy-, or transfer-instructions. E.g. instructions which perform data loading implicitly are arithmetic/logic instructions which implicitly perform the loading of their operands from some memory (register file) of the memory system but without requiring an separate data load instruction for doing so. E.g. an 'ADD R1,R2,R3' instruction which adds the contents of registers R1 and R2 together and stores the result (sum) into register R3 implicitly (automatically) performs the loading of the contents of registers R1 and R2 of some register file such that said register contents (values) are available for computation within a FU on which the ADD-instruction is going to be executed.

Therefore, we may distinguish between :

- , any loading of a datum performed implicitly by an instruction (e.g. arithmetic/logic instruction) in order to have said datum available as operand of that same instruction and any explicit data loading instruction (e.g. data load-, store-, copy-, transfer-, move-, pre-fetch instructions) which has the aim to make data available as operands of other instructions

If data loading is performed or initiated by instructions, then the points in time at which the loading of a datum is started and is finished are determined by the scheduling and execution of said data load instruction. In other words, the loading of a datum starts as soon as a corresponding instruction enters a certain stage of an instruction pipeline of the data processing device and is finished as soon as said instruction has left a certain stage of a said instruction pipeline. However, the points in time at which the loading of a datum starts (also called starting point) and finishes (also called end point) are dependent on a specific embodiment (realization) and does not have to coincide exactly with an entrance or an exit point into a pipeline stage. Furthermore, said pipeline stages which may define the starting point and the end point of the loading of a datum may be different for each instruction. E.g. the loading of a datum by a 'move'-instruction may start as soon as said instruction has left a 'decode' stage, but the loading of a datum by an arithmetic instruction may only start when said instruction has entered an 'issue' stage or when said datum is valid.

Furthermore, if dynamic instruction scheduling and execution is used then said starting points and end points are not known exactly before the instructions performing said data loading actually begin execution. However, said starting points and end points can often be estimated by determining optimistic or as-soon-as-possible (ASAP) instruction schedules. In this case the estimated starting points and end points represent earliest possible starting points and end points. In other words, said data loading will not in any case start and end before said earliest possible points in time. Starting points and end points are said to be determined if they can be exactly calculated. Otherwise starting points and end points are said to be estimated. Furthermore, it is assumed in the following that the loading of a datum may only be started if resources are available, otherwise the start of said loading may have to be delayed or postponed until resources are available. In other words, even if some point in time where the loading of a datum may start is determined or estimated according to some method, said point in time may have to be delayed or postponed if resources are not available. Resources may be of any kind, e.g. number and type of ALUs, FPUs or load/store units, number and bandwidth of busses, number and type of read/write ports of memories etc ... However, data loading may also occur autonomously without requiring instructions to start or initiate data loading. E.g. said data processing device may have means to load or move data from one cache into another cache of the memory system, without requiring instructions to do so, but only by using some caching strategies such as a least-recently-used (LRU) strategy or such as some random replacement strategy. In this case, said caching strategies decide when the loading of a datum is started.

A definition of the points in time at which the loading of a datum is started and finished respectively, which is independent of whether the loading of said datum is initiated by instructions or whether it is performed autonomously, is as follows : the loading of a datum starts : as soon as said data processing device applies some memory control signals to (e.g. any kind of clock signals, column/row address strobe signals, chip select signals etc..) and/or applies valid data address signals on some read address bus (or some read address ports) of a memory or of the memory system in order to signal a request for a data access to said memory of the memory system or in order to start a data read operation of said data from said memory and is finished : as soon as said data are returned by the memory system and/or are valid on some data read bus (or data read ports) of said memory or of said memory system or when the transmission of said data to a memory location of the same or of another memory is finished

As before, this definition assumes as well that the loading of a datum can only be started if resources are available, otherwise the start of said loading may have to be delayed until resources are available.

Another important concept in the context of the present invention is that of 'data lifetime'. The lifetime of a datum denotes a time interval on said time axis. The two points in time (on said time axis) defining the lifetime of a datum are respectively :

(1) a first point in time at which an instruction has finished execution on a FU of the data processing device and where the result of said instruction is equal to (the value of) said datum

(2) a second point in time lying chronologically ahead of said first point in time and at which (the value of) said datum is used again by one or more instructions being executed on a FU of the data processing device

Clearly, data lifetimes depend on instruction scheduling and execution as well. If instructions are scheduled using static scheduling (as in the case of DSPs), then most of the data lifetimes can be exactly calculated, e.g. by relying on array data flow analysis. E.g. if instructions are dynamically scheduled (as is the case for super-scalar microprocessors), then most of the data lifetimes can only be estimated by using array data flow analysis or by determining optimistic schedules like ASAP (as soon as possible) schedules. Data may be re-used at multiple and different points in time. Minimal and maximal data lifetimes are determined by the points in time where data are reused for the first and for the last time respectively. For the scope of the present invention, it is not relevant whether data lifetimes represent minimal, maximal or some intermediate lifetimes. Furthermore, it does not matter whether data lifetimes are calculated in an exact way or whether they are approximated or estimated by some method.

Data lifetimes are given in some time unit. As mentioned before, this can be done f. ex. in form of fractional numbers or in form of integer numbers. Fractional numbers are often expressed (specified) in some physical time unit (e.g. in [ns]) while integer numbers are often expressed (specified) in logical time units such as the cycle units of a clock of a synchronously clocked microprocessor. However, fractional numbers may also be expressed (specified) in logical time units.

The concept and the term of a 'run time window' is important in the context of the present invention and is now further explained. A run time window denotes a time interval where one of the two points in time defining said run-time window is the actual execution state of a machine code running on said data processing device. The actual execution state of a machine code running on said data processing device can be defined in multiple ways. Conceptually spoken, the actual execution state of a machine code running on said data processing device allows to determine all the instructions (of said given machine code) which have been executed or which have entered a certain instruction pipeline stage since the start of execution of said machine code by said data processing device. In the text that follows, unless explicitly mentioned otherwise the term 'the execution state' always refers to the actual execution state whereas the terms 'an execution state' and 'one or more execution states' refer to execution states which are not further specified or which may lie chronologically before said actual execution state.

Often, the execution state of a machine code is defined to be the point in time when the latest instruction of said machine code was fetched from some memory (e.g. instruction cache) or when the latest instruction fetched so far enters a certain stage of an instruction pipeline of said processing device. If said data processing device operates with some clock, then another possibility consists in defining the execution state in form of an integer number which represents the number of clock cycles of said clock which have elapsed since said program has started execution. Whatever definition of the execution state is taken, it does not affect the scope of the present invention. Note that the definition of the execution state of a machine code running on a data processing device has not to be confused with the execution state of the data processing device itself. In the following, the term 'execution state' refers to the execution state of a machine code running on a data processing device as defined before.

The other point in time defining said run-time window is always a point in time lying chronologically ahead of said execution state. In other words if t _ahead denotes said point in time t _exec denotes said execution state then : t _exec < t _aheaιl . In other words, some positive amount of time elapses before the execution state reaches (is equal to) this end point. The time difference between the two end points is called the size of the run time window. E.g. if t _aΛea<. = t _exec + 350 ns, then the size s = t _ahead - 1 _ejcec = 350 ns.

As mentioned before, dynamic instruction scheduling and execution makes it that the data processing device can often only estimate which instructions will begin execution within a run time window of a certain size and which will not. As soon as the data processing device knows which instructions are likely to begin execution within said run-time window, then said data processing is also able to determine which data are required by said instructions. Usually, this is done by fetching and decoding instructions ahead of the actual execution state by using branch or trace prediction methods. Therefore, in the following we assume that said data processing device has means to do so.

3. Prior Art

Prior-art data loading strategies load data from some memory into the same of into another memory of the memory system or into a load buffer of the data processing device as soon as one or more of the following three conditions are satisfied :

(1) said data are known to be used by instructions which are known or estimated to begin execution and/or to finish execution within a run time window of certain size

(2) said data are valid

(3) computing resources are available

These 3 conditions are now discussed in further detail in order to show the difference with the present invention.

As mentioned in section 2, condition (1) is checked by a data processing device after having fetched instructions ahead of the actual execution state according to branch or trace prediction methods and after having decoded said instructions together with their operands.

Condition (2) is checked by a data processing device by checking if all data hazards, e.g. read-after- write (RAW) hazards, have resolved or are satisfied. Data hazards are satisfied if instructions are executed in an chronological order which satisfies the data dependencies and the control dependencies between instructions. E.g., data which are used by some instruction denoted by Op1' in form of instruction operands are only valid if said data are not equal to the result of (or are not produced by) another instruction which has not yet finished execution. As mentioned before, condition (2) has not necessarily to be satisfied. Data can well be loaded from some memory and used by other instructions even before all data hazards are known to be satisfied or not. This is called speculative data loading and speculative instruction execution.

Condition (3) checks if there are resources available to schedule and execute instructions which use said data as operands. Resources may be of any kind, e.g. number and type of ALUs, FPUs or load/store units, number and bandwidth of busses, number and type of read/write ports of memories etc ... If condition (3) is not satisfied, the loading of a datum may have to be delayed or postponed until resources become available.

The main difference between prior-art data loading and just-time data loading as based on the present invention consists in the fact that neither the access time of the memory in which a datum is stored nor the lifetime of a datum are used to determine or to estimate the point in time when the loading of said datum may start, e.g. the point in time when said processing device applies some memory control signals to and/or applies valid data address signals on some read address bus (read address ports) of the memory system in order to signal a request for a data access to a particular memory of the memory system or in order to start a data read operation of said data from said memory. In prior-art data loading, the access time may only determine the end point as well as any point in time lying chronologically between the starting point and the end point of the loading of said datum. E.g. although an arithmetic instruction such as 'ADD R1,R2,R3' loads its operands R1 and R2 from a register file and although the amount of time required for loading said operands out of said register file and hence the end points of the loading of said operands are determined by the access time of said register file, the point in time at which the loading of said operands is started or initiated does not depend on the access time of said register file.

In a particular embodiment, just-in-time data loading delays the start of the loading of a datum from a memory as long as the access time for data read of the memory in question still allows the datum to be loaded into the same or into another memory of said memory system or into a load buffer of the data processing device just-in-time, e.g. just before the instruction, which uses said datum as an operand, is calculated or estimated to begin execution.

4. Description of the invention

The main aspects of the present invention are now, described by referring to a data processing system containing : a data processing device, a memory system which may comprise one or more of the following memories : one or more register files of said data processing device itself any data caches, e.g. L0-,L1-,L2- data caches a main memory, where a machine code is running (is executed) on said data processing device, where said machine code contains instructions which use data in form of operands and which produce data in form instruction results (data operation results), where said data processing device has means to perform just-in-time data loading, where said just-in-time data loading is defined as follows :

1. given a datum which is used by one or more instructions during the execution of said machine code; said data processing device uses the access time of a memory where said datum is stored and/or the lifetime of said datum in order to determine or to estimate a point in time at which the loading of said datum may start, where said point in time may be postponed if resources are not available

Note that this definition applies both to data loading performed autonomously and to data loading performed or initiated by instructions (as explained in section 2). A definition of just-in-time data loading which applies only to data loading performed or initiated by instructions is as follows :

2. said data processing device determines all of or part of the data used by instructions (of said machine code) which are known or estimated to begin and/or end execution within a run time window, where the size of said run-time window may vary during execution of said machine code, where said instructions refer to any instructions performing or involving data loading

3. said data processing device uses the access time of a memory where a datum of said data is stored and/or the lifetime of said datum in order to determine or estimate a point in time at which the loading of said datum may start, where said point in time may be postponed if resources are not available

Since data can be stored multiple times in different memories at different memory hierarchy levels, just- time-time data loading assumes that said data processing device has means to determine one or more of said memories where a datum is stored.

As explained in section 2, it is no contradiction that the access time of the memory where a datum is stored may determine the starting point of the loading of said datum itself, e.g. the entrance point of a load instruction into an 'execution' stage of the instruction pipeline.

As explained in section 2, even if it is not explicitly mentioned it is assumed in the following that the loading of a datum may only be started if resources are available, otherwise the start of said loading may have to be delayed until resources are available. In other words, even if said data processing has determined or estimated some point in time where the loading of a datum may start, said point in time may have to be postponed if resources are not available.

A particular embodiment of just-in-time data loading, as defined in steps 2. and 3., which is closer to a practical implementation is as follows :

4. said data processing device determines all of or part of the data used by instructions (of said machine code) which are known or estimated to begin and/or end execution within a run time window, where the size of said run-time window may vary during execution of said machine code, where said instructions refer to instructions performing or involving data loading

5. said data processing device uses the access time of a memory where a datum of said data is stored and/or the lifetime of said datum in order to determine if the loading of said datum may start at a point in time given by the actual execution state or not

As mentioned in steps 2. and 4., said instructions of said machine code may refer to any instruction performing data loading, e.g. explicit load instructions, whether they be implicit and/or potential, and/or any integer and floating-point arithmetic/logic instructions. A practical and efficient implementation of the steps 2. to 5. would exploit the following 'buffering' property : the set of data which, for a given execution state t _eχec , is used by instructions which are known or estimated to begin or to end execution within a run time window of a given size s_? is identical to the set of data which, at a later execution state given by t _eχe_c ⁺ δ, is used by instructions whose execution times lie within a run time window of size Si - δ, δ being some positive amount of time

A data processing device using this buffering property determines only once if a certain datum is used within a given run-time window and when the loading of said datum may start. In other words, as soon as the data processing device knows or estimates that a datum is used within a certain run-time window, then it determines or estimates a point in time at which the loading of said datum may start. The processing device then stores said point in time (maybe together with a label specifying to which datum this point in time refers to) in some kind of buffer. As soon as the actual execution state has advanced and has reached said point in time, then the loading of said datum is started, provided that resources are available. This buffering property reduces the amount of computation as well as the costs required to implement just-in-time data loading.

Steps 3. and 5. have to be performed according to a given procedure. Said procedure determines how said access time is used in order to calculate or to estimate the point in time at which said data loading may start. A procedure of particular interest is as follows :

Let M define a positive integer and let tr_κ denote the longest of all access times for data read of all memories of said memory system. Let s₀, s-, ... s^.. define a sequence of run time windows whose sizes satisfy :

(1) for all 1≤/< M-1 : S_j ≥ S_j.-,

(2) s_u.₁ ≥ tr_κ;

For all j satisfying 0 < j≤ M-1 : let SDy denote all of or part of the data required (used) by instructions which are known or estimated to begin and/or to end execution within the run time window of size s₇- . The loading of a datum of said set of data SD; is started at the point in time given by the actual execution state if and only if the access time for data read (denoted by tη ) of a memory where said datum is stored lies within a narrow time interval centered around the point in time where said datum is calculated or estimated to be required (used) by instructions of said machine code, in other words if and only if the following conditions are satisfied :

(3) there exists k such that said datum does not belong to SD_ft but belongs to SD_fc+r

(4) s_k ≤ tη ≤ s_k+ι

A small example shall illustrate how this procedure works in practice. To this end, we consider a source program containing some indexed variable b[l][k], I and k denoting the indices. Suppose we want to determine, for a given execution state of the machine code of said source program, if some data values of variable b have to be loaded and transmitted from some memory to the data processing device. Suppose we choose M=3 and s₀ = 1 ns , s_? = 15 ns, s₂ = 30 ns. Suppose we have three memories labeled 0,1 and 2 and having access times for data read of tr₀ = 1 ns , tr-, = 5 ns, tr₂ = 20 ns respectively. Suppose that the data processing device finds out that SD₀ = { } , SD^ = {b[0][2] } , SD₂ = { b[0][2], b[1][2] } . Suppose that the value of b[1][2] is stored at memory 1 and the value o b[0][2] is stored at memory 0. Since b[0][2] does not belong to SD₀ but belongs to SD^ and since s₀ ≤ fro ≤ s-, , the value of b[0][2] is loaded from memory for the given execution state. However, b[1][2] is not loaded from memory because, since b[1][2] does not belong to SD, but belongs to SD₂ , the condition s, < _r. < s₂ is not satisfied. In fact, since tr-, - s-, = 10 ns, the loading of the value of b[1][2] can be delayed by another 10 ns, while still guaranteeing that said value can be loaded in-time f. ex. into a load buffer or into a register file of the data processing device before the instruction using said value as operand begins execution. Therefore the value of b[1][2] is not loaded for t(πe given execution state, but only at a later execution state !

As can be seen from this example, the key feature behind just-in-time data loading is the fact the access time of a memory at which a datum is stored determines the point in time at which the loading of said datum may start. In particular, the loading of data from the memory system is delayed as much as possible. In other words, just-in-time data loading delays the loading of a datum as long as the access time for data read of the memory in which said datum is stored still allows the datum to be loaded f. ex. into a load buffer of the data processing device just-in-time, e.g. just before the instruction, which uses said datum as an operand, is calculated or estimated to begin execution.

5. Summary of the invention

The present invention concerns a data processing system containing a data processing device and a memory system and where said data processing device performs just-in-time data loading according to claim 1.

Claims

ClaimsWhat is claimed is :

1. a data processing system containing : a data processing device, a memory system which may comprise one or more of the following memories : one or more register files of said data processing device any data caches a main memory, where a machine code is executed on said data processing device, where said machine code contains instructions which use data in form of operands and which produce data in form instruction results, where said data processing device has means to perform just-in-time data loading, where said just-in-time data loading is defined as follows : a. given a datum which is used by one or more instructions during the execution of said machine code; said data processing device uses the access time of a memory where said datum is stored and/or the lifetime of said datum in order to determine or to estimate a point in time at which the loading of said datum may start, where said point in time may be postponed if resources are not available

2. a data processing system containing : a data processing device, a memory system which may comprise one or more of the following memories : one or more register files of said data processing device any data caches a main memory, where a machine code is executed on said data processing device, where said machine code contains instructions which use data in form of operands and which produce data in form instruction results, where said data processing device has means to perform just-in-time data loading, where said just-in-time data loading is defined as follows : b. said data processing device determines all of or part of the data used by instructions which are known or estimated to begin and/or to end execution within a run time window, where the size of said run-time window may vary during execution of said machine code c said data processing device uses the access time of a memory where a datum of said data is stored and/or the lifetime of said datum in order to determine or to estimate a ^' point in time at which the loading of said datum may start, where said point in time may be postponed if resources are not available

3. a data processing system containing : a data processing device, a memory system which may comprise one or more of the following memories : one or more register files of said data processing device any data caches a main memory, where a machine code is executed on said data processing device, where said machine code contains instructions which use data in form of operands and which produce data in form instruction results, where said data processing device has means to perform just-in-time data loading, where said just-in-time data loading is defined as follows : a. said data processing device determines all of or part of the data used by instructions which are known or estimated to begin and/or to end execution within a run time window, where the size of said run-time window may vary during execution of said machine code b. said data processing device uses the access time of a memory where a datum of said data is stored and/or the lifetime of said datum in order to determine or to estimate if the loading of said datum may start at a point in time given by the actual execution state or not

4. a data processing system containing : a data processing device, a memory system which may comprise one or more of the following memories : one or more register files of said data processing device any data caches a main memory, where a machine code is executed on said data processing device, where said machine code contains instructions which use data in form of operands and which produce data in form instruction results, where said data processing device has means to perform just-in-time data loading, where said just-in-time data loading is defined as follows : a. Let M define a positive integer and let tr_κ denote the longest of all access times for data read of all memories of said memory system. For ally satisfying 0 ≤j≤ M-1 : let SD,- denote all of or part of the data used by instructions which are known or estimated to begin and/or to end execution within the run time window of size S_j . Let s₀, s-, ... s_M.ι define a sequence of run time windows whose sizes satisfy :

(i) for all 1<y< M-1 : s, > s_y._t

b. The loading of a datum of said set of data SD_j is started at the point in time given by the actual execution state if and only if all of the following conditions are satisfied : (iii) there exists k such that 0 < k < M-1 and such that said datum does not belong to

SD_fr but belongs to SD^ (iv) s_ft < tη ≤ s_k+1 , tη denoting the access time for data read of a memory where said datum is stored

5. A data processing system as claimed in claim 2, where as soon as the data processing device knows that a datum is used by instructions which are known or estimated to begin and/or to end execution within a run time window of a certain size, said data processing device uses the access time of a memory where said datum is stored and/or the lifetime said datum in order determine or to estimate a point in time at which the loading of said datum may start. The processing device then stores said point in time, maybe together with a label specifying to which datum this point in time refers to, in some kind of buffer. As soon as the actual execution state has advanced and has reached said point in time, the loading of said datum may start.

6. A data processing system as claimed in claim 3, where as soon as the data processing device knows that a datum is used by instructions which are known or estimated to begin and/or to end execution within a run time window of a certain size, said data processing device uses the access time of a memory where said datum is stored and/or the lifetime said datum in order determine or to estimate a point in time at which the loading of said datum may start. The processing device then stores said point in time, maybe together with a label specifying to which datum this point in time refers to, in some kind of buffer. As soon as the actual execution state has advanced and has reached said point in time, the loading of said datum may start.

7. A data processing system as claimed in claim 4, where as soon as the data processing device knows that a datum is used by instructions which are known or estimated to begin and/or to end execution within a run time window of a certain size, said data processing device uses the access time of a memory where said datum is stored and/or the lifetime said datum in order determine or to estimate a point in time at which the loading of said datum may start. The processing device then stores said point in time, maybe together with a label specifying to which datum this point in time refers to, in some kind of buffer. As soon as the actual execution state has advanced and has reached said point in time, the loading of said datum may start.

8. a data processing system as claimed in claim 1 , where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory

9. a data processing system as claimed in claim 2, where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory

10. a data processing system as claimed in claim 3, where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory

11. a data processing system as claimed in claim 4, where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory

12. a data processing system as claimed in claim 5, where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory

13. a data processing system as claimed in claim 6, where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory

14. a data processing system as claimed in claim 7, where said memory system comprises one or more register files, one or more L1 -caches, one or more L2- caches, a main memory