WO2008067918A1 - Procédés et dispositifs permettant d'améliorer la mise en mémoire cache - Google Patents
Procédés et dispositifs permettant d'améliorer la mise en mémoire cache Download PDFInfo
- Publication number
- WO2008067918A1 WO2008067918A1 PCT/EP2007/010175 EP2007010175W WO2008067918A1 WO 2008067918 A1 WO2008067918 A1 WO 2008067918A1 EP 2007010175 W EP2007010175 W EP 2007010175W WO 2008067918 A1 WO2008067918 A1 WO 2008067918A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cache
- memory
- processor
- request
- data
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0862—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
Definitions
- the present invention relates to cache-based memory and processor systems. More specifically, the described subject matter relates to devices for and methods of enhanced cache performance comprising pre-fetch.
- Modern processors often need memory systems with memories in several layers where the memories closest to the processor typically are small and fast, often referred to as caches.
- the memories in the layers further from the processor are typically larger and slower.
- a reason for this layering of different memories is that fast memories are extremely costly compared to slower memories and thus it would be very costly and difficult to have a whole program fit in a tightly connected and fast memory.
- tightly connected is meant a connection with a sufficiently large bandwidth so the connection do not act as a bottle-neck. It is well known that if the cache sizes are sufficiently large the performance degradation for the processor system is quite small even though only a smaller portion of the memories are fast.
- a processor system can have caches for instructions and/or data. Instruction caches typically exist only for read (or fill) operations while data caches may exist for both read and write operations.
- a typical instruction cache operation may e.g. be described as:
- the processor executes a program by issuing a program address and tries to fetch the corresponding instruction from the cache memory. If an instruction corresponding to the issued address resides in the cache, the processor fetches the instruction. The processor then continues with execution by generating another address. If the instruction does not reside in the cache a so-called cache miss happens and the cache circuit then generates a number of memory reads to upper layer memories in order to fill the cache with correct content where the latter typically is referred to as a cache line fill. Once the correct content is in the cache, execution resumes. Obviously, it is important to try to minimize the number of cache misses since this is the kind of event that degrades the performance.
- the cost of a cache miss is the number of clock cycles it takes to fill up the cache line where this time mostly depends on the size of the cache line and the type of memory it is fetched from.
- the cost of a cache miss may also be a use of bandwidth for a bus or the like connecting the processor or the cache memory to the memory where the information is retrieved from, especially if the bus or the like is also used by other functions or circuits.
- DSP Digital Signal Processor
- This exemplary DSP has the following memory setup: 32 kBytes instruction cache, 32 kBytes tightly coupled program memory, and 64 kBytes tightly coupled data memory.
- a DSP subsystem comprising a DSP (101 ), comprising a DSP core (102), an instruction cache (103), a tightly coupled instruction memory (104) and a tightly coupled data memory (105) both being internal memories of the DSP (101 ).
- the DSP subsystem (100) is typically located on an ASIC (Application Specific Integrated Circuit).
- the DSP subsystem (100) typically further comprises an additional internal (to the DSP subsystem (100) but external to the DSP (101)) memory (106), which typically is user configurable with respect to size and which may comprise SRAM memory and ROM memory of several hundreds of kilobytes, that also may be located on the ASIC comprising the DSP subsystem (100).
- the main content of the additional internal memory (106) will typically be the relevant program code. Execution is done from the instruction cache (103) and cache fills are typically done from the additional internal memory (106), e.g. either from the SRAM or ROM memory.
- the DSP subsystem (100) will also typically comprise one or more peripherals (107) controlled by the DSP core (102). As examples of such peripherals (107) are e.g. timers, interrupt controllers, etc.
- DSP subsystem may vary in certain details and layout but will function in a similar fashion.
- DSP solutions are quite expensive and take up a relatively large size on the ASIC compared to the other circuits, applying specific functions e.g. for digital baseband ASIC solutions, that the DSP (101 ) and DSP sub-system (100) are to be used in connection with.
- the internal memories (104, 105) are typically the same size as the DSP core (101 ) itself for a proper or useful performance, which is very expensive due to the cost of such internal and tightly coupled memories (104, 105).
- patent specification US 6,981 ,127 discloses a processor consuming a variable amount of bytes in each clock cycle into the processor's fetching unit.
- a pre-fetch unit is used consuming a fixed amount of data.
- the pre-fetch buffer requires address decoding logic increasing the complexity of the design.
- Patent application US 2001/018735 discloses an alternative to traditional instruction cache structure where a whole code line is pre-fetched from external memory and stored in buffers. If the code is linear, i.e. it does not contain loops, the next coming instructions will already be fetched. If the code is not linear, i.e. contains loops, this can be handled efficiently as long as the loop is small enough to be contained in the buffers. However, if the software comprising the instructions to be executed is linear the benefit of an instruction cache pre-fetch is small.
- Patent specification US 6,792,496 discloses pre-fetching of data in a system with variable latency in time for a response from a memory or peripheral.
- Patent specification US 6,195,735 discloses a pre-fetch system that utilizes system information to adjust the size of the pre-fetch in order to optimize system performance.
- Patent specification US 5,787,475 discloses a method and apparatus for determining when an Input/Output (I/O) module should pre-fetch cache lines of data from a main memory.
- I/O Input/Output
- An additional object is to enable this without causing significant, if any, performance degradation.
- a further object is to enable this by a simple design.
- An additional object is to avoid or mitigate the impact on the bandwidth to external (to the DSP or processor) (e.g. off-chip) memory by the DSP or processor now using some of the bandwidth to access the external memory for the DSP or processor code.
- Yet a further object in one embodiment is to reduce the impact of using external (e.g. off-chip) memories with a setup time being large compared to the time required for the actual data access.
- a method (and corresponding device) of enhanced cache performance comprising pre-fetch, the method comprising receiving a first pre-fetch request in a cache boosting circuit from a processor where the first pre-fetch request requests cache data from a memory being external to the processor, merging or combining the first pre-fetch request with an additional pre-fetch request requesting additional cache data resulting in a single merged request for cache data and forwarding the single merged request to the memory, where the additional cache data is consecutive to the data of the first pre-fetch request, receiving at least a first part of the requested cache data of the single merged request in the cache boosting circuit from the memory and supplying the first part of the requested cache data to the processor for processing, where the first part of the requested cache data corresponds to the data requested by the first pre-fetch request, receiving a second pre-fetch request in the cache boosting circuit after the processor has received the first part of the requested cache data where the cache boosting circuit do not forward the second pre-fetch request to the memory and where the
- the second (and any additional if applicable) prefetch request(s) (forth denoted second cache line fill) will only take the pure access time to the external memory as it is done as part of the same request as the first pre-fetch request (forth denoted first cache line fill).
- the internal delays and the setup time to the external memory for the second time are saved.
- the additional pre-fetch request requesting additional cache comprises the second (and any additional if applicable) pre-fetch request(s) from the processor.
- high-speed memory close to the processor is not needed or can be reduced in size as it can be replaced partly or completely with external (to the processor) memory that can be slower and thereby cheaper without significant disadvantages of performance.
- the issues of using a pre-fetch function together with memory, and especially off-chip memory with a larger setup time are effectively alleviated.
- the cost for a cache line fill is reduced whereby the performance is enhanced, especially for control type of code where the instruction cache overhead can be significant.
- the external memory is occupied less cycles giving a reduction in occupied time, which is very important since the bandwidth to the off-chip memory is a scarce resource.
- the single merged request comprises an integer number of pre-fetch requests, each pre-fetch request corresponding to a pre-fetch request from the processor, where the integer number is selected from the group of 2, 4, 8, 16, ... and wherein the cache boosting circuit supplies received cache data of the pre-fetch requests in turn to the processor in response to a pre-fetch request from the processor and does not forward any of the pre-fetch requests to the memory.
- various implementations may fetch or request data for 2 or 4 or 8, etc. pre-fetch requests at a time paying the memory setup time only once.
- the specific integer number may be selected depending on what is best for a given implementation or setup. Alternatively, the integer number may be any natural number larger than 1.
- the processor starts processing the first part of the requested cache data of the single merged request after requesting the second pre-fetch request.
- the memory is an off-chip memory and the processor is on-chip. In another embodiment, the memory is an on-chip memory and the processor is on-chip but where the memory is external to the processor.
- the memory has a data retrieval time or setup time for requested data that is large relative to the time it takes to retrieve the requested data.
- the processor is a digital signal processor.
- the memory is a 16-bit DDR or a 16-bit SDR memory.
- the cache boosting circuit receives an enable control signal controlling whether the cache boosting circuit should be active or not.
- the invention also relate to a cache boosting device for enhancing cache performance comprising pre-fetch, the cache boosting device comprising a control logic circuit being adapted to receive a first pre- fetch request from a processor where the first pre-fetch request requests cache data from a memory being external to the processor, merge or combine the first pre-fetch request with an additional pre-fetch request requesting additional cache data resulting in a single merged request for cache data and forwarding the single merged request to the memory, where the additional cache data is consecutive to the data of the first pre-fetch request, receive at least a first part of the requested cache data of the single merged request in the cache boosting circuit from the memory and supplying the first part of the requested cache data to the processor for processing, where the first part of the requested cache data corresponds to the data requested by the first pre-fetch request, receive a second pre-fetch request in the cache boosting circuit after the processor has received the first part of the requested cache data where the cache boosting circuit do not forward the second pre-fetch request to the memory
- the embodiments of the device according to the present invention correspond to the embodiments of the method according to the present invention and have the same advantages for the same reasons. Further, the invention also relates to a computer readable medium having stored thereon instructions for causing one or more processing units to execute the method according to the present invention.
- FIG. 1 schematically illustrates a prior art DSP subsystem
- FIG. 1 schematically illustrates an embodiment of the present invention
- Figure 3 schematically illustrates interfaces between a DSP and (off-chip) memory according to an embodiment
- Figure 4 schematically illustrates memory timings for different configurations
- Figure 5 schematically illustrates a block diagram of an embodiment of a cache boosting circuit.
- FIG. 1 schematically illustrates a prior art DSP subsystem. Shown are, as described earlier, a DSP subsystem (100) comprising a DSP (101 ) where the DSP (101 ) comprises a DSP core (102), an instruction cache (103), a tightly coupled instruction memory (104) and a tightly coupled data memory (105). Furthermore, the DSP subsystem (100) comprises an additional internal (to the DSP subsystem (100) but external to the DSP (101 )) memory (106) and one or more peripherals (107) controlled by the DSP core (102).
- a DSP subsystem (100) comprising a DSP (101 ) where the DSP (101 ) comprises a DSP core (102), an instruction cache (103), a tightly coupled instruction memory (104) and a tightly coupled data memory (105).
- the DSP subsystem (100) comprises an additional internal (to the DSP subsystem (100) but external to the DSP (101 )) memory (106) and one or more peripherals (107) controlled by the DSP core (102).
- a cache line fill will cost 10 DSP cycles in certain systems, from the internal memory (106), e.g. SRAM and/or ROM, while line fill from off-chip memory in the same system would require 70 DSP cycles (for 104 MHz memory, single data rate and 16-bit off-chip memory).
- the instruction cache (103) will have a feature often referred to as pre-fetch or similar that enhances performance for some memory systems. If pre-fetch is enabled, once a cache line fill is completed, then the instruction cache (103) initiates a cache line fill request for the consecutive cache line. Simultaneously, the DSP core (102) starts executing. With a high probability the next cache line fill will be the consecutive 32 bytes. If pre-fetch is enabled the request is already started and the instruction data is on its way into the instruction cache (103), which will shorten the wait time. In this way, a higher performance is achieved.
- the second cache line is not required after all, the important and usually scarce bandwidth has been wasted and other accesses to the off-chip memory have been halted unnecessarily.
- the second cache line is required then it is very likely required fast after the first one.
- the gain is then just the few cycles of execution before the next line is requested since the setup time is large thereby delaying the access and retrieval of information.
- the pre-fetch function is far from optimal for off-chip memories as the gain, when successful, is limited (only a few cycles since the setup time is large causing a wait although the correct second cache line is being retrieved) while the cost, when not successful, is very high (waste of bandwidth possible blocking other accesses and locking the needed cache line by the ongoing pre-fetch).
- FIG. 2 schematically illustrates an embodiment of the present invention. Shown is DSP (101 ) or another type of processor connected to a cache boosting circuit (110) according to the present invention that is connected to a memory (114) that may be external to both the DSP (101 ) and the DSP subsystem (not shown).
- DSP digital signal processor
- FIG. 1 Shown is DSP (101 ) or another type of processor connected to a cache boosting circuit (110) according to the present invention that is connected to a memory (114) that may be external to both the DSP (101 ) and the DSP subsystem (not shown).
- the DSP (101 ) may be any processor using a pre-fetch function or the like as explained in connection with Figure 1 , i.e. a pre-fetch function that performs a first cache line fill and when the first cache line fill is completed then initiates a second cache line fill while the processor core at the same time starts executing the instructions of the first cache line fill whereby the instruction data of the second cache line fill already is on its way into the cache thereby shortening the wait time when the processor core is done executing the instruction of the first cache line fill.
- a pre-fetch function that performs a first cache line fill and when the first cache line fill is completed then initiates a second cache line fill while the processor core at the same time starts executing the instructions of the first cache line fill whereby the instruction data of the second cache line fill already is on its way into the cache thereby shortening the wait time when the processor core is done executing the instruction of the first cache line fill.
- the memory (114) need not be off-chip and relatively slower although the advantages of the present invention are largest for memories with large setup times relative to data access time. However, there is a benefit with all memory types.
- the setup penalty for accessing the (off-chip) memory (114) is paid only once.
- the cache boosting circuit (110) can be turned on and off using the enable control signal (109) being derived from or supplied by the DSP or processor (101 ). If pre-fetch is disabled the cache boosting circuit (110) should be turned off.
- the enable control signal (109) may be derived or supplied from elsewhere.
- the cache boosting circuit (110) merges the requests for cache use by, once the first cache line fill is requested, generating a request for twice the amount of data to be retrieved, i.e. for a first and a second cache line fill. As soon as the cache boosting circuit (110) receives half of the total amount of data it supplies the half to the DSP or processor (101 ). After half of the requested amount of data is received by the DSP or processor (101 ) it generates a request for the second cache line fill and starts executing the received data as normal. The second cache line fill request is then 'captured' by the cache boosting circuit (110) but not passed on as normally.
- the corresponding data of the second cache line fill is already present in the cache boosting circuit (110) or at least on its way as twice the amount of data was requested at the time of the first cache line fill. Consequently, the DSP or processor (101 ) gets a very fast reply for the second cache line request as it already is ready or on its way without having to spend additional setup time for the memory (114) or use bandwidth for the second request. As mentioned later in connection with tables 1 and 2, this could in one example e.g. save 38 cycles out of 140 cycles for SDR memory or out of 108 cycles for DDR memory.
- the cache boosting circuit (110) may request another amount of data to be retrieved for an integer number of cache line fills, e.g. 3, 4, etc. times the amount of data to be retrieved corresponding to 3, 4, etc. cache line fills. Then the cache boosting circuit (110) receives an appropriate part or fraction of the total requested data instead of a half.
- the integer number may be selected from the set of integers 2 N ; A/ e 1 , 2, 3
- the cache boosting operation can be done transparently for the DSP or processor (101 ) as it still functions as specified requesting two cache line fills and do not require modification of the DSP or CPU core.
- the cache boosting circuit (110) automatically doubles the requested amount of data to be retrieved when receiving a request for a first cache line fill.
- FIG. 3 schematically illustrates interfaces between a DSP and (off-chip) memory according to an embodiment. Shown is a DSP subsystem (100) or another type of processing subsystem comprising a DSP or another type of processor e.g. corresponding to the DSP subsystem and DSP shown and explained in connection with Figure 1.
- a DSP subsystem 100
- another type of processing subsystem comprising a DSP or another type of processor e.g. corresponding to the DSP subsystem and DSP shown and explained in connection with Figure 1.
- a bus protocol master e.g. an AHB (AMBA High-performance Bus) bus protocol master from ARM
- the cache boosting circuit's (110) function can be turned on and off using an enable control signal (109) derived from or supplied by the DSP or processor (101 ).
- Interface 1 is an interface between a program port (108) of the DSP (101 ) and the cache boosting circuit (110).
- interface 1 is a 64- bit bus interface, e.g. a 64-bit AHB interface from ARM.
- Interface 2 is an interface between the cache boosting circuit (110) and the bus protocol master (111 ) of the DSP subsystem.
- interface 2 is also a 64-bit bus interface, e.g. a 64-bit AHB interface from ARM.
- Interface 3 is an interface between the bus protocol master (111 ) of the DSP subsystem and the (external) memory interface (112).
- interface is a 32-bit bus interface, e.g. a 32 AHB interface from ARM.
- Interface 4 is an interface between the (external) memory interface (112) and the (off-chip) memory (114). In one embodiment, this interface typically is a 16-bit interface.
- the bus widths require a certain clock speed in order for an interface to not become a bottleneck of the system that depends on the specific task(s) and design of the DSP or processor subsystem. In one embodiment, they should be designed to handle 104 MHz Double Data Rate (DDR) memories. On a 16-bit interface (Interface 4) such a memory allows a 416 MByte/s transfer rate. In order to handle this data efficiently, Interface 3 should run at 104 MHz. In the same way, Interfaces 1 and 2 should run at least 52 MHz to keep up.
- DDR Double Data Rate
- a sequence of events in a cache line fill if the cache boosting circuit (110) is not enabled and pre-fetch is enabled may be as follows.
- the cache line fill is generated as 4 64-bit read requests on both DSP or processor program port (108) and the DSP or processor subsystem boundary, i.e. at Interfaces 1 and 2 and the cache boosting circuit (110) simply forwards the requests unaltered.
- the request is first converted to 8 32-bit read requests on Interface 3, and finally to 16 16-bit read requests on Interface 4.
- the data propagates from the (e.g. off-chip) memory through the different interfaces into the instruction cache (not shown; see 103 in Figure 1 ), and the processor resumes execution.
- a second cache line fill request for the next 32 consecutive bytes, i.e. an amount equal to the cache line size, is then issued and the process is repeated again.
- the DSP or processor (101 ) issues 4 64-bit read requests as usual.
- the cache boosting circuit (110) accept the request but issues 8 64-bit read requests instead and these 8 64- bit read requests are presented at the DSP or processor subsystem boundary, i.e. at Interface 2.
- the request is converted into 16 32-bit read requests on Interface 3 and finally to 32 16-bit read requests on Interface 4.
- the requested data propagates from the (e.g. off-chip) memory through the different interfaces into the instruction cache of the DSP or processor (not shown; see e.g. 103 in Figure 1 ).
- the cache line fill request from the DSP or processor has been completed. Consequently, the
- DSP or processor resumes execution and issues a second cache line fill request for the next 32 consecutive bytes.
- the second cache line fill request is accepted by the cache boosting circuit (110) but the cache boosting circuit (110) does not forward the read request since the corresponding data already is being collected from the (e.g. off- chip) memory (114).
- the cache boosting circuit (110) merely gathers and delays the streaming data e.g. in a small number of FIFO-registers (e.g. as in the exemplary embodiment of Figure 5) or the like. In this way, the DSP or processor (101 ) gets an immediate or as soon as possible reply for its second cache line fill request thereby enhancing the cache operation.
- bus rates and lengths of read requests are only exemplary. Other values may e.g. be 64-bit and 128-bit reads or 8- bit and 16-bit reads and correspondingly for the bus rates. As another example the 16-bit size of Interface 4 is dependent on the bit size of the used memory and could e.g. be 32 bit, 64 bit, etc.
- the cache boosting circuit (110) may request more than two times the amount of data requested by the DSP or processor (101 ) and the cache boosting circuit (110) would receive an appropriate part or fraction of the total requested data instead of a half.
- Figure 4 schematically illustrates memory timings for different configurations. Shown are memory timings for combinations of Single and Double Data Rate, 16 and 32 16-bit reads, namely 1 ) Single Data Rate with 16 16-bit words, 2) Single Data Rate with 32 16-bit words, 3) Double Data Rate with 16 16-bit words, 4) Double Data Rate with 32 16-bit words and an indication of the time needed for memory setup, which approximately, in this specific example, is 60 ns.
- This specific example is done using off-chip memories where such a memory typically operates at 104 MHz and has a setup time of approximately 60 ns.
- the specific example assumes a single data rate memory with a 16-bit wide interface. For one cache line fill a single setup time and 16 104 MHz cycles are spent. Thereby, approximately 12 208 MHz cycles for the setup and 32 208 MHz cycles for actually reading the data from the memory are used.
- the cost of the cache fills is summarized below in Table 1 illustrating the costs for the first and second cache fill if the cache boosting circuit is not enabled and Table 2 illustrating the costs for the first and second cache line fill if the cache boosting circuit is enabled.
- the cache boosting circuit If the cache boosting circuit is enabled and we make a longer access to off- chip memory than otherwise then the first cache line fill will never be more expensive.
- the benefit of the cache boosting circuit is that the second cache line fill will only take the pure access time to the off-chip memory as it is done as part of the same request as the first cache line fill. The internal delays and the setup time to the off-chip memory for the second time are saved.
- the cache boosting circuit is a very simple design and it does not have any negative effect if pre-fetch is enabled since the signals may pass directly through the cache boosting circuit with no loss of cycles.
- FIG. 5 schematically illustrates a block diagram of an embodiment of a cache boosting circuit. Shown are a cache boosting circuit (110) connected to a program port (108) of a processor, DSP or the like and to a bus bridge (503).
- the cache boosting circuit (110) comprises in this example a control logic circuit (501 ) and a data storage element (502) like a FIFO queue element or another type of storage.
- the control logic circuit (501 ) and the data storage element (502) receives an enable control signal (109) for controlling whether the cache boosting circuit (110) is enabled or not as explained earlier.
- the control logic circuit (501 ) is connected to receive one or more control signals from the program port (108) and to transmit one or more control signals to the bridge (503).
- the data storage element (502) is connected to receive data on a data bus from the bus bridge (503) and to supply data on a data bus to the program port (108).
- control logic circuit (501) will (when enabled) after receiving a request for data in the form of a single cache line fill from a processor or DSP by control signal(s) from the program port (108) generate a merged or combined request for a number of consecutive cache line fills or memory requests.
- the merged or combined request will be transmitted via control signal(s) to the bus bridge (503) which will cause the requested data to be fetched from the appropriate memory as explained before.
- the requested data will be supplied to the data storage element (502) via the data bus connection between the bus bridge (503) and the data storage element (502) and held there.
- this data is provided via a data bus to the program port (108) and to the processor or DSP as explained earlier.
- the additional part or parts of data being requested (arising from the merged or combined request) is gathered and kept in the data storage element (502).
- the processor or DSP requests another request for data in the form of a cache line fill to the control logic circuit (501 ) this request is halted and is not passed on to the bridge (503) since this data already was requested as part of the combined or merged request. Instead the data is passed on to the program port (108) from the data storage element (502) if it already is present there or when it is received by the data storage (502).
- any reference signs placed between parentheses shall not be constructed as limiting the claim.
- the word “comprising” does not exclude the presence of elements or steps other than those listed in a claim.
- the word “a” or “an” preceding an element does not exclude the presence of a plurality of such elements.
- the invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer.
- the device claim enumerating several means several of these means can be embodied by one and the same item of hardware.
- the mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
La présente invention se rapporte à une amélioration des performances de mise en mémoire cache lors de l'utilisation d'une fonction de prélecture. Si une fonction de prélecture est activée, une requête fusionnée unique d'accès à la mémoire est faite qui comprend les données concernant à la fois un remplissage d'une ligne de mémoire cache et la fonction de prélecture. Quand la mémoire cache a consommé les données correspondant au remplissage de la ligne de mémoire cache, elle sollicite les données correspondant à la fonction de prélecture. Cependant, ces données ont déjà été retrouvées dans la mémoire. De cette façon, une manière simple, fiable et efficace d'améliorer une opération de mise en mémoire cache est obtenue dans la mesure où la deuxième requête de prélecture (appelée ci-après « remplissage de deuxième ligne de mémoire ») prend uniquement le temps d'accès purement nécessaire à la mémoire externe étant donné qu'elle est faite dans le cadre de la même requête que la première requête de prélecture (appelée ci-après « remplissage de première ligne de mémoire »). Les retards internes et le temps de programmation par rapport à la mémoire externe pour la deuxième période sont économisés. En outre, une mémoire à grande vitesse proche du processeur n'est pas nécessaire ou bien peut être réduite en termes de dimensions, dans la mesure où elle peut être remplacée - en tout ou partie - par une mémoire externe (au processeur) qui peut être plus lente et de ce fait moins chère, sans présenter pour autant des inconvénients significatifs en termes de performance. D'autre part, les problèmes liés à une utilisation d'une fonction de prélecture en même temps qu'une mémoire et, en particulier, une mémoire hors puce ayant un temps de programmation plus long, sont sensiblement allégés. Par ailleurs, sur la moyenne, le coût d'un remplissage de ligne de mémoire cache est réduit, ce qui permet de réaliser une amélioration des performances, en particulier pour un type de code de commande dans lequel le temps d'exploitation de la mémoire cache d'instructions peut être considérable. De plus, sur la moyenne, la mémoire externe est occupée pendant moins de cycles, ce qui procure une réduction en termes du temps d'occupation, et qui est très important dans la mesure où la largeur de bande de la mémoire hors puce est une ressource rare.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP06388066A EP1930811A1 (fr) | 2006-12-05 | 2006-12-05 | Procédés et dispositifs d'amélioration de mémoire cache |
EP06388066.0 | 2006-12-05 | ||
US86875006P | 2006-12-06 | 2006-12-06 | |
US60/868,750 | 2006-12-06 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008067918A1 true WO2008067918A1 (fr) | 2008-06-12 |
Family
ID=38935949
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2007/010175 WO2008067918A1 (fr) | 2006-12-05 | 2007-11-23 | Procédés et dispositifs permettant d'améliorer la mise en mémoire cache |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2008067918A1 (fr) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5963481A (en) * | 1998-06-30 | 1999-10-05 | Enhanced Memory Systems, Inc. | Embedded enhanced DRAM, and associated method |
US20010018735A1 (en) * | 2000-02-24 | 2001-08-30 | Yasuyuki Murakami | Data processor and data processing system |
-
2007
- 2007-11-23 WO PCT/EP2007/010175 patent/WO2008067918A1/fr active Application Filing
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5963481A (en) * | 1998-06-30 | 1999-10-05 | Enhanced Memory Systems, Inc. | Embedded enhanced DRAM, and associated method |
US20010018735A1 (en) * | 2000-02-24 | 2001-08-30 | Yasuyuki Murakami | Data processor and data processing system |
Non-Patent Citations (3)
Title |
---|
JOUPPI N P: "IMPROVING DIRECT-MAPPED CACHE PERFORMANCE BY THE ADDITION OF A SMALL FULLY-ASSOCIATIVE CACHE AND PREFETCH BUFFERS", COMPUTER ARCHITECTURE NEWS, ACM, NEW YORK, NY, US, vol. 18, no. 2, 1 June 1990 (1990-06-01), pages 364 - 373, XP000134818, ISSN: 0163-5964 * |
MILLIGAN M K ET AL: "PROCESSOR IMPLEMENTATIONS USING QUEUES", IEEE MICRO, IEEE SERVICE CENTER, LOS ALAMITOS, CA, US, vol. 15, no. 4, 1 August 1995 (1995-08-01), pages 58 - 65, XP000526604, ISSN: 0272-1732 * |
PALACHARLA S ET AL: "Evaluating stream buffers as a secondary cache replacement", PROCEEDINGS OF THE ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE. CHICAGO, APRIL 18 - 21, 1994, LOS ALAMITOS, IEEE COMP. SOC. PRESS, US, vol. SYMP. 21, 18 April 1994 (1994-04-18), pages 24 - 33, XP010098147, ISBN: 0-8186-5510-0 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230418759A1 (en) | Slot/sub-slot prefetch architecture for multiple memory requestors | |
US6978350B2 (en) | Methods and apparatus for improving throughput of cache-based embedded processors | |
US5608892A (en) | Active cache for a microprocessor | |
US5572703A (en) | Method and apparatus for snoop stretching using signals that convey snoop results | |
US7111153B2 (en) | Early data return indication mechanism | |
US6052756A (en) | Memory page management | |
US5388247A (en) | History buffer control to reduce unnecessary allocations in a memory stream buffer | |
US4933837A (en) | Methods and apparatus for optimizing instruction processing in computer systems employing a combination of instruction cache and high speed consecutive transfer memories | |
JP3516963B2 (ja) | メモリアクセス制御装置 | |
US6457075B1 (en) | Synchronous memory system with automatic burst mode switching as a function of the selected bus master | |
US6094711A (en) | Apparatus and method for reducing data bus pin count of an interface while substantially maintaining performance | |
KR100348099B1 (ko) | 단일의캐쉬액세스파이프단을이용하여파이프라인저장명령을실행하기위한장치및방법과,파이프라인프로세서및,컴퓨터시스템 | |
US7162588B2 (en) | Processor prefetch to match memory bus protocol characteristics | |
US6985999B2 (en) | Microprocessor and method for utilizing disparity between bus clock and core clock frequencies to prioritize cache line fill bus access requests | |
EP1930811A1 (fr) | Procédés et dispositifs d'amélioration de mémoire cache | |
US6775756B1 (en) | Method and apparatus for out of order memory processing within an in order processor | |
JP2003099324A (ja) | マルチメディアプロセッサ用のストリーミングデータキャッシュ | |
WO2008067918A1 (fr) | Procédés et dispositifs permettant d'améliorer la mise en mémoire cache | |
US20040083344A1 (en) | Method, apparatus, and system for improving memory access speed | |
US5781916A (en) | Cache control circuitry and method therefor | |
US20070198754A1 (en) | Data transfer buffer control for performance | |
US7035981B1 (en) | Asynchronous input/output cache having reduced latency | |
US9135157B2 (en) | Integrated circuit device, signal processing system and method for prefetching lines of data therefor | |
US20040103267A1 (en) | Data processor having cache memory | |
JP3735373B2 (ja) | マイクロコンピュータ |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 07846776 Country of ref document: EP Kind code of ref document: A1 |
|
DPE1 | Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101) | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 07846776 Country of ref document: EP Kind code of ref document: A1 |