CN116701245A

CN116701245A - Pipelined cache data caching method and device with variable delay and bit width

Info

Publication number: CN116701245A
Application number: CN202310610570.1A
Authority: CN
Inventors: 曾坤; 周宏伟; 黄胜渝; 金辉; 邵靖杰; 饶建波
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-09-05

Abstract

The invention discloses a method and a device for caching data of a pipelined cache with variable delay and bit width, wherein the method comprises the following steps: s01, when a cache request is received, caching the request in a buffer corresponding to a plurality of data array groups in a running way, and generating a read-write micro-request for accessing each data array group in the running way by changing information in the request; s02, determining the number of data array groups and the number of RAMs according to the configured pipeline width and the cache data block width; s03, when each data array group is accessed, each data array group request is accessed and transmitted to the next station in a pipeline mode, the number of the RAM multi-period accessed registers is obtained, and the requests in the registers consistent with the RAM multi-period delay beats configured currently are selected. The invention can realize the pipelined cache data with variable delay and variable bit width, and has the advantages of simple realization, low cost, high cache efficiency, strong flexibility and adaptability and the like.

Description

Pipelined cache data caching method and device with variable delay and bit width

Technical Field

The present invention relates to the field of cache data caching technologies, and in particular, to a method and an apparatus for caching pipelined cache data with variable delay and bit width.

Background

To extend the battery life of portable devices (e.g., cell phones, MP3 s, multimedia players, notebook computers, etc.), new energy saving technologies need to be developed at great effort. As a new energy-saving mode, according to different demands of the application program running on the chip on the computing power, the power consumption of the processor can be effectively optimized by adopting DVFS (dynamic voltage frequency adjustment), but multiple demands on RAM multi-cycle delay can be generated at the same time. In addition, the bus bandwidth requirements of different processors are different, and the data width of the cache data pipeline is increased and vice versa for processors with high bus bandwidth requirements. In the prior art, the cache data pipeline is usually of a fixed data size and a fixed RAM multi-cycle delay, and cannot adapt to the situations that the processor frequency modulation delay needs to be changed and cannot meet the bus bandwidth requirements of different processors.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems existing in the prior art, the invention provides a pipelined cache data caching method and device with variable delay and bit width, which are simple to realize, low in cost, high in caching efficiency and strong in flexibility and adaptability, can realize the pipelined cache with variable delay and variable bit width, can be suitable for scenes where the processor frequency modulation delay needs to be changed, and can meet bus bandwidth requirements of different processors.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a variable delay and bit width pipelined cache data caching method, comprising the steps of:

s01, when a cache request is received, the request is cached in a buffer corresponding to a plurality of data array groups in a running way, and a read-write micro-request for accessing each data array group in the running way is spontaneously generated by changing information in the request;

s02, configuring the width of a pipeline and the width of a cache data block according to the currently required data bit width, and determining the number of data array groups and the number of RAMs (random access memories) used for caching data in each data array group according to the configured width of the pipeline and the configured width of the cache data block so as to realize variable bit width data caching;

s03, accessing each data array group in a streaming mode according to the request, accessing each data array group request, streaming and transmitting the request to a next station when accessing each data array group, obtaining the number of currently configured RAM multi-period accessed registers, and selecting the requests in the registers consistent with the currently configured RAM multi-period delay beat number so as to realize variable-delay data caching.

Further, the step S01 includes: when the pipeline is in a state capable of receiving a new request, receiving a cache request through handshake; after handshake succeeds, the request is buffered, the value of the data ID in the request corresponds to one data array group in sequence, the micro operation is carried out on the request after the request is buffered, the ID number corresponding to the data array group in the request is increased through a control signal, and the internal running water type generates micro requests for each data array group until the last data array group is buffered in the buffer of the last data array.

Further, in the step S02, the number of RAMs in a data array group is determined according to the pipeline data width, the number of data array groups is determined according to the cache data block width, and the expression for determining the cache data block width is: buffer data block width = number of data array groups x 128 x number of RAMs in one data array group.

Further, in the step S02, the maximum value of the ID number of the micro-operation is determined according to the number of the RAM, so as to be used for accessing different data array groups.

Further, after the step S01 and before the step S02, each data array group correspondingly receives a selection signal, so that the data array groups are accessed without any gap after the request enters the pipeline.

Further, step S02 and step S03 are implemented by using Python, where the number of registers accessed by the RAM in multiple cycles is configured by adjusting the parameter configuration pipeline width and the buffer data block width in Python.

A variable delay and bit width pipelined cache data caching apparatus comprising a plurality of data array sets, each data array set including a plurality of RAMs for caching data, further comprising:

the request receiving module is used for receiving a cache request, caching the request in a buffer corresponding to a plurality of data array groups in a running water mode when the cache request is received, and generating micro requests for accessing each data array group in a running water mode;

the bit width control module is used for configuring the width of a pipeline and the width of a cache data block according to the currently required data bit width, and determining the number of data array groups and the number of RAMs (random access memories) used for caching data in each data array group according to the configured width of the pipeline and the width of the cache data block so as to realize variable bit width data caching;

and the RAM delay control module is used for accessing each data array group in a flow mode according to the request, accessing each data array group request flow entry register and transmitting to the next station when accessing each data array group, acquiring the number of currently configured RAM multi-period accessed registers, and selecting the request in the register consistent with the currently configured RAM multi-period delay beat number so as to realize variable delay data caching.

Further, a counter for controlling pipeline state and request micro-operations is included, the counter determining that the pipeline can receive a new request when counting is completed by counting the life cycle of requests from entering the pipeline to completion of access to RAM.

Further, each data array is respectively and correspondingly received with a selection signal, the first selection signal selects the request information on the interface when the request just enters, otherwise, the request information in the corresponding first buffer is selected, the other selection signals select the request information in the buffer corresponding to the current selection signal when the data ID is the corresponding ID value, otherwise, the request information in the next buffer is selected.

Further, the request information of each selection signal is used for accessing the corresponding data array group, and when the first data array group is accessed in the process of sequentially accessing the data array group, the request is sent to a first register corresponding to the first data array group, and the current data ID is a first ID value; the next beat of requests with the data ID being the first ID value enter the second register, and the requests with the data ID being the second ID value enter the first register; and then the next beat of requests with data ID of the third ID value enter the first register, the requests with data ID of the second ID value enter the second register, the requests with data ID of the first ID value enter the third register, and so on.

Compared with the prior art, the invention has the advantages that: according to the invention, when a cache request is received, the request stream is cached in the corresponding caches of a plurality of data array groups, so that micro-requests for accessing each data array group in a stream mode are generated, the number of the data array groups and the number of RAMs in each data array group are determined according to the width of the stream line and the width of a cache data block, the variability of the cache data width is realized, the requirements of different processors on the data width in a cache data stream line can be flexibly adapted, meanwhile, the request stream is accessed to each data array group, the number of the registers is determined according to the number of RAM multi-period delay beats, the variability of delay is realized, the method and the device are flexibly applicable to various scenes of RAM multi-period delay beats caused by the change of the frequency of the processor, the design limitation caused by the fixed RAM multi-period delay beats is avoided, and the flexibility and the suitability of the cache data are greatly improved.

Drawings

FIG. 1 is a flow chart illustrating an implementation of the variable delay and bit width pipelined cache data caching method of the present embodiment.

FIG. 2 is a schematic diagram of a pipeline architecture formed in a specific application embodiment (pipeline width 256 bits, cache data width 1024 bits).

Detailed Description

The invention is further described below in connection with the drawings and the specific preferred embodiments, but the scope of protection of the invention is not limited thereby.

As shown in fig. 1, the steps of the pipelined cache data caching method with variable delay and bit width of the present embodiment include:

s02, configuring the width of a pipeline and the width of a cache data block according to the current required data bit width, and determining the number of data array groups and the number of RAMs (random access memories) used for caching data in each data array group according to the configured width of the pipeline and the width of the cache data block so as to realize variable bit width data caching;

According to the embodiment, when a cache request is received, the request stream is cached in the corresponding caches of the plurality of data array groups, micro-requests for accessing each data array group in a stream mode are generated, the number of the data array groups and the number of RAMs in each data array group are determined according to the width of the stream line and the width of a cache data block, the variability of the cache data width is realized, the requirements of different processors on the data width in a cache data stream line can be flexibly adapted, meanwhile, the request stream is accessed to each data array group, the number of the registers is determined according to the number of the RAMs, the variability of delay is realized, the method can be flexibly applied to various scenes of RAM multi-cycle delay beat change caused by the frequency change of the processor, design limitation caused by the fixed RAM multi-cycle delay beat number is avoided, and the flexibility and the adaptability of the cache data are greatly improved.

In this embodiment, the specific steps of step S01 include: when the pipeline is in a state capable of receiving a new request, receiving a cache request through handshake; after handshake succeeds, the request is buffered, the value of the data ID in the request corresponds to one data array group in sequence, the micro operation is carried out on the request after the request is buffered, the ID number corresponding to the data array group in the request is increased through a control signal, and the internal running water type generates micro requests for each data array group until the last data array group is buffered in the buffer of the last data array.

In a specific application embodiment, the state of the pipeline is controlled by a counter, for example by setting a counter 1 for counting the life cycle of a request from entering the pipeline to completion of an access to RAM, when the counting is completed, meaning that the pipeline can receive a new request, i.e. in Ready state, and when the pipeline is in a state in which it can receive a new request, the cache request is received by handshake. The value of the counter when completed can be automatically calculated in advance according to the configuration parameter information. The caching is specifically completed by using a register, the caching behavior is controlled by a counter, each data array group is configured with a respective buffer, and the value of the data ID in the request sequentially corresponds to one data array group. After the handshake is successful, requesting the cache to enter a buffer0, wherein the data ID is 0; the next beat of buffer enters buffer 1, where the data ID is 1, and so on, until the last data array is buffered in buffer N. The number of the buffers can be automatically calculated in advance according to the configuration parameter information only when the next request enters the pipeline.

In step S02 of this embodiment, the number of RAMs in a data array group is determined according to the pipeline data width, and the number of data array groups is determined according to the cache data block width. When the data is accessed, different data array groups are accessed in a streaming mode according to the request, and the required pipeline cache structure can be formed by modifying parameters related to the width of the pipeline and the cached data block, so that the bit width of the cached data is variable.

Taking 144 bits (128 bit data and 16bit ECC encoding) as an example, the width of each RAM cache data is an integer multiple of 128, so that the number of RAMs in a data array group can be determined; the buffer data block width determines the number of data array groups, and the buffer data block width can be determined according to the following formula:

buffer data block width = number of data array groups x 128 x number of RAMs in one data array group.

In step S02 of this embodiment, when the pipeline width and the cache data block parameters are modified, RAM bodies with different numbers are automatically generated to form a data array group, and the maximum value of the ID number of the micro-operation is determined according to the number of the RAMs, that is, the maximum value of the ID number of the micro-operation is automatically calculated and generated along with the number of the RAMs, so as to be used for accessing different data array groups.

In this embodiment, after step S01 and before step S02, each data array group receives a selection signal correspondingly, so that the data array groups are accessed without any gap after the request enters the pipeline. For example, each data array group corresponds to a selection signal, and when a request just enters, the selection signal 0 selects the request information on the interface, otherwise, the request information in the buffer0 is selected; selecting the information in the buffer0 by the selection signal 1 when the data ID is 1, otherwise selecting the information in the buffer 1; the selection signal 2 selects the information requested in the buffer 1 when the data ID is 2, otherwise selects the information requested in the buffer 2, and so on.

In this embodiment, the RAM multicycle delay is configured as 1,2,3,4 beats, when accessing the data array groups, the requests for accessing each data array group enter the register in a pipeline, and are transmitted backward, the selector is used to select the requests in the register corresponding to the RAM multicycle delay beats, and the number of the registers accessed by the RAM multicycle can be configured by modifying the parameters, so that the beats accessed by the RAM multicycle can be changed.

In a specific application embodiment, step S02 and step S03 may be implemented by using Python, where parameters in Python are modified to configure pipeline width and cache data block width, and parameters in Python are modified to configure parameters such as number of registers accessed by RAM in multiple cycles or number of beats accessed by RAM in multiple cycles, so that Verilog languages with different hardware configurations are automatically generated, and the method can be flexibly applied to different processors, thereby meeting the requirements of different processors on the data width of the cache data pipeline. The detailed steps for implementing the above-mentioned cache data using Python are:

request handshake: when the pipeline is in a state where a new request can be received, the request is received by handshake.

Request caching: each data array group is provided with a respective buffer, is automatically generated according to parameters in the Python, and has unlimited number. After receiving the request, the request is cached in a buffer corresponding to each data array group in a running way, and the ID number of each corresponding data array group in the request is increased by a control signal, so that the micro-request for accessing each data array group in a running way is generated, even if the micro-request for each data array group is generated internally and the micro-request for each data array group is generated in a running way.

Request selection: each data array group has a corresponding selection signal to realize Bypass function, so that the data array group is accessed without gaps after the request enters the pipeline.

Data array group access: and automatically generating Verilog by using Python, changing parameters (pipeline data width and cache data block width) in Python to generate RAM bodies with different numbers to form data array groups, and forming a cache structure corresponding to cache data with different bit widths, wherein the number of the data array groups is determined by caching the data with different widths, and the number of the RAMs in each data array group is determined by the pipeline data width. At the time of access, different groups of data arrays are accessed in a stream according to the request. The maximum value of the ID number of the micro operation can be automatically calculated along with the number of the RAMs to generate a corresponding Verilog language for accessing different data array groups.

RAM multicycle implementation: when accessing the data array groups, accessing each data array group request stream to enter a register and transmitting the request stream to a next station, selecting the requests in the register corresponding to the RAM multicycle delay beats by using a selector, wherein the number of the registers is determined according to the RAM multicycle beats, and the RAM multicycle delay is configured to be 1,2,3 and 4 beats in the embodiment; and automatically generating Verilog by using Python, and configuring the number of registers accessed by the RAM in multiple cycles by modifying parameters in the Python, so as to change the number of beats accessed by the RAM in multiple cycles and generate cache structures with different numbers of beats delayed by the RAM in multiple cycles.

The variable delay and bit width pipelined cache data caching apparatus of the present embodiment includes a plurality of data array groups, each of the data array groups including a plurality of RAMs for caching data, further comprising:

The device of the embodiment can generate RAM with different numbers according to different data bit widths, form a cache data array pipeline with variable delay and bit width, can be suitable for scenes of processor frequency modulation delay change and bus bandwidths of different processors, and has high flexibility and adaptability.

In this embodiment, a counter is also included for controlling pipeline state and request micro-operations, the counter determining that the pipeline can receive a new request when counting is completed by counting the life cycle of requests from entering the pipeline to completion of access to RAM.

In this embodiment, the method further includes several registers for implementing functions such as request micro-operation, pipeline design, RAM multi-cycle delay variability, etc.

In this embodiment, each data array corresponds to a selection signal respectively, where the first selection signal selects the request information on the interface when the request just enters, otherwise selects the request information in the corresponding first buffer, and the other selection signals select the request information in the buffer corresponding to the current selection signal when the data ID is the corresponding ID value, otherwise selects the request information in the next buffer. For example, each data array group corresponds to a selection signal, when the request just enters, the selection signal 0 selects the request information on the interface, otherwise, the request information in the buffer0 is selected; selecting the information in the buffer0 by the selection signal 1 when the data ID is 1, otherwise selecting the information in the buffer 1; the selection signal 2 selects the information requested in the buffer 1 when the data ID is 2, otherwise selects the information requested in the buffer 2, and so on.

In this embodiment, the request information of each selection signal is used to access the corresponding data array group, and when accessing the first data array group in sequence, the request is sent to the first register corresponding to the first data array group, and the current data ID is a first ID value; the next beat of requests with the data ID being the first ID value enter the second register, and the requests with the data ID being the second ID value enter the first register; and then the next beat of requests with data ID of the third ID value enter the first register, the requests with data ID of the second ID value enter the second register, the requests with data ID of the first ID value enter the third register, and so on.

Specifically, the request information of the selection signal 0 is used to access the data array group 0, and the request information of the selection signal 1 is used to access the data array group 1. Sequentially accessing the data array group, and requesting to enter a register 0 when accessing the data array group 0, wherein the data ID is 0; the next beat of a request with data ID 0 enters register 1, and the request with data ID 1 enters register 0; next beat, a request with data ID 2 goes into register 0, a request with data ID 1 goes into register 1, a request with data ID 0 goes into register 2, and so on.

In a specific application embodiment, the detailed flow for implementing the cache of the high-speed data by adopting the device of the invention is as follows:

request handshake: the state of the pipeline is controlled by a counter, the life cycle from entering the pipeline to completing the access to the RAM is calculated by the counter, the completion of the counting indicates that the pipeline can receive a new request, namely in a Ready state, and the value of the counter is automatically calculated by Python according to the configuration parameter information.

Request caching: the caching is completed by a register, and the caching behavior is controlled by a counter. Each data array group has a respective buffer, and the value of the data ID in the request corresponds to one data array group in turn. After the handshake is successful, requesting the cache to enter a buffer0, wherein the data ID is 0; the next beat of buffer enters buffer 1, where the data ID is 1, and so on, until the last data array is buffered in buffer N. Wherein, only when the next request enters the pipeline, the requests in the buffer memory are changed, and the number of the buffer memory is automatically calculated by Python according to the configuration parameter information;

request selection: each data array group corresponds to a selection signal, when a request just enters, the selection signal 0 selects the request information on the interface, otherwise, the request information in the buffer0 is selected; selecting the information in the buffer0 by the selection signal 1 when the data ID is 1, otherwise selecting the information in the buffer 1; the selection signal 2 selects the information requested in the buffer 1 when the data ID is 2, otherwise selects the information requested in the buffer 2. And so on;

RAM access: the request information of the select signal 0 is used to access the data array group 0 and the request information of the select signal 1 is used to access the data array group 1. Sequentially accessing the data array group, and requesting to enter a register 0 when accessing the data array group 0, wherein the data ID is 0; the next beat of a request with data ID 0 enters register 1, and the request with data ID 1 enters register 0; next beat, a request with data ID 2 goes into register 0, a request with data ID 1 goes into register 1, a request with data ID 0 goes into register 2, and so on. The number of the data array groups and the number of the RAMs in each data group are determined by the width of the streaming water line in Python and the size of the cache data, and the corresponding Verilog language can be generated after the data array groups are determined, so that the number is unlimited.

RAM multicycle delay implementation logic: when accessing the data array groups, the requests for accessing each data array group are pipelined into a register and transmitted to the next station, and the requests in the register corresponding to the RAM multi-cycle delay beats are selected by a selector. For example, configured as 2 beats, the request in register 1 is selected, the number of registers is determined by the parameters in Python for the RAM multicycle beats, the different parameter values, the number of registers generated in Verilog are different.

In a specific application embodiment, taking a configuration with a pipeline width of 256 bits and a cache data width of 1024 bits as an example, a pipeline structure constructed and formed by adopting the method is shown in fig. 2, each data set has two RAM bodies, so that the width of each data set for reading and writing data in the pipeline is 256 bits, and four data sets are used, so that the cache data width is 1024 bits. When receiving a read-write request from outside to access a first data set, caching the request into a buffer0 corresponding to the first data set, advancing the request to a next station by a selector mux, modifying information in the request to spontaneously generate a read-write request of the three remaining data sets for running water access, caching the read-write request into corresponding buffers, and advancing the request to the next station by the selector mux. There are four options for the access period of each data body. In the figure, four stations h5, h50, h51 and h52 are used for generating one or more of the four stations by adjusting parameters of Python to achieve a configurable effect, and after the data sets are accessed, the data read by each data set is output in a running water mode through mux.

The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention shall fall within the scope of the technical solution of the present invention.

Claims

1. A method for variable delay and bit width pipelined caching of data, comprising the steps of:

2. The variable delay and bit width pipelined cache data caching method of claim 1, wherein said step S01 comprises: when the pipeline is in a state capable of receiving a new request, receiving a cache request through handshake; after handshake succeeds, the request is buffered, the value of the data ID in the request corresponds to one data array group in sequence, the micro operation is carried out on the request after the request is buffered, the ID number corresponding to the data array group in the request is increased through a control signal, and the internal running water type generates micro requests for each data array group until the last data array group is buffered in the buffer of the last data array.

3. The method for buffering data in a pipelined cache with variable delay and bit width according to claim 1, wherein in step S02, the number of RAMs in a data array group is determined according to the pipeline data width, and the number of data array groups is determined according to the buffered data block width, and the expression for determining the buffered data block width is: buffer data block width = number of data array groups x 128 x number of RAMs in one data array group.

4. The method of claim 1, wherein in step S02, the maximum value of the ID number of the micro-operation is determined according to the RAM amount, so as to be used for accessing different data array groups.

5. The method of any one of claims 1-4, further comprising receiving a selection signal for each data array group after step S01 and before step S02, such that the data array groups are accessed without gaps after the request enters the pipeline.

6. The method for buffering data in a pipelined cache with variable delay and bit width according to any one of claims 1-4, wherein step S02, step S03 are implemented by using Python, wherein the pipeline width and the buffered data block width are configured by adjusting parameters in Python, and the number of registers accessed in RAM multi-cycles is configured by adjusting parameters in Python.

7. A variable delay and bit width pipelined cache data caching apparatus comprising a plurality of data array sets, each data array set comprising a plurality of RAMs for caching data, further comprising:

8. The variable latency and bit width pipelined cache data caching device of claim 7, further comprising a counter for controlling pipeline state and request micro-operations, the counter determining that the pipeline can receive a new request when counting is completed by counting a lifecycle of a request from entering the pipeline to completion of an access to RAM.

9. A variable delay and bit width pipelined cache data caching device according to claim 7 or 8, wherein each of said data arrays respectively receives a selection signal, a first selection signal selecting request information on the interface upon entry of a request, otherwise selecting request information in the corresponding first buffer, and the remaining selection signals selecting request information in the corresponding buffer for the current selection signal when the data ID is the corresponding ID value, otherwise selecting request information in the next buffer.

10. The pipelined cache data caching device of claim 9, wherein the request information of each selection signal is used to access a corresponding data array group, and when accessing a first data array group in sequence, the request is sent to a first register corresponding to the first data array group, and the current data ID is a first ID value; the next beat of requests with the data ID being the first ID value enter the second register, and the requests with the data ID being the second ID value enter the first register; and then the next beat of requests with data ID of the third ID value enter the first register, the requests with data ID of the second ID value enter the second register, the requests with data ID of the first ID value enter the third register, and so on.