CN116069719A

CN116069719A - Processor, memory controller, system-on-chip and data prefetching method

Info

Publication number: CN116069719A
Application number: CN202111290828.1A
Authority: CN
Inventors: 王科兵; 邸千力; 周永彬; 陈章麒; 蒋志军
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2023-05-05

Abstract

The application discloses a processor, a memory controller, a system-on-chip and a data prefetching method, which relate to the field of storage and are used for improving the coverage rate of data prefetching. The processor comprises: the core, the buffer controller and the buffer; running threads in the kernel; the cache controller is used for: acquiring the page access density of at least one physical page accessed by a thread, wherein the page access density refers to the ratio of cache lines accessed by the thread in a single physical page to all cache lines in preset time; when the program accesses the data corresponding to the target cache line of the target physical page, when the page access density of at least one physical page meets a first condition, sending a request message to a memory controller, wherein the request message is used for requesting to acquire the data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise the target cache line and a prefetched cache line which is associated with the target cache line in the target physical page; data corresponding to the plurality of cache lines is received from the memory controller and stored in the cache.

Description

Processor, memory controller, system-on-chip and data prefetching method

Technical Field

The present disclosure relates to the field of memory, and more particularly, to a processor, a memory controller, and a system on chip (SoC) chip and a data prefetching method.

Background

Because the speed of the processor performance improvement is far faster than the speed of the memory performance improvement, the memory performance limits the release of the processor performance, and becomes a main performance bottleneck of the system. This problem can be solved by data caching and data prefetching. The data cache refers to temporarily storing data in a memory through a cache in the processor, and the processor avoids directly accessing the memory by accessing the cache. The data prefetching refers to loading data in a memory into a cache in advance so as to be convenient for a processor to access.

The granularity of data prefetching in the prior art is a cache line, typically 32 or 64 bytes in size. The amount of data prefetched at a time is small, so that the amount of data required by the processor to be prefetched into the cache is relatively low, i.e., the coverage rate is low.

Disclosure of Invention

The embodiment of the application provides a processor, a memory controller, a system-on-chip and a data prefetching method, which are used for improving the coverage rate of data prefetching.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, there is provided a processor comprising: the core, the buffer controller and the buffer; running threads in the kernel; the cache controller is used for: acquiring the page access density of at least one physical page accessed by a thread, wherein the page access density refers to the ratio of cache lines accessed by the thread in a single physical page to all cache lines in preset time; when the program accesses the data corresponding to the target cache line of the target physical page, when the page access density of at least one physical page meets a first condition, sending a request message to a memory controller, wherein the request message is used for requesting to acquire the data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise the target cache line and a prefetched cache line which is associated with the target cache line in the target physical page; data corresponding to the plurality of cache lines is received from the memory controller and stored in the cache.

According to the processor provided by the embodiment of the invention, when the page access density of a thread operated by a kernel in the processor to access at least one physical page meets the first condition, namely, in terms of the behavior of the thread when historically accessing the physical page, more data corresponding to the cache line in the same physical page is generally accessed, so that under the condition that the thread accesses the data corresponding to the target cache line in a new physical page, the processor requests the memory controller to acquire the data corresponding to the target cache line and requests the memory controller to acquire the data corresponding to the prefetched cache line associated with the target cache line in the same physical page, thereby improving the coverage rate of data prefetching.

In one possible implementation, the page access density of the at least one physical page satisfies a first condition, including: the average value of the page access densities of the at least one physical page is greater than a first threshold, or the maximum value of the page access densities of the at least one physical page is greater than a second threshold, or each value of the page access densities of the at least one physical page is greater than a third threshold. The physical meaning of the page access density of at least one physical page meeting the first condition is that: if the page access density of most of physical pages accessed by a thread is high, when the thread accesses a new physical page, the data corresponding to a plurality of cache lines in the physical page in the memory can be prefetched into the cache in advance, so that the prefetching coverage rate can be improved and the accuracy rate is not too low.

In one possible implementation, the cache controller is specifically configured to: data corresponding to the target cache line is received from the memory controller, and then data corresponding to the prefetched cache line is received from the memory controller. The processor receives the data corresponding to the target cache line which is needed immediately by the core, and then receives the data corresponding to the pre-fetch cache line which may be needed in the future by the core, so that the core can process the data immediately after obtaining the data corresponding to the target cache line.

In one possible implementation, the cache refers to a mid-level cache or a last-level cache. For a large-capacity intermediate level cache (second level cache) and a last level cache (third level cache), coverage rate is more important, so that the embodiment of the application is more suitable for realizing data prefetching on the caches.

In a second aspect, a memory controller is provided, including a first decision maker, a second decision maker, and a communication interface; the communication interface is used for receiving a request message from the processor, wherein the request message is used for requesting to acquire data corresponding to a plurality of cache lines in the memory, and the plurality of cache lines comprise target cache lines in physical pages accessed by the processor and prefetch cache lines associated with the target cache lines; the first decision maker is used for sending data corresponding to the target cache line to the processor through the communication interface; the second decision maker is used for sending data corresponding to the prefetched cache line to the processor through the communication interface. The first decision maker and the second decision maker may be combined into one decision maker, or may be integrated into one functional module.

According to the memory controller provided by the embodiment of the application, the memory controller can send the data corresponding to the target cache line and the data corresponding to the prefetched cache line associated with the target cache line to the processor, so that the coverage rate of data prefetching is improved.

In one possible implementation manner, for the target cache line and the prefetch cache line corresponding to the same request message, the first decision device sends the data corresponding to the target cache line first, and the second decision device sends the data corresponding to the prefetch cache line later. Because the data corresponding to the target cache line is the most needed by the kernel at present, the memory controller firstly sends the data corresponding to the target cache line, so that the kernel can acquire the data corresponding to the target cache line as soon as possible, and the time delay of the kernel for acquiring the data from the memory is reduced.

In one possible implementation, if there are multiple prefetched cache lines, the second decision maker preferentially sends the data corresponding to the prefetched cache line with the lowest latency. For example, since the memory controller addresses the memory with a time delay of switching between different ranks of memory > a time delay of switching between different rows > a time delay of switching between different banks > a time of switching between the same row and different columns, the second decision maker may preferentially transmit data corresponding to the prefetched cache line of the same rank of memory and the same row but different banks.

In one possible implementation, the request message includes a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line; if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are located in the same row of the same memory bank in the memory, after the first decision maker sends the data corresponding to the first target cache line, the second decision maker preferentially sends the data corresponding to the first prefetched cache line, then the first decision maker sends the data corresponding to the second target cache line, and the second decision maker sends the data corresponding to the second prefetched cache line.

Since the time to switch between different columns in the same row is the shortest, the delay in sending out the data sets of different columns in the same row in the same bank is the smallest.

In one possible implementation, the request message includes a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line; if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are not located in the same row of the same memory bank in the memory, the first decision maker firstly sends the data corresponding to the first target cache line and the data corresponding to the second target cache line, and then the second decision maker sends the data corresponding to the first prefetched cache line and the data corresponding to the second prefetched cache line.

Because the data corresponding to the target cache line is the most needed data at present by the core, namely, the memory controller firstly transmits the data corresponding to the target cache line which is needed by the core immediately, and then transmits the data corresponding to the prefetched cache line which is possibly needed by the core in the future, the data can be processed immediately after the core acquires the data corresponding to the target cache line, and compared with the data corresponding to the prefetched cache line which is transmitted firstly, the time delay of the core acquiring the data from the memory can be reduced.

In one possible implementation, when the first decision maker or the second decision maker sends the same line of data in the memory, the same line of data is sent according to any one of the following modes: transmitting data in the same row from a low address to a high address; alternatively, the data in the same row is transmitted from the high address to the low address; or, data corresponding to the target cache line is used as a center to send data in the same row to the high address and the low address; alternatively, the data within the same row is transmitted in random order.

The present application is not limited to the several ways described above.

In a third aspect, a system-on-chip is provided, including a processor as described in the first aspect and any of its embodiments, a memory controller as described in the second aspect and any of its embodiments, and a memory.

In a fourth aspect, a data prefetching method is provided, including: acquiring the page access density of at least one physical page accessed by a thread, wherein the page access density refers to the ratio of cache lines accessed by the thread in a single physical page to all cache lines in preset time; when the program accesses the data corresponding to the target cache line of the target physical page, when the page access density of at least one physical page meets a first condition, sending a request message to a memory controller, wherein the request message is used for requesting to acquire the data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise the target cache line and a prefetched cache line which is associated with the target cache line in the target physical page; data corresponding to the plurality of cache lines is received from the memory controller and stored in the cache.

In one possible implementation, the page access density of the at least one physical page satisfies a first condition, including: the average value of the page access densities of the at least one physical page is greater than a first threshold, or the maximum value of the page access densities of the at least one physical page is greater than a second threshold, or each value of the page access densities of the at least one physical page is greater than a third threshold.

In one possible implementation, receiving data corresponding to a plurality of cache lines from a memory controller includes: data corresponding to the target cache line is received from the memory controller, and then data corresponding to the prefetched cache line is received from the memory controller.

In one possible implementation, the cache refers to a mid-level cache or a last-level cache.

In a fifth aspect, a data prefetching method is provided, including: receiving a request message from a processor, wherein the request message is used for requesting to acquire data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise target cache lines in physical pages accessed by the processor and prefetch cache lines associated with the target cache lines; and sending the data corresponding to the target cache line and the data corresponding to the prefetched cache line to the processor.

In one possible implementation, sending data corresponding to the target cache line and data corresponding to the prefetch cache line to the processor includes: and for the target cache line and the prefetched cache line corresponding to the same request message, firstly sending the data corresponding to the target cache line, and then sending the data corresponding to the prefetched cache line.

In one possible embodiment, the method further comprises: if there are a plurality of prefetched cache lines, the data corresponding to the prefetched cache line with the lowest delay is preferentially sent.

In one possible implementation, the request message includes a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line; sending data corresponding to the target cache line and data corresponding to the prefetched cache line to the processor, including: if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are located in the same row of the same memory bank in the memory, after the data corresponding to the first target cache line is sent, the data corresponding to the first prefetched cache line is sent preferentially, then the data corresponding to the second target cache line is sent, and then the data corresponding to the second prefetched cache line is sent.

In one possible implementation, the request message includes a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line; sending data corresponding to the target cache line and data corresponding to the prefetched cache line to the processor, including: if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are not located in the same row of the same memory bank in the memory, the data corresponding to the first target cache line and the data corresponding to the second target cache line are sent first, and then the data corresponding to the first prefetched cache line and the data corresponding to the second prefetched cache line are sent.

In one possible embodiment, the method further comprises: when transmitting the data in the same row in the memory, transmitting the data in the same row according to any one of the following modes: transmitting data in the same row from a low address to a high address; alternatively, the data in the same row is transmitted from the high address to the low address; or, data corresponding to the target cache line is used as a center to send data in the same row to the high address and the low address; alternatively, the data within the same row is transmitted in random order.

In a sixth aspect, there is provided a computer readable storage medium having instructions stored therein which, when executed on a processor, cause the processor to perform the method of the fourth aspect and any embodiment thereof.

In a seventh aspect, there is provided a computer readable storage medium having instructions stored therein that, when executed on a memory controller, cause the memory controller to perform the method of the fifth aspect and any embodiment thereof.

In an eighth aspect, there is provided a computer program product comprising instructions which, when executed on a processor, cause the processor to perform the method of the fourth aspect and any of its embodiments.

In a ninth aspect, there is provided a computer program product comprising instructions which, when executed on a memory controller, cause the memory controller to perform the method of the fifth aspect and any of its embodiments.

The technical effects concerning the third aspect to the ninth aspect refer to the technical effects of the first aspect to the second aspect and any implementation manner thereof.

Drawings

Fig. 1 is a schematic structural diagram of an SoC chip according to an embodiment of the present application;

FIG. 2 is a flow chart of a data prefetching method according to an embodiment of the present application;

FIG. 3A is a schematic diagram of a prefetched cache line and a target cache line according to one embodiment of the present disclosure;

FIG. 3B is a schematic diagram of another prefetched cache line and a target cache line according to one embodiment of the present disclosure;

FIG. 3C is a schematic diagram of a prefetched cache line and a target cache line according to another embodiment of the present application;

FIG. 3D is a schematic diagram of another prefetched cache line and a target cache line according to one embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a memory and a memory controller according to an embodiment of the present application.

Detailed Description

It should be noted that the terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between the same type of feature, and not to be construed as indicating a relative importance, quantity, order, or the like.

The terms "exemplary" or "such as" and the like, as used in connection with embodiments of the present application, are intended to be exemplary, or descriptive. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

The terms "coupled" and "connected" in connection with embodiments of the present application are to be construed broadly, and may refer, for example, to a physical direct connection, or to an indirect connection via electronic devices, such as, for example, a connection via electrical resistance, inductance, capacitance, or other electronic devices.

Because most of the data prefetching schemes at present adopt an accuracy priority design, the granularity of the data prefetching is a cache line, and the data quantity prefetched each time is smaller, so that the occupation ratio of all the data used by the processor in the mode to be prefetched into the cache in advance is lower, namely the coverage rate is lower. See detailed description below regarding physical pages and page access densities.

As shown in fig. 1, the embodiment of the present application provides an SoC chip 10, including a processor 11, a memory controller 12, and a memory 13 (e.g., a dynamic random access memory (dynamic random access memory, DRAM)), where the processor 11 may be a central processing unit (central processing unit, CPU), and the SoC chip 10 may further include other types of processors, such as a graphics processing unit (graphics processing unit, GPU), a neural network processor (neural network processing unit, NPU), and the like. The processor 11 includes at least one core 111, a cache controller 112, and a cache 113 (e.g., static random access memory (static random access memory, SRAM)). Among other things, the caches 113 may include a primary cache (also called a primary cache (first level cache, FLC)) 1131, a secondary cache (also called a Mid Level Cache (MLC)) 1132, and a tertiary cache (also called a Last Level Cache (LLC)) 1133.

At least one core 111 may run threads to read and write data stored in memory 13 through memory controller 12. However, since the operation speed between the core 111 and the memory 13 is greatly different, a buffer 113 is provided between the processor 11 and the memory 13 for storing the data interacted between the core 111 and the memory 13, and the buffer 113 has a smaller capacity but has an operation speed close to that of the processor 11. According to the thread locality principle, namely temporal locality and spatial locality, temporal locality refers to data recently accessed by the kernel 111, and is also accessed by the kernel 111 in a short period; spatial locality refers to data near data accessed by kernel 111 that is also to be accessed by kernel 111 in the short term. Therefore, if the data which the core 111 has just accessed and the data nearby are cached in the cache 113 which has a running speed much faster than the memory 13, the next time the core 111 accesses the data, the data can be directly obtained from the cache 113, so that the access speed of the data is improved by an order of magnitude.

The operation speeds of the primary cache 1131, the secondary cache 1132, and the tertiary cache 1133 in the cache 113 decrease in sequence, but the capacities increase in sequence. The first level cache 113 in the cache 113 is used for caching instructions and data that the core 111 hits most often, the first level cache 113 and the core 111 are in the same clock cycle, running speed is high, but general capacity is particularly small because of high cost. Each core 111 corresponds to a secondary cache 1132, and the secondary cache 1132 may cache data that is shared exclusively by the cores 111. The three-level cache 1133 is generally one, and the data stored in the three-level cache 1133 is shared among the multiple cores 111, and is mainly used for reducing the time delay of the data between the cores 111 and the memory 13.

The cache controller 112 is configured to load related data stored in the memory 13 into the cache 113 in advance according to an address of the data accessed by the core 111 in the memory 13, so as to implement data prefetching.

The granularity of data prefetching is a cache line, and a thread running by the kernel 111 cannot directly access a physical address in the memory 13, but accesses a virtual address, and maps from the virtual address to the physical address in units of physical pages in the cache 113, where each physical page may include a plurality of cache lines, and the physical addresses of the plurality of cache lines in the same physical page are consecutive.

The performance indicators for evaluating data prefetching include:

coverage rate: the duty cycle at which all data used by core 111 is prefetched into cache 113. The higher the coverage, the better, and most of the data used by the core 111 is prefetched into the cache 113, i.e. the higher the coverage.

Accuracy rate: the prefetched data accessed by the kernel 111 is a proportion of all prefetched data. The higher the accuracy, the better, and if the accuracy is too low, the useless prefetched data may contaminate the cache 113 and occupy the bandwidth of the memory 13, affecting system performance.

Timeliness: when data is being accessed, just has been prefetched into the cache 113, the timeliness is good. When data is being accessed, it is not time efficient if it has not been prefetched into the cache 113. In addition, if the data is pre-fetched into the cache 113 prematurely, other valid data may be replaced, which is also not efficient.

There is a trade-off between coverage and accuracy. To improve coverage, the cache 113 prefetches more data, increasing the probability of prefetching invalid data, and thus decreasing accuracy. To improve accuracy, the cache 113 prefetches data with a higher probability, possibly discarding valid data, and thus reducing coverage. The priority of performance indicators of data prefetching is different for caches of different capacities: for example, for small capacity primary cache 1131, accuracy is more important; coverage is more important for large capacity secondary and

tertiary caches

1132, 1133, so embodiments of the present application are more suitable for implementing data prefetching for these caches.

In this embodiment, the cache controller 112 and the memory controller 12 are configured to perform the data prefetching method shown in fig. 2, including:

s101, the cache controller 112 obtains the page access density of the thread to access at least one physical page.

The page access density refers to the ratio of the cache lines accessed by threads in a single physical page to the total cache lines in a preset time, i.e., how many of the total cache lines in the single physical page are actually accessed by threads running in the kernel 111. For example: assuming a physical page size of 4KB and a cache line size of 64 bytes, each physical page includes 16 cache lines, if 8 of which are actually accessed by core 111, then the page access density is 50%. The high page access density means that most of the data of one physical page is actually accessed.

The cache controller 112 may generate a page access tracking (page access tracing, PAT) table shown in table 1 that is used to track the page access density of at least one physical page for different threads. The page access tracking table includes a thread identification (thread identifier, TID), a physical address (physical address), a cache line access bitmap (bitmap of cache line access), and a timer (timer). The thread identification is used to indicate which thread of the kernel 111 accesses a certain physical page. The physical address is used to indicate the physical address of the physical page accessed by the thread, and since the physical addresses of the cache lines in the same physical page are consecutive, the physical address of the physical page to which the thread belongs can be determined by the physical address of the cache line accessed by the thread. Thus, the corresponding entry in the page access tracking table can be found by the thread identification of the thread and the physical address of the cache line accessed by the thread.

The cache line access bitmap is used to indicate which cache lines in the physical page were accessed by the thread, e.g., each bit of the cache line access bitmap corresponds to a cache line in the physical page, indicating that the corresponding cache line was accessed by the thread when the bit is 1, indicating that the corresponding cache line was not accessed by the thread when the bit is 0, or indicating that the corresponding cache line was accessed by the thread when the bit is 0, indicating that the corresponding cache line was not accessed by the thread when the bit is 1. The physical meaning of the cache line access bitmap is that it represents the page access density, for example, assuming that each physical page includes 16 cache lines, the cache line access bitmap has 16 bits, and if 8 of the cache lines are actually accessed by the thread, the 8 bits in the cache line access bitmap are valid, and the corresponding page access density is 50%.

The timer is used for indicating a time window for counting the page access density, the timer in the table entry is cleared every time a new table entry is generated, the timer can monotonically increase every clock period, and the timer is overflowed to indicate that the table entry is invalid.

TABLE 1

TID	Physical address	Cache line access bitmap	Timer device
				XXXXXXXXXX	XXXXXXXXXX	XXXXXXXXXX	XXXXXXXXXX
XXXXXXXXXX	XXXXXXXXXX	XXXXXXXXXX	XXXXXXXXXX
				XXXXXXXXXX	XXXXXXXXXX	XXXXXXXXXX	XXXXXXXXXX

When a thread in the core 111 accesses a target cache line in a physical page, if the thread does not access the physical page for the first time, that is, if there is an entry whose thread identifier is the same as that of the thread and whose physical address corresponds to that of the physical page, the cache controller 112 updates a bit in the cache line access bitmap corresponding to the cache line in the entry.

S102, when the thread accesses the data corresponding to the target cache line in the target physical page, and the page access density of at least one physical page meets the first condition, the cache controller 112 sends a request message to the memory controller 12.

The request message is used to request to obtain data (specifically, the secondary cache 1132 or the tertiary cache 1133) corresponding to a plurality of cache lines in the memory 13, where the plurality of cache lines includes a target cache line and a prefetch cache line associated with the target cache line in the target physical page, the target cache line is a cache line that the core requests to access, and the prefetch cache line associated with the target cache line may refer to at least one cache line adjacent to the target cache line. For example, as shown in FIG. 3A, a prefetch cache line associated with a target cache line may point to at least one cache line adjacent to the target cache line in a high address direction and in a low address direction; as shown in fig. 3B, the prefetch cache line associated with the target cache line may point to at least one cache line adjacent to the target cache line in the high address direction; as shown in fig. 3C, the prefetch cache line associated with the target cache line may point to at least one cache line adjacent to the target cache line in a low address direction; as shown in fig. 3D, the prefetch cache line associated with the target cache line may refer to other cache lines in the entire target physical page than the target cache line, i.e., the plurality of cache lines may refer to all cache lines of the entire target physical page. Additionally, the plurality of cache lines may also refer to cache lines comprising the target cache line that comprise a proportion (e.g., one-half, one-fourth, etc.) of the target physical page.

The request message may include the physical address of the target cache line. Alternatively, the request message may include the physical address of the prefetched cache line (i.e., an explicit indication), or the request message may include an indication of not only the data in memory 13 corresponding to the target cache line, but also the data in memory 13 corresponding to the prefetched cache line associated with the target cache line, and in particular which prefetched cache line is determined by memory controller 12.

The page access density of the at least one physical page satisfies a first condition comprising: the average of the page access densities of the at least one physical page is greater than a first threshold (e.g., 50%,75%, 80%); alternatively, the maximum value of the page access density of the at least one physical page is greater than the second threshold, or each value of the page access density of the at least one physical page is greater than the third threshold.

The physical meaning of the page access density of at least one physical page meeting the first condition is that: if the page access density of most of physical pages accessed by a thread is high, when the thread accesses a new physical page, the data corresponding to a plurality of cache lines in the physical page in the memory can be prefetched into the cache in advance, so that the prefetching coverage rate can be improved and the accuracy rate is not too low.

In addition, the cache controller 112 may also create a new entry in the page access tracking table for the target physical page. For the case where the cache controller 112 pre-allocates memory space for the page access tracking table, if there is free memory space, then a new entry may be established directly, and if there is no free memory space, then the old entry may be replaced with the new entry, e.g., a least recently used (least recently used, LRU) replacement algorithm may be employed.

S103, the memory controller 12 sends data corresponding to the plurality of cache lines in the memory 13 to the cache controller 112.

As shown in fig. 4, the memory 13 includes a plurality of memory ranks (ranks) 131, each rank 131 including a plurality of banks (banks) 1311, each bank 1311 including N rows of storage units 13111, N being a positive integer. The memory controller 12 includes a first decision maker 121, a second decision maker 122, and a communication interface 123, and the first decision maker 121 and the second decision maker 122 may be combined into one decision maker or may be integrated into one functional module. The first decision device 121 and the second decision device 122 are connected to each row of storage units 13111 in the memory 13, and the communication interface 123 is connected to the first decision device 121 and the second decision device 122, respectively. The communication interface 123 is configured to receive a request message from the processor 11 and transmit and receive data to and from the cache controller 112 of the processor 11; the first decision maker 121 is configured to read and write data corresponding to the target cache line, and send the data corresponding to the target cache line through the communication interface 123; the second decision maker 122 is configured to read and write data corresponding to the prefetched cache line, and send the data corresponding to the prefetched cache line through the communication interface 123.

The memory controller 12 transmits data corresponding to the plurality of cache lines following a latency minimization principle, thereby improving a transmission bandwidth of the data. Specifically, the method comprises the following steps:

for the target cache line and the prefetch cache line corresponding to the same request message, the first decision maker 121 sends the data corresponding to the target cache line first, and the second decision maker 122 sends the data corresponding to the prefetch cache line later, because the data corresponding to the target cache line is the most needed data of the core 11 at present, the memory controller 12 sends the data corresponding to the target cache line first, so that the data can be processed immediately after the core 111 obtains the data corresponding to the target cache line, and compared with the time delay of the core 111 obtaining the data from the memory 13 after the data corresponding to the target cache line is received first. If there are a plurality of prefetched cache lines, the second decision maker 122 preferentially sends the data corresponding to the prefetched cache line with the lowest latency, for example, since the memory controller 12 addresses the memory with the latency of switching between different ranks of memory > latency of switching between different rows > latency of switching between different banks > latency of switching between the same row and different column, the second decision maker 122 may preferentially send the data corresponding to the prefetched cache line with the same rank of memory and the same row but different banks.

Assuming that the memory controller 12 receives a first request message and a second request message from the processor 11, the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetch cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetch cache line associated with the second target cache line.

If the data corresponding to the first prefetched cache line is located in the same row of the same bank in the memory 13 as the data corresponding to the first target cache line, the second decision maker 122 preferentially transmits the data corresponding to the first prefetched cache line after the first decision maker 121 transmits the data corresponding to the first target cache line, and then the first decision maker 121 transmits the data corresponding to the second target cache line, and the second decision maker 122 retransmits the data corresponding to the second prefetched cache line. Since the time to switch between different columns in the same row is the shortest, the latency of sending out data sets in the same row in the same bank is the smallest.

If the data corresponding to the first prefetch cache line and the data corresponding to the first target cache line are not located in the same row of the same bank in the memory 13, the first decision maker 121 sends the data corresponding to the first target cache line and the data corresponding to the second target cache line first, and then the second decision maker 122 sends the data corresponding to the first prefetch cache line and the data corresponding to the second prefetch cache line. Because the data corresponding to the target cache line is the most needed by the core 11 at present, the memory controller 12 firstly sends the data corresponding to the target cache line, so that the core 111 obtains the data corresponding to the target cache line as soon as possible, and the time delay of the core 111 obtaining the data from the memory 13 is reduced.

In addition, the sending sequence of the first decision maker 121 or the second decision maker 122 when sending the data in the same line in the memory 13 may be optimized according to the delay of sending the data, for example, the data in the same line may be sent from the low address to the high address; alternatively, the same in-row data may be sent from high address to low address; or, data corresponding to the target cache line is used as a center to send data in the same row to the high address and the low address; alternatively, the data within the same row may be transmitted in a random order.

S104, the cache controller 112 receives data corresponding to the plurality of cache lines from the memory controller 12 and stores the data into the plurality of cache lines.

The buffer controller 112 stores the data into the plurality of buffer lines in the order of receiving the data, and the specific receiving order is described in the order in which the memory controller 12 sends the data in step S103, which is not repeated here.

According to the processor, the memory controller, the system-on-chip and the data prefetching method, when the page access density of a thread operated by a kernel in the processor to access at least one physical page meets a first condition, namely, when the thread accesses the physical page historically, the processor generally accesses more data corresponding to cache lines in the same physical page, so that under the condition that the thread accesses the data corresponding to a target cache line in a new physical page, the processor requests the memory controller to acquire the data corresponding to the target cache line and requests the memory controller to acquire the data corresponding to the prefetched cache line associated with the target cache line in the same physical page, and the data prefetching coverage rate is improved.

Embodiments of the present application also provide a computer readable storage medium having instructions stored therein that, when executed on a processor or memory controller, cause the processor or memory controller to perform the method of fig. 2.

Embodiments of the present application also provide a computer program product comprising instructions that, when executed on a processor or memory controller, cause the processor or memory controller to perform the method of fig. 2.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system, apparatus and module may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, e.g., the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or components may be combined or integrated into another device, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physically separate, i.e., may be located in one device, or may be distributed over multiple devices. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated in one device, or each module may exist alone physically, or two or more modules may be integrated in one device.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using a software program, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, etc. that can be integrated with the medium. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A processor, comprising: the core, the buffer controller and the buffer; running threads in the kernel; the cache controller is configured to:

acquiring the page access density of the thread to access at least one physical page, wherein the page access density refers to the ratio of the cache line accessed by the thread to all the cache lines in a single physical page within a preset time;

when the thread accesses data corresponding to a target cache line of a target physical page and the page access density of at least one physical page meets a first condition, sending a request message to a memory controller, wherein the request message is used for requesting to acquire data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise the target cache line and a prefetched cache line associated with the target cache line in the target physical page;

And receiving data corresponding to the cache lines from the memory controller and storing the data in the cache.

2. The processor of claim 1, wherein the page access density of the at least one physical page satisfies a first condition, comprising:

the average value of the page access densities of the at least one physical page is greater than a first threshold, or the maximum value of the page access densities of the at least one physical page is greater than a second threshold, or each value of the page access densities of the at least one physical page is greater than a third threshold.

3. The processor of any of claims 1-2, wherein the cache controller is specifically configured to:

and receiving the data corresponding to the target cache line from the memory controller, and then receiving the data corresponding to the prefetched cache line from the memory controller.

4. A processor according to any one of claims 1-3, wherein the cache refers to a mid-level cache or a last-level cache.

5. The memory controller is characterized by comprising a first decision maker, a second decision maker and a communication interface;

the communication interface is used for receiving a request message from the processor, wherein the request message is used for requesting to acquire data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise a target cache line in a physical page accessed by the processor and a prefetched cache line associated with the target cache line;

The first decision maker is used for sending data corresponding to the target cache line to the processor through the communication interface;

the second decision maker is used for sending data corresponding to the prefetched cache line to the processor through the communication interface.

6. The memory controller of claim 5, wherein,

for the target cache line and the prefetch cache line corresponding to the same request message, the first decision device sends the data corresponding to the target cache line first, and the second decision device sends the data corresponding to the prefetch cache line later.

7. The memory controller according to any one of claims 5-6, wherein,

and if a plurality of prefetch cache lines exist, the second decision device preferentially transmits the data corresponding to the prefetch cache line with the lowest time delay.

8. The memory controller of any one of claims 5-7, wherein the request message comprises a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line;

If the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are located in the same row of the same memory bank in the memory, after the first decision maker sends the data corresponding to the first target cache line, the second decision maker preferentially sends the data corresponding to the first prefetched cache line, then the first decision maker sends the data corresponding to the second target cache line, and the second decision maker sends the data corresponding to the second prefetched cache line.

9. The memory controller of any one of claims 5-7, wherein the request message comprises a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line;

and if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are not located in the same row of the same memory bank in the memory, the first decision maker firstly sends the data corresponding to the first target cache line and the data corresponding to the second target cache line, and then the second decision maker sends the data corresponding to the first prefetched cache line and the data corresponding to the second prefetched cache line.

10. The memory controller according to any one of claims 5-9, wherein when the first decision maker or the second decision maker transmits the same line of data in the memory, the same line of data is transmitted according to any one of the following modes:

transmitting the data in the same row from a low address to a high address; or alternatively, the process may be performed,

transmitting the data in the same row from a high address to a low address; or alternatively, the process may be performed,

transmitting the data in the same row to a high address and a low address by taking the data corresponding to the target cache line as a center; or alternatively, the process may be performed,

and transmitting the data in the same row according to a random sequence.

11. A system on chip comprising a processor as claimed in any one of claims 1 to 4, a memory controller as claimed in any one of claims 5 to 10, and a memory.

12. A method of prefetching data comprising:

13. The method of claim 12, wherein the page access density of the at least one physical page satisfies a first condition, comprising:

14. The method of any of claims 12-13, wherein receiving data corresponding to the plurality of cache lines from the memory controller comprises:

15. A method of prefetching data comprising:

receiving a request message from a processor, wherein the request message is used for requesting to acquire data corresponding to a plurality of cache lines in a memory, and the plurality of cache lines comprise a target cache line in a physical page accessed by the processor and a prefetched cache line associated with the target cache line;

And sending the data corresponding to the target cache line and the data corresponding to the prefetched cache line to the processor.

16. The method of claim 15, wherein the sending the data corresponding to the target cache line and the data corresponding to the prefetch cache line to the processor comprises:

and for the target cache line and the prefetched cache line corresponding to the same request message, firstly sending the data corresponding to the target cache line, and then sending the data corresponding to the prefetched cache line.

17. The method according to any one of claims 15-16, further comprising:

and if a plurality of prefetch cache lines exist, preferentially sending the data corresponding to the prefetch cache line with the lowest time delay.

18. The method according to any of claims 15-17, wherein the request message comprises a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line; the sending, to the processor, the data corresponding to the target cache line and the data corresponding to the prefetch cache line includes:

And if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are positioned in the same row of the same memory bank in the memory, after the data corresponding to the first target cache line is sent, preferentially sending the data corresponding to the first prefetched cache line, then sending the data corresponding to the second target cache line, and then sending the data corresponding to the second prefetched cache line.

19. The method according to any of claims 15-17, wherein the request message comprises a first request message and a second request message; the first request message requests to acquire data corresponding to a first target cache line and data corresponding to a first prefetched cache line associated with the first target cache line, and the second request message requests to acquire data corresponding to a second target cache line and data corresponding to a second prefetched cache line associated with the second target cache line; the sending, to the processor, the data corresponding to the target cache line and the data corresponding to the prefetch cache line includes:

and if the data corresponding to the first prefetched cache line and the data corresponding to the first target cache line are not located in the same row of the same memory bank in the memory, firstly sending the data corresponding to the first target cache line and the data corresponding to the second target cache line, and then sending the data corresponding to the first prefetched cache line and the data corresponding to the second prefetched cache line.

20. The method of any of claims 15-19, further comprising, when transmitting the same line of in-memory data, transmitting the same line of in-memory data in any of the following ways:

transmitting the data in the same row from a low address to a high address; or sending the data in the same row from a high address to a low address; or, sending the data in the same row to a high address and a low address by taking the data corresponding to the target cache line as a center; or, transmitting the data in the same row according to a random sequence.