CN116680214A

CN116680214A - Data access method, readable storage medium and electronic equipment

Info

Publication number: CN116680214A
Application number: CN202310532247.7A
Authority: CN
Inventors: 陈健; 黄涛
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-09-01

Abstract

The specification provides a data access method, a readable storage medium and an electronic device, wherein the method comprises the following steps: responding to a cache data reading request of the processor core aiming at target data, and predicting whether the target data exists in a target processor cache corresponding to the processor core; and under the condition that the prediction result shows that the target data does not exist in the target processor cache, sending a data prefetching request for the target data to a computing quick link CXL device corresponding to the processor core, so that the target data is stored in the cache corresponding to the CXL device, and the processor core can acquire the target data from the cache corresponding to the CXL device.

Description

Data access method, readable storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data access method, a readable storage medium, and an electronic device.

Background

The computing fast link (Compute Express Link, CXL) technology is an emerging interconnect technology standard, and compared with the conventional high-speed serial computer expansion bus standard (peripheral component interconnect Express, PCI-Express or PCI-e) technology, the computing fast link (Compute Express Link, CXL) technology can more effectively maintain the consistency between the memory space of the processor (Central Processing Unit, CPU) and the memory of the connection device, thereby improving the data access speed of the CPU to the memory of the corresponding device. However, the CXL interfaces and CXL switches introduced with the CXL technology will cause serious additional delays to the above described data access process, limiting further application and development of the CXL technology.

Disclosure of Invention

In view of the above, the present specification provides a data access method, a readable storage medium, and an electronic device to solve the drawbacks of the related art.

Specifically, the specification is realized by the following technical scheme:

according to a first aspect of embodiments of the present specification, there is provided a data access method; the method comprises the following steps:

responding to a cache data reading request of a processor core aiming at target data, and predicting whether the target data exists in a target processor cache corresponding to the processor core;

and under the condition that the prediction result shows that the target data does not exist in the target processor cache, sending a data prefetching request for the target data to a computing quick link CXL device corresponding to the processor core, so that the target data is stored in the cache corresponding to the CXL device, and the processor core can acquire the target data from the cache corresponding to the CXL device.

According to a second aspect of embodiments of the present specification, there is provided a data access method, the method comprising:

generating a cache data reading request aiming at target data, wherein the cache data reading request is used for indicating a cache predictor corresponding to a processor core: under the condition that the target data does not exist in a target processor cache corresponding to the processor core is predicted, sending a data prefetching request for the target data to a computing quick link CXL device corresponding to the processor core, so that the target data is stored in the cache corresponding to the CXL device;

And acquiring the target data from the cache corresponding to the CXL equipment.

According to a third aspect of embodiments of the present specification, there is provided a data access system, the system comprising: cache predictor and processor core, wherein:

the cache predictor is configured to implement the steps of the method according to the first aspect when executed;

the processor core is configured to implement the steps of the method according to the second aspect when executed.

According to a fourth aspect of embodiments of the present specification, there is provided a processor comprising a data access system as described in the third aspect.

According to a fifth aspect of embodiments of the present specification, there is provided an electronic device comprising a data access system as described in the third aspect or a processor as described in the fourth aspect.

According to a sixth aspect of embodiments of the present description, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method according to the first or second aspect.

According to a seventh aspect of embodiments of the present specification, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first or second aspect when the program is executed.

In the technical scheme provided by the specification, whether the target data exist in the target processor cache is predicted through the cache predictor, and a data prefetching request for the target data is sent to the CXL device under the condition that the prediction result indicates that the target data do not exist in the target processor cache, so that the target data can be stored in the cache corresponding to the CXL device, wherein once the processor core does not acquire the target data from the target processor cache, the target data can be quickly acquired through the cache corresponding to the CXL device, the process of waiting for the CXL device to inquire the target data from the corresponding memory is avoided, and the data access speed of the CPU for the CXL device is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments described in the present description, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

FIG. 1 is a schematic diagram of an architecture of a data access system according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating a method of data access according to an exemplary embodiment of the present disclosure;

FIG. 3a is a schematic diagram illustrating a data interaction of a processor core with a CXL device according to an exemplary embodiment of the present disclosure;

FIG. 3b is a schematic diagram illustrating another processor core data interaction with a CXL device according to an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another illustrative bloom filter according to an exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a data access method implemented under a multi-core system according to an exemplary embodiment of the present disclosure;

FIG. 6 is a flow chart illustrating yet another data access method according to an exemplary embodiment of the present disclosure;

FIG. 7 is a flow chart illustrating yet another data access method according to an exemplary embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of an electronic device shown in an exemplary embodiment of the present disclosure;

fig. 9 is a schematic structural view of a data access apparatus according to an exemplary embodiment of the present specification;

fig. 10 is a schematic structural view of another data access apparatus according to an exemplary embodiment of the present specification.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description.

It should be noted that: in other embodiments, the steps of the corresponding method are not necessarily performed in the order shown and described in this specification. In some other embodiments, the method may include more or fewer steps than described in this specification. Furthermore, individual steps described in this specification, in other embodiments, may be described as being split into multiple steps; while various steps described in this specification may be combined into a single step in other embodiments. It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "responsive to a determination", depending on the context.

User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in this specification are both information and data authorized by the user or sufficiently authorized by the parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of the relevant country and region, and is provided with corresponding operation portals for the user to choose authorization or denial.

In the related art, CXL technology may provide unified, coherent device memory access capabilities by dynamically multiplexing three protocols (cxl.io, cxl.cache, and cxl.mem) on the physical layer of the PCIe Gen5, where all devices that connect to a host that contains a CPU and main memory (which may be considered as memory or cache) based on CXL technology may be referred to as CXL devices. Furthermore, the multiplexing combinations of the above protocols may vary depending on the usage scenario of the respective CXL device, for example: in the event that a user wishes to access and cache data in host memory through a CXL device, the CXL device can use the cxl.io protocol to allow the CPU to discover and configure the CXL device and the cxl.cache protocol to allow the CXL device to access host memory; alternatively, in a case where a Memory such as a double rate synchronous dynamic random Access Memory (DDR SDRAM) or a high bandwidth Memory (High Bandwidth Memory, HBM) is configured in the CXL device as a Memory, and the user wants to be able to interact between the host and the CXL device, the cxl.mem protocol may be used on the basis of using the cxl.io protocol and the cxl.cache protocol to allow the CPU of the host to Access the Memory of the CXL device; alternatively, when the CXL device is provided with the above-described memory as a memory, for example, and the user desires the host to access the memory of the CXL device using a load/store command, the CXL device may connect to the host using the cxl.io protocol and the cxl.mem protocol.

However, in the scenario where the host accesses the CXL device Memory, although the latency incurred by the CXL access mode (i.e., the host accesses the CXL device Memory based on CXL technology) is much lower than the PCIe access mode that accesses the device Memory based on, for example, memory-mapped I/O (I/O MMIO) or direct Memory access (Direct Memory Access, DMA) technology. However, considering that the access path of the CXL access method involves at least the following CXL root port, the CXL device port, and the memory controller of the CXL device, the access path when the host accesses the local memory essentially involves only the memory controller of the main memory, and thus the delay of the former is generally higher than that of the latter, and the executed application program is different, the delay may cause a performance degradation of 5-40%, which hinders the large-scale application of the CXL device in terms of data storage.

Accordingly, the present specification proposes the following technical means to solve the above-described problems.

Fig. 1 is a schematic architecture diagram of a data access system according to an exemplary embodiment of the present disclosure. As shown in fig. 1, a processor core 11, a cache predictor 12, a CXL device 13, and a prefetch cache 14 may be included.

The processor Core 11 is the Core (Core) portion of the CPU. During operation of the system, the processor core may generate a cache data read request for target data according to an input instruction from a user or an instruction from an application program, and acquire corresponding target data from a prefetch cache of the cache predictor. The number of processor cores may affect the deployment manner of some modules such as the cache predictor, and this description will be described in detail later, so that details will not be repeated here.

The cache predictor 12 may be used to predict hits of target data in the multi-level processor cache of the CPU. In the running process of the system, the cache predictor can send a data prefetching request for target data to the CXL device according to the prediction condition so as to enable the CXL device to perform subsequent operation. The cache predictor may be a software module in a logic level, for example, implemented by a processor core by executing a preset code; alternatively, the cache predictor may be a hardware module, for example, may be disposed in the CPU together with the processor core, or may be a separate hardware device connected to the CPU, which is not limited in this specification.

The CXL device 13 may be a storage device employing CXL technology and comprising a memory (e.g., DDR memory) and a corresponding memory controller, while the CXL device may support the use of a corresponding CXL interface for byte-addressable memory accesses. During operation of the system, the CXL device may receive a data prefetch request from a cache predictor and ensure that target data required by the processor core is stored in the prefetch cache and return corresponding target data for the processor core. In the case where the processor core needs to acquire data in the memory of the CXL device, the CXL device may use the cxl.io protocol to connect with the host to which the processor core belongs, as described above. Of course, the number of CXL devices may also be set to be multiple according to the actual storage requirement and implement data interaction with the host through the corresponding CXL switch, and this description will be described in detail later, so that details will not be described here.

Prefetch cache 14 may be an example of a "CXL device to cache" in this specification, and prefetch cache 14 may be dedicated to pre-storing data that may need to be accessed by a processor core. Of course, the cache corresponding to the CXL device may take other forms in addition to the prefetch cache 14, which is not limited in this specification. In practice, as long as the CXL device and the processor core can access the cache, the cache corresponding to the CXL device in the present disclosure may be used as a cache, and the data that may need to be accessed by the processor core may be stored in advance, thereby helping to improve the data access efficiency of the processor core. The prefetch buffer 14 is taken as an example: the prefetch buffer 14 is different from the storage space of the CXL device memory, and can be set in the following manner according to the actual storage requirement: the CXL device, another CXL device different from the CXL device, or a CXL switch connected to the CXL device is not limited in this specification. In the running process of the system, the CXL device can prestore target data in the memory into a prefetch buffer memory with higher reading efficiency according to a data prefetch request, so that a host can acquire the target data more quickly, and the access speed of the host to the data in the memory of the CXL device is improved.

The technical solution of the present specification is explained below with reference to the embodiment shown in fig. 2. FIG. 2 is a flow chart of a method for accessing data according to an exemplary embodiment of the present disclosure, and as shown in FIG. 2, the method is applied to a cache predictor corresponding to a processor core, and may include the following steps:

s201, responding to a cache data reading request of a processor core aiming at target data, and predicting whether the target data exists in a target processor cache corresponding to the processor core.

When the processor core needs to acquire the target data, a cache data reading request for the target data can be generated, so that a cache predictor can respond to the cache data reading request and predict whether the target data exists in a target processor cache corresponding to the processor core. The processor core may also query the target processor cache for the target data while generating the cache data read request, in other words, the processor core queries the operation of the target processor cache, and may be executed synchronously with the operation of the cache predictor for predicting the target data, thereby improving the overall efficiency of the processor core for obtaining the target data.

The processor cores may correspond to multi-stage processor caches, and the initiation timing of the cache data read request and the target processor caches may be adjusted and optimized according to the query conditions of the processor caches at each stage.

In an embodiment, when the processor core does not query the target data in a portion of the processor caches in the multi-level processor cache, the cache data read request may be initiated, and the target processor cache includes a processor cache in the multi-level cache structure that has not been queried by the processor core for the target data. Taking the CPU portion shown in fig. 3a or fig. 3b as an example, assuming that the processor core corresponds to a processor Cache with a tertiary structure (the Cache space of the first Level Cache L1, the second Level Cache L2 and the third Level Cache L3 increases in sequence), the processor core may attempt to sequentially read the processor Cache when acquiring the target data, specifically, the processor core may preferentially read the first Level Cache and query the target data, if the first Level Cache misses the target data, may read the second Level Cache, if the second Level Cache also misses the target data, may read the next Level Cache, that is, the third Level Cache L3 of the Last Level Cache (LLC), if the LLC also misses the target data, may read from the memory (that is, the memory in the figure) of the CLX device through the clx.mem protocol, thereby saving the time wasted when the CPU reads the memory data. Wherein the partial processor cache may be set to L1, then the target processor cache may be determined to be L2 and L3, which is equivalent to the processor core may initiate a cache data read request if L1 misses target data, and the cache predictor need only predict if target data hits L2 and L3, without paying attention to whether target data hits L1. In this embodiment, since the above-mentioned target processor cache is set as a processor cache in the multi-level cache structure that has not been queried by the processor core for target data, and the initiation condition of the cache data read request is that the target data has not been queried in the partial processor cache, the processor core needs to query at least one level of cache starting from the minimum level (i.e., L1) in the multi-level cache structure, and there is no target data in the queried cache, otherwise the above-mentioned cache data read request cannot be initiated without (because the target data in this case has been hit in the partial processor cache) nor is the cache predictor from performing the prediction operation. Compared with the mode of predicting the processor caches of all levels by the cache predictor, the multi-level processor caches are hierarchically divided in the specification, temporal locality and spatial locality caused by low capacity and low relevance of a low-level cache (such as L1) are filtered, the frequency of hash collision of the cache predictor is reduced, and prediction accuracy is improved.

The cache predictor can selectively predict partial target data according to the specific type of the target data.

In an embodiment, the cache predictor may predict whether the target processor cache includes the target data if it determines that the storage address of the target data is within the storage address range corresponding to the CXL device. The meaning of determining the storage address is to determine whether the target data is from the CXL device, in other words, if the target data is from the host memory in which the processor core is located, the cache predictor does not need to predict the target data. In addition, the process of determining whether the storage address of the target Data is in the storage address range corresponding to the CXL device may be performed by an independent address filter, as shown in the CPU part shown in fig. 3a or fig. 3b, assuming that the address filter is in the CPU and is connected to the processor core and the Cache predictor respectively, the address filter may monitor all address flows that miss the L1 Cache, that is, all address flows that miss the Data Cache (Data Cache, D-Cache) of the L1 Cache (because the instruction Cache (Instruction Cache, I-Cache) corresponding to the D-Cache) corresponds to instructions instead of Data, the address predictor in the figure only needs to screen a part of the content in the L1 Cache, so as to further reduce the processing Data amount of the address predictor, improve the screening efficiency, and further, send the screened target Data in the storage address range corresponding to the CXL device to the Cache predictor for prediction, thereby improving the effective prediction rate of the Cache predictor, and avoiding unnecessary consumption of the target computing resources of the Cache predictor due to prediction.

The cache predictor may predict whether the target data is present in the target processor cache based on a Ji Shushi bloom filter. Taking fig. 4 as an example, the cache predictor may generate a plurality of hash values according to a storage address of an input target address and a plurality of different preset hash functions, where the plurality of hash functions may be respectively stored in Static Random-Access Memory (SRAM) arrays with a preset bit width, and the number of the arrays is the same as the number of the hash functions. For example, in fig. 4, there are 4 different hash functions H (n 1) to H (n 4), and 4 hash value arrays with 8bit width elements, specifically, it is assumed that after the predictor receives the target address "0x0000FFFF", the content output by the target address through the hash address function may be stored in a corresponding plurality of arrays (for example, hash values corresponding to the 4 hash functions of the target address "0x0000FFFF" are "00000001", "00000010", "00000011", "00000100"). Then, the entries of the plurality of arrays as a whole are input to a zero detector, and the zero detector can detect the count condition of the hash value corresponding to each array, when the count of any array is equal to zero (for example, the count value of the '00000002' is 0, and the rest is 1), the LLC cache cannot hit the target data corresponding to the entry, so that the cache predictor needs to send the following data prefetch request to enable the CPU to make off-chip access to the CXL device; when all the entries are not equal to zero, it means that the L2 cache or the LLC cache may hit the target data corresponding to the entry, and of course, considering the problem of hash collision existing in the hash function (for example, the storage addresses "0x0000FFFF" and "0x0000AAAA" of different target data generate the same hash values "00000001", "00000010", "00000011" or "00000100" as the plurality of hash functions), in this case, there is still the phenomenon that the LLC cache misses the target data, resulting in the CPU needs to obtain the target data from the CXL device, and corresponding measures are also provided in the present specification. When the CPU obtains the target data from the CXL device memory, the count update engine in the figure may add the count of the corresponding entry to update the cache predictor (e.g., 1 to the count values of the 4 hash values "01100001", "0100010", "01100011", "01100100" corresponding to the target address "0x0000 AAAA"), respectively; alternatively, when a cache line corresponding to target data is removed from the LLC, the cache predictor also decrements the count of the corresponding entry by the count update engine to update the cache predictor (e.g., decrements the count value of the 4 hash values "01100001", "0100010", "01100011", "01100100", respectively, for the target address "0x0000 AAAA"). In addition, compared with the method of predicting whether the target data is in the processor cache by training the context of the processor core, the counting bloom filter in the embodiment does not need an additional training stage, does not need to train again after each context switch, and further effectively improves the prediction precision.

It will be appreciated by those skilled in the art that the number of hash functions and the length of the bloom filter (i.e. the bit width of each element in the array and the maximum number of arrays) may be adjusted according to practical situations, so that the false alarm rate will be reduced while avoiding space waste, which is not limited in the present specification.

The CPUs involved in the above embodiments may be based on a single core system or a multi-core system. Taking fig. 3a or fig. 3b as an example, although the CPU in the drawing is implemented based on a single core system, the same technical concept can be used to replicate in the multi-core system in the present specification.

The implementation of the data access method in the present specification in a multi-core system is explained below in connection with fig. 5, and as shown in fig. 5, the multi-core CPU system may be designed based on, for example, a Tile (or module) architecture, where each Tile is connected to a corresponding network on Chip (NoC). The local memory controller corresponding to the host memory to which the CPU belongs and the CXL root port for connecting to the CXL device may be located in different modules in the Mesh (Mesh) formed by the NoC, for example, in the multi-core system in the figure, the local memory controller and the CXL root port may be disposed in the lower left corner and the lower right corner of the NoC Mesh according to the actual usage scenario, and each module in other locations may be respectively formed by a processor core, a group (Bank) or a Slice (Slice) of the LLC, the CXL address filter above, and the cache predictor. Specifically, the slice corresponding to the LLC may be connected to the L2 cache through a NoC router corresponding to the NoC Mesh, and similarly, the CXL address filter and the cache predictor between the modules may also be connected through the NoC router. In contrast to a single core system, a cache predictor in a module-based multi-core system may be solely responsible for predicting whether a cache line (cacheline) corresponding to target data exists in the LLC slice of the module to which it belongs. Because the CXL address range is also divided into a plurality of different modules, the size of the predictor in each module can be generally smaller than that of a single-core system, and the space occupation of the module of each processor core is reduced.

S202, when the prediction result shows that the target data does not exist in the target processor cache, sending a data prefetching request for the target data to a computing quick link CXL device corresponding to the processor core, so that the target data is stored in the cache corresponding to the CXL device, and the processor core can acquire the target data from the cache corresponding to the CXL device.

In the case where the cache predictor predicts that the target processor cache does not have the target data (i.e., means that the LLC misses the target data), a data prefetch request for the target data may be sent to the CXL device corresponding to the processor core, so that the CXL device ensures that the target data is stored in the cache corresponding to the CXL device (e.g., the prefetch cache) for the processor core to acquire the target data from the cache. The data prefetch request may carry a storage address of the target data, so that the CXL device may determine the target data in a prefetch buffer or a corresponding memory. In addition, the prefetch buffer structure can be preferentially created based on SRAM, thereby providing shorter access delay, and meanwhile, the prefetch buffer can also work on the buffer line granularity or multi-buffer line granularity of the processor buffer, and the memory type and the processing granularity of the prefetch buffer in the specification are more suitable for the memory access of CXL devices than the traditional buffer system for buffering data with 4KB page level granularity based on dynamic random access memory (Dynamic Random Access Memory, DRAM).

As previously described, the prefetch cache may be configured according to actual storage requirements (which may be a matter of design choice based on several factors such as the physical location of the processing data prefetch request, the CXL device bandwidth that the target data of the prefetch is expected to consume, and how much delay is desired to be reduced) in the CXL device, another CXL device distinct from the CXL device, or a CXL switch coupled to the CXL device. This in effect also shows that the CXL device corresponding to the prefetch cache can be coupled to the processor core in a variety of ways.

In one embodiment, the prefetch cache may be provided in the CXL device. Taking fig. 3a as an example, the CXL device in the figure is a cxl.mem device that implements the processor core to obtain target data based on the cxl.mem protocol, where the CXL port of the cxl.mem device may be used to send the target data in the corresponding memory to the CXL root port of the processor core, and the prefetch cache may be used to store at least a portion of the memory so that the processor core may return the target data faster.

In another embodiment, the prefetch cache may be provided in a CXL switch, as illustrated in fig. 3b, which may be connected to the CXL root port of the processor, n cxl.mem devices (n is 1 or more), or other CXL switches. Compared with the previous embodiment, the prefetch buffer of the present embodiment is not disposed in the cxl.mem device, but disposed in the CXL switch connected to the plurality of cxl.mem devices, which is equivalent to the prefetch buffer in the CXL switch that can be shared by the plurality of cxl.mem devices, and the prefetch buffer can be connected to the arbitration and multiplexing module in the CXL switch, thereby effectively avoiding the collision that may occur when the plurality of cxl.mem devices transmit the target data to the CXL switch, and improving the channel utilization rate between the plurality of cxl.mem devices.

It should be noted that, compared with the traditional CPU prefetching scheme, the application method of the prefetching cache is characterized in that an independent prefetching cache is introduced into or nearby the CXL device (such as a CXL switch), so that the data in the prefetching cache will not pollute the data in the LLC cache, and the bandwidth of the CXL device itself will not be wasted. More importantly, in connection with the above embodiment, the process of prefetching cache storage data is coordinated at the hardware level: i.e., the prefetch cache will only store target data if the cache predictor predicts that target data is not present in the target processor cache. In the conventional CPU prefetching scheme, the existing hardware prefetcher usually uses an address sequence mode to determine the storage address of the target data, but the principle of the address sequence mode predicts the next address according to the past address sequence, which usually has a better effect only when the storage address of the target data has a certain rule, in other words, in most cases, the prediction effect achieved by the address sequence mode does not meet the actual use requirement, which usually results in that the memory bandwidth of the CXL device is consumed by the target data to be stored in the prefetched cache, which is invalid, and also causes subsequent pollution to other data in the LLC cache.

When the CXL device receives the data prefetching request, the CXL device can judge whether target data exists in the prefetching cache or not, specifically, when the CXL device determines that the target data does not exist in the prefetching cache, the target data can be searched from the memory of the CXL device and stored in the prefetching cache; in the case that the CXL device determines that the target data already exists in the prefetch buffer, it is explained that the CXL device has previously received a data prefetch request for the same target data and stored the corresponding target data from the memory into the prefetch buffer, so that the data prefetch request of this time can terminate the processing.

Of course, there may be multiple situations in the actual scenario, for example, on the premise that the LLC cache misses the target data, the processor core may send a CXL data read request for the target data to the CXL device, while the CXL device may be in a situation that it is determined that there is no outstanding data prefetch request for the target data and there is the target data in the prefetch cache after receiving the CXL data read request, and then the CXL device may directly read and return the target data from the prefetch cache; or, after receiving the CXL data read request, the CXL device may be in a case where a data prefetch request for the target data has been received, but the data prefetch request is not completed, and at this time, the CXL device may wait for the execution of the data prefetch request and return data based on the response result of the data prefetch request; or, after receiving the CXL data read request, the CXL device may not receive the data prefetch request for the target data and may not have the target data in the prefetch buffer, and then may query and return the target data from the memory corresponding to the CXL device.

To further illustrate the execution of the processor core, cache predictor, CXL device, and prefetch cache in the various scenarios described above, the present description will proceed in the following steps in conjunction with FIG. 6:

s601, the processor core issues a cache read request for the target data.

In one embodiment, the processor core may generate and issue a cache read request for the target data for obtaining the target data, where the cache read request may carry the storage address X of the target data.

S602, judging whether the target data hits the L1 cache.

In one embodiment, to obtain the target data in the shortest time, the processor core may attempt to query the first level cache L1 for the target data, and if so (i.e., if the target data hits L1), re-execute S601 to query the new target data; if not, S611 and S621 may be performed simultaneously.

S611, the address filter judges that the storage address of the target data is in the storage address range corresponding to the CXL device.

In an embodiment, an address filter may be disposed in the processor, so as to monitor the address stream of the missed L1, and determine whether each address in the address stream is in the storage address range corresponding to the CXL device, if not, S622 may be executed, and if so, S612 may be executed.

S612, the address filter sends a cache data reading request to the cache predictor.

In one embodiment, the address filter may send a cache data read request including a storage address of target data to the cache predictor when it determines that there is a miss L1 and the target data is in a storage address range corresponding to the CXL device.

S613, the cache predictor is used for predicting whether the target data exists in the target processor cache.

In one embodiment, after the cache predictor receives the cache data read request, it may predict whether the target data exists in the target processor cache based on the counter bloom filter, if not, S614 may be performed, and if so, S622 may be performed.

S614, the cache predictor sends a data prefetch request to the CXL root port.

In an embodiment, in a case where the prediction result indicates that the target data does not exist in the target processor cache, the following S623 is necessarily performed S624 after being performed, so the cache predictor may send, to the CXL device, in advance, a data prefetch request for the target data through the CXL root port, where the data prefetch request may include a storage address of the target data.

S615, the CXL device determines whether the target data is already in the prefetch cache.

In an embodiment, in the case of receiving the data prefetch request from the cache predictor, the CXL device may detect in advance whether the target data is already stored in the prefetch cache, and if so, the CXL device may execute S618 without executing an additional operation; if not, S616 may be performed.

S616, the CXL device sends a data prefetch request to the memory controller.

In one embodiment, since the target data is not stored in the prefetch cache, the CXL device can only attempt to query the memory of the CXL device for the target data through the memory controller.

S617, the memory controller stores the target data in the memory into the prefetch buffer.

In an embodiment, the memory controller responds to the data prefetch request to query target data in the memory of the CXL device, and if the query is successful, the queried target data may be stored in the prefetch buffer, and if the query fails, a result of the query failure may be returned to the processor core through the CXL port.

S618, the CXL device terminates processing the data prefetch request.

In one embodiment, the CXL device may compress the data prefetch request, set a discard flag, etc., to avoid additional space occupation of the CXL device by the data prefetch request.

S621, the processor core accesses L2 and LLC and queries the target data.

S622, waiting for the query result.

In an embodiment, the processor core may access L2 and LLC and query the target data while executing S611, where S623 may also be executed after waiting for the query result of the processor core for L2 and LLC to be generated, if the address filter has determined that the storage address of the target data is not in the storage address range corresponding to the CXL device, or the cache predictor has predicted that the target data exists in the target processor cache.

S623, LLC hits target data.

In an embodiment, if the LLC cache hits the target data, S6211 may be executed to directly return the target data in the processor cache to the processor core; if not, S624 may be performed.

S624, judging whether the storage address of the target data is in the storage address range corresponding to the CXL device.

In an embodiment, the processor core may directly determine whether the storage address of the target data is in the storage address range corresponding to the CXL device along with the result of using the address filter or reusing the address filter or its own preset processing logic. If not, S625 may be performed; if so, then S626 may be performed

S625, the processor core accesses the target data in the local memory.

In an embodiment, this step corresponds to a general access scenario of the processor core to the local memory, and is not related to the CXL device, so the description will not be repeated here.

S626, the processor core sends a CXL data read request to the CXL root port.

In one embodiment, the processor core may send a storage address CXL data read request containing target data to the CXL device via the CXL root port.

S627, the CXL device determines whether there is a processed data prefetch request matching the CXL data read request.

In one embodiment, the CXL device upon receiving the CXL data read request may attempt to match the currently outstanding data prefetch request for the target data, and if so, may execute S628; if not, S629 may be performed.

S628, the CXL device waits to process the data returned by the data prefetch request.

In one embodiment, the execution of the data prefetch request may be awaited because the previously sent data prefetch request for the target data has been received in the CXL device.

S629, the CXL device determines whether or not target data exists in the prefetch buffer.

In one embodiment, the CXL device may determine whether the target data is present in the prefetch cache, and if so, may execute S6211; if not, S6210 may be performed.

S630, the CXL device sends a memory read request to the memory controller.

In one embodiment, the CXL device determines that there is no target data in the prefetch cache, and therefore needs to send a memory read request to the memory controller to cause the memory controller to fetch the target data in memory.

S631, returning the target data to the processor core.

In one embodiment, the processor core or the CXL device may obtain the target data from the local memory, the prefetch buffer, or the memory of the CXL device, respectively, and may eventually return to the processor core.

FIG. 7 is a flowchart of another data access method according to an exemplary embodiment of the present disclosure, and as shown in FIG. 7, the method is applied to a cache predictor corresponding to a processor core, and may include the following steps:

s701, generating a cache data read request for target data, where the cache data read request is used to instruct a cache predictor corresponding to the processor core: under the condition that the target data does not exist in a target processor cache corresponding to the processor core is predicted, sending a data prefetching request for the target data to a computing quick link CXL device corresponding to the processor core, so that the target data is stored in the cache corresponding to the CXL device;

S702, obtaining the target data from a cache corresponding to the CXL equipment.

As previously described, the above method further comprises:

querying the target data in the target processor cache in response to the cache data read request;

the obtaining the target data from the cache corresponding to the CXL device includes: and under the condition that the target data is not queried in the target processor cache, acquiring the target data from the cache corresponding to the CXL equipment.

As described above, the obtaining the target data from the cache corresponding to the CXL device includes: transmitting a CXL data reading request aiming at the target data to the CXL equipment, and receiving the target data returned by the CXL equipment; the target data is read and returned from a cache corresponding to the CXL device under the condition that the CXL device determines that an unfinished data prefetching request aiming at the target data does not exist and the target data exists in the cache corresponding to the CXL device;

the method further comprises the steps of: transmitting a CXL data reading request aiming at the target data to the CXL equipment, and receiving the target data returned by the CXL equipment; and the CXL device inquires and returns the target data from a memory corresponding to the CXL device under the condition that the CXL device determines that the incomplete data pre-fetching request aiming at the target data does not exist and the target data does not exist in a cache corresponding to the CXL device.

As previously described, the processor cores correspond to multi-level processor caches; the cache data reading request is initiated by the processor core under the condition that the target data is not queried in part of processor caches in the multi-level processor cache, and the target processor cache comprises a processor cache in the multi-level cache structure, which is not queried by the processor core for the target data.

As described above, the target processor cache is queried while the cache predictor predicts whether the target data is present in the target processor cache.

As previously described, the data prefetch request is used to instruct the CXL device to:

searching the target data and storing the target data into a cache corresponding to the CXL equipment under the condition that the target data does not exist in the cache corresponding to the CXL equipment;

and terminating the processing when the target data exists in the cache corresponding to the CXL device.

As described above, the cache data read request is used to instruct the cache predictor: and under the condition that the storage address of the target data is determined to be in the storage address range corresponding to the CXL device, predicting whether the target data exists in the target processor cache.

As described above, the CXL device is a different CXL device from the CXL device, and the CXL switch is connected to the CXL device.

As described above, the cache data read request is used to instruct the cache predictor: predicting whether the target data exists in the target processor cache through a counter bloom filter.

In addition, the technical solution of the present disclosure may introduce a paging mechanism in the memory, so as to implement the migration of hot pages from the memory of the CXL device to the local memory of the host described by the processor core according to the page hot statistics related data, thereby reducing the effective memory access delay when using the CXL memory. Among other things, page migration operations may incur the overhead of address translation lookaside buffers (Translation Lookaside Buffer, TLB) knockdown, thereby incurring significant time costs. Under proper conditions (for example, improving the precision of page migration to reduce invalid page migration, thereby reducing TLB shootdown, or reducing the cost of TLB shootdown from a software or hardware level), the original CXL access mode can be converted into the local access with higher execution efficiency.

According to the above embodiments, according to the scheme of the present disclosure, the cache predictor and the cache corresponding to the CXL device are combined with each other, the target data that misses in the cache (for example, L1) of the specific processor is predicted in a hardware level coordinated manner, the predicted target data that does not have the LLC cache is stored in the cache corresponding to the CXL device in advance, and the LLC access delay saved from the cache predictor is added, so that the average access delay of the CXL memory can be significantly reduced, and the average access delay approaches to the local DDR memory access delay.

Fig. 8 is a schematic structural diagram of an electronic device in an exemplary embodiment. Referring to fig. 8, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a nonvolatile memory, and may include other required hardware. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs, and the seed data access device is formed on a logic level. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

Corresponding to the foregoing embodiments of the data access method, the present specification also provides embodiments of a data access apparatus.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating a data access apparatus according to an exemplary embodiment. As shown in fig. 9, in a software implementation, the apparatus may include:

a target data prediction unit 901, configured to predict whether target data exists in a target processor cache corresponding to a processor core, in response to a cache data read request of the processor core for the target data;

And the target data prediction unit 902 is configured to send a data prefetch request for the target data to a computing fast link CXL device corresponding to the processor core, where the prediction result indicates that the target data does not exist in the target processor cache, so that the cache corresponding to the CXL device stores the target data, so that the processor core obtains the target data from the cache corresponding to the CXL device.

Optionally, the processor core corresponds to a multi-level processor cache; the cache data reading request is initiated by the processor core under the condition that the target data is not queried in part of processor caches in the multi-level processor cache, and the target processor cache comprises a processor cache in the multi-level cache structure, which is not queried by the processor core for the target data.

Optionally, the target data prediction unit 901 is specifically configured to: and predicting whether the target data exists in the target processor cache while the processor core queries the target processor cache.

Optionally, the data prefetch request is used to instruct the CXL device to:

Optionally, the target data prediction unit 901 is specifically configured to: and under the condition that the storage address of the target data is determined to be in the storage address range corresponding to the CXL device, predicting whether the target data exists in the target processor cache.

Optionally, the setting location of the buffer corresponding to the CXL device includes any one of the following:

in the CXL device, in another CXL device distinct from the CXL device, in a CXL switch connected to the CXL device.

Optionally, the target data prediction unit 901 is specifically configured to: predicting whether the target data exists in the target processor cache through a counter bloom filter.

The present specification also provides another embodiment of a data access apparatus. Referring to fig. 10, fig. 10 is a schematic diagram illustrating a structure of a data access device according to an exemplary embodiment. As shown in fig. 10, in a software implementation, the apparatus may include:

A cache data read request generating unit 1001, configured to generate a cache data read request for target data, where the cache data read request is used to instruct a cache predictor corresponding to a processor core: under the condition that the target data does not exist in a target processor cache corresponding to the processor core is predicted, sending a data prefetching request for the target data to a computing quick link CXL device corresponding to the processor core, so that the target data is stored in the cache corresponding to the CXL device;

and a first target data obtaining unit 1002, configured to obtain the target data from a cache corresponding to the CXL device.

Optionally, the apparatus further includes:

a target data querying unit 1003 configured to query the target processor cache for the target data in response to the cache data read request;

the first target data obtaining unit 1002 is specifically configured to: and under the condition that the target data is not queried in the target processor cache, acquiring the target data from the cache corresponding to the CXL equipment.

Optionally, the first target data obtaining unit 1002 is specifically configured to: the obtaining the target data from the cache corresponding to the CXL device includes: transmitting a CXL data reading request aiming at the target data to the CXL equipment, and receiving the target data returned by the CXL equipment; the target data is read and returned from a cache corresponding to the CXL device under the condition that the CXL device determines that an unfinished data prefetching request aiming at the target data does not exist and the target data exists in the cache corresponding to the CXL device;

The apparatus further comprises:

a second target data obtaining unit 1004, configured to send a CXL data reading request for the target data to the CXL device, and receive the target data returned by the CXL device; and the CXL device inquires and returns the target data from a memory corresponding to the CXL device under the condition that the CXL device determines that the incomplete data pre-fetching request aiming at the target data does not exist and the target data does not exist in a cache corresponding to the CXL device.

Optionally, the target data query unit 1003 is specifically configured to: querying the target processor cache while the cache predictor predicts whether the target data is present in the target processor cache.

Optionally, the data prefetch request is used to instruct the CXL device to:

Optionally, the target data query unit 1003 is specifically configured to: the cache data read request is to instruct the cache predictor to: and under the condition that the storage address of the target data is determined to be in the storage address range corresponding to the CXL device, predicting whether the target data exists in the target processor cache.

Optionally, the target data query unit 1003 is specifically configured to: the cache data read request is to instruct the cache predictor to: predicting whether the target data exists in the target processor cache through a counter bloom filter.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present disclosure provides a data access system, where the system is as described in the embodiment of fig. 1, and a cache predictor in the data access system may exist in various forms, and if the cache predictor is implemented by a processor core through executing a preset code or is co-deployed with the processor core in a CPU, the data access system may be deployed in the processor; if the cache predictor is a separate hardware device coupled to the CPU, the data access system may be comprised of a processor and an additional cache predictor.

The present specification provides a processor deployed with a data access system as described above.

The present specification provides an electronic device having a data access system or processor as described above deployed.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of processing unit. Typically, the processing unit will receive instructions and data from a read-only memory and/or a random access memory. The essential elements of a computer include a processing unit for executing or carrying out instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A method of data access, the method comprising:

2. The method of claim 1, the processor core corresponding to a multi-level processor cache; the cache data reading request is initiated by the processor core under the condition that the target data is not queried in part of processor caches in the multi-level processor cache, and the target processor cache comprises a processor cache in the multi-level cache structure, which is not queried by the processor core for the target data.

3. The method of claim 1, the predicting whether the target data exists in a target processor cache corresponding to the processor core, comprising:

And predicting whether the target data exists in the target processor cache while the processor core queries the target processor cache.

4. The method of claim 1, the data prefetch request to instruct the CXL device to:

5. The method of claim 1, the predicting whether the target data exists in a target processor cache corresponding to the processor core, comprising:

and under the condition that the storage address of the target data is determined to be in the storage address range corresponding to the CXL device, predicting whether the target data exists in the target processor cache.

6. The method according to claim 1, wherein the setting location of the buffer corresponding to the CXL device includes any one of the following:

7. The method of claim 1, the predicting whether the target data exists in a target processor cache corresponding to the processor core, comprising: predicting whether the target data exists in the target processor cache through a counter bloom filter.

8. A method of data access, the method comprising:

9. The method according to claim 8, wherein the method comprises,

the obtaining the target data from the cache corresponding to the CXL device includes: transmitting a CXL data reading request aiming at the target data to the CXL equipment, and receiving the target data returned by the CXL equipment; the target data is read and returned from a cache corresponding to the CXL device under the condition that the CXL device determines that an unfinished data prefetching request aiming at the target data does not exist and the target data exists in the cache corresponding to the CXL device;

10. A data access system, the system comprising: cache predictor and processor core, wherein:

the cache predictor, when executed, is configured to implement the steps of the method according to any one of claims 1 to 7;

the processor core being adapted to perform the steps of the method according to any of claims 8 to 9.

11. A processor comprising the data access system of claim 10.

12. An electronic device comprising the data access system of claim 10 or the processor of claim 11.

13. A computer readable storage medium having stored thereon a computer program which when executed performs the steps of the method according to any of claims 1 to 9.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 9 when the program is executed.