CN118035131A

CN118035131A - Data prefetching method and device, processor and computer readable storage medium

Info

Publication number: CN118035131A
Application number: CN202410350103.4A
Authority: CN
Inventors: 麻莹莹; 金伟松
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2024-03-25
Filing date: 2024-03-25
Publication date: 2024-05-14

Abstract

Embodiments of the present disclosure provide a data prefetching method and apparatus, a processor, and a computer readable storage medium. The data prefetching method comprises the following steps: acquiring thread information of a plurality of threads processed in a processor; and according to the thread information of the plurality of threads, responding to the shared codes of at least a first thread and a second thread in the plurality of threads, and sharing the same shared data prefetcher by the threads sharing the codes so as to perform data prefetching. The method can improve the overall performance of the processor and reduce the power consumption generated by the training of the prefetcher.

Description

Data prefetching method and device, processor and computer readable storage medium

Technical Field

Embodiments of the present disclosure relate to a data prefetching method, a data prefetching apparatus, a processor, and a computer-readable storage medium.

Background

Modern multi-issue high performance CPUs (central processing units, central Processing Unit) include at least one processor Core, each of which includes a plurality of execution units to execute instructions. For example, a pipelined process of instruction execution includes: five stages of fetch (IF, instruction Fetch), decode (ID, instruction Dispatch/Decode), execute (EX), memory, and Write Back (WB, write Back), update the result of the instruction Execution to the register).

For an application program, a thread can be a minimum scheduling unit for running and executing a process (the process is an entity running process for running the program, and is an independent unit for allocating and allocating resources by a system), when a processor core executes a single-thread instruction, a plurality of execution units and hardware resources in the processor core cannot be fully utilized in most of time, and especially when the single thread runs for some reasons (such as L2 cache flushing) and stops running, the execution units can only idle, so that the waste of the hardware resources is caused, and the performance power consumption ratio is reduced.

If multiple threads are run simultaneously in a process to accomplish different tasks, the design is referred to as multithreading. Simultaneously multithreading (SMT, simultaneous Multithreading, also known as concurrent multithreading) is a hardware multithreading technology capable of executing instructions from multiple threads in one operation cycle, which utilizes mechanisms such as multiple execution units, out-of-order execution and the like of a high-performance processor core to simultaneously execute instructions of multiple threads, and when one thread is stopped due to some reasons, other threads can still run, or redundant resources of one thread can be utilized by another thread, so that the multithreading Cheng Tuntu amount of the processor core, the overall performance and performance power consumption ratio of a CPU, and the utilization rate of hardware resources are improved.

Disclosure of Invention

At least one embodiment of the present disclosure provides a data prefetching method, including: acquiring thread information of a plurality of threads processed in a processor; and according to the thread information of the plurality of threads, responding to at least a first thread and a second thread sharing code in the plurality of threads, and sharing the same shared data prefetcher by the threads sharing the code so as to perform data prefetching.

In a data prefetching method provided in at least one embodiment of the present disclosure, the sharing a thread sharing a code with a same shared data prefetcher to perform a data prefetching process includes: and performing data prefetching training on the shared data prefetcher by using a request sent by the first thread.

In the data prefetching method provided in at least one embodiment of the present disclosure, the training of data prefetching for the shared data prefetcher using the request issued by the first thread includes: requests issued by threads of shared code other than the first thread are blocked or filtered to avoid use of requests issued by threads of shared code other than the first thread for data prefetch training of the shared data prefetcher.

In the data prefetching method provided in at least one embodiment of the present disclosure, the sharing the threads of the shared code with the same shared data prefetcher to perform the data prefetching process, further includes: responding to the request of the first thread and the request of the second thread respectively using the shared data prefetcher.

The data prefetching method provided by at least one embodiment of the present disclosure further includes: determining the shared codes of the first thread and the second thread according to the thread information of the threads, wherein the first thread uses a first data prefetcher to perform the data prefetching process, and the second thread uses a second data prefetcher to perform the data prefetching process; wherein the sharing the thread sharing the code to share the same shared data prefetcher for data prefetching processing includes: combining the first data prefetcher and the second data prefetcher to obtain the shared data prefetcher.

In the data prefetching method provided in at least one embodiment of the present disclosure, the merging the first data prefetcher and the second data prefetcher to obtain the shared data prefetcher includes: the first data prefetcher is used as the shared data prefetcher.

In the data prefetching method provided in at least one embodiment of the present disclosure, the merging the first data prefetcher and the second data prefetcher to obtain the shared data prefetcher includes: at least a portion of the hardware resources of the first data prefetcher and at least a portion of the hardware resources of the second data prefetcher are combined for use with the shared data prefetcher.

The data prefetching method provided in at least one embodiment of the present disclosure further includes: and setting a sharing state identifier in an item trained by the shared data prefetcher to record that the shared data prefetcher is shared by threads of the shared code.

The data prefetching method provided in at least one embodiment of the present disclosure further includes: according to the thread information of the plurality of threads, in response to the first thread and the second thread in the plurality of threads no longer sharing codes, a first data prefetcher and a second data prefetcher are obtained by the shared data prefetcher to be used for the first thread and the second thread to conduct data prefetching processing respectively.

In the data prefetching method provided in at least one embodiment of the present disclosure, the obtaining, by the shared data prefetcher, a first data prefetcher and a second data prefetcher for the first thread and the second thread to perform the data prefetching process, respectively, includes: causing the first data prefetcher to retain at least a first partial entry of the shared data prefetcher and causing the second data prefetcher to retain at least a second partial entry of the shared data prefetcher or causing the second data prefetcher to be empty.

In the data prefetching method provided in at least one embodiment of the present disclosure, the first partial table entry reserved by the first data prefetcher and the second partial table entry reserved by the second data prefetcher are at least partially identical to each other.

In the data prefetching method provided in at least one embodiment of the present disclosure, after the first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher, the request sent by the second thread is used for data prefetching training of the second data prefetcher in response to the second data prefetcher being empty.

At least one embodiment of the present disclosure further provides a data prefetching apparatus, including: the thread information acquisition module is configured to acquire thread information of a plurality of threads processed in the processor; and the data prefetching processing module is configured to respond to at least a first thread and a second thread sharing code in the plurality of threads according to the thread information of the plurality of threads, and the threads sharing the code share the same shared data prefetcher so as to perform data prefetching processing.

In the data prefetching apparatus provided in at least one embodiment of the present disclosure, the data prefetching processing module includes: a prefetch training module configured to perform data prefetch training on the shared data prefetcher using a request issued by the first thread; and the request response module is configured to respond to requests sent by threads of the shared codes respectively by using the shared data prefetcher.

In the data prefetching apparatus provided in at least one embodiment of the present disclosure, the data prefetching processing module further includes: the sharing state determining module is configured to determine the sharing code of the first thread and the second thread according to the thread information of the plurality of threads, wherein the first thread uses a first data prefetcher to perform the data prefetching process, and the second thread uses a second data prefetcher to perform the data prefetching process; a resource allocation module configured to: the first data prefetcher and the second data prefetcher are combined to obtain the shared data prefetcher in response to the first thread and the second thread sharing code, or the first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher for the first thread and the second thread to perform the data prefetching process, respectively, in response to the first thread and the second thread no longer sharing code.

The data prefetching apparatus provided in at least one embodiment of the present disclosure further includes: and the identification setting module is configured to set a sharing state identification in an item trained by the shared data prefetcher to record that the shared data prefetcher is shared by threads of the shared code.

At least one embodiment of the present disclosure further provides a data prefetching apparatus, including a processing unit and a memory communicatively connected to the processing unit; wherein the memory stores computer readable instructions; the processing unit executes the computer readable instructions stored in the memory to implement the data prefetching method provided by any of the embodiments of the present disclosure.

At least one embodiment of the present disclosure also provides a processor, including a data prefetching apparatus provided in any one embodiment of the present disclosure.

At least one embodiment of the present disclosure further provides a computer readable storage medium, where computer readable instructions are stored in the computer readable storage medium, and when the processor executes the computer readable instructions, the data prefetching method provided in any embodiment of the present disclosure is implemented.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.

FIG. 1A illustrates a flow chart for training and prefetching by a data prefetcher;

FIG. 1B is a flow chart illustrating a method for prefetching data according to at least one embodiment of the present disclosure;

FIG. 2A is a schematic diagram of a data prefetch training process when a first thread and a second thread are in a shared code state provided in at least one embodiment of the present disclosure;

FIG. 2B illustrates a schematic diagram of a data prefetch operation when a first thread and a second thread are in a shared code state provided by at least one embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a data prefetch operation when a first thread and a second thread are in an unshared code state provided by at least one embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a data prefetching apparatus according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating another data prefetching apparatus according to at least one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a computer-readable storage medium provided in accordance with at least one embodiment of the present disclosure;

fig. 7 is a schematic block diagram of an electronic device according to at least one embodiment of the present disclosure.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.

Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. Likewise, the terms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Computers typically include one or more processor cores (CPU cores) and a multi-level cache architecture in which the level one (L1) cache memory is accessed at the fastest speed, but at the smallest capacity, typically located within the processor cores; the last level (LLC, typically third level) cache memory is the largest, but the slowest access speed, typically shared by multiple processor cores; the access speed and capacity of the level two (L2) cache memory is intermediate between the L1 cache memory and the LLC cache memory, typically also located within the processor core.

The terms that may be involved in at least one embodiment of the present disclosure are explained as follows.

Data prefetch (DATA PREFETCH): in a CPU architecture, program instructions and data may be stored in dynamic random access memory (Dynamic Random Access Memory, DRAM). The operating frequency of the processor core is far higher than that of the DRAM memory, so that hundreds of processor core clock cycles are required to acquire data and instructions from the memory, causing the processor core to idle due to waiting for related instructions and data, resulting in performance loss. Therefore, modern high-performance processor cores all contain a multi-level cache architecture to store recently accessed data, and a prefetcher is utilized to find the data access rule of a CPU so as to prefetch the data to be accessed and instructions into a cache in advance.

Prefetcher (PREFETCHER): the prefetcher used to prefetch instructions is referred to as an instruction prefetcher; the prefetcher used to prefetch data is referred to as a data prefetcher. For example, the data prefetcher may be divided into a primary data prefetcher (i.e., a data prefetcher that prefetches target data into a first level cache memory), a secondary data prefetcher (i.e., a data prefetcher that prefetches target data into a second level cache memory), a final (LAST LEVEL) data prefetcher (i.e., a data prefetcher that prefetches target data into a last level cache memory), and so on.

Prefetch mode (PREFETCH PATTERN): in data prefetching, different prefetchers learn different prefetching modes by identifying access history rules of memory addresses, and the prefetchers initiate new data prefetching according to the learned prefetching modes, so that data blocks are fetched into a cache in advance before memory access arrives, and when the memory access arrives, the data blocks are hit in the cache directly, thereby achieving the purpose of hiding memory access delay.

In the case of SMT, prefetchers are provided for a plurality of threads, respectively, and training of the prefetchers is derived from memory address access histories of the plurality of threads; because of the overall limited hardware resources, each prefetcher's hardware resources are allocated according to the number of threads, each thread training a corresponding prefetch pattern according to the thread's requests. Thus, in the case of multithreading, the resources each thread occupies in the prefetcher are further limited, and it may not be guaranteed that the prefetcher can learn the prefetch pattern of each thread.

FIG. 1A shows a flow chart for training and prefetching by a data prefetcher. The prefetcher shown in FIG. 1A is a first level data prefetcher, i.e., a data prefetcher that prefetches target data into a first level cache memory. The prefetcher trains the data prefetcher using virtual addresses.

For example, as shown in fig. 1A, the data prefetcher performs data prefetch training and data prefetch operations (also simply referred to as "training" and "prefetching") through the following step S012-step S017.

Step S012: the data prefetcher receives the virtual addresses and other attributes of all (or part of) the access requests (e.g., the historical access requests), trains with the virtual addresses of the access requests (e.g., the historical access requests) to obtain the data access rules of the processor cores, outputs the data prefetching requests based on the data access rules, and prefetches the data based on the data prefetching requests.

Step S013: the virtual address of the target data of the data prefetch request is translated to a physical address using an address translator, and then a check is made (e.g., by the processor core looking up using the physical address of the target data of the data prefetch request) as to whether the target data of the data prefetch request is in the primary cache memory.

For example, if the target data of the data prefetch request is in the primary cache memory, the data prefetch request is discarded. Correspondingly, the following steps S014-S017 need not be performed for this data prefetch request.

Step S014: if the target data of the data prefetch request is not in the first level cache, a memory entry is applied to the miss address Buffer (MISSING ADDRESS Buffer, MAB) 112 and allocated to the data prefetch request.

Step S015: the miss address buffer memory 112 requests the target data from a buffer memory of a next stage (e.g., a secondary buffer memory).

Step S016: the buffer memory of the next stage acquires the target data and returns the target data to the miss address buffer memory.

For example, as shown in fig. 1A, in the case where the target data of the data prefetch request is stored in the secondary cache memory, the secondary cache memory acquires the target data of the data prefetch request from the secondary cache memory. For example, in the case where the target data of the data prefetch request is not stored in the secondary cache memory, the secondary cache memory acquires the target data of the data prefetch request from a storage device located at a next level of the secondary cache memory.

Step S017: the miss address buffer memory writes the target data into the first level cache memory.

It should be noted that, while the data prefetcher is training and prefetching, the processor core reads the data to perform the relevant operations. Correspondingly, for convenience of description, fig. 1A also shows a part of a flow of reading data by the processor core. For example, the flow of the processor core reading data includes the following step S011.

Step S011: the processor core outputs a virtual address of the target data of the access request to an address translator (e.g., an address translation pipeline) 111, which translates the virtual address into a physical address, and the processor checks to see if the target data of the access request is in the primary cache memory.

If the target data of the access request is in the first-level cache memory, the processor core fetches the target data of the access request from the first-level cache memory. If the target data of the access request is not in the first-level cache memory, the miss address buffer memory 112 acquires the target data of the access request, writes the target data of the access request into the first-level cache memory, and then the processor core fetches the target data of the access request from the first-level cache memory.

For example, the data prefetcher shown in FIG. 1A may be trained using only the addresses of load access requests (e.g., historical load access requests). In this case, the access rule acquired by the data prefetcher is only the rule of the load access request, and correspondingly, the data prefetcher can only perform load access prefetching, that is, the data prefetcher can only prefetch the target data of the load access request. However, since the data prefetcher cannot prefetch the target data of the memory access request, in the case where the data of the target data of the memory access request is not located in the buffer memories (e.g., the first and second buffer memories), it is necessary to fetch the data to the memory (DRAM), which may cause adverse effects on the CPU performance.

For example, in order for a data prefetcher to prefetch target data of a store access request, the data prefetcher (single data prefetcher) shown in FIG. 1A may be trained using both the address of a load access request (e.g., a history load access request) and the address of a store access request (e.g., a history store access request) such that the access requests output by the data prefetcher include both load access requests and store access requests.

The inventors of the present disclosure have noted that in a simultaneous multithreaded processor, for a multithreaded application program sharing code, for example, two threads, for example, of a plurality of threads, execute at least partially the same instruction sequence, and the two threads are substantially identical in rule of data access, i.e., each thread has the same prefetch pattern, which results in a situation that the data prefetcher is retrained for the same prefetch pattern, resulting in a certain resource waste.

In response to the above shortcomings, one or more embodiments of the present disclosure provide a data prefetching method, a data prefetching apparatus, a processor, and a computer-readable storage medium.

The data prefetching method comprises the following steps: acquiring thread information of a plurality of threads processed in a processor; and according to the thread information of the plurality of threads, responding to at least a first thread and a second thread sharing code in the plurality of threads, and sharing the same shared data prefetcher by the threads sharing the code so as to perform data prefetching. The method can improve the utilization rate of hardware resources of the prefetcher by enabling threads sharing codes to share the same shared data prefetcher, namely the shared data prefetcher can only learn the prefetching mode of any one thread (such as a first thread or a second thread) in the threads sharing codes, and the prefetching mode can be simultaneously applied to requests sent by all threads sharing codes, and in at least one embodiment, a processor has enough resources to fully learn the prefetching mode so as to improve the hit rate of the prefetcher.

The processor is, for example, a Simultaneous Multithreaded (SMT) processor.

The present disclosure is illustrated by the following several specific examples. Detailed descriptions of known functions and known components may be omitted for the sake of clarity and conciseness in the following description of the embodiments of the present disclosure. When any element of an embodiment of the present disclosure appears in more than one drawing, the element is identified by the same or similar reference numeral in each drawing.

Fig. 1B is a schematic flow chart of a data prefetching method according to at least one embodiment of the present disclosure, where the data prefetching method includes steps S10 and S20.

Step S10: thread information for a plurality of threads being processed in a processor is obtained.

Step S20: and according to the thread information of the plurality of threads, responding to at least a first thread and a second thread sharing code in the plurality of threads, and sharing the same shared data prefetcher by the threads sharing the code so as to perform data prefetching.

For example, the thread information may be information of an instruction sequence executed by each of the plurality of threads, including shared code information, that is, which threads of the plurality of threads currently share code or do not share code, e.g., whether the code information is shared may be recorded by a register, e.g., for an SMT4 (i.e., simultaneous 4-thread) processor, which threads of the plurality of threads currently share code may be recorded using a 4-bit register. For example, bits 0 through 3 of the register correspond to threads T0, T1, T2, and T3, respectively, and when two threads share code, the corresponding positions of the threads in the register are set to 1, otherwise to 0, so "0101" indicates threads T0 and T2 share code.

For example, the first thread and the second thread share code, i.e., the first thread and the second thread execute the same instruction sequence, and thus the first thread and the second thread may share the same shared data prefetcher and perform data prefetching processing in the same prefetch mode, including data prefetching training or data prefetching operations, which are performed using prefetchers obtained through the data prefetching training.

For example, for an SMT4 processor that includes 4 threads, code may be shared for 3 threads and the same data prefetcher used for data prefetching; it is also possible that all threads (4 threads) share code and use the same data prefetcher for data prefetching, which is not limiting in this disclosure.

For example, step S20 may include step S210 or step S220; there is no limitation on the execution sequence of step S210 and step S220, for example, they may be executed sequentially or in parallel, so that the prefetch training and the prefetch operation are both in dynamic state.

Step S210: the shared data prefetcher is trained for data prefetching using requests issued by the first thread.

Step S220: the shared data prefetchers are used to respond to requests by threads that share code, respectively.

In step S210, since the prefetching modes of the threads of the shared code are the same, and the threads of the shared code include at least the first thread and the second thread, the data prefetcher may be trained by selecting a request sent by any one of the threads of the shared code, where the selection is the first thread, and of course, the second thread may be selected, or other threads in the threads of the shared code.

When the first thread is used for data prefetching training, requests (data prefetching requests) sent by threads sharing codes except the first thread are prevented or filtered, so that the requests sent by the threads sharing codes except the first thread are prevented from being used for data prefetching training of a shared data prefetcher, one training process is reduced, the performance of a processor is improved, and the power consumption of the processor is reduced. For example, such that data prefetch requests issued by threads of shared code other than the first thread do not enter or are not processed by the shared data prefetcher, thereby blocking or filtering requests issued by threads of shared code other than the first thread.

It should be noted that, the embodiments of the present disclosure are not limited to performing data prefetching training by using the first thread and preventing the thread training of the shared code other than the first thread, but only need to use a request sent by any one of the threads of the shared code to train the shared data prefetcher and prevent the thread other than the selected thread from participating in the data prefetching training process, thereby avoiding repeated training.

As described above, the thread information itself may be modified in near real time as the application executes to reflect the current state of multiple threads in the processor, so that the processor uses the shared data prefetcher for some period of time and resumes with each thread its own data prefetcher without using the shared data prefetcher for another period of time, and vice versa.

For example, the data prefetching method may further include the following step S30 before the above step S20.

Step S30: and determining a first thread and a second thread to share codes according to the thread information of the threads, wherein the first thread uses a first data prefetcher to perform data prefetching processing, and the second thread uses a second data prefetcher to perform data prefetching processing.

That is, before the first thread and the second thread share the same shared data prefetcher to perform data prefetching, after the first thread and the second thread share the code are determined, switching is performed between the data prefetching modes, and switching is performed from the first thread to the second thread to perform data prefetching.

For example, code may be shared by a first thread and a second thread of the plurality of threads by using a code injection technique; for example, specific instructions or tags may be inserted into the code to identify the shared code portions and thereby determine the threads of the shared code.

For another example, the thread information may originate from statistics for instructions within a predetermined time period (e.g., 100 milliseconds) prior to the present time, or from a compiled record of the multithreaded application by the compiler, etc.; the thread information itself may be modified in near real-time as the application executes to reflect the current state of multiple threads in the processor, as embodiments of the present disclosure are not limited in this regard.

Step S30 may also be implemented by introducing additional metadata information into the I-Cache (instruction Cache), such as storing the most recently accessed thread ID and the type of code stored, to detect whether different threads share the thread information of the code.

As described above, before determining that the first thread and the second thread share code, the first thread performs data prefetching processing using the first data prefetcher, and the second thread performs data prefetching processing using the second data prefetcher; correspondingly, for example, after determining that the first thread and the second thread share code, sharing the same shared data prefetcher by the first thread and the second thread for data prefetching processing includes: the first data prefetcher and the second data prefetcher are combined to obtain a shared data prefetcher.

For example, the first data prefetcher may be used directly as a shared data prefetcher for subsequent data prefetch training and data prefetching operations, in which case the shared data prefetcher may begin with data that is a copy of the first data prefetcher, at which point, for example, the resources it occupies may be reallocated after the second data prefetcher is emptied; or at least part of the hardware resources of the first data prefetcher and at least part of the hardware resources of the second data prefetcher may be combined for the shared data prefetcher, in which case the shared data prefetcher is the sum of the first data prefetcher and the second data prefetcher and the data at the beginning is the sum of the first data prefetcher and the second data prefetcher, e.g. repeated parts of the data may also be removed.

For example, the hardware resources of the prefetcher include a prefetch table (Prefetch Table), prefetch buffers, address prediction units, state machines, controllers, and the like. The prefetch table comprises a plurality of table entries (entries) which are respectively used for recording and tracking the memory access mode and the address information of prefetch data or instructions; the prefetching buffer is used for temporarily storing prefetched data to wait for the use of the processor core; the address prediction unit is used for generating future prediction of the memory address according to the pre-fetch mode; the state machine and controller are used to manage and schedule prefetch operations, such as deciding when to initiate a prefetch operation, how to update prefetch entries, and the like.

In the prefetch table of the prefetcher, information about each predicted or executed prefetch operation is recorded as an Entry (also referred to as an "Entry"), e.g., each Entry contains the associated thread number, predicted memory address, prefetch status, and associated metadata, such as time stamp and hit rate statistics.

For example, FIG. 2A illustrates a schematic diagram of a data prefetch training process when a first thread and a second thread are in a shared code state provided by at least one embodiment of the present disclosure.

This embodiment takes a 4-thread simultaneous multi-threading (SMT 4) processor as shown in FIG. 2A, where the SMT4 processor has 4 threads (i.e., a first thread T1, a second thread T2, a third thread T3, and a fourth thread T4) running therein, the first thread T1, the second thread T2, the third thread T3, and the fourth thread T4 issuing requests R1-R4, respectively. Before determining that threads T1-T4 share codes, the threads T1-T4 respectively have corresponding data prefetchers; the hardware resources 200 for the data prefetchers in the processor may be allocated to the 4 data prefetchers corresponding to the threads T1-T4 in a pattern, e.g., the hardware resources 200 may include 16 entries, and the data predictors corresponding to each thread may be allocated 4 entries in an evenly allocated pattern. For ease of description, the hardware resources allocated by threads T1-T4 are described as a first data prefetcher (not shown, see FIG. 3 below), a second data prefetcher (not shown, see FIG. 3 below), a third data prefetcher 223, and a fourth data prefetcher 224, respectively; these data prefetchers may, for example, share portions of processing logic when implemented, but operate on different hardware resources (portions).

As shown in fig. 2A, in the running process of the processor, after determining that the first thread T1 and the second thread T2 are shared codes, the first data prefetcher corresponding to the first thread T1 may be used as the shared data prefetcher 220 to perform data prefetching training and subsequent data prefetching operations; or at least a portion of the hardware resources of the first data prefetcher corresponding to the first thread T1 and at least a portion of the hardware resources of the second data prefetcher corresponding to the second thread T2 may be combined for use in the shared data prefetcher 220. The third threads T3 and T4 still perform data prefetching using the previously corresponding data prefetchers 223 and 224, respectively.

For example, the 4 entries originally occupied by the first data prefetcher and the 4 entries originally occupied by the second data prefetcher may be used for the shared data prefetcher 220, and then the shared data prefetcher 220 has 8 entries, correspondingly 4 entries still occupied by the third data prefetcher 223 and 4 entries still occupied by the fourth data prefetcher 224. Thus the shared data prefetcher 220 is significantly improved in terms of prefetch processing power relative to the previous first and second data prefetchers.

Or reallocate the hardware resources 200, for example, such that the shared data prefetcher 220 occupies 6 entries (e.g., 4 entries originally occupied by the first data prefetcher and 2 entries from the original second data prefetcher), 5 entries occupied by the third data prefetcher 223 (4 entries originally and 1 entry from the original second data prefetcher) and 5 entries still occupied by the fourth data prefetcher 224 (4 entries originally and 1 entry from the original second data prefetcher). The shared data prefetcher 220, the third data prefetcher 223, and the fourth data prefetcher 224, which are both reallocated hardware resources, are improved in terms of processing power relative to the original hardware resource allocation scheme.

Since the first thread T1 and the second thread T2 share code, the prefetch patterns of the first thread T1 and the second thread T2 are the same, a request of any one of the threads sharing the code may be selected to train the shared data prefetcher 220, for example, the request R1 of the first thread T1 or the request R2 of the second thread T2, while the request R3 of the third thread T3 may be used to train the third data prefetcher 223 and the request R4 of the fourth thread T4 may be used to train the fourth data prefetcher 224. For example, when the request R1 of the first thread T1 trains the shared data prefetcher 220, the filter 210 may be used to block or filter the request R2 issued by the second thread T2 according to the thread number in the request, so as to avoid that the request R2 issued by the second thread T2 is used for data prefetching training of the shared data prefetcher 220, thereby avoiding a training operation for the request R2 issued by the second thread T2 and reducing power consumption.

FIG. 2B illustrates a schematic diagram of a data prefetch operation when a first thread and a second thread are in a shared code state provided by at least one embodiment of the present disclosure.

As shown in fig. 2B, since the first thread T1 and the second thread T2 are determined to be shared codes, the first thread T1 and the second thread T2 share the same shared data prefetcher 220, and the requests R1 and R2 respectively issued by the first thread T1 and the second thread T2 may trigger the shared data prefetcher 220 to perform corresponding data prefetching operations, and the prefetched data is stored in corresponding caches. The third thread T3 does not belong to a thread sharing code, and only the request R3 issued by the third thread T3 triggers the prefetch operation of the third data prefetcher 223. The fourth thread T4 is similar to the third thread T3 and will not be described here again.

It should be noted that the multithreading shown in fig. 2A-2B includes 4 threads, where the first thread T1 and the second thread T2 share code, but embodiments of the present disclosure are not limited thereto, e.g., the plurality of threads may include threads other than thread T5, thread T6, etc. For example, in addition to the first thread T1 and the second thread T2 sharing code, the thread T5 and the thread T6 may also share code. For example, the plurality of threads may share code among threads T0, T1, and T2, or may share code among all threads T0-T4.

For example, the data prefetching method provided in at least one embodiment of the present disclosure may further include: the shared state identification is set in the table item trained by the shared data prefetcher to record that the shared data prefetcher is shared by the threads sharing the code.

In one example of the present disclosure, the plurality of threads in the processor includes 4 threads, namely threads T1-T4 described above, and when the CPU begins to operate, threads sharing code are determined from the thread information of threads T1-T4, and a shared state flag may be set at an entry trained by the data prefetcher using, for example, an additional 4 bits (bits). For example, when the shared status flag is 0001, the resource of the data prefetcher corresponding to the entry is used by the first thread T1 alone, and when the shared status flag is 0101, the first thread T1 and the third thread T3 share the code and the resource of the data prefetcher corresponding to the entry is used by the first thread T1 and the third thread T3 together. In addition, the shared state identifier may be, but not limited to, 4 bits, and other representations that can represent the resources of the data prefetcher, which are shared by a thread alone or by multiple threads, may also be used as the shared state identifier.

The data prefetching method provided in at least one embodiment of the present disclosure may further include: according to the thread information of the multiple threads, the first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher to be respectively used for the first thread and the second thread to perform data prefetching processing in response to that the first thread and the second thread no longer share codes.

For example, when the first thread and the second thread no longer share code, the first data prefetcher and the second data prefetcher are used for the process of the first thread and the second thread performing data prefetching processing, respectively.

FIG. 3 illustrates a schematic diagram of a data prefetch operation when a first thread and a second thread are in an unshared code state provided by at least one embodiment of the present disclosure.

Referring to fig. 3, the first thread T1 and the second thread T2 are changed to perform data prefetching processing using the first data prefetcher 221 and the second data prefetcher 222, respectively; the third data prefetcher and the fourth data prefetcher are still respectively used for the process of performing the data prefetching process by the third thread T3 and the fourth thread T4, and are not described herein again.

For example, when the first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher, the data inside the shared data prefetcher is maintained unchanged, and the first data prefetcher and the second data prefetcher are obtained only by reallocating hardware resources of the shared data prefetcher.

For example, when the first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher, the first data prefetcher may be caused to retain at least a first partial table of the shared data prefetcher and the second data prefetcher may be caused to retain at least a second partial table of the shared data prefetcher, wherein the first partial table retained by the first data prefetcher and the second partial table retained by the second data prefetcher are caused to be at least partially identical to each other. For example, the first partial entry and the second partial entry may be entries with higher confidence in the shared data prefetcher.

For example, when the first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher, the second data prefetcher may be left empty, even though the second data prefetcher does not retain entries of the shared data prefetcher.

For example, in response to the second data prefetcher being empty, a request issued by the second thread T2 may be used for data prefetch training of the second data prefetcher, training to obtain a prefetch pattern for the second thread T2. I.e. retraining from the head for the second data prefetcher and for the data prefetching process of the second thread T2.

The data prefetching method provided in the above embodiment is not only applicable to a data prefetcher, but also to an instruction prefetcher, which is not limited by the embodiments of the present disclosure.

Fig. 4 is a schematic structural diagram of a data prefetching apparatus according to at least one embodiment of the present disclosure.

As shown in fig. 4, the data prefetching apparatus 400 includes a thread information acquisition module 410 and a data prefetch processing module 420. The thread information acquisition module 410 is configured to acquire thread information of a plurality of threads processed in a processor. The data prefetching module 420 is configured to, according to thread information of a plurality of threads, respond to a thread with a shared code among the plurality of threads, and share the same shared data prefetcher with the thread with the shared code for data prefetching, wherein the thread with the shared code at least comprises a first thread and a second thread.

For example, the data prefetch processing module of the above embodiment includes a prefetch training module and a request response module. The prefetch training module is configured to perform data prefetch training on the shared data prefetcher using requests issued by the first thread, and is further configured to block or filter requests issued by the second thread to avoid requests issued by the second thread for data prefetch training of the shared data prefetcher. The request response module is configured to respond to requests issued by threads of the shared code, respectively, using the shared data prefetcher.

The data prefetching apparatus of the above embodiment further includes a sharing state determining module and a resource allocating module. The sharing state determining module is configured to determine a first thread and a second thread to share codes according to thread information of the plurality of threads, wherein the first thread uses a first data prefetcher to perform data prefetching processing, and the second thread uses a second data prefetcher to perform data prefetching processing.

The resource allocation module is configured to combine the first data prefetcher and the second data prefetcher to obtain a shared data prefetcher in response to the first thread and the second thread sharing code, where the first thread and the second thread share the same shared data prefetcher for data prefetching. Combining the first data prefetcher and the second data prefetcher to obtain a shared data prefetcher comprises: taking the first data prefetcher as a shared data prefetcher; or at least part of the hardware resources of the first data prefetcher and at least part of the hardware resources of the second data prefetcher are combined for sharing the data prefetcher.

The resource allocation module is configured to obtain, by the shared data prefetcher, the first data prefetcher and the second data prefetcher for the first thread and the second thread, respectively, to perform the data prefetching process in response to the first thread and the second thread of the plurality of threads no longer sharing code.

For example, the resource allocation module causes the first data prefetcher to hold at least a first partial entry of the shared data prefetcher, causes the second data prefetcher to hold at least a second partial entry of the shared data prefetcher, or causes the second data prefetcher to be empty. For example, a first partial entry reserved by a first data prefetcher and a second partial entry reserved by a second data prefetcher are at least partially identical to each other.

The data prefetching apparatus of the above embodiment further includes an identification setting module configured to set a sharing status identification at an entry trained by the shared data prefetcher to record that the shared data prefetcher is shared by threads of the shared code.

The data prefetching apparatus of the above-described embodiments of the present disclosure or the modules therein may be implemented, for example, by hardware, software, firmware, or any combination thereof, and the embodiments of the present disclosure are not limited thereto.

Fig. 5 is a schematic structural diagram of another data prefetching apparatus according to at least one embodiment of the present disclosure.

As shown in fig. 5, the data prefetching apparatus 500 includes a processing unit 520 and a memory 510 communicatively coupled to the processing unit 520. The memory 510 has stored thereon computer readable instructions. The processing unit 520 executes computer-readable instructions stored in the memory 510 to implement the data prefetching method provided by any of the embodiments of the present disclosure.

For example, the memory 510 and the processing unit 520 may communicate with each other directly or indirectly. For example, in some examples, as shown in FIG. 5, the data prefetching apparatus 500 may further include a system bus 530, and the memory 510 and the processing unit 520 may communicate with each other via the system bus 530, e.g., the processing unit 520 may access the memory 510 via the system bus 530. For example, in other examples, components such as memory 510 and processing unit 520 may communicate via a Network On Chip (NOC) connection.

For example, the processing unit 520 may control other components in the data prefetching apparatus to perform a desired function. The processing unit 520 may be a Central Processing Unit (CPU), tensor Processor (TPU), network Processor (NP), or Graphics Processor (GPU) or the like having data processing and/or program execution capabilities, or may be a Digital Signal Processor (DSP), application Specific Integrated Circuit (ASIC), field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

For example, memory 510 may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, random Access Memory (RAM) and/or cache memory (cache) and the like. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like.

For example, one or more computer-readable instructions may be stored on memory 510 and processing unit 520 may execute the computer-readable instructions to implement various functions. Various applications and various data, such as instruction processing code and various data used and/or generated by the applications, may also be stored in the computer readable storage medium.

For example, some of the computer instructions stored by memory 510, when executed by processing unit 520, may perform one or more steps in accordance with the data pre-fetching method described above.

For example, as shown in fig. 5, the data prefetching apparatus 500 may further include an input interface 540 that allows an external device to communicate with the data prefetching apparatus 500. For example, input interface 540 may be used to receive instructions from an external computer device, from a user, and the like. The data prefetching apparatus 500 may further include an output interface 550 interconnecting the data prefetching apparatus 500 and one or more external devices. For example, the data prefetching apparatus 500 may be via the output interface 550 or the like.

It should be noted that, the data prefetching apparatus provided by the embodiment of the present disclosure is exemplary and not limited, and the data prefetching apparatus may further include other conventional components or structures according to practical application requirements, for example, to implement the necessary functions of the data prefetching apparatus, those skilled in the art may set other conventional components or structures according to specific application scenarios, which the embodiment of the present disclosure is not limited to.

At least one embodiment of the present disclosure also provides a processor, such as an SMT processor, that includes the data pre-fetching apparatus provided by any of the embodiments of the present disclosure. The maximum number of threads supportable by an SMT processor may be, for example, 2, 4, 8, etc., may be a single-core or multi-core processor, for example, a processor core may employ a microarchitecture of X86, ARM, RISC-V, etc., may include one or more levels of cache, and embodiments of the present disclosure are not limited in this respect.

At least one embodiment of the present disclosure also provides a computer-readable storage medium.

Fig. 6 is a schematic diagram of a computer-readable storage medium according to at least one embodiment of the present disclosure.

For example, as shown in fig. 6, the computer-readable storage medium 600 stores computer-readable instructions 610, which when executed by a computer (including a processor) can implement the data prefetching method provided by any of the embodiments of the present disclosure.

For example, one or more computer-readable instructions may be stored on computer-readable storage medium 600. Some of the computer readable instructions stored on computer readable storage medium 600 may be, for example, instructions for implementing one or more steps of the data pre-fetching method described above.

For example, a computer-readable storage medium may include a memory component of a tablet computer, a hard disk of a personal computer, random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), compact disc read only memory (CD-ROM), flash memory, or any combination of the foregoing, as well as other suitable storage media. For example, the computer-readable storage medium 600 may include the memory 510 in the data prefetching apparatus 500 described previously.

Technical effects of the computer readable storage medium provided by the embodiments of the present disclosure may refer to corresponding descriptions of the data prefetching method in the above embodiments, and are not repeated here.

At least some embodiments of the present disclosure also provide an electronic device comprising a processor of any one of the embodiments described above. Fig. 7 is a schematic block diagram of an electronic device provided in at least one embodiment of the present disclosure.

The electronic device in the embodiments of the present disclosure may be implemented as, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), etc., and a stationary terminal such as a digital TV, a desktop computer, etc.

The electronic device 700 shown in fig. 7 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

For example, as shown in fig. 7, in some examples, an electronic device 700 includes a processor 701, which may include a processor of any of the embodiments described above, that may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the computer system are also stored. The processor 701, the ROM 702, and the RAM 703 are connected thereto via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

For example, the following components may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 707 including a Liquid Crystal Display (LCD), speaker, vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; for example, communication device 709 may also include a network interface card such as a LAN card, modem, etc. The communication device 709 may allow the electronic device 700 to perform wireless or wired communication with other apparatuses to exchange data, performing communication processing via a network such as the internet. The drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 710, so that a computer program read therefrom is installed as needed into the storage 708. While fig. 7 illustrates an electronic device 700 including various devices, it is to be understood that not all illustrated devices are required to be implemented or included. More or fewer devices may be implemented or included instead.

For example, the electronic device 700 may further include a peripheral interface (not shown), and the like. The peripheral interface may be various types of interfaces, such as a USB interface, a lightning (lighting) interface, etc. The communication means 709 may communicate with networks and other devices through wireless communication, such as the internet, intranets, and/or wireless networks such as cellular telephone networks, wireless Local Area Networks (LANs), and/or Metropolitan Area Networks (MANs). The wireless communication may use any of a variety of communication standards, protocols, and technologies including, but not limited to, global System for Mobile communications (GSM), enhanced Data GSM Environment (EDGE), wideband code division multiple Access (W-CDMA), code Division Multiple Access (CDMA), time Division Multiple Access (TDMA), bluetooth, wi-Fi (e.g., based on the IEEE 802.11a, IEEE 802.11b, IEEE 802.11g, and/or IEEE 802.11n standards), voice over Internet protocol (VoIP), wi-MAX, protocols for email, instant messaging, and/or Short Message Service (SMS), or any other suitable communication protocol.

In addition to the above exemplary description, the following points are required:

(1) The drawings of the embodiments of the present disclosure relate only to the structures related to the embodiments of the present disclosure, and other structures may refer to the general design.

(2) The embodiments of the present disclosure and features in the embodiments may be combined with each other to arrive at a new embodiment without conflict.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the disclosure, which is defined by the appended claims.

Claims

1. A data prefetching method, comprising:

Acquiring thread information of a plurality of threads processed in a processor; and

And according to the thread information of the plurality of threads, responding to the shared codes of at least a first thread and a second thread in the plurality of threads, and sharing the same shared data prefetcher by the threads sharing the codes so as to perform data prefetching.

2. The data prefetching method of claim 1, wherein said sharing the threads of the shared code with the same shared data prefetcher for data prefetching processing comprises:

and performing data prefetching training on the shared data prefetcher by using a request sent by the first thread.

3. The data prefetching method of claim 2, wherein said using said request from said first thread to perform data prefetching training on said shared data prefetcher further comprises:

Requests issued by threads of shared code other than the first thread are blocked or filtered to avoid use of requests issued by threads of shared code other than the first thread for data prefetch training of the shared data prefetcher.

4. The data prefetching method according to claim 1 or 2, wherein said sharing the threads of the shared code with the same shared data prefetcher for data prefetching processing further comprises:

And responding to the requests of the threads of the shared codes respectively by using the shared data prefetcher.

5. The data pre-fetching method of claim 1, further comprising:

determining that the first thread and the second thread share code according to the thread information of the plurality of threads,

The first thread performs the data prefetching processing by using a first data prefetcher, and the second thread performs the data prefetching processing by using a second data prefetcher;

Wherein the sharing the first thread and the second thread with the same shared data prefetcher to perform data prefetching processing includes:

Combining the first data prefetcher and the second data prefetcher to obtain the shared data prefetcher.

6. The data prefetching method of claim 5, wherein said combining said first data prefetcher and said second data prefetcher to obtain said shared data prefetcher comprises:

the first data prefetcher is used as the shared data prefetcher.

7. The data prefetching method of claim 5, wherein said combining said first data prefetcher and said second data prefetcher to obtain said shared data prefetcher comprises:

at least a portion of the hardware resources of the first data prefetcher and at least a portion of the hardware resources of the second data prefetcher are combined for use with the shared data prefetcher.

8. The data pre-fetching method of claim 1, further comprising:

And setting a sharing state identifier in an item trained by the shared data prefetcher to record that the shared data prefetcher is shared by threads of the shared code.

9. The data pre-fetching method of claim 1, further comprising:

According to the thread information of the plurality of threads, in response to the first thread and the second thread in the plurality of threads no longer sharing codes, a first data prefetcher and a second data prefetcher are obtained by the shared data prefetcher to be used for the first thread and the second thread to conduct data prefetching processing respectively.

10. The data prefetching method of claim 9, wherein said deriving, by said shared data prefetcher, a first data prefetcher and a second data prefetcher for said first thread and said second thread, respectively, to perform said data prefetching process, comprises:

causing the first data prefetcher to retain at least a first partial entry of the shared data prefetcher, and

Causing the second data prefetcher to either retain at least a second partial entry of the shared data prefetcher or cause the second data prefetcher to be empty.

11. The data prefetching method of claim 10, wherein said first partial table entry reserved by said first data prefetcher and said second partial table entry reserved by said second data prefetcher are at least partially identical to each other.

12. The data prefetching method of claim 10, wherein said request by said second thread is used for data prefetch training of said second data prefetcher in response to said second data prefetcher being empty after said first data prefetcher and said second data prefetcher are obtained by said shared data prefetcher.

13. A data prefetching apparatus comprising:

The thread information acquisition module is configured to acquire thread information of a plurality of threads processed in the processor; and

And the data prefetching processing module is configured to respond to at least a first thread and a second thread sharing code in the plurality of threads according to the thread information of the plurality of threads, and the threads sharing the code share the same shared data prefetcher so as to perform data prefetching processing.

14. The data prefetching apparatus of claim 13, wherein the data prefetching processing module comprises:

a prefetch training module configured to perform data prefetch training on the shared data prefetcher using a request issued by the first thread;

And the request response module is configured to respond to requests sent by threads of the shared codes respectively by using the shared data prefetcher.

15. The data prefetching apparatus of claim 13, further comprising: the sharing state determining module is configured to determine the sharing code of the first thread and the second thread according to the thread information of the plurality of threads, wherein the first thread uses a first data prefetcher to perform the data prefetching process, and the second thread uses a second data prefetcher to perform the data prefetching process;

A resource allocation module configured to:

Combining the first data prefetcher and the second data prefetcher to obtain the shared data prefetcher in response to the first thread and the second thread sharing code, or

The first data prefetcher and the second data prefetcher are obtained by the shared data prefetcher for the first thread and the second thread to perform the data prefetching process, respectively, in response to the first thread and the second thread no longer sharing code.

16. The data prefetching apparatus of claim 13, further comprising:

And the identification setting module is configured to set a sharing state identification in an item trained by the shared data prefetcher to record that the shared data prefetcher is shared by threads of the shared code.

17. A data prefetching device comprises a processing unit and a memory which is in communication connection with the processing unit; wherein,

The memory stores computer readable instructions;

The processing unit executes the computer readable instructions stored by the memory to implement the data prefetching method of any one of claims 1-12.

18. A processor comprising a data pre-fetching device as claimed in any of claims 13 to 17.

19. A computer readable storage medium having computer readable instructions stored therein, which when executed by a processor, implement the data pre-fetching method according to any of claims 1-12.