CN118012788A - Data processor, data processing method, electronic device, and storage medium - Google Patents

Data processor, data processing method, electronic device, and storage medium Download PDF

Info

Publication number
CN118012788A
CN118012788A CN202410420140.8A CN202410420140A CN118012788A CN 118012788 A CN118012788 A CN 118012788A CN 202410420140 A CN202410420140 A CN 202410420140A CN 118012788 A CN118012788 A CN 118012788A
Authority
CN
China
Prior art keywords
data
data access
access instruction
level
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410420140.8A
Other languages
Chinese (zh)
Other versions
CN118012788B (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Original Assignee
Shanghai Bi Ren Technology Co ltd
Beijing Bilin Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Bi Ren Technology Co ltd, Beijing Bilin Technology Development Co ltd filed Critical Shanghai Bi Ren Technology Co ltd
Priority to CN202410420140.8A priority Critical patent/CN118012788B/en
Priority claimed from CN202410420140.8A external-priority patent/CN118012788B/en
Publication of CN118012788A publication Critical patent/CN118012788A/en
Application granted granted Critical
Publication of CN118012788B publication Critical patent/CN118012788B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data processor, a data processing method, an electronic device, and a non-transitory computer readable storage medium. The data processor comprises an N-level cache, a memory and an N-level data proxy module, wherein each level of data proxy module in the N-level data proxy module corresponds to each level of cache in the N-level cache one by one, the N-level data proxy module is configured to acquire a data access instruction sequence in a task currently executed by the data processor, adjust an instruction sending sequence of the data access instruction sequence according to a storage position of destination data of each data access instruction in the data access instruction sequence, send each data access instruction in sequence according to the instruction sending sequence, and cache destination data returned by each data access instruction into a corresponding 1-level data proxy module. The data processor reduces the occupation time of the register, avoids that the data access instruction with a far storage position extrudes the bandwidth of the data access instruction with a near storage position, and greatly improves the data access efficiency.

Description

Data processor, data processing method, electronic device, and storage medium
Technical Field
Embodiments of the present disclosure relate to a data processor, a data processing method, an electronic device, and a non-transitory computer readable storage medium.
Background
A data access instruction (e.g., load instruction) is a basic instruction in a computer composition principle, and is used to Load data from a storage unit such as a memory into a register, and is used to read the data in the storage unit such as the memory into a specified register for subsequent operation and processing. When the data access instruction is executed, an address to be read needs to be specified, and the read data is stored in a target register. Taking the example of the read data in the memory, the data access instruction generally includes the following steps:
1. loading the memory address into an address register (memory ADDRESS REGISTER, MAR for short);
2. Loading the data of the memory into a data register (memory DATA REGISTER, MDR for short);
3. Data is transferred from the MDR to the destination register.
Disclosure of Invention
The invention provides a data processor, which comprises N-level caches and a memory, wherein the i-level caches in the N-level caches comprise at least 1 i-level cache node, at least part of i-level cache nodes in the at least 1 i-level cache nodes share one i+1-level cache node, N is a positive integer greater than 1, i is any positive integer between 1 and N-1, the N-level caches in the N-level caches are electrically connected with the memory, each 1-level cache node is connected with a corresponding computing nuclear power, the data processor further comprises N-level data proxy modules, each level of data proxy modules in the N-level data proxy modules is in one-to-one correspondence with each level of cache in the N-level caches, the N-level data proxy modules are configured to acquire a data access instruction sequence in a task currently executed by the data processor, the storage position of each data access instruction in the memory is adjusted according to the instruction sequence of the data access instruction, the instruction sequence of the data access instruction is sequentially sent according to the instruction sending sequence of the instruction of the data access instruction, each level 1 data proxy module is returned to the corresponding data proxy module of the data access instruction of the data access 1.
For example, in the data processor provided in at least one embodiment of the present disclosure, the N-level data proxy module executes the following operations when adjusting the instruction sending sequence of the data access instruction sequence according to the storage position of the destination data of each data access instruction in the data access instruction sequence in the memory: obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, wherein the sending time of the data access instruction with smaller corresponding physical distance is earlier.
For example, in the data processor provided in at least one embodiment of the present disclosure, when the N-level data proxy module executes the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, the method includes executing the following operations: classifying the data access instructions into different channels according to the physical distances corresponding to the data access instructions, wherein the physical distances corresponding to the data access instructions in the different channels are different; according to the sequence of the physical distances corresponding to the data access instructions in different channels from small to large, arranging the data access instructions in each channel by taking the channel as a unit to obtain the instruction sending sequence, wherein in the instruction sending sequence, the data access instruction belonging to a first channel is sent first, the data access instruction belonging to a second channel is sent last, the data access instructions belonging to the same channel are sent in series according to the relative sequence relation in the data access instruction sequence, the physical distance corresponding to the data access instruction in the first channel is minimum, and the physical distance corresponding to the data access instruction in the second channel is maximum.
For example, in the data processor provided in at least one embodiment of the present disclosure, the N-level data proxy module executes the following operations according to the storage location of destination data of each data access instruction in the data access instruction sequence in the memory, and adjusts the instruction sending sequence of the data access instruction sequence: acquiring the priority of each data access instruction; obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the priorities and the corresponding physical distances of the data access instructions, wherein the sending time of the data access instructions with higher priorities is earlier, and the sending time of the data access instructions with smaller corresponding physical distances is earlier under the condition of the same priorities.
For example, in a data processor provided in at least one embodiment of the present disclosure, the N-level data proxy module performs a priority of acquiring each data access instruction, including performing the following operations: and determining the priority of each data access instruction according to the time of returning the target data by each data access instruction when the task is pre-run, wherein each data access instruction is sequentially sent according to the sequence in the data access instruction sequence when the task is pre-run, and the time of returning the target data by each data access instruction is collected.
For example, in the data processor provided in at least one embodiment of the present disclosure, when the N-level data proxy module performs buffering the destination data returned by the data access instruction into the corresponding 1 st-level data proxy module, the method includes performing the following operations: and caching the target data returned by the data access instruction into a level 1 cache node or a shared memory connected with a computing core using the target data according to each data access instruction, wherein the corresponding level 1 data proxy module comprises a plurality of cache blocks, the cache blocks are mapped to the level 1 cache node or the shared memory, and the cache blocks are used for caching the received target data.
For example, in the data processor provided in at least one embodiment of the present disclosure, the at least 1 i-level cache node corresponds to at least 1 i-level data proxy module one by one, the i-level data proxy modules corresponding to at least some i-level cache nodes are electrically connected to the i+1-level data proxy modules corresponding to the i+1-level cache nodes, and the 1-level data proxy module corresponding to each 1-level cache node is electrically connected to at least one computing core, where the at least one computing core is electrically connected to the 1-level cache node; and the N-stage data proxy module corresponding to the N-stage cache is electrically connected with the memory.
For example, in a data processor provided in at least one embodiment of the present disclosure, each data proxy module includes a scheduler, a request state list, a data request list, and a plurality of cache blocks, where the data proxy module includes any one of an ith data proxy module or the nth data proxy module, the data request list is used to cache a data access instruction to be sent, the scheduler is used to sequentially send, according to the instruction sending order, the data access instruction cached in the data request list, and send received destination data to a corresponding computing core or a cache block in an electrically connected previous data proxy module, the request state list includes a plurality of state items, where each state item in the plurality of state items is used to indicate a cache block associated with the data access instruction corresponding to the state item, a location of the data access instruction in the instruction sending order, and the plurality of cache blocks are used to cache the received destination data.
For example, in the data processor provided in at least one embodiment of the present disclosure, a plurality of channels are provided in a data request list of the data proxy module to be used for caching the data access instruction to be sent, physical distances corresponding to the data access instruction in different channels are different, and the scheduler is configured to arrange the data access instructions in each channel according to an order from small to large corresponding to the data access instruction in different channels, and obtain an instruction sending order of the data access instruction cached in the data request list, and sequentially send the data access instruction to be sent according to the instruction sending order, where the physical distance corresponding to each data access instruction is a distance between a storage position of destination data of each data access instruction in the memory and a computing core in the data processor, and the physical distance corresponding to each data access instruction is determined by a read address of destination data serving as an input parameter in each data access instruction.
For example, in a data processor provided in at least one embodiment of the present disclosure, the data proxy module is configured to: and in response to the existence of the unassociated cache block in the data proxy module, associating the unassociated cache block with one data access instruction selected from a data request list of the data proxy module according to the instruction sending sequence.
For example, in a data processor provided in at least one embodiment of the present disclosure, each cache block has a unique corresponding cache block number, each data access instruction in the data request list has a unique corresponding request number, and when the data proxy module performs associating the unassociated cache block with one data access instruction selected from the data request list of the data proxy module according to the instruction sending order, the method includes: and setting a request number of the data access instruction and a cache block number of the unassociated cache block in a state item corresponding to the data access instruction so as to associate the unassociated cache block with the data access instruction.
For example, in the data processor provided in at least one embodiment of the present disclosure, the i+1st data proxy module is further configured to: and responding to the destination data cached in any one cache block in the (i+1) -th data proxy module to be sent to the (i) -th data proxy module, clearing a data access instruction associated with any cache block from a data request list of the (i+1) -th data proxy module, and clearing a state item corresponding to the cleared data access instruction in a request state list of the (i+1) -th data proxy module.
For example, in a data processor provided in at least one embodiment of the present disclosure, the scheduler of the nth data agent module is configured to: sequentially sending the cached data access instructions to the memory according to the instruction sending sequence of the cached data access instructions in the data request list of the N-stage data proxy module; and receiving the destination data returned from the memory in turn, and storing each returned destination data into a cache block associated with the corresponding data memory access instruction.
For example, in a data processor provided in at least one embodiment of the present disclosure, the scheduler of the ith data agent module is configured to: transmitting a data request signal to the i+1st data proxy module which is electrically connected, and caching received return data into a cache block associated with a data access instruction corresponding to the transmitted data request signal; the scheduler of the i+1st stage data agent module is configured to: and in response to receiving the data request signal, transmitting destination data of a data access instruction corresponding to the data request signal to the ith data proxy module.
For example, in the data processor provided in at least one embodiment of the present disclosure, the data agent module 1 st stage is further configured to: before executing the task, selecting at least one data access instruction from the data access instruction sequence according to the sequence of the data access instruction sequence, and executing initialization operation on the at least one data access instruction, wherein the at least one data access instruction is executed by a computing core electrically connected with the data access agent module at the 1 st stage.
For example, in the data processor provided in at least one embodiment of the present disclosure, when the data proxy module at stage 1 performs an initialization operation on the at least one data access instruction, the method includes performing the following operations: caching the at least one data access instruction into a data request list of the data agent module of the level 1; initializing a state item corresponding to the at least one data access instruction and storing the state item into a request state list of the data agent module at the level 1; and synchronizing the at least one data access instruction and the corresponding state item to other stages of data proxy modules in direct or indirect electrical connection with the stage 1 data proxy module.
For example, in the data processor provided in at least one embodiment of the present disclosure, the data agent module 1 st stage is further configured to: in response to receiving a data access instruction sent by a computing core electrically connected with the data access instruction at the level 1, determining a cache block associated with the sent data access instruction, and sending data in the associated cache block to a destination register indicated by the data access instruction; clearing a data access instruction associated with the associated cache block from a data request list of the data proxy module 1, and clearing a state item corresponding to the cleared data access instruction in a request state list of the data proxy module 1; and selecting at least one data access instruction from the data access instruction sequence according to the sequence of the data access instruction sequence, and executing initialization operation on the at least one data access instruction.
For example, in the data processor provided in at least one embodiment of the present disclosure, the state item includes a cache block number and a request number, which are used to indicate a data access instruction corresponding to the state item and a cache block associated with the data access instruction, the state item further includes a channel number of a channel to which the data access instruction belongs and a priority of the data access instruction, the channel number and the priority are used to indicate a position of the data access instruction in the instruction sending sequence, the state item further includes request receiving state information and request sending state information, which are used to indicate a current state of the data access instruction, and the state item further includes data state information, which is used to indicate a current state of destination data returned by the data access instruction.
For example, in a data processor provided in at least one embodiment of the present disclosure, prior to performing the task, the data processor is further configured to: and selecting a plurality of cache lines from the cache nodes corresponding to each data agent module to be mapped into the cache blocks in the data agent module.
For example, in the data processor provided in at least one embodiment of the present disclosure, the data access instruction sequence is obtained by arranging all data access instructions in the task according to a sequential positional relationship occurring in a program corresponding to the task.
For example, in a data processor provided in at least one embodiment of the present disclosure, the data processor is a general-purpose graphics processor or a graphics processor, the data processor includes a plurality of computing units, n=2, a level 1 cache in the N-level cache includes a plurality of level 1 cache nodes, a level 2 cache in the N-level cache includes 1 level 2 cache node, each level 1 cache node is used for data sharing in 1 computing unit, each level 2 cache node is used for data sharing between the plurality of computing units, each level 1 cache node corresponds to 1 level 1 data proxy module, and the level 2 cache node corresponds to 1 level 2 data proxy module.
The present disclosure further provides a data processing method, for a data processor including N-level caches and a memory, where an i-level cache in the N-level caches includes at least 1 i-level cache node, at least some i-level cache nodes in the at least 1 i-level cache nodes share an i+1-level cache node, N is a positive integer greater than 1, i is any positive integer between 1 and N-1, and the N-level cache in the N-level caches is electrically connected to the memory, and each 1-level cache node is connected to a corresponding computing core, where the data processing method includes: acquiring a data access instruction sequence in a task currently executed by the data processor; according to the storage position of the destination data of each data access instruction in the data access instruction sequence in the memory, adjusting the instruction sending sequence of the data access instruction sequence; sequentially sending the data access instructions according to the instruction sending sequence; and caching the target data returned by each data access instruction into a level 1 cache node or a shared memory connected with the calculation core using the target data.
For example, in the data processing method provided in at least one embodiment of the present disclosure, according to a storage position of destination data of each data access instruction in the data access instruction sequence in the memory, adjusting an instruction sending sequence of the data access instruction sequence includes: obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is a physical distance between a storage position of destination data of each data access instruction in the memory and a computing core in the data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, wherein the sending time of the data access instruction with smaller corresponding physical distance is earlier.
For example, in the data processing method provided in at least one embodiment of the present disclosure, according to a storage position of destination data of each data access instruction in the data access instruction sequence in the memory, adjusting an instruction sending sequence of the data access instruction sequence includes: acquiring the priority of each data access instruction; obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is a physical distance between a storage position of destination data of each data access instruction in the memory and a computing core in the data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the priorities and the corresponding physical distances of the data access instructions, wherein the sending time of the data access instructions with higher priorities is earlier, and the sending time of the data access instructions with smaller corresponding physical distances is earlier under the condition of the same priorities.
At least one embodiment of the present disclosure also provides an electronic device, including: a memory non-transitory storing computer-executable instructions; a processor configured to execute the computer-executable instructions, wherein the computer-executable instructions, when executed by the processor, implement a data processing method according to any embodiment of the present disclosure.
At least one embodiment of the present disclosure also provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions that, when executed by a processor, implement a data processing method according to any embodiment of the present disclosure.
At least one embodiment of the present disclosure also provides an electronic device including a data processor according to any one of the embodiments of the present disclosure.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly described below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure, not to limit the present disclosure.
FIG. 1 is a schematic block diagram of a multi-core chip system;
FIG. 2 is a schematic block diagram of a general purpose graphics processor;
FIG. 3 is a diagram illustrating a data access instruction flowing between a cache and a memory;
FIGS. 4A-4D are timing diagrams of a data access instruction;
FIG. 5 is a schematic block diagram of a data processor provided in accordance with at least one embodiment of the present disclosure;
FIG. 6 is a schematic block diagram of a data proxy module provided in accordance with at least one embodiment of the present disclosure;
FIG. 7 is a schematic block diagram of a data processor provided in accordance with at least one embodiment of the present disclosure;
FIGS. 8A-8G are schematic diagrams illustrating a processing procedure of a multi-stage data proxy module according to at least one embodiment of the present disclosure;
FIGS. 9A-9B are timing diagrams of a data processor according to at least one embodiment of the present disclosure;
FIG. 10 is a schematic flow chart diagram of a data processing method provided by at least one embodiment of the present disclosure;
FIG. 11 is a schematic block diagram of an electronic device provided in an embodiment of the present disclosure;
FIG. 12 is a schematic diagram of a non-transitory computer-readable storage medium provided by at least one embodiment of the present disclosure;
fig. 13 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without the need for inventive faculty, are within the scope of the present disclosure, based on the described embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in this disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The terms "first," "second," and the like, as used in this disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
In order to keep the following description of the embodiments of the present disclosure clear and concise, the present disclosure omits a detailed description of some known functions and known components.
The working speed of the processor is far higher than that of the memory, and if the processor directly accesses the memory to read and write data, a certain period of time needs to be waited, so that a Cache device (Cache) is generally adopted in the process of accessing the memory to improve the system efficiency and the speed of the processor for accessing the memory. Typically, a processor preferentially looks up data from a Cache, e.g., if the data requested by an application or software is present in the Cache, then it is referred to as a Cache hit (Cache hit), and conversely it is referred to as a Cache miss (CACHE MISS).
FIG. 1 is a schematic block diagram of a multi-core chip system. As shown in fig. 1, the system is a typical 4-Core system-on-a-chip, comprising 4 cores (cores), three levels of caches (L1 cache, L2 cache, and L3 cache) for the 4 cores, respectively, an internet-on-a-chip, a memory, and other intellectual property cores (Intellectual Property Core). I-L1$ is the private instruction L1 cache for each core, D-L1$ is the private data L1 cache for each core, one L2 cache is shared for every two cores, and L3 cache is shared for four cores. L3 caches and other intellectual property cores (e.g., direct memory access/video/display, etc.) access memory through an on-chip interconnect network.
The L1 cache is closest to the core, its capacity is smallest and its speed is fastest.
The L2 cache capacity is larger, e.g., 256K, and is slower than the L1 cache. An L2 cache is understood to be a buffer of an L1 cache, which is expensive to manufacture and therefore has limited capacity, and the L2 cache is used to store data that is needed for processing by the processor and that cannot be stored by the L1 cache.
The L3 cache is the highest capacity level of the three level cache, e.g., 12MB, and is also the slowest level of access. The L3 cache and memory may be considered buffers of the L2 cache. The L1 cache to L3 cache capacity increases but the unit manufacturing cost decreases.
When the processor is running, it first looks for the data it needs to go to the L1 cache, if it misses, it looks for the data to go to the L2 cache, if it still misses, it looks for the data to go to the L3 cache. If the data needed by the processor is not found in the three-level cache, the data is obtained from the memory. The longer the path is found, the longer the time consuming.
Similarly, caches in GPGPUs (General-Purpose Graphics Processing Unit, general graphics processors) or GPUs (Graphic Processing Unit, graphics processors) also exist in cache structures similar to the multi-core system-on-a-chip shown in FIG. 1, including, for example, hierarchical multi-level cache structures as well.
Fig. 2 is a schematic block diagram of a general graphic processor.
As shown in fig. 2, the general-purpose graphics processor is actually an array of stream processor clusters (STREAMING PROCESSOR CLUSTER, abbreviated SPC), including, for example, stream processor cluster 1, fig. 2, stream processor cluster M, M being a positive integer greater than 1. In a general-purpose graphics processor, 1 stream processor cluster processes one computing task, or a plurality of stream processor clusters process one computing task. And sharing data among the plurality of stream processor clusters through a global cache (L2 cache) or a global memory.
As shown in fig. 2, taking the streaming processor cluster 1 as an example, the 1 streaming processor cluster includes a plurality of computing units, for example, the computing unit 1, the computing unit 2, the computing units N, N in fig. 1 are positive integers. Each computation Unit (CU for short) is used to perform, for example, accumulation, reduction, conventional addition, subtraction, multiplication, division, etc. One computing unit includes a plurality of cores (also referred to as cores or cores), each including an Arithmetic Logic Unit (ALU), a floating point computing unit, etc., for performing a particular computing task. In addition, the computing unit further comprises a register file, a shared memory and an L1 cache, and the register file, the shared memory and the L1 cache are used for storing source data and destination data related to the computing task in a layering mode. The shared memory in a computing unit is used to share data between the cores of the computing unit, and the L1 cache is used to cache data from the memory or L2 cache that has been used by the computing unit. Of course, in some general purpose graphics processors, the compute unit may also store data using only the L1 cache and register file.
As shown in fig. 2, the general-purpose graphics processor may also include an L2 cache (also referred to as a global cache) and memory. The L2 cache may be used for data sharing among the various streaming processor clusters, such as HBM (High Bandwidth Memory ).
In parallel computing, computing tasks are typically performed by multiple threads (threads). These threads are divided into thread blocks (thread blocks) before execution in a general purpose graphics processor (otherwise referred to as a parallel computing processor), and then the thread blocks are distributed to individual computing units via a thread block distribution module (not shown in FIG. 2). All threads in a thread block must be allocated to the same compute unit for execution. At the same time, the thread block is split into a minimum execution thread bundle (or simply thread bundle, warp), each of which contains a fixed number of threads (or less than the fixed number), e.g., 32 threads. Multiple thread blocks may be executed in the same computing unit or in different computing units.
In each compute unit, a thread bundle scheduling/dispatching module (not shown in FIG. 1) schedules, dispatches, and distributes thread bundles so that multiple compute cores of the compute unit run the thread bundles. The multiple thread bundles in a thread block may be executed simultaneously or in a time-sharing manner, depending on the number of compute cores in the compute unit. Multiple threads in each thread bundle will execute the same instruction. The data access instruction may be transmitted to an L1 cache (or shared memory) in the computing unit or further transmitted to an L2 cache or global memory for read and write operations, etc.
FIG. 3 is a diagram illustrating a data access instruction flowing between a cache and a memory. In fig. 3, a level 2 cache structure is shown, and of course, similar flow logic is also provided for more levels of cache structures, and will not be described again here.
With the multi-level cache structure as described above, when executing a data access instruction, as shown by the solid black line in fig. 3, the computing core (core) preferentially inquires from the L1 cache whether or not to cache a hit according to the data address, and if the hit returns the data to the corresponding computing core. If the data misses in the L1 cache, as indicated by the black dashed line in FIG. 3, the lookup from the L2 cache continues, and if the data hits in the L2 cache, the data is returned to the L1 cache and the compute core. If the data also misses in the L2 cache, as indicated by the black dashed line in FIG. 3, the data needs to be loaded from memory and returned to the L2 cache, the L1 cache, and the corresponding compute core.
The data access instruction includes a read address of the destination data as an input parameter, such as a memory address, and an address of the destination register as an output parameter, and is used to read the destination data from the read address in the memory and load the read address into the destination register.
The current data access instruction is not efficient enough and has high requirement on sending the data access instruction. This is because the time at which the data access instruction returns data is not determined, for example, depending on the distance between the storage location where the destination data is located and the computing core, the time at which the data is returned is also different. For example, memory is typically composed of multiple memory blocks, with different memory blocks being spaced apart from the compute core such that data may be stored in remote memory (memory blocks farther from the compute core) or in near memory (memory blocks closer to the compute core). In addition, since it takes time to return data from the memory, the return data cannot be received immediately, so that waiting is required when the data returned by the data access instruction is used, and the data is likely to not arrive exactly when in use.
FIG. 4A is a timing diagram of a data access instruction.
As shown in fig. 4A, the data access instruction is sent only when the data is needed (as shown by the dotted line in fig. 4A), the target data is located in the memory, the instruction needs to be transferred from the computing core to the L1 cache and the L2 cache to the memory, and the data read from the memory is returned to the computing core through the L2 cache and the L1 cache, so that even if other instructions (for example, instruction 2) are executed after the data access instruction is sent, a long data waiting time still occurs.
To avoid the above problem, it is common to send the data access instruction earlier so that the data is returned before use, thereby avoiding waiting of the data as much as possible. However, this way, after the data access instruction is issued, the destination register is occupied until the data is returned, so that the register resource is occupied for a long time, so that the register resource cannot be used by other instructions, and the register resource waste is formed.
Fig. 4B and 4C are timing diagrams of another data access instruction.
As shown in fig. 4B, the data arrives at the point of use (as the dotted line in fig. 4B), but the destination register needs to be occupied all the time from the point of sending the data access instruction, so even though the waiting time of the data is short, register resource waste still occurs.
As shown in fig. 4C, since the time at which the data is returned is not determined, the data is more likely to arrive earlier, and the data has arrived at the time of use (as in the dotted line timing in fig. 4C), without waiting for the data but with a longer register occupation time, resulting in a waste of register resources.
The user may be able to empirically set the time at which the different data access instructions are issued, but it is still not guaranteed that the data is returned in the desired order, as it is difficult for the user to actually determine the actual storage location of the data in the processor. And the time for returning data by the data access instructions may be out of order, if a plurality of threads send a plurality of data access instructions at the same time, the time required for returning the data is different, so the data is likely not to be returned according to the sending sequence of the data access instructions when returning, for example, the data which is not needed for a short time is occupied with bandwidth, and the data which is wanted at present is blocked and then arrived.
FIG. 4D is a timing diagram of yet another embodiment of a data access instruction.
As shown in fig. 4D, the user empirically considers that the return data of data access instruction 2 will arrive at the desired time (black bold dashed line in fig. 4D), but the return data is likely to be bandwidth-occupied by other short-term unwanted data, for example, as shown in fig. 4D, return data 1 of data access instruction 1 is occupying bandwidth such that the return data of data access instruction 2 cannot be returned at the desired time, the actual return time of the return data of data access instruction 2 is later than the desired time, and causes a plurality of data waits of different lengths, for example, after waiting for a period of time, other instructions (for example, instruction 1 or instruction 2 in fig. 4D) are executed first and then continue waiting again, which may result in a longer overall waiting time.
At least one embodiment of the present disclosure provides a data processor, a data processing method, an electronic device, and a non-transitory computer readable storage medium. The data processor comprises N-level caches and a memory, wherein the ith-level cache in the N-level caches comprises at least 1 ith-level cache node, at least part of the ith-level cache nodes in the at least 1 ith-level cache node share one i+1 level cache node, N is a positive integer greater than 1, i is any positive integer between 1 and N-1, the nth-level cache in the N-level caches is electrically connected with the memory, each 1 level cache node is connected with the corresponding computing nuclear power, the data processor further comprises an N-level data proxy module, each level data proxy module in the N-level data proxy module corresponds to each level cache in the N-level cache one by one, the N-level data proxy module is configured to acquire a data access instruction sequence in a task currently executed by the data processor, according to the storage position of each data access instruction in the memory in the data access instruction sequence, the instruction sending sequence of each data access instruction is adjusted, each data access instruction is sequentially sent to the corresponding 1 st data proxy module according to the instruction sending sequence, and the returned data of each data access instruction is corresponding to the corresponding 1 st-level data proxy module.
In the data processor provided in at least one embodiment of the present disclosure, the N-level data proxy module is utilized to uniformly send the data access instruction, and for the data access operation, the computing core and the memory are decoupled, and the computing core does not need to care about sending, receiving and scheduling of the data access instruction, and only needs to extract the data from the corresponding 1 st-level data proxy module when the data is needed.
Because the data is extracted to the data agent module of the 1 st stage in advance and is not directly loaded to the destination register, the data access instruction can be sent in advance, the data is cached to the data agent module of the 1 st stage in advance, the destination register is not required to be occupied all the time before the data is used, and the destination data is loaded to the destination register only by occupying the register when the destination data is used, so that the occupied time of the register is greatly reduced, and the resource consumption of the register is reduced.
In addition, the instruction sending sequence is adjusted according to the storage position of the target data of different instructions, so that the data access instruction with a far storage position is prevented from extruding the bandwidth of the data access instruction with a near storage position, the return time of the data is enabled to accord with the expectations of users, and the data access efficiency is improved.
Embodiments of the present disclosure will be described in detail below with reference to the attached drawings, but the present disclosure is not limited to these specific embodiments.
Fig. 5 is a schematic block diagram of a data processor provided in at least one embodiment of the present disclosure.
As shown in fig. 5, the data processor 100 includes a memory, an N-level cache, and an N-level data proxy module.
The specific structure of the level 1 cache, the level 2 cache, and the level N cache is shown in fig. 5, and the structures of the other level caches are similar and will not be repeated. Of course, in some embodiments, a level 2 cache structure is also possible, i.e., n=2, for example, refer to fig. 7, which is not described herein.
For example, an i-th level cache in the N-level cache includes at least 1 i-level cache node, and at least some of the at least 1 i-level cache nodes share 1 i+1-level cache node. Here, N is a positive integer greater than 1, and i is any positive integer between 1 and N-1.
For example, as shown in fig. 5, when i=1, it includes a plurality of level 1 cache nodes for level 1 caches. When i=2, it includes a plurality of level 2 cache nodes for level 2 caches.
For example, as shown in fig. 5, an nth level cache of the N level caches includes 1N level cache node.
For example, multiple i-level cache nodes may share one i+1-level cache node, with all N-1-level cache nodes sharing N-level cache nodes. Here, sharing means that the plurality (for example, M, where M is a positive integer) of i-level cache nodes are electrically connected to the i+1-level cache node, and all of the M i-level cache nodes may exchange data with the i+1-level cache node, for example, obtain required data from the i+1-level cache node.
For example, referring to FIG. 5, multiple level 1 cache nodes share one level 2 cache node, e.g., some level 1 cache nodes share one level 2 cache node, and others share another level 2 cache node. For example, all N-level 1 cache nodes share an N-level cache node.
For example, in each level, how many cache nodes are set to share one higher level cache node may be set as needed, and each level may be different, which is not particularly limited by the present disclosure. For example, in response to n=4, it may be provided that every 4 level 1 cache nodes share a level 2 cache node, every 8 level 2 cache nodes share 1 level 3 cache node, and all level 3 cache nodes share a level 4 cache node.
For example, as shown in FIG. 5, the level N cache is electrically connected to the memory, which may be, for example, a high bandwidth memory.
For example, as shown in FIG. 5, each level 1 cache node is connected with a corresponding computational core. For example, referring to the processor architecture shown in fig. 1, a level 1 cache node may be connected to 1 computational core, a level 1 cache node may be an L1 cache, and a level 1 cache node is connected to 1 computational core; or, for example, referring to the processor architecture shown in fig. 2, a level 1 cache node may be connected to a plurality of compute cores, a level 1 cache node may be an L1 cache, and in one compute unit, a level 1 cache node may be connected to 4 compute cores.
As shown in fig. 5, the data processor includes each level of data agent modules in the N-level data agent module in one-to-one correspondence with each level of cache in the N-level cache.
For example, the N-level data proxy module includes a level 1 data proxy module corresponding to a level 1 cache, a level 2 data proxy module corresponding to a level 2 cache, an N-level data proxy module corresponding to an N-level cache.
For example, for level 1 through N-1 data proxy modules, in some embodiments, each level cache corresponds to 1 data proxy module, e.g., all level 1 cache nodes in a level 1 cache correspond to 1 level 1 data proxy module.
For example, in other embodiments, as shown in fig. 5, a single cache node in each level of cache corresponds to 1 data agent module, e.g., a data processor includes a plurality of level 1 data agent modules that are in one-to-one correspondence with a plurality of level 1 cache nodes included in the level 1 cache.
For example, the nth level cache includes 1 nth level cache node corresponding to 1 nth level data proxy module.
For example, the same electrical connection relationship exists between the data proxy modules as between the corresponding cache nodes. For example, the M i-level cache nodes share 1 i+1-level cache node, and then an electrical connection relationship exists between M i-level data proxy modules corresponding to the M i-level cache nodes and the i+1-level data proxy modules corresponding to the i+1-level cache nodes. And for the ith data proxy module corresponding to other i-level cache nodes, no electric connection relation exists between the ith data proxy module and the (i+1) -th data proxy module.
In addition, for the level 1 data proxy modules, each level 1 data proxy module may be connected to some computing cores for data interaction, where the computing cores are computing cores having an electrical connection relationship with the level 1 cache node corresponding to the level 1 data proxy module, that is, the computing cores share the level 1 cache node.
For the nth data proxy module, similar to the N level cache node, there is also an electrical connection relationship with the memory, and the nth data proxy module can perform data interaction with the memory.
For example, the N-level data proxy module is configured to obtain a data access instruction sequence in a task currently executed by the data processor, adjust an instruction sending sequence of the data access instruction sequence according to a storage position of destination data of each data access instruction in the data access instruction sequence, sequentially send each data access instruction according to the instruction sending sequence, and cache destination data returned by each data access instruction into the corresponding 1-level data proxy module.
Here, the level 1 data proxy module corresponds to a level 1 cache. For example, if only 1 data agent module 1 is provided in the data processor, the returned destination data is cached in the data agent module 1. For example, if a plurality of level 1 data proxy modules are provided in the data processor, the plurality of level 1 data proxy modules are in one-to-one correspondence with the plurality of level 1 cache nodes, the returned destination data is cached to the level 1 data proxy module corresponding to the target level 1 cache node, and the computing core electrically connected to the target level 1 cache node uses the returned destination data. For different data access instructions, the returned destination data may be cached in different data proxy modules at stage 1.
For example, the data access instruction sequence is an original instruction sequence, and the data access instruction sequence is obtained by arranging all data access instructions in the task according to the sequence position relationship in the program corresponding to the task. That is, the data access instruction sequence represents the original transmission sequence in the task program, and the transmission sequence is arranged according to the sequence expected to be used by the user.
For example, the data access instruction sequence may be obtained by scanning the program code corresponding to the task, or by pre-running the task.
For example, in some embodiments, the N-level data proxy module executes the instruction sending sequence of the data access instruction sequence according to the storage location of the destination data of each data access instruction in the data access instruction sequence, including executing the following operations: the method comprises the steps of obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between a storage position of destination data of each data access instruction in a memory and a computing core in a data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, wherein the sending time of the data access instruction with smaller corresponding physical distance is earlier.
For example, the destination data to be loaded by the data access instruction is placed in a memory, and the memory has a plurality of memory blocks, as described above, the distances between different memory blocks and the computing core are different, the destination data existing on the same memory block can be regarded as the same distance, and the destination data on different memory blocks can be regarded as different distances. Or adjacent memory blocks may be considered to be the same distance from the compute core, as may be desired by one skilled in the art.
According to different physical distances corresponding to the data access instructions, the instruction sending sequence of the data access instruction sequence can be adjusted, for example, on the basis of the original data access instruction sequence, the sending sequence is adjusted according to the physical distances corresponding to the data access instructions.
For example, the smaller the corresponding physical distance, the earlier the transmission time of the data access instruction, i.e., the earlier the data access instruction is placed in the instruction transmission order, and the longer the corresponding physical distance, the later the transmission time of the data access instruction, i.e., the later the data access instruction is placed in the instruction transmission order. For example, the corresponding data access instructions with the same physical distance are sequentially sent according to the relative sequence relationship in the data access instruction sequence.
In at least one embodiment of the present disclosure, the physical distance corresponding to the data access instruction is not necessarily a real, spatial distance value, and may be represented by any manner capable of distinguishing the distance from the computing core. For example, when the read address of the data access instruction is located in the far-end memory, the physical distance corresponding to the data access instruction may be marked as a first value, and when the read address is located in the near-end memory, the physical distance corresponding to the data access instruction may be marked as a second value, so as to distinguish the physical distances corresponding to different access instructions.
For example, when the N-level data proxy module executes the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, the following operations are executed: classifying each data access instruction into different channels according to the physical distance corresponding to each data access instruction, wherein the physical distances corresponding to the data access instructions in the different channels are different; according to the sequence of the physical distances corresponding to the data access instructions in different channels from small to large, arranging the data access instructions in each channel by taking the channel as a unit to obtain an instruction sending sequence, wherein in the instruction sending sequence, the data access instruction belonging to a first channel is sent first, the data access instruction belonging to a second channel is sent last, the data access instructions belonging to the same channel are sent in series according to the relative sequence relation in the data access instruction sequence, the physical distance corresponding to the data access instruction in the first channel is minimum, and the physical distance corresponding to the data access instruction in the second channel is maximum.
For example, a plurality of channels may be set in the data proxy module, and the data access instruction may be classified into different channels according to the corresponding physical distance. For example, the data access instruction of the destination data in the same memory block may be classified into one channel, or the data access instruction of the destination data in several memory blocks that are relatively close to each other may be classified into one channel.
For example, a first channel, a second channel, a third channel, etc. may be provided. For example, the first channel is a fast channel, and the physical distance corresponding to the data access instruction in the first channel is smaller, for example, the destination data of the data access instruction in the first channel is located in the near-end memory. For example, the second channel is a slow channel, and the physical distance corresponding to the data access instruction in the second channel is larger, for example, the destination data of the data access instruction in the second channel is located in the remote memory. For example, a third channel may be further configured, where the physical distance corresponding to the data memory address in the channels is smaller than the physical distance corresponding to the data memory address in the second channel, but greater than the physical distance corresponding to the data memory address in the first channel.
For example, when the instruction transmission order is arranged, the data access instruction is arranged in units of channels. For example, assume that three channels are provided in the data proxy module, the first channel is a fast channel, the second channel is a slow channel, the third channel is a middle channel, the physical distance corresponding to the data access instruction classified to the first channel is the smallest, the physical distance corresponding to the data access instruction classified to the third channel is greater than the physical distance corresponding to the data access instruction classified to the first channel, and the physical distance corresponding to the data access instruction classified to the second channel is the largest, that is, greater than the physical distance corresponding to the data access instruction classified to the third channel. It should be noted that, according to the actual situation, more channels may be set or only two channels may be set to sort different data access instructions, which is not specifically limited in the present disclosure.
All data access instructions in the first channel are placed at the forefront of the instruction sending sequence to be sent earliest, all data access instructions in the third channel are placed after the data access instructions of the first channel, and all data access instructions of the second channel are sent after the data access instructions of the third channel.
If a certain channel comprises a plurality of data access instructions, the data access instructions can be serially transmitted according to any sequence; or the data access instructions are sequentially sent according to the relative sequence relation of the data access instructions in the data access instruction sequence, namely, for the data access instructions, the data access instructions in the front in the data access instruction sequence are sent preferentially, the data access instructions in the back are sent later, the instruction sending sequence at this time meets the instruction sending sequence expected by the program, and the target data expected to be loaded earlier can be returned earlier.
In the above embodiment, the introducing channel performs classification management of the data access instruction, classifies the data access instruction into different channels according to the difference between the storage position of the target data to be loaded and the distance between the computing core, and sends the data access instruction with the corresponding physical distance being close to the priority, and sends the data access instruction with the corresponding physical distance being far to the priority, so that the long-distance request is prevented from extruding the bandwidth of the short-distance request as far as possible, and the return time of the target data is enabled to be in line with the expectations of users as possible.
For example, in some embodiments, the instruction issue order may also take into account the priority of the data access instructions. For example, some data access instructions correspond to a relatively large physical distance, but may be important to the user, e.g., the data returned by the data access instructions may be expected to be used earlier by the user, so the data access instructions may be prioritized for earlier transmission.
For example, the N-level data proxy module executes the instruction sending sequence of the data access instruction sequence according to the storage position of the destination data of each data access instruction in the data access instruction sequence, and includes executing the following operations: acquiring the priority of each data access instruction; the method comprises the steps of obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between a storage position of destination data of each data access instruction in a memory and a computing core in a data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the priority of each data access instruction and the corresponding physical distance, wherein the sending time of the data access instruction with higher priority is earlier, and the sending time of the data access instruction with smaller corresponding physical distance is earlier under the condition of the same priority.
For example, the N-level data proxy module performs the priority of acquiring each data access instruction, including performing the following operations: and determining the priority of each data access instruction according to the time of returning the target data of each data access instruction when the task is pre-run, wherein each data access instruction is sequentially sent according to the sequence in the data access instruction sequence when the task is pre-run, and the time of returning the target data of each data access instruction is collected.
For example, by collecting the time at which each data access instruction returns destination data during a pre-run task, it may be determined which data access instructions need to be set for transmission in advance, e.g., some data access instructions may return destination data later but require earlier use of the destination data, and high priority may be set for these data access instructions.
In the adjusted instruction sending sequence, the priority of the data access instruction is given priority, and then the channel is considered. For example, the higher priority data memory instructions are sent earlier, and are sent preferentially even if they belong to the slow channel, and the other data memory instructions (e.g., low priority but in the fast channel) are sent again with a delay of the corresponding clock cycle. And if the priorities are the same, sending the data access instructions according to the order from small to large of the physical distances corresponding to the data access instructions. Specifically, the data access instructions may be classified into different channels as described above, and the data access instructions in each channel are arranged in units of channels according to the order of the physical distances corresponding to the data access instructions in the different channels from small to large, so as to obtain the instruction sending order.
In the embodiment, the priority is set for each data access instruction, so that the high priority can be sent out as soon as possible, and the return time accords with the expectations of users; in addition, the data access instructions are sent by using multiple channels, the data access instructions in the fast and slow channels are sent separately, and the slow requests which are longer in time consumption from the slow channels are prevented from extruding the bandwidth of the fast requests which are shorter in time consumption from the fast channels as much as possible.
For example, when the N-level data proxy module performs buffering the destination data returned by each data access instruction into the corresponding 1-level data proxy module, the following operations are performed: and for each data access instruction, caching the target data returned by the data access instruction into a level 1 cache node or a shared memory connected with a computing core using the target data. For example, the corresponding level 1 data proxy module includes a plurality of cache blocks, the plurality of cache blocks are mapped to the level 1 cache node or the shared memory, and the plurality of cache blocks are used for caching the received destination data.
For example, assuming that the data access instruction 1 is sent by a certain computing core in fig. 5, the computing core needs to use the destination data 1 returned by the data access instruction 1 to read the destination data 1 from the memory by using the N-level data proxy module and cache the destination data 1 into a level 1 cache node or a shared memory connected to the computing core, specifically, the level 1 data proxy module may include a plurality of cache blocks, where the plurality of cache blocks are mapped to the level 1 cache node or the shared memory, that is, the cache blocks actually use the level 1 cache node or the shared memory to cache the received destination data. For example, as shown in fig. 1 and 2, the level 1 cache node may be an L1 cache, and as shown in fig. 2, the shared memory may be the shared memory of the compute core connection.
Therefore, the target data returned by each data access instruction is cached to a level 1 cache node or a shared memory which is very close to the computing core, and the computing core can send the data access instruction to the corresponding level 1 data proxy module when the target data is needed, and the level 1 data proxy module directly returns the target data cached in the level 1 cache node or the shared memory to the target register. The level 1 cache node or the shared memory has the advantages of being closest to the computing core, high in bandwidth, high in data transmission speed, free of taking data from the memory, capable of greatly reducing data waiting time, enabling the register to be occupied only when the target data are used, short in data waiting time, capable of greatly reducing the register occupying time, capable of saving register resources, capable of reducing resource consumption and capable of improving data memory efficiency.
For example, as described above, each data agent module has a similar electrical connection relationship with the corresponding cache node, that is, how the cache nodes are electrically connected, and the corresponding data agent modules are electrically connected according to the same relationship. Taking an ith data agent module in the N-level data agent module as an example, at least 1 ith cache node is in one-to-one correspondence with at least 1 ith data agent module, and at least part of ith data agent modules corresponding to the ith cache nodes are respectively and electrically connected with an (i+1) th data agent module corresponding to one (i+1) th cache node.
For example, the electrical connection relationship between the level 1 data agent module and the computing cores is the same as the electrical connection relationship between the corresponding level 1 cache nodes and the computing cores, that is, the level 1 cache nodes are connected with the computing cores, and the corresponding level 1 data agent module is also connected with the computing cores.
For example, the nth data proxy module is also electrically connected to the memory.
Fig. 6 is a schematic block diagram of a data proxy module provided in at least one embodiment of the present disclosure.
For example, the data agent module may be any of an ith-stage data agent module or an nth-stage data agent module, that is, each of the data agent modules has the same structure as shown in fig. 6.
As shown in fig. 6, the data agent module 200 includes a scheduler 201, a request status list 202, and a data request list 203.
The data request list 203 is used for caching a data access instruction to be sent, and includes instruction information of the data access instruction, for example, the instruction information includes an input parameter and an output parameter of the data access instruction, the input parameter includes a read address of destination data, and the output parameter includes an address of a destination register. In addition, a unique corresponding request number can be set for each data access instruction, and the instruction information can further comprise the request number.
For example, P data access instructions may be cached in the data request list 203, when P is less than the total number of data access instructions in the data access instruction sequence, the previous P data access instructions may be cached in the data request list 203 according to the sequence in the data access instruction sequence, after the data access instructions are processed, the (p+1) th data access instruction is fetched from the data access instruction sequence and cached in the data request list 203, and so on. Here, P is a positive integer.
For example, as shown in fig. 6, a plurality of channels (a plurality of rectangular boxes in the data request list 203) are disposed in the data request list 203, for example, including a first channel, a second channel, and the like, where each channel is used to cache at least one data access instruction, and the physical distances corresponding to the data access instructions in different channels are different.
For example, the physical distance corresponding to the data access instruction may be determined by a read address in the data access instruction. For example, when the data access instruction is cached in the data request list 203, the corresponding physical distance is determined according to the read address of the different data access instructions, and the data access instruction is cached in different channels according to the corresponding physical distance. For example, a data access instruction with a read address in the near-end memory is placed in the first channel, a data access instruction with a read address in the far-end memory is placed in the second channel, and so on. For the description of the channels, reference is made to the relevant parts described above, and no further description is given here.
The scheduler 201 is configured to sequentially send the data access instructions cached in the data request list 203 according to the instruction sending order, and send the received destination data to the corresponding computing core or the cache block in the electrically connected upper-level data proxy module.
For example, for the data-level-1 proxy module, in response to receiving a data access instruction sent by a computing core electrically connected to the data-level-1 proxy module, determining a cache block associated with the sent data access instruction, and sending data in the associated cache block to a destination register indicated by the data access instruction.
For other data proxy modules except the 1 st stage data proxy module, for example, for the i+1st stage data proxy module, when receiving a data request signal sent by a previous stage data proxy module (for example, the i-th stage data proxy module), the scheduler 201 of the i+1st stage data proxy module sends destination data of a data access instruction corresponding to the data request signal to the i-th stage data proxy module.
The request state list 202 includes a plurality of state items, where each state item in the plurality of state items is used to indicate a cache block associated with a data access instruction corresponding to the state item, a position of the data access instruction corresponding to the state item in an instruction sending sequence, and the like.
As shown in fig. 6, the data proxy module further includes a plurality of buffer blocks, for example, buffer block 0 to buffer block N, for buffering the destination data received by the data proxy module.
For example, for the nth data proxy module, the received destination data may come from the memory, and for the ith data proxy module, the received destination data may come from the next data proxy module, i.e., the (i+1) th data proxy module.
For example, prior to performing the task, the data processor is further configured to: and selecting at least one cache line from the cache nodes corresponding to each data proxy module to be mapped into the cache blocks in the data proxy module.
For example, as shown in FIG. 6, a cache node includes a plurality of cache lines (CACHE LINE), which are the basic units of cache. Each cache line and one cache tag are mapped according to one mapping relation among direct connection, group connection or full connection, namely each cache tag has one-to-one fixed mapping relation or static mapping relation with one cache line. For example, each cache tag includes a Virtual Address (VA), access information (age count), reference information (REFERENCE COUNT/ref_cnt), and the like.
As shown in fig. 6, the data proxy module 200 selects some cache lines from the corresponding cache nodes to generate a mapping relationship with the cache blocks, that is, actually stores the destination data cached in the cache blocks in the mapped cache lines.
Of course, as described above, the cache blocks in the data proxy module may also be mapped to a shared memory within the computing unit. For example, on-chip memory or the like may also be used to map to cache blocks.
When the level 1 cache node and the shared memory are mapped into the cache block, the data transmission bandwidth is high and the speed is high due to the fact that the distance between the level 1 cache node and the computing core is closest, so that the data waiting delay can be reduced, the occupied time of a register is reduced, the resource consumption is reduced, and the data memory efficiency is improved.
For example, the received destination data needs to be stored in a corresponding cache block, and a data access instruction associated with the corresponding cache block is used to load the destination data.
For example, each cache block has a unique corresponding cache block number, and as shown in fig. 6, the state item includes a cache block number and a request number, which are used to indicate a data access instruction corresponding to the state item and a cache block associated with the data access instruction. For example, by setting a cache block number and a request number in a state entry, a data access instruction may be associated with a cache block, and destination data returned by the data access instruction may be stored in the associated cache block.
For example, in some embodiments, as shown in fig. 6, the state item further includes a channel number, which indicates a channel to which the data access instruction (identified by the request number) corresponding to the state item belongs. For example, the first channel marks channel number 0, the second channel marks channel number 1, and so on. For example, the channel number is used to indicate the position of the data access instruction in the instruction sending sequence, for example, the channel to which the data access instruction belongs can be determined through the channel number, and the instruction is sent according to the foregoing process.
For example, when sending the data access instructions cached in the data request list 203, the scheduler 201 arranges the data access instructions in each channel according to the order from small to large of the physical distances corresponding to the data access instructions in different channels, and obtains the instruction sending order of the data access instructions cached in the data request list 203, and sends the data access instructions sequentially according to the instruction sending order. For example, assume that two channels are provided in the data request list 203, the first channel is a fast channel, the second channel is a slow channel, the data access instruction belonging to the first channel is sent first, then the data access instruction belonging to the second channel is sent, and the data access instructions belonging to the same channel are sent in series according to the relative sequence relationship in the data access instruction sequence.
For example, in some embodiments, as shown in FIG. 6, the status item may also include a priority, the channel number and the priority collectively indicating the location of the data access instruction in the instruction issue sequence.
For example, when the status item includes priority, as described above, when sending the data access instruction cached in the data request list 203, the scheduler 201 first considers priority to send the data access instruction with high priority first, and then sends the other data access instructions with low priority sequentially from small to large according to the physical distances corresponding to the data access instructions in different channels, still taking the channel as a unit. The determination of the priority and the specific procedure of instruction transmission using the channel number and the priority refer to the foregoing, and will not be described in detail here.
For example, the status item may include only the channel number and not the priority, at which point the channel number is used to determine the location of the data access instruction in the instruction issue order. For example, the status item may include both a channel number and a priority, where the channel number and priority are used to determine the location of the data access instruction in the instruction issue order.
For example, as shown in fig. 6, the status items further include request receiving status information and request sending status information, which are used to indicate the current status of the data access instruction, and the user may perform status monitoring on the data access instruction according to the request receiving status information and the request sending status information.
For example, the request receiving state information includes three states: the first state is that no request is received, for example, the state item is not bound with any data access instruction after initialization, etc.; the second state is that a request is being received, for example, the upper level data agent module is synchronizing a data access instruction; the third state is that the request already exists, for example, the state item already has a corresponding data access instruction, for example, the request number of the data access instruction is filled in the corresponding position of the state item.
For example, the request-to-send status information includes three states: the first state is that no request is received, for example, the state item is not bound with any data access instruction after initialization, etc.; the second state is that a request is being sent, for example, a data access instruction is being synchronized or sent to a memory or a next-level data proxy module; the third state is that the request has been sent, e.g., the request has been sent over.
For example, as shown in fig. 6, the status item further includes data status information for indicating the current status of the destination data returned by the data access instruction.
For example, the data state information includes four states: the first state is empty, for example, the state item is not bound with any data access instruction after initialization; the second state is waiting for data, such as a data access instruction or a data request signal has been sent waiting for data to return; the third state is that data is being received, indicating that the data has been returned from the memory or the next level data agent module; the fourth state is that the data already exists, e.g. the data has been buffered to the corresponding buffer block.
The specific settings regarding the request-to-receive status information, the request-to-send status information, and the data status information may be adjusted as needed, which is not particularly limited by the present disclosure.
For example, each data agent module is initialized prior to performing a task.
For example, the stage 1 data proxy module is configured to: before executing the task, selecting at least one data access instruction from the data access instruction sequence according to the sequence of the data access instruction sequence, and executing initialization operation on the at least one data access instruction. Here, the at least one data access instruction is executed by a computing core electrically coupled to the data agent module of stage 1.
For example, the level 1 data agent module is configured to select at least one data access instruction according to an order of the data access instructions in the data access instruction sequence from the data access instructions executed by the computing cores electrically connected to the level 1 cache nodes corresponding to the level 1 data agent module, where the order refers to an original order of the data access instruction sequence, that is, an order in which the data access instructions appear in the program codes corresponding to the tasks. In the initialization stage before the task is executed, P data access instructions are selected to be cached in a data request list to execute initialization operation, after the task is started to be executed, after destination data of the data access instructions are sent to a destination register, one or more data access instructions which are not sent yet are sequentially selected from a data access instruction sequence to be cached in the data request list to execute initialization operation.
For example, the data agent module at stage 1 performs an initialization operation on at least one data access instruction, including performing the following operations: caching at least one data access instruction into a data request list of a data agent module of the level 1; initializing a state item corresponding to at least one data access instruction and storing the state item into a request state list of a data agent module 1 st stage; and synchronizing the at least one data access instruction and the corresponding state item to other levels of data proxy modules in direct or indirect electrical connection with the level 1 data proxy module.
For example, when the data access instruction of the data proxy module of the 1 st stage is cached in the data request list of the data access instruction of the data proxy module of the 1 st stage, the data access instruction of the data is cached in the corresponding channel according to the physical distance corresponding to the data access instruction of the data proxy module of the 1 st stage.
For example, when the data proxy module at stage 1 initializes a state item corresponding to at least one data access instruction, a cache block number and a request number in the state item are set to determine the data access instruction and an associated cache block corresponding to the state item. The specific process is described later, and will not be described here again.
According to the read address in the data access instruction, the physical distance corresponding to the data access instruction can be determined to set the channel number in the state item; when the status item includes a priority, the priority in the status item may also be set according to the priority of the acquired data access instruction.
In addition, the request receiving state information, the request transmitting state information, and the data state information are set to initial values, and are not described here.
And then, the data access instructions and corresponding state items thereof are synchronized into other data proxy modules which have direct or indirect electric connection relation with the data proxy module 1 by the data proxy module 1.
For example, the other data proxy modules include a data proxy module from a data proxy module at the 2 nd level to a data proxy module at the nth level, and the electrical connection relationship is determined according to the electrical connection relationship between the corresponding cache nodes. For example, the synchronized information includes instruction information, status items, etc. of the data access instruction. For example, the data access instructions to be sent are also stored in the data request list of the other data proxy modules after synchronization, and the corresponding state items are stored in the request state list. For example, for the nth data proxy module, all the data access instructions and the status items thereof selected by the nth data proxy module are stored.
For example, for any one of the data proxy modules, in response to the existence of an unassociated cache block in the data proxy module, the unassociated cache block is associated with a data access instruction selected from a list of data requests of the data proxy module in the order in which the instructions were sent. For example, an unassociated cache block refers to a cache block that has not yet established an association with any data access instruction, and the unassociated cache block may be determined by a request number and a cache block number in a status entry.
For example, when the data proxy module performs the association of the unassociated cache block with one data access instruction selected from the data request list of the data proxy module in the instruction transmission order, the following operations are performed: and setting the request number of the data access instruction and the cache block number of the unassociated cache block in the state item corresponding to the data access instruction so as to associate the unassociated cache block with the data access instruction.
For example, the nth data proxy module is configured to: sequentially sending the cached data access instructions to the memory according to the instruction sending sequence of the cached data access instructions in the data request list of the N-stage data proxy module; and receiving the destination data returned from the memory in turn, and storing each returned destination data into the cache block associated with the corresponding data memory access instruction.
That is, when the first block of data starts to be carried, the nth data proxy module sends a data access instruction which is determined according to the instruction sending sequence and needs to be sent earliest to the memory, and stores the received target data into a cache block associated with the data access instruction. If the buffer memory block is free, then the data access instruction which is determined according to the instruction sending sequence and needs to be sent for the second time is sent to the memory, the received target data is stored in the buffer memory block associated with the data access instruction, and the like.
And for the data agent modules from the data agent module 1 to the data agent module N-1, namely the data agent module i, the data agent module i is configured to respond to the buffer block in the idle state in the data agent module i, send a data request signal to the data agent module i+1 which is electrically connected, and buffer the received return data into the buffer block associated with the data access instruction corresponding to the sent data request signal. For example, the i+1th data proxy module is configured to send destination data of a data access instruction corresponding to the data request signal to the i-th data proxy module that sends the data request signal in response to receiving the data request signal.
For example, a cache block in an idle state refers to the cache block having an associated data access instruction, but not storing data and not having sent the data access instruction associated with the cache block, and the cache block in the idle state may be determined by data state information and request to send state information.
For example, when the ith DATA agent module finds that there is a buffer block in an idle state, the ith DATA agent module sends a DATA request signal need_data to the (i+1) th DATA agent module, and when the (i+1) th DATA agent module receives the DATA request signal need_data, the (i+1) th DATA agent module determines destination DATA of a DATA access instruction corresponding to the DATA request signal need_data from the buffer block, sends the destination DATA to the ith DATA agent module, and the ith DATA agent module buffers the received return DATA into the buffer block, wherein the buffer block is associated with the DATA access instruction corresponding to the sent DATA request signal.
Through the process, the N-th data proxy module caches the target data read from the memory into the cache block of the N-th data proxy module, the N-1-th data proxy module acquires the target data from the N-th data proxy module and caches the target data into the cache block of the N-1-th data proxy module, and the like until the target data is cached into the cache block of the 1-th data proxy module. And each data agent module obtains the target data according to the same instruction sending sequence.
Therefore, the data access instruction can be sent in advance, the data of the memory is cached in the data agent module 1, the data can be directly read from the data agent module 1 when the target data is used, the data does not need to be read from the memory, and the data waiting time is reduced. And, the data is cached in the cache block of the data access storage module of the 1 st stage, even if the data access storage instruction is sent in advance, the destination register is not required to be occupied in advance, the occupied time of the destination register is reduced, the resource consumption is reduced, and the data access storage efficiency is improved.
For example, the stage 1 data proxy module is further configured to: responding to a data access instruction sent by a computing core which is electrically connected with a level 1 cache node corresponding to a level 1 data agent module, determining a cache block associated with the sent data access instruction, and sending data in the associated cache block to a destination register indicated by the data access instruction; clearing a data access instruction associated with the associated cache block from a data request list of the data proxy module 1, and clearing a state item corresponding to the cleared data access instruction in a request state list of the data proxy module 1; at least one data access instruction is selected in sequence from the sequence of data access instructions, and an initialization operation is performed on the at least one data access instruction.
For example, the data access instruction is cleared from the data request list after the data is sent to the computing core by the data proxy module at stage 1, and the state item corresponding to the cleared data access instruction in the request state list is cleared, and the state item can be reused for other data access instructions. Then, 1 data access instruction is selected from the data access instruction sequence to be cached in the data request list, and initialization operation is executed according to the foregoing process, for example, including setting a status item, caching the data access instruction in the data request list according to a physical distance corresponding to the data access instruction, synchronizing the data access instruction to other levels of data proxy modules, and the like, which are not described herein again.
For other data proxy modules, taking the i+1st data proxy module as an example, the i+1st data proxy module is further configured to: and in response to the destination data cached in any one cache block in the (i+1) -th data proxy module being sent to the (i) -th data proxy module, clearing the data access instruction associated with any cache block from the data request list of the (i+1) -th data proxy module, and clearing the state item corresponding to the cleared data access instruction in the request state list of the (i+1) -th data proxy module.
For example, after clearing and flushing information related to the data access instruction, the i+1st data proxy module may associate the cache block with other unassociated data access instructions in the data request list.
Fig. 7 is a schematic block diagram of a data processor provided in at least one embodiment of the present disclosure.
For example, the data processor as shown in fig. 7 may be a general-purpose graphics processor or a graphics processor. The data processor includes a plurality of computing units. The structure in fig. 7 may refer to the related description in fig. 2, and will not be described again here.
For example, n=2, the level 1 cache in the N level cache includes a plurality of level 1 cache nodes, each level 1 cache node for data sharing within 1 computing unit, each level 1 cache node may be an L1 cache or shared memory in fig. 7. The level 2 cache in the N-level cache includes 1 level 2 cache node, where the level 2 cache node is used for data sharing between all computing units, and the level 2 cache node may be the L2 cache in fig. 7, that is, the global cache.
For example, each level 1 cache node corresponds to 1 level 1 data proxy module, and level 2 cache nodes correspond to 1 level 2 data proxy module. For example, taking fig. 7 as an example, 1 data 1 agent module is provided in each computing unit in fig. 7, and the data 1 agent module is connected with all computing cores in the computing units; and providing 1 data agent module of the 2 nd level corresponding to the L2 cache, wherein the data agent module of the 2 nd level is electrically connected with the memory and all data agent modules of the 1 st level.
For example, taking the data processor shown in fig. 7 as an example, a specific process of scheduling data access instructions and caching data by the multi-level data proxy module is specifically described.
Fig. 8A to 8G are schematic views illustrating a processing procedure of a multi-stage data proxy module according to at least one embodiment of the present disclosure.
First, as shown in fig. 8A, the computing core initializes each level of data proxy modules before starting to execute tasks.
For example, initializing each level of data proxy module includes allocating a cache block for each data proxy module, e.g., selecting some cache lines or memory line mappings from level 1 cache nodes or shared memory as cache blocks in level 1 data proxy modules, selecting some cache lines from level 2 cache nodes as cache blocks in level 2 data proxy modules. In addition, the data access instruction sequence is sequentially selected by the data agent module 1, and the data access instructions are initialized. For example, the initializing operation includes buffering the data access instructions into corresponding channels in the data request list of the data access proxy module at level 1, initializing state items corresponding to the data access instructions and storing the state items in the request state list of the data access proxy module at level 1, and synchronizing the data access instructions and the corresponding state items into other data proxy modules (i.e., data proxy modules at level 2) having a direct or indirect electrical connection relationship with the data access proxy module at level 1. The specific process is as described above, and will not be described here again.
After initializing each level of data proxy module, the data request list of each data proxy module caches the data access instruction to be sent, and the cached data access instruction is classified according to the channel, in addition, the state item corresponding to the data access instruction completes initialization setting and is stored in the request state list of each data proxy module.
For example, as shown in fig. 8A, 5 data access instructions are cached in the first channel, which are the data access instruction req#0, the data access instruction req#1, the data access instruction req#2, the data access instruction req#3, and the data access instruction req#4 in fig. 8A. For example, the stage 1 data proxy module includes a plurality of channels, each channel can cache 5 data access instructions at most, the data access instructions are selected according to the sequence from front to back in the data access instruction sequence, and if the physical distance corresponding to the first data access instruction req#0 in the data access instruction sequence is very small, for example, the read address of the data access instruction req#0 is in the near-end memory, the data access instruction req#0 is cached to the first channel, and so on. For example, the first channel is a fast channel, and the sequence of the data access instructions req#0 to req#4 is the same as the relative sequence relationship in the sequence of the data access instructions.
For example, as shown in fig. 8A, each data agent module provides 2 cache blocks, namely, cache block 0 and cache block 1, and in each data agent module, for example, an initialization stage sets a data access instruction req#0 to be associated with cache block 0, and a data access instruction req#1 to be associated with cache block 1.
Then, as shown in fig. 8B, when the first block of Data starts to be carried, the Data access instruction is selected from the Data request list by the Data agent module at stage 2 according to the instruction sending sequence, and sent to the memory, for example, the first Data access instruction req#0 in the first channel (fast channel) is selected and sent to the memory, and the destination Data data#0 of the Data access instruction req#0 returned by the memory is received and stored in the cache block 0 of the Data agent module at stage 2 associated with the Data access instruction req#0.
Then, as shown in fig. 8C, the second block Data starts to be transferred and the destination Data data#0 is returned to the Data agent module 1 st stage.
For example, since the cache block 1 in the Data-access-2 proxy module is in an idle state, the scheduler of the Data-access-2 proxy module sends the Data access instruction req#1 to the memory, and the destination Data data#1 of the Data access instruction req#1 returned by the memory is stored in the cache block 1 of the Data-access-2 proxy module associated with the Data access instruction req#1.
For example, since the cache block 0 in the data agent module 1 is in an idle state, the scheduler of the data agent module 1 sends a data request signal to the data agent module 2, the data request signal indicating the destination data of the data access instruction req#0 is required. After receiving the Data request signal, the Data-level Data proxy module at the 2 nd stage sends the destination Data data#0 to the Data-level Data proxy module at the 1 st stage, clears the Data access instruction req#0 from the Data request list of the Data-level Data proxy module at the 2 nd stage, and clears the state item corresponding to the Data access instruction req#0 in the request state list, that is, for the cache block 0 of the Data-level Data proxy module at the 2 nd stage, it may be associated with other Data access instructions, for example, with the Data access instruction req#2. After receiving the destination Data data#0, the Data agent module at stage 1 stores it in the cache block 0 associated with the Data access instruction req#0.
For example, as shown in fig. 8D, when the computing core sends the Data access instruction req#0 to the stage 1 Data proxy module, the stage 1 Data proxy module returns the Data data#0 in the cache block 0 associated with the Data access instruction req#0 to the destination register. Then, the data access instruction req#0 is cleared from the data request list of the data proxy module at stage 1, and the state item corresponding to the data access instruction req#0 in the request state list of the data proxy module at stage 1 is cleared, that is, the cache block 0 may be associated with other data access instructions, for example, associated with the data access instruction req#2.
Then, as shown in fig. 8E, the data access instruction req#5 is sequentially selected from the data access instruction sequence by the data proxy module at stage 1, and cached in the data request list of the data proxy module at stage 1, for example, the physical distance corresponding to the data access instruction req#5 is smaller, and classified into the first channel. Meanwhile, initializing a state item corresponding to the data access instruction req#5 and storing the state item into a request state list of the data agent module at the level 1, and synchronizing the data access instruction req#5 and the corresponding state item into the data agent module at the level 2.
For example, as shown in fig. 8E, since the cache block 1 in the data agent 1 st is in an idle state, the scheduler of the data agent 1 st transmits a data request signal indicating destination data of the data access instruction req#1 to the data agent 2 nd. After receiving the Data request signal, the Data-level Data proxy module 2 sends the destination Data data#1 to the Data-level Data proxy module 1, clears the Data access instruction req#1 from the Data request list of the Data-level Data proxy module 2, and clears the state item corresponding to the Data access instruction req#1 in the request state list, that is, for the cache block 1 of the Data-level Data proxy module 2, it may be associated with other Data access instructions again, for example, with the Data access instruction req#3. After receiving the destination Data data#1, the Data agent module at stage 1 stores it in the cache block 1 associated with the Data access instruction req#1.
As shown in fig. 8E, since the cache block 0 in the Data-2-stage Data proxy module is in an idle state, the scheduler of the Data-2-stage Data proxy module sends a Data access instruction req#2 to the memory, and receives the destination Data data#2 of the Data access instruction req#2 returned by the memory, and stores the destination Data data#2 into the cache block 0 of the Data-2-stage Data proxy module associated with the Data access instruction req#2.
For example, as shown in fig. 8F, when the computing core sends the Data access instruction req#1 to the stage 1 Data proxy module, the stage 1 Data proxy module returns the Data data#1 in the cache block 1 associated with the Data access instruction req#1 to the destination register. And then, clearing the data access instruction req#1 from the data request list of the stage 1 data proxy module, and clearing the state item corresponding to the data access instruction req#1 in the request state list of the stage 1 data proxy module, i.e. the cache block 1 can be associated with other data access instructions, for example, with the data access instruction req#3.
Then, as shown in fig. 8G, the data access instruction req#6 is sequentially selected from the data access instruction sequence by the data proxy module at stage 1, and cached in the data request list of the data proxy module at stage 1, for example, the physical distance corresponding to the data access instruction req#6 is smaller, and classified into the first channel. Meanwhile, initializing a state item corresponding to the data access instruction req#6 and storing the state item into a request state list of the data agent module at the level 1, and synchronizing the data access instruction req#6 and the corresponding state item into the data agent module at the level 2.
For example, as shown in fig. 8G, since the cache block 0 in the data agent 1 st is in an idle state, the scheduler of the data agent 1 st transmits a data request signal indicating destination data of the data access instruction req#2 to the data agent 2 st. After receiving the Data request signal, the Data-2-stage Data proxy module sends the destination Data data#2 to the Data-1-stage Data proxy module, clears the Data access instruction req#2 from the Data request list of the Data-2-stage Data proxy module, and clears the state item corresponding to the Data access instruction req#2 in the request state list, that is, for the cache block 0 of the Data-2-stage Data proxy module, the Data access instruction req#4 may be associated with other Data access instructions, for example. After receiving the destination Data data#2, the Data agent module at stage 1 stores it in the cache block 0 associated with the Data access instruction req#2.
For example, as shown in fig. 8G, since the cache block 1 in the Data-2-stage Data proxy module is in an idle state, the scheduler of the Data-2-stage Data proxy module sends the Data access instruction req#3 to the memory, and the destination Data data#3 of the Data access instruction req#3 returned by the memory is stored in the cache block 1 of the Data-2-stage Data proxy module associated with the Data access instruction req#3.
And repeatedly executing the process, thereby sequentially sending each data access instruction according to the instruction sending sequence, and caching the destination data returned by each data access instruction into the corresponding level 1 data proxy module, for example, a level 1 cache node corresponding to the level 1 data proxy module. Therefore, even if the data access instruction is sent earlier, the register is not required to be occupied earlier, the occupied time of the register is reduced, and even if the time for sending the data access instruction is later, for example, the data access instruction is sent only when destination data is needed to be used, the data only need to be read into the data agent module 1, the whole waiting time is shorter, and the data access efficiency is improved.
The instruction sending sequence is determined according to the original sequence of the data access instruction sequence and the distance between the storage position of the target data of each data access instruction and the computing core, the priority sending is close, the priority sending is far, and the bandwidth of the fast request is prevented from being extruded by the slow request as far as possible; in some embodiments, the instruction sending sequence is determined according to the original sequence of the data access instruction sequence, the distance between the storage location of the destination data of each data access instruction and the computing core, and the priority of each data access instruction, and the data access instruction with high priority can be sent out as early as possible, so that the return time accords with the expectations of the user.
Fig. 9A is a timing diagram of a data processor according to at least one embodiment of the present disclosure.
As shown in fig. 9A, the computing core caches the destination data from the memory into the L1 cache at an earlier time according to the procedure described above, and even if the time for sending the data access instruction is later, the overall waiting time is shorter because only the data agent module 1 st stage is required to read the destination data, compared with the timing shown in fig. 4A.
In addition, as only the data agent module at stage 1 is required to read the destination data, as shown in fig. 9A, compared with the time sequences shown in fig. 4B and 4C, the occupied time of the register is greatly reduced, the resource consumption is reduced, and the data access efficiency is improved.
Fig. 9B is a timing diagram of a data processor according to at least one embodiment of the present disclosure.
As shown in fig. 9B, the computing core caches the destination data from the memory into the L1 cache at an earlier time according to the procedure described above, and the register occupation time is shorter than the time sequence shown in fig. 4D; in the method, the data access instructions are sent by the sub-channels by taking the channels as a unit, the data access instructions close to the computing core are sent first, and the data access instructions far from the computing core are sent later, so that the bandwidth of the fast request is prevented from being extruded by the slow request as much as possible; in addition, the data access request can also be provided with priority, so that the data access instruction with high priority can be sent out as soon as possible, and the data return time accords with the user expectation.
It should be noted that the components and structures of the data processor 100 shown in fig. 5 and the like are exemplary only and not limiting, as the data processor 100 may have other components and structures as desired.
For example, these modules may be implemented by hardware (e.g., circuit) modules, software modules, or any combination of the two, and the like, and the following embodiments are the same and will not be repeated.
For example, the data processor may be a general purpose processor such as a Central Processing Unit (CPU), a graphics processor, a general purpose graphics processor, a digital signal processor, or a special purpose processor such as a Tensor Processor (TPU), a neural Network Processor (NPU), or other form of processing unit having data processing and/or instruction execution capabilities, and corresponding computer instructions to implement the units.
It should be noted that in the embodiment of the present disclosure, the data processor 100 may include more or fewer circuits or units, and the connection relationship between the circuits or units is not limited, and may be determined according to actual requirements. The specific configuration of each circuit or unit is not limited, and may be constituted by an analog device according to the circuit principle, a digital chip, or other applicable means.
At least one embodiment of the present disclosure also provides a data processing method. Fig. 10 is a schematic flow chart of a data processing method provided in at least one embodiment of the present disclosure.
For example, the data processing method is used for a data processor comprising an N-level cache and a memory.
For example, the data processor may be a Central Processing Unit (CPU), a graphics processor, a general purpose graphics processor, a Tensor Processor (TPU), a neural Network Processor (NPU), etc., which is not particularly limited by the present disclosure.
For example, the data processor may employ a similar architecture as shown in fig. 1 or fig. 2, as previously described.
The ith level cache in the N level caches comprises at least 1 ith level cache node, at least part of the ith level cache nodes in the at least 1 ith level cache nodes share one i+1 level cache node, N is a positive integer greater than 1, and i is any positive integer between 1 and N-1.
The N-level cache in the N-level caches is electrically connected with the memory, and each 1-level cache node is connected with the corresponding computing nuclear power.
The relevant contents of the N-level cache, the memory and the computing core may refer to the relevant description of the data processor, and the repetition is omitted.
As shown in fig. 10, the data processing method provided in at least one embodiment of the present disclosure at least includes steps S10 to S40.
In step S10, a data access instruction sequence in a task currently executed by the data processor is acquired.
For example, the data access instruction sequence is obtained by arranging all the data access instructions in the task according to the sequence position relationship appearing in the program corresponding to the task.
For the specific content of the instruction sequence for acquiring the data access, reference may be made to the description of the relevant part in the foregoing data processor, and the repetition is not repeated.
In step S20, the instruction sending sequence of the data access instruction sequence is adjusted according to the storage position of the destination data of each data access instruction in the data access instruction sequence.
For example, in some embodiments, step S20 may include: the method comprises the steps of obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is a physical distance between a storage position of destination data of each data access instruction in a memory and a computing core in a data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, wherein the sending time of the data access instruction with smaller corresponding physical distance is earlier.
For example, the physical distance corresponding to each data access instruction may be obtained by a read address in the data access instruction.
According to different physical distances corresponding to the data access instructions, the instruction sending sequence of the data access instruction sequence is adjusted, for example, on the basis of the original data access instruction sequence, the sending sequence is adjusted according to the physical distances corresponding to the data access instructions. For example, the smaller the corresponding physical distance, the earlier the transmission time of the data access instruction, i.e., the earlier in the instruction transmission order; the larger the corresponding physical distance is, the later the transmission time of the data access instruction is, namely, the later the data access instruction is in the instruction transmission sequence. For example, the corresponding data access instructions with the same physical distance are sequentially sent according to the relative sequence relationship in the data access instruction sequence.
For example, according to the physical distance corresponding to each data access instruction, adjusting the instruction sending sequence of the data access instruction sequence may include: classifying each data access instruction into different channels according to the physical distance corresponding to each data access instruction, wherein the physical distances corresponding to the data access instructions in the different channels are different; according to the sequence of the physical distances corresponding to the data access instructions in different channels from small to large, arranging the data access instructions in each channel by taking the channel as a unit to obtain an instruction sending sequence, wherein in the instruction sending sequence, the data access instruction belonging to a first channel is sent first, the data access instruction belonging to a second channel is sent last, the data access instructions belonging to the same channel are sent in series according to the relative sequence relation in the data access instruction sequence, the physical distance corresponding to the data access instruction in the first channel is minimum, and the physical distance corresponding to the data access instruction in the second channel is maximum.
For example, a plurality of channels may be set in the data proxy module, and the data access instruction may be classified into different channels according to the corresponding physical distance. For example, the data access instructions of the destination data in the same memory block may be classified into one channel, or the data access instructions of the destination data in several memory blocks that are relatively close to each other may be placed in one channel.
For more content of adjusting the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to each data access instruction, reference may be made to the related description in the foregoing data processor, and the repetition is not repeated.
In the above embodiment, the introducing channel performs classification management of the data access instruction, classifies the data access instruction into different channels according to the difference between the storage position of the target data to be loaded and the distance between the computing core, preferentially sends the data access instruction with a close physical distance, and then sends the data access instruction with a far physical distance, and the data access instruction corresponding to the different physical distances is sent separately, so that the long-distance request is prevented from extruding the bandwidth of the short-distance request as much as possible, and the return time of the target data is enabled to meet the expectations of users as much as possible.
For example, in other embodiments, step S20 may include: acquiring the priority of each data access instruction; the method comprises the steps of obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is a physical distance between a storage position of destination data of each data access instruction in a memory and a computing core in a data processor; and adjusting the instruction sending sequence of the data access instruction sequence according to the priority of each data access instruction and the corresponding physical distance, wherein the sending time of the data access instruction with higher priority is earlier, and the sending time of the data access instruction with smaller corresponding physical distance is earlier under the condition of the same priority.
For example, acquiring the priority of each data access instruction may include: and determining the priority of each data access instruction according to the time of returning the target data of each data access instruction when the task is pre-run, wherein each data access instruction is sequentially sent according to the sequence in the data access instruction sequence when the task is pre-run, and the time of returning the target data of each data access instruction is collected.
For more details on the priority of acquiring each data access instruction, reference may be made to the description related to the foregoing data processor, and the repetition is not repeated.
In the adjusted instruction sending sequence, the priority of the data access instruction is given priority, and then the channel is considered. For example, the higher priority data memory instructions are sent earlier, and are sent preferentially even if they belong to the slow channel, and the other data memory instructions (e.g., low priority but in the fast channel) are sent again with a delay of the corresponding clock cycle. And if the priorities are the same, sending the data access instructions according to the order from small to large of the physical distances corresponding to the data access instructions. Specifically, the data access instructions may be classified into different channels as described above, and the data access instructions in each channel are arranged in units of channels according to the order of the physical distances corresponding to the data access instructions in the different channels from small to large, so as to obtain the instruction sending order.
In the embodiment, the priority is set for each data access instruction, so that the high priority can be sent out as soon as possible, and the return time accords with the expectations of users; in addition, the data access instructions are sent by using multiple channels, the data access instructions in the fast and slow channels are sent separately, and the slow requests which are longer in time consumption from the slow channels are prevented from extruding the bandwidth of the fast requests which are shorter in time consumption from the fast channels as much as possible.
For the specific content of adjusting the instruction sending sequence of the data access instruction sequence according to the priority and the physical distance, reference may be made to the related description in the foregoing data processor, and the repetition is not repeated.
In step S30, each data access instruction is sequentially transmitted in the instruction transmission order.
For example, each data access instruction is sequentially sent to the memory according to the instruction sending sequence.
In step S40, the destination data returned by each data access instruction is cached to a level 1 cache node or a shared memory connected to the computing core using the destination data.
For example, the level 1 cache node may be an L1 cache and the shared memory may be the shared memory in the computing unit shown in fig. 2.
According to the data processing method provided by at least one embodiment of the present disclosure, destination data can be cached in a storage component, such as a level 1 cache node or a shared memory, which is closest to a computing core, for example, an L1 cache or a shared memory, in advance, so that even if the time for sending a data access instruction is late, since only the destination data needs to be read in the level 1 cache node or the shared memory, the overall waiting time is short, the time occupied by a register is greatly reduced, the resource consumption is reduced, and the data access efficiency is improved. In the data processing method, the data access instruction is sent by taking the channel as a unit, the data access instruction close to the computing core is sent first, the data access instruction far from the computing core is sent later, and the bandwidth of the fast request is prevented from being extruded by the slow request as much as possible; in addition, the data access request can also be provided with priority, so that the data access instruction with high priority can be sent out as soon as possible, and the data return time accords with the user expectation.
For example, the data processor further includes an N-level data proxy module, where each level of data proxy module in the N-level data proxy module corresponds to each level of cache in the N-level cache one-to-one. For example, an N-level data proxy module may be used to implement steps S10-S40 as described above. The specific process of implementing the steps S10 to S40 described above by using the N-level data proxy module may refer to the related description of the foregoing data processor, and the repetition is not repeated.
Of course, the present disclosure is not limited thereto, and the steps S10 to S40 described above may be implemented in other manners, for example, in hardware manners such as other module structures or software manners such as program codes, which are not particularly limited thereto.
Fig. 11 is a schematic block diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 11, the electronic device 300 is suitable for use, for example, to implement the data processing method provided by the embodiments of the present disclosure. It should be noted that the components of the electronic device 300 shown in fig. 11 are exemplary only and not limiting, and that the electronic device 300 may have other components as desired for practical applications.
As shown in fig. 11, the electronic device 300 may include a processing apparatus (e.g., central processing unit, graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with non-transitory computer readable instructions stored in memory to achieve various functions.
For example, computer readable instructions, when executed by the processing device 301, may perform one or more steps of a data processing method according to any of the embodiments described above. It should be noted that, the detailed description of the processing procedure of the data processing method may refer to the related description in the embodiment of the data processing method, and the repetition is not repeated.
For example, the memory may comprise any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory may include, for example, random Access Memory (RAM) 303 and/or cache memory (cache) or the like, and computer readable instructions may be loaded from storage 308 into Random Access Memory (RAM) 303 to execute the computer readable instructions. The non-volatile memory may include, for example, read-only memory (ROM) 302, a hard disk, erasable programmable read-only memory (EPROM), portable compact disc read-only memory (CD-ROM), USB memory, flash memory, and the like. Various applications and various data, such as style images, and various data used and/or generated by the applications, may also be stored in the computer readable storage medium.
For example, the processing device 301, the read only memory 302, and the random access memory 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, the following devices may be connected to the input/output interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, flash memory, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other electronic devices to exchange data. While fig. 11 shows the electronic device 300 with various means, it is to be understood that not all of the illustrated means are required to be implemented or provided, and that the electronic device 300 may alternatively be implemented or provided with more or fewer means. For example, the processor 301 may control other components in the electronic device 300 to perform desired functions. The processor 301 may be a central processor, tensor processor, or a graphics processor or the like having data processing and/or program execution capabilities. The central processing unit can be an X86 or ARM architecture, etc. The graphics processor may be integrated directly onto the motherboard alone or built into the north bridge chip of the motherboard. The graphics processor may also be built-in on the central processor.
The technical effects of the electronic device 300 shown in fig. 11 are the same as those of the data processing method provided in the present disclosure, and will not be repeated here.
Fig. 12 is a schematic diagram of a non-transitory computer readable storage medium according to at least one embodiment of the present disclosure. For example, as shown in fig. 12, the storage medium 400 may be a non-transitory computer-readable storage medium, and one or more computer-readable instructions 401 may be stored non-transitory on the storage medium 400. For example, computer readable instructions 401, when executed by a processor, may perform a data processing method according to the above.
For example, the storage medium 400 may be applied to the above-described electronic device, and for example, the storage medium 400 may include a memory in the electronic device 300.
For example, the storage medium may include a memory card of a smart phone, a memory component of a tablet computer, a hard disk of a personal computer, random access memory, read only memory, erasable programmable read only memory, portable compact disc read only memory, flash memory, or any combination of the above storage media, as well as other suitable storage media.
For example, the description of the storage medium 400 may refer to the description of the memory in the embodiment of the electronic device, and the repetition is omitted.
Some embodiments of the present disclosure also provide another electronic device. Fig. 13 is a schematic block diagram of another electronic device provided in accordance with at least one embodiment of the present disclosure.
For example, as shown in fig. 13, the electronic device 500 may include a data processor 501 as described in any of the embodiments of the present disclosure. For example, the data processor 501 may be the aforementioned data processor 100, and the repetition is omitted.
It should be noted that the components of the electronic device 500 shown in fig. 13 are only exemplary and not limiting, and that the electronic device 500 may also have other components, including more or less circuits or units, for example, and may also include other circuits or units that support the operation of the data processor 501, as desired for practical applications, which the present disclosure is not limited to in particular.
The connection relation between the circuits or units is not limited, and can be determined according to actual requirements. The specific configuration of each circuit or unit is not limited, and may be constituted by an analog device according to the circuit principle, a digital chip, or other applicable means.
Those skilled in the art will appreciate that various modifications and improvements can be made to the disclosure. For example, the various devices or components described above may be implemented in hardware, or may be implemented in software, firmware, or a combination of some or all of the three.
Further, while the present disclosure makes various references to certain elements in a system according to embodiments of the present disclosure, any number of different elements may be used and run on a client and/or server. The units are merely illustrative and different aspects of the systems and methods may use different units.
A flowchart is used in this disclosure to describe the steps of a method according to an embodiment of the present disclosure. It should be understood that the steps that follow or before do not have to be performed in exact order. Rather, the various steps may be processed in reverse order or simultaneously. Also, other operations may be added to these processes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the methods described above may be implemented by a computer program to instruct related hardware, and the program may be stored in a computer readable storage medium, such as a read only memory, a magnetic disk, or an optical disk. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, each module/unit in the above embodiment may be implemented in the form of hardware, or may be implemented in the form of a software functional module. The present disclosure is not limited to any specific form of combination of hardware and software.
Unless defined otherwise, all terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The foregoing is illustrative of the present disclosure and is not to be construed as limiting thereof. Although a few exemplary embodiments of this disclosure have been described, those skilled in the art will readily appreciate that many modifications are possible in the exemplary embodiments without materially departing from the novel teachings and advantages of this disclosure. Accordingly, all such modifications are intended to be included within the scope of this disclosure as defined in the claims. It is to be understood that the foregoing is illustrative of the present disclosure and is not to be construed as limited to the specific embodiments disclosed, and that modifications to the disclosed embodiments, as well as other embodiments, are intended to be included within the scope of the appended claims. The disclosure is defined by the claims and their equivalents.

Claims (27)

1. A data processor comprises an N-level cache and a memory, wherein the i-level cache in the N-level cache comprises at least 1 i-level cache node, at least part of i-level cache nodes in the at least 1 i-level cache nodes share one i+1-level cache node, N is a positive integer greater than 1, i is any positive integer between 1 and N-1,
An Nth level buffer memory in the N level buffer memories is electrically connected with the memory, each 1 level buffer memory node is connected with a corresponding calculation nuclear power,
The data processor also comprises an N-level data proxy module, wherein each level of data proxy module in the N-level data proxy module corresponds to each level of cache in the N-level cache one by one,
The N-level data proxy module is configured to acquire a data access instruction sequence in a task currently executed by the data processor, adjust an instruction sending sequence of the data access instruction sequence according to a storage position of destination data of each data access instruction in the data access instruction sequence in the memory, send each data access instruction in sequence according to the instruction sending sequence, and cache destination data returned by each data access instruction into a corresponding 1 st-level data proxy module, wherein the corresponding 1 st-level data proxy module corresponds to the 1 st-level cache.
2. The data processor of claim 1, wherein the N-level data proxy module performs the following operations when adjusting the instruction sending order of the data access instruction sequence according to the storage location in the memory of the destination data of each data access instruction in the data access instruction sequence:
Obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor;
and adjusting the instruction sending sequence according to the physical distance corresponding to each data access instruction, wherein the sending time of the data access instruction is earlier when the corresponding physical distance is smaller.
3. The data processor of claim 2, wherein the N-level data proxy module, when executing the instruction sending sequence of the data access instruction sequence according to the physical distance corresponding to the respective data access instruction, comprises executing the following operations:
classifying the data access instructions into different channels according to the physical distances corresponding to the data access instructions, wherein the physical distances corresponding to the data access instructions in the different channels are different;
according to the order of the physical distance from small to large corresponding to the data access instructions in each channel, the data access instructions in each channel are arranged by taking the channel as a unit to obtain the instruction sending order,
In the instruction sending sequence, the data access instruction belonging to the first channel is sent first, the data access instruction belonging to the second channel is sent last, the data access instructions belonging to the same channel are sent in series according to the relative sequence relation in the data access instruction sequence, the physical distance corresponding to the data access instruction in the first channel is minimum, and the physical distance corresponding to the data access instruction in the second channel is maximum.
4. The data processor of claim 2, wherein the N-level data proxy module performs adjusting an instruction sending order of the data access instruction sequence according to a storage position of destination data of each data access instruction in the data access instruction sequence in the memory, including performing the following operations:
acquiring the priority of each data access instruction;
Obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor;
And adjusting the instruction sending sequence according to the priorities and the corresponding physical distances of the data access instructions, wherein in the instruction sending sequence, the sending time of the data access instruction with higher priority is earlier, and the sending time of the data access instruction with smaller corresponding physical distance is earlier under the condition of the same priority.
5. The data processor of claim 4, wherein the N-level data proxy module performs a priority of acquiring each data access instruction, comprising performing the following:
And determining the priority of each data access instruction according to the time of returning the target data by each data access instruction when the task is pre-run, wherein each data access instruction is sequentially sent according to the sequence in the data access instruction sequence when the task is pre-run, and the time of returning the target data by each data access instruction is collected.
6. The data processor of claim 1, wherein the N-level data proxy module, when executing the buffering of the destination data returned by the respective data access instructions into the corresponding data proxy module at level 1, comprises:
And caching the target data returned by the data access instruction into a level 1 cache node or a shared memory connected with a computing core using the target data according to each data access instruction, wherein the corresponding level 1 data proxy module comprises a plurality of cache blocks, the cache blocks are mapped to the level 1 cache node or the shared memory, and the cache blocks are used for caching the received target data.
7. The data processor of any one of claims 1 to 6, wherein the at least 1 level i cache node is in one-to-one correspondence with at least 1 level i data agent, the level i data agent corresponding to at least some of the level i cache nodes is electrically connected to the level i+1 data agent corresponding to the one level i+1 cache node,
The data agent module of the level 1 corresponding to each level 1 cache node is connected with at least one computing core, wherein the at least one computing core is electrically connected with the level 1 cache node;
and the N-stage data proxy module corresponding to the N-stage cache is electrically connected with the memory.
8. The data processor of claim 7, wherein each data agent comprises a scheduler, a request state list, a data request list, and a plurality of cache blocks, the data agent comprising any one of an ith data agent or the nth data agent,
The data request list is used for caching data access instructions to be sent,
The scheduler is used for sequentially sending the data access instructions to be sent, which are cached in the data request list, according to the instruction sending sequence, and sending the destination data received by the data agent module to a corresponding computing core or a cache block in an electrically connected upper-level data agent module,
The request state list comprises a plurality of state items, wherein each state item in the plurality of state items is used for indicating a cache block associated with a data access instruction corresponding to the state item and the position of the data access instruction corresponding to the state item in the instruction sending sequence,
The plurality of cache blocks are used for caching the destination data received by the data proxy module.
9. The data processor of claim 8, wherein a plurality of channels are arranged in the data request list for caching the data access instructions to be sent, the physical distances corresponding to the data access instructions in different channels are different,
The scheduler is configured to arrange the data access instructions in each channel according to the order of the physical distance corresponding to the data access instructions in each channel from small to large, the instruction sending order is obtained, the data access instructions to be sent are sequentially sent according to the instruction sending order,
The physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor, and the physical distance corresponding to each data access instruction is determined through the reading address of the destination data serving as an input parameter in each data access instruction.
10. The data processor of claim 8, wherein the data proxy module is configured to:
And in response to the existence of the unassociated cache block in the data proxy module, associating the unassociated cache block with one data access instruction selected from the data request list according to the instruction sending sequence.
11. The data processor of claim 10, wherein each cache block has a unique corresponding cache block number, each data access instruction in the data request list has a unique corresponding request number,
The data proxy module performs the following operations when associating the unassociated cache block with a data access instruction selected from the data request list according to the instruction sending order:
And setting a request number of the data access instruction and a cache block number of the unassociated cache block in a state item corresponding to the data access instruction so as to associate the unassociated cache block with the data access instruction.
12. The data processor of claim 8, wherein the i+1-th data proxy module is further configured to:
And responding to the destination data cached in any one cache block in the (i+1) -th data proxy module to be sent to the (i) -th data proxy module, clearing a data access instruction associated with any cache block from a data request list of the (i+1) -th data proxy module, and clearing a state item corresponding to the cleared data access instruction in a request state list of the (i+1) -th data proxy module.
13. The data processor of claim 8, wherein the scheduler of the nth data proxy module is configured to:
sequentially sending data access instructions cached in a data request list of the Nth data proxy module to the memory according to the instruction sending sequence;
And receiving the destination data returned from the memory in turn, and storing each returned destination data into a cache block associated with the corresponding data memory access instruction.
14. The data processor of claim 8, wherein the scheduler of the ith data proxy module corresponding to any of the at least some i-level cache nodes is configured to:
transmitting a data request signal to the i+1st data proxy module which is electrically connected, and caching the received return data into a cache block associated with a data access instruction corresponding to the data request signal;
the scheduler of the i+1st stage data agent module is configured to:
and in response to receiving the data request signal, transmitting destination data of a data access instruction corresponding to the data request signal to the ith data proxy module.
15. The data processor of claim 8, wherein the data-level-1 proxy module is further configured to:
Before executing the task, selecting at least one data access instruction from the data access instruction sequence according to the sequence of the data access instruction sequence, and executing initialization operation on the at least one data access instruction, wherein the at least one data access instruction is executed by a computing core electrically connected with the data access agent module at the 1 st stage.
16. The data processor of claim 15, wherein the data proxy module, when executing an initialization operation on the at least one data access instruction, comprises:
caching the at least one data access instruction into a data request list of the data agent module of the level 1;
Initializing a state item corresponding to the at least one data access instruction and storing the state item into a request state list of the data agent module at the level 1; and
And synchronizing the at least one data access instruction and the corresponding state item to other stages of data proxy modules which have direct or indirect electrical connection relation with the stage 1 data proxy module.
17. The data processor of claim 8, wherein the data-level-1 proxy module is further configured to:
in response to receiving a data access instruction sent by a computing core electrically connected with the data access instruction at the level 1, determining a cache block associated with the sent data access instruction, and sending data in the associated cache block to a destination register indicated by the sent data access instruction;
Clearing the transmitted data access instruction from the data request list of the data proxy module 1, and clearing a state item corresponding to the transmitted data access instruction in the request state list of the data proxy module 1;
And selecting at least one data access instruction from the data access instruction sequence according to the sequence of the data access instruction sequence, and executing initialization operation on the at least one data access instruction.
18. The data processor of claim 8, wherein,
The state item comprises a cache block number and a request number, which are used for indicating the cache block associated with the data access instruction corresponding to the state item and the data access instruction corresponding to the state item,
The state item also comprises a channel number of a channel to which the data access instruction corresponding to the state item belongs and a priority of the data access instruction corresponding to the state item, the channel number and the priority are used for indicating the position of the data access instruction corresponding to the state item in the instruction sending sequence,
The state item also comprises request receiving state information and request sending state information, which are used for indicating the current state of the data access instruction corresponding to the state item,
The state item further comprises data state information, and the data state information is used for indicating the current state of target data returned by the data access instruction corresponding to the state item.
19. The data processor of claim 8, wherein prior to performing the task, the data processor is further configured to:
and selecting a plurality of cache lines from the cache nodes corresponding to each data agent module to be mapped into the cache blocks in the data agent module.
20. A data processor according to any one of claims 1 to 6, wherein the sequence of data access instructions is obtained by arranging all data access instructions in the task according to a sequential positional relationship occurring in a program to which the task corresponds.
21. The data processor of claim 7, wherein the data processor is a general-purpose graphics processor or a graphics processor, the data processor comprising a plurality of computing units,
In response to n=2, a level 1 cache of the N level caches includes a plurality of level 1 cache nodes, a level 2 cache of the N level caches includes 1 level 2 cache node,
Each level 1 cache node for data sharing within 1 computing unit, said level 2 cache nodes for data sharing between said plurality of computing units,
Each level 1 cache node corresponds to 1 level 1 data proxy module, and the level 2 cache node corresponds to 1 level 2 data proxy module.
22. A data processing method is used for a data processor comprising an N-level cache and a memory,
The ith level cache in the N level caches comprises at least 1 ith level cache node, at least part of the ith level cache nodes in the at least 1 ith level cache nodes share one i+1 level cache node, N is a positive integer greater than 1, i is any positive integer between 1 and N-1,
An Nth level buffer memory in the N level buffer memories is electrically connected with the memory, each 1 level buffer memory node is connected with a corresponding calculation nuclear power,
The data processing method comprises the following steps:
acquiring a data access instruction sequence in a task currently executed by the data processor;
according to the storage position of the destination data of each data access instruction in the data access instruction sequence in the memory, adjusting the instruction sending sequence of the data access instruction sequence;
Sequentially sending the data access instructions according to the instruction sending sequence;
And caching the target data returned by each data access instruction into a level 1 cache node or a shared memory connected with the calculation core using the target data.
23. The data processing method according to claim 22, wherein adjusting the instruction sending order of the data access instruction sequence according to the storage position of the destination data of each data access instruction in the data access instruction sequence in the memory includes:
Obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor;
and adjusting the instruction sending sequence according to the physical distance corresponding to each data access instruction, wherein in the instruction sending sequence, the sending time of the data access instruction with smaller corresponding physical distance is earlier.
24. The data processing method according to claim 22, wherein adjusting the instruction sending order of the data access instruction sequence according to the storage position of the destination data of each data access instruction in the data access instruction sequence in the memory includes:
acquiring the priority of each data access instruction;
Obtaining a physical distance corresponding to each data access instruction, wherein the physical distance corresponding to each data access instruction is the distance between the storage position of the destination data of each data access instruction in the memory and the computing core in the data processor;
And adjusting the instruction sending sequence according to the priorities and the corresponding physical distances of the data access instructions, wherein in the instruction sending sequence, the sending time of the data access instruction with higher priority is earlier, and the sending time of the data access instruction with smaller corresponding physical distance is earlier under the condition of the same priority.
25. An electronic device, comprising:
a memory non-transitory storing computer-executable instructions;
a processor configured to execute the computer-executable instructions,
Wherein the computer executable instructions, when executed by the processor, implement the data processing method according to any of claims 22-24.
26. A non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions,
The computer executable instructions, when executed by a processor, implement the data processing method according to any of claims 22-24.
27. An electronic device comprising a data processor according to any of claims 1-21.
CN202410420140.8A 2024-04-09 Data processor, data processing method, electronic device, and storage medium Active CN118012788B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410420140.8A CN118012788B (en) 2024-04-09 Data processor, data processing method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410420140.8A CN118012788B (en) 2024-04-09 Data processor, data processing method, electronic device, and storage medium

Publications (2)

Publication Number Publication Date
CN118012788A true CN118012788A (en) 2024-05-10
CN118012788B CN118012788B (en) 2024-06-28

Family

ID=

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175136A (en) * 2018-12-30 2019-08-27 成都海光集成电路设计有限公司 Buffer memory management method, buffer and storage medium
CN111143242A (en) * 2018-11-02 2020-05-12 华为技术有限公司 Cache prefetching method and device
CN111580754A (en) * 2020-05-06 2020-08-25 西安交通大学 Write-friendly flash memory solid-state disk cache management method
CN114579479A (en) * 2021-11-16 2022-06-03 中国科学院上海高等研究院 Low-pollution cache prefetching system and method based on instruction flow mixed mode learning
WO2022213871A1 (en) * 2021-04-06 2022-10-13 华为云计算技术有限公司 Caching apparatus, method and system
WO2023011236A1 (en) * 2021-07-31 2023-02-09 华为技术有限公司 Compilation optimization method for program source code, and related product
CN116841623A (en) * 2023-06-30 2023-10-03 摩尔线程智能科技(北京)有限责任公司 Scheduling method and device of access instruction, electronic equipment and storage medium
CN117453435A (en) * 2023-12-20 2024-01-26 北京开源芯片研究院 Cache data reading method, device, equipment and storage medium
CN117725115A (en) * 2023-12-19 2024-03-19 金篆信科有限责任公司 Database sequence processing method, device, equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111143242A (en) * 2018-11-02 2020-05-12 华为技术有限公司 Cache prefetching method and device
CN110175136A (en) * 2018-12-30 2019-08-27 成都海光集成电路设计有限公司 Buffer memory management method, buffer and storage medium
CN111580754A (en) * 2020-05-06 2020-08-25 西安交通大学 Write-friendly flash memory solid-state disk cache management method
WO2022213871A1 (en) * 2021-04-06 2022-10-13 华为云计算技术有限公司 Caching apparatus, method and system
WO2023011236A1 (en) * 2021-07-31 2023-02-09 华为技术有限公司 Compilation optimization method for program source code, and related product
CN114579479A (en) * 2021-11-16 2022-06-03 中国科学院上海高等研究院 Low-pollution cache prefetching system and method based on instruction flow mixed mode learning
CN116841623A (en) * 2023-06-30 2023-10-03 摩尔线程智能科技(北京)有限责任公司 Scheduling method and device of access instruction, electronic equipment and storage medium
CN117725115A (en) * 2023-12-19 2024-03-19 金篆信科有限责任公司 Database sequence processing method, device, equipment and storage medium
CN117453435A (en) * 2023-12-20 2024-01-26 北京开源芯片研究院 Cache data reading method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
EP3754498A1 (en) Architecture for offload of linked work assignments
US9742869B2 (en) Approach to adaptive allocation of shared resources in computer systems
CN108694089B (en) Parallel computing architecture using non-greedy scheduling algorithm
CN114020470B (en) Resource allocation method and device, readable medium and electronic equipment
US8180998B1 (en) System of lanes of processing units receiving instructions via shared memory units for data-parallel or task-parallel operations
US11940915B2 (en) Cache allocation method and device, storage medium, and electronic device
US11222258B2 (en) Load balancing for memory channel controllers
US20210248006A1 (en) Hardware Resource Allocation System for Allocating Resources to Threads
US20130117751A1 (en) Compute task state encapsulation
CN114429214A (en) Arithmetic unit, related device and method
CN115562838A (en) Resource scheduling method and device, computer equipment and storage medium
CN112925616A (en) Task allocation method and device, storage medium and electronic equipment
EP3879406A1 (en) Simulated annealing-based memory allocations
CN118012788B (en) Data processor, data processing method, electronic device, and storage medium
CN116795503A (en) Task scheduling method, task scheduling device, graphic processor and electronic equipment
CN118012788A (en) Data processor, data processing method, electronic device, and storage medium
US20110066813A1 (en) Method And System For Local Data Sharing
US12007913B2 (en) On-chip interconnect for memory channel controllers
CN114661363A (en) Pipeline instruction distribution method, system, equipment and medium
US11249910B2 (en) Initialization and management of class of service attributes in runtime to optimize deep learning training in distributed environments
CN112910960A (en) Virtual network online migration method and device with time delay, resource and energy consumption perception
KR20220142059A (en) In-memory Decoding Cache and Its Management Scheme for Accelerating Deep Learning Batching Process
WO2022088074A1 (en) Instruction processing method based on multiple instruction engines, and processor
US20230115542A1 (en) Programmable matrix multiplication engine
EP4198719A1 (en) Processing work items in processing logic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant