CN112506437A

CN112506437A - Chip, data moving method and electronic equipment

Info

Publication number: CN112506437A
Application number: CN202011458676.7A
Authority: CN
Inventors: 冷祥纶; 周俊; 王文强
Original assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Current assignee: Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-16
Also published as: WO2022121278A1; JP2023509818A

Abstract

The application provides a chip, a data moving method and electronic equipment. The chip may include at least one processing core and at least one memory partition. The memory partition includes a cache system, a memory system, and a direct memory access controller DMA. The DMA is respectively connected with the cache system and the memory system and is used for carrying out data transfer among different storage spaces in the memory partition.

Description

Chip, data moving method and electronic equipment

Technical Field

The application relates to a computer technology, in particular to a chip, a data moving method and electronic equipment.

Background

With the rapid development of computer technology, the computing power of various chips is gradually improved. The improvement of the chip arithmetic capability requires higher data transfer efficiency.

In the related art, when data needs to be moved between the first storage space and the second storage space in the memory partition inside the chip, the processing core needs to read data from the first storage space first and store the data in the processing core. Then, the processing core reads out the stored data and writes the data into the second storage space.

Therefore, in the related art, data transfer between different storage spaces inside the memory partition needs to pass through the processing kernel, and the data transfer efficiency is low.

Disclosure of Invention

In view of the above, the present application discloses a chip, which includes:

at least one processing core and at least one memory partition; the memory partition comprises a cache system, a memory system and a direct memory access controller (DMA); the DMA is respectively connected with the cache system and the memory system and is used for carrying out data transfer among different storage spaces in the memory partition. In one illustrative embodiment, a first processing core of the at least one processing core is coupled to the DMA in a first memory partition of the at least one memory partition; a first processing core of the at least one processing core is configured to send a data move instruction to a DMA in a first memory partition, where the at least one memory partition includes the first memory partition; and the DMA in the first memory partition is used for carrying out data transfer between different memory spaces in the first memory partition based on the data transfer instruction.

In an embodiment, the cache system includes a plurality of levels of cache, wherein at least a portion of a storage space of a last level of cache in the plurality of levels of cache is configured as a scratch pad memory SPM; the DMA is used to perform data transfer between the memory space configured as the SPM in the last-level cache and the memory system.

In an illustrative embodiment, the last level CACHE of the CACHE system supports three operation modes, wherein in the first operation mode, the whole storage space of the last level CACHE is configured as a CACHE, in the second operation mode, the whole storage space of the last level CACHE is configured as an SPM, and in the third operation mode, a part of the storage space of the last level CACHE is configured as a CACHE and another part of the storage space is configured as an SPM.

In an embodiment, the memory partition further includes a mode configurator configured to configure an operation mode of a last-level cache in the cache system based on user configuration information.

In an embodiment shown, the at least one processing core of the chip and the DMA in the at least one memory partition access each other through a network on a master chip; and/or the DMA, the cache system and the memory system access each other through a network on a sub-chip.

In an embodiment, the data movement between different storage spaces inside the memory partition includes at least one of the following: data movement between different storage spaces in the last level cache of the cache system; data movement between different storage spaces in the memory system; and data movement between the storage space in the last level cache of the cache system and the storage space in the memory system.

In an embodiment, the chip includes at least one memory partition that employs unified memory access.

In an illustrative embodiment, the processing core is configured to broadcast a data move instruction to at least one DMA in the at least one memory partition.

In an embodiment, the data moving instruction includes: data move type, data length, source memory address, and destination memory address.

In an embodiment shown in the figure, the data moving instruction includes a first field, a second field, a third field and a fourth field; wherein, the first field is used for indicating the calling type and the data length; the second field is used for indicating a low address of a source storage address; the third field is used for indicating a high address of the source storage address and a high address of the destination storage address; the fourth field is used to indicate a low address of the destination memory address.

In an illustrated embodiment, the DMA is configured to: and reading data from the first storage space in the memory partition, and writing the read data into the second storage space in the memory partition.

In an embodiment, the memory system is an HBM high bandwidth memory.

The application also provides a data moving method which is applied to the chip; the chip comprises at least one processing core and at least one memory partition, wherein the memory partition comprises a cache system, a memory system and a direct memory access controller (DMA); the method comprises the following steps: the processing kernel sends a data moving instruction to the DMA; the DMA transfers data between different memory spaces within the memory partition based on the data transfer command.

In an embodiment, the DMA, based on the data move instruction, performing data move between different memory spaces in the memory partition, includes: and the DMA in the first memory partition carries out data transfer between different memory spaces in the first memory partition based on the data transfer command.

In an embodiment, the cache system includes a plurality of levels of cache, wherein at least a portion of a storage space of a last level of cache in the plurality of levels of cache is configured as a scratch pad memory SPM; the DMA, based on the data transfer command, transfers data between different memory spaces within the memory partition, including: the DMA transfers data between the memory space configured as the SPM in the last-level cache and the memory system based on the data transfer instruction.

In an embodiment, the memory partition further comprises a mode configurator; the method further comprises the following steps: and configuring the working mode of the last-level cache in the cache system through the mode configurator based on the user configuration information.

In an embodiment shown in the drawing, at least one memory partition included in the chip is a plurality of memory partitions, and the plurality of memory partitions all use unified memory access; the sending of the data moving instruction to the DMA by the processing core includes: the processing core broadcasts a data transfer instruction to at least one DMA in the at least one memory partition.

In an embodiment, the DMA, based on the data move instruction, performing data move between different memory spaces in the memory partition, includes: the DMA reads data from a first storage space in the memory partition based on the data move command, and writes the read data into a second storage space in the memory partition.

In an embodiment, the memory system is an HBM high bandwidth memory.

The present application further proposes an electronic device, comprising: any of the embodiments described above show a chip.

According to the technical scheme, on one hand, the DMA is respectively connected with the cache system and the memory system and is used for carrying out data movement among different storage spaces in the memory partition, so that the data can be controlled to be moved in the memory partition without occupying the memory access bandwidth of the chip, the memory access bandwidth in the chip is released in the data movement process, the data movement efficiency is improved, and the performance of the chip is prompted.

On the other hand, because the processing core sends a data transfer instruction to the DMA, the DMA can respond to the data transfer instruction and control data transfer between different storage spaces in the memory partition, so that data to be transferred can be transferred inside the memory partition, thereby releasing the memory access bandwidth inside the chip, prompting the data transfer efficiency and improving the chip performance.

On the other hand, the chip can improve the data moving efficiency of the memory partitions and has higher performance, so that the processing efficiency of computing tasks can be improved in an auxiliary mode by using the chip, and the performance of the electronic equipment is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate one or more embodiments of the present application or technical solutions in the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive exercise.

Fig. 1 is an internal structure diagram of an AI chip shown in the present application;

FIG. 2 illustrates the internal structure of a chip according to the present application;

FIG. 3 is a diagram of a chip structure shown in the present application;

FIG. 4 is a diagram of a chip structure shown in the present application;

FIG. 5 is a schematic diagram of a data move instruction shown in the present application;

FIG. 6 is a schematic diagram of a data move instruction shown in the present application;

fig. 7 is a flowchart of a data moving method according to the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It should also be understood that the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.

Referring to fig. 1, fig. 1 is a diagram illustrating an internal structure of an AI chip according to the present application.

As shown in fig. 1, the processing core of the AI chip is connected to the memory partition; the memory partition at least comprises a memory system and a cache system.

In the AI chip shown in fig. 1, when a part of data in the memory system needs to be moved to the cache, the processing core reads the part of data from the memory system by a read command and stores the part of data in the processing core, and then the processing core writes the part of data in the cache by a write command.

Therefore, when data is moved between the cache and the memory system, the memory access bandwidth needs to be occupied at least twice, so that the data moving delay is very large, the memory access bandwidth is occupied, and the chip performance is greatly reduced. It will be appreciated by those skilled in the art that the above problems also exist with the movement of data within the cache area, as well as within the memory system, and will not be described in detail herein.

In view of the above, the present application provides a chip. The chip adds DMA (Direct Memory Access) controllers which are respectively connected with a cache system and a Memory system in a Memory partition, so that the DMA can carry out data moving instructions among different Memory spaces in the Memory partition, thereby releasing the Memory Access bandwidth in the chip, improving the data moving efficiency and improving the chip performance.

The internal structure of the chip will be described below.

Referring to fig. 2, fig. 2 is a diagram illustrating an internal structure of a chip according to the present application. As shown in fig. 2, the chip includes:

at least one processing core 21 and at least one memory partition 22;

the memory partition 22 includes a cache system, a memory system 222, and a DMA 223;

the DMA223 is connected to the cache system and the memory system, respectively, and is used for data transfer between different storage spaces in the memory partition.

It should be noted that fig. 2 exemplarily shows that the last-level cache system 221 included in the cache system is connected to the DMA. In practical applications, the above-mentioned DMA may also be connected to other levels of cache systems. And is not particularly limited herein.

In practical applications, the DMA may read data from a first storage space in the memory partition and write the read data into a second storage space in the memory partition.

For example, when the first storage space is a memory system, the second storage space is an L2 cache system. The DMA may control data movement of data between the memory system and the L2 cache system in response to a data movement instruction issued by the processing core.

It should be noted that, a memory partition may include one or more DMAs, for example, a memory partition includes one DMA and is responsible for data movement between all memory spaces in the memory partition, and for example, a memory partition includes a plurality of DMAs, each of which may be responsible for data movement between one or more pairs of memory spaces, where when the memory partition includes a plurality of the DMAs and a plurality of the memory partitions, the present application does not limit the specific locations of the DMAs. For example, the DMA may be located in each memory partition or may be centrally located in one of the memories.

The chip may be any chip requiring high access bandwidth. In practical applications, the chip may be a chip of a multi-channel DRAM (Dynamic Random Access Memory) storage system.

Such as a CPU, DSP, MCU, etc. In one embodiment, the chip may execute an artificial intelligence algorithm. For example, the chip may be an AI neural network chip (e.g., FPGA, TPU, etc.) or a GPU graphics processing chip.

The processing core, typically a computational core in a chip, for performing code operations may include one or more processing units. For example, the processing core may generally perform data migration in the memory partition according to a program code specified by a developer.

In practical applications, the data movement between the storage spaces in the storage partition may generally include the movement of data inside the last-level cache system in the storage partition, the movement of data inside the memory system in the storage partition, and the movement of data between the last-level cache system and the memory system in the storage partition.

The memory partitions are typically used to store data.

In practical applications, a chip typically employs a memory partition having a memory hierarchy. The memory partition may include at least one level of cache system and a memory system.

For example, with continued reference to FIG. 2, the memory partition may include a last level cache (assuming the chip also includes at least L1 and L2 caches), as well as a memory system. At this time, when the processing core needs to obtain data, the L1 cache is usually accessed first. If the L1 cache stores the data needed by the processing core, the processing core completes the data acquisition. If the L1 cache does not store the data needed by the processing core, the processing core needs to retrieve the needed data from the L2 cache in the memory partition. And by analogy, if the last-level cache system does not relate to the data required by the kernel, the processing kernel continues to acquire the data from the memory system.

In the above example, it can be seen that the performance of the chip depends largely on the cache HIT rate (CHCHE HIT). In order to increase the cache hit rate, a large-capacity cache which can be directly managed by a developer is provided in a chip, so that the cache hit rate in the large-capacity cache is increased.

In general, the last level cache system of the at least one level cache system included in the memory partition may serve as the mass cache.

Since at least a portion of the storage space of the cache system is configured with the SPM and affects the data moving efficiency of the portion of the storage space, in an embodiment, to improve the data moving efficiency, the cache system includes a plurality of levels of caches, wherein at least a portion of the storage space of the last level of cache in the plurality of levels of cache is configured as the scratchpad SPM.

At this time, when data transfer is performed, the DMA is used to perform data transfer between the memory space configured as the SPM in the last-level cache and the memory system. Because the data transfer between the memory space configured as the SPM in the last-level cache and the memory system is carried out through the DMA, the transferred data can be prevented from passing through an inner core, thereby releasing the bandwidth, shortening the data transfer path and improving the data transfer efficiency.

In an embodiment, in order to flexibly adapt to multiple service scenarios, the last-level CACHE of the CACHE system supports three operation modes, wherein in the first operation mode, the entire storage space of the last-level CACHE is configured as a CACHE, in the second operation mode, the entire storage space of the last-level CACHE is configured as an SPM, and in the third operation mode, a part of the storage space of the last-level CACHE is configured as a CACHE, and another part of the storage space is configured as an SPM.

In this way, developers can flexibly configure the last-level cache system according to requirements, so that the applicability of the chip is improved.

It should be noted that, in order to implement the dynamic configuration of the last-level cache system, in an embodiment, the last-level cache system may further include a mode configurator.

The mode configurator is configured to configure a working mode of a last-level cache in the cache system based on user configuration information.

In practical applications, a developer may configure the operation mode of the last-level cache through the mode configurator based on user configuration information.

For example, in a scenario of a multi-chip cascade distributed training system, communication between chips requires high capacity and low latency, and all memory space of the last level cache system can be configured as an SPM scratch pad memory.

For another example, in a scenario where the performance-demanding algorithm development does not require management of the last-level CACHE system among developers, all the memory space of the last-level CACHE system may be configured as a CACHE buffer.

For another example, in a scenario that both data transmission efficiency and data reuse rate need to be emphasized, a part of the storage space of the last-level CACHE system may be configured as a CACHE buffer, and a part of the storage space is configured as an SPM scratch pad memory to store AI operation parameters.

The memory system may be a global memory system. For example, the Memory system may be a Dynamic Random Access Memory (DRAM), a Synchronous Dynamic Random Access Memory (SDRAM), or the like.

In one embodiment, to increase the memory access bandwidth, the global memory system may be an HBM high bandwidth memory.

It should be noted that the chip may adopt a bus or an NOC (network-on-chip) architecture, and may be set according to actual requirements. In the related art, please refer to fig. 3, wherein fig. 3 is a structure diagram of a chip shown in the present application. As shown in fig. 3, the DMA, the chip processing core, and the memory partition are connected by a bus.

At this time, if the memory system inside the memory partition needs to move data to the L2 cache, the chip processing core will send a data move instruction to the DMA to complete the data move.

However, it is not easy to find that, in the above chip structure, even if the working pressure of the chip processing core is released through DMA, the data still needs to flow from the memory system to the core first and then to the L2 cache during the transfer process, so that the problems of the memory bandwidth preemption by data transfer and the low data transfer efficiency still exist with the above chip structure.

In order to solve the above problem, as shown in fig. 2, in the present application, the DMA is built in the memory partition, so that the DMA can control the data to be moved inside the memory partition without preempting the memory access bandwidth of the chip.

According to the technical scheme, the DMA is respectively connected with the cache system and the memory system and is used for carrying out data movement among different storage spaces in the memory partition, so that the data can be controlled to be moved in the memory partition without occupying the memory access bandwidth of the chip, the memory access bandwidth in the chip is released in the data movement process, the data movement efficiency is improved, and the performance of the chip is prompted.

In one embodiment, a first processing core of said at least one processing core is coupled to said DMA in a first memory partition of said at least one memory partition;

a first processing core of the at least one processing core is configured to send a data move instruction to a DMA in a first memory partition, where the at least one memory partition includes the first memory partition;

and the DMA in the first memory partition is used for carrying out data transfer between different memory spaces in the first memory partition based on the data transfer instruction.

With continued reference to fig. 2, the DMA is connected to the processing core. The connection mode may be a bus mode connection.

In one embodiment, to further improve chip performance, the DMAs, and the processing cores may be accessible to each other via a network-on-chip (NOC).

The master network-on-chip may be a master network in the chip. When the chip includes a plurality of cores and a plurality of memory partitions, the plurality of processing cores and the DMAs in the plurality of memory partitions may be accessible to each other through the network-on-chip (NOC).

Referring to fig. 2, the DMA is connected to the last level cache system of the at least one level cache system and the memory system, respectively. The connection mode may be a bus mode connection.

In one embodiment, to further improve chip performance, the DMA, the last level cache system, and the memory system are accessed via a network-on-chip (NOC).

The sub-network-on-chip may be a sub-network within the memory partition. When the chip includes a plurality of memory partitions, the plurality of memory partitions may all use the network on chip, so that the DMA, the last-level cache system, and the memory system in each memory partition may access each other through the network-on-chip (NOC).

To increase the memory access bandwidth and the chip capacity, the chip may generally include a plurality of memory partitions in one embodiment, due to the limited bandwidth and capacity of a single memory (including a buffer or a memory system). These memory partitions may be connected in parallel with the processing cores.

Referring to fig. 4, fig. 4 is a structural diagram of a chip shown in the present application. As shown in fig. 4, the chip includes a plurality of processing cores, and a plurality of memory partitions. It should be noted that only the last level cache system is illustrated in the memory partition, and other caches are not shown in fig. 4.

The processing cores and memory partitions of the chip may be accessible to each other via the network-on-chip (NOC).

By adopting the mode, the parallel connection of the multiple memory partitions is realized, so that the memory access bandwidth and the chip capacity are widened.

In the above situation, in order to facilitate the developer to write the program, the at least one Memory included in the chip is a plurality of Memory partitions, and the plurality of Memory partitions all use Unified Memory Access (UMA).

In practical applications, UMA may be employed between last level cache systems in the multiple memory partitions. UMA may also be used between memory systems in the multiple memory partitions.

By the method, for developers, effective addresses of different caches are the same, and effective addresses of different memory systems are also the same, so that when data are written to the caches or the memory systems in an access mode, only one address needs to be input, data do not need to be written for multiple caches or multiple memory systems respectively, programming efficiency of the developers is improved, and data storage efficiency is also improved.

The processing core may send a data move instruction to one or more DMAs, respectively, and in some embodiments, to reduce the call overhead for the DMAs, the processing core may broadcast the data move instruction to at least one DMA in the at least one memory partition.

In practical applications, when data migration is required in a memory partition, the processing core may send a data migration instruction to the DMA broadcast in the plurality of memory partitions.

For example, assume that a chip may include 8 memory partitions. Unified memory access may be adopted between last-level cache systems of 4 memory partitions (assuming that the last-level cache system is the L2 cache) in the 8 memory partitions, and between memory systems in the multiple memory partitions.

In the above situation, if 8M data needs to be moved from the memory system to the L2 cache, it is actually necessary to complete the movement of 1 mb of data in each memory partition. At this time, on one hand, the processing core may broadcast and send a data move instruction to the DMA in the 4 memory partitions adopting the UMA architecture; on the other hand, the data move instruction may be sent to the DMA in the memory partition not adopting the UMA architecture, respectively.

After receiving the data transfer command, each DMA may extract 1 million data from the storage location of the memory system indicated by the data transfer command, and transfer the 1 million data to the storage location of the L2 high-speed memory indicated by the data transfer command, thereby completing the data transfer.

Because the processing kernel can broadcast and send a data transfer instruction to the DMA in the plurality of memory partitions adopting the UMA architecture to complete the data transfer in each memory partition, the number of times of the kernel calling the DMA is reduced, thereby reducing the calling overhead of the DMA.

In an embodiment, the plurality of DMAs included in the chip may be collectively located in the same memory partition, and respectively correspond to the memory system and the cache system included in each memory included in the chip in a one-to-one manner.

In this case, when data transfer by the plurality of DMAs is necessary, data transfer between different memory spaces in the respective memory partitions can be completed by broadcasting and transmitting a data transfer command to the plurality of DMAs in the memory partitions.

The following introduces an improvement of the present application on a data move instruction, and in the present application, in order to further reduce the call overhead for DMA, a DMA data move instruction with a completely new format is provided. The data moving instruction reduces the number of fields of the data moving instruction and reasonably sets the meaning indicated by each field, thereby reducing the length of the data moving instruction and reducing the call overhead of DMA.

In the related art, the data move instruction indication for DMA includes 6 fields, which are a data move type field, a data length field, a high speed memory low address field, a high speed memory high address field, a memory system low address field, and a memory system high address field.

Therefore, the data transfer instruction in the related art is relatively long, and when the DMA is called, a long data transfer instruction needs to be sent to the DMA, so that the call overhead of the DMA is increased.

To solve this problem, in an embodiment, the data move instruction at least includes a data move type, a data length, a source storage address, and a destination storage address.

The data movement type specifically indicates a data movement direction. In one embodiment, the data movement type may indicate a data flow direction in the memory partition. Specifically, the data flow direction (data transfer type) may include any one of the following four types:

the data migration from the last level cache system in the storage partition to the memory system, and the data migration from the memory system in the storage partition to the last level cache system.

In practical application, the four data flow directions correspond to the four identifiers, and when the DMA is actually called, the four identifiers are written into the data transfer type, so that the DMA can identify the data flow direction of the data transfer.

The data length specifically indicates the size of the data volume to be transmitted. It can be understood that the data size has a corresponding relationship with the storage space, so if the starting position of the data in the memory is known, the ending position of the data in the memory can be obtained according to the data length of the data.

The source storage address is specifically a start address of a current storage location of the data to be moved. For example, if data is moved from the memory system to the last level cache system, the source storage address is the starting location of the data in the memory system.

The destination storage address is specifically a start address of a storage location where data to be moved needs to be moved. For example, if data is moved from the memory system to the last level cache system, the destination storage address is the starting location where the data is moved to the last level cache system.

It can be understood that, after the DMA receives the data move instruction, on one hand, the DMA can determine the source storage space according to the source storage address field and the data length in the data move instruction; on the other hand, the destination storage space can be determined according to the destination storage address field and the data length in the data moving instruction; on the other hand, the data in the source storage space can be moved to the destination storage space according to the data moving type in the data moving instruction.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a data move instruction according to the present application. As shown in fig. 5, the data move instruction includes a first field, a second field, a third field, and a fourth field;

wherein, the first field is a field indicating the data moving type and the data length;

the second field is a field indicating a low address of the source storage address;

the third field is a field indicating a high address of the source memory address and a high address of the destination memory address;

the fourth field is a field indicating a low address of the destination memory address.

It should be noted that, the order of the fields of the data move instruction and the positions of the data bits indicating different meanings in the fields may be adjusted according to actual situations, and are not limited herein.

Assume that 0000 (binary) indicates data move within the last level cache system, 0001 (binary) indicates data move within the memory system, 0010 (binary) indicates data move from the memory system to the last level cache system, and 0011 (binary) indicates data move from the last level cache system to the memory system.

In the above situation, assume that the memory system moves 2 megabytes of data from the low address 0x3EAB _0000(16 CALEN), the high address 0xAB _00(16 CALEN), to the low address 0x3E5B _0000(16 CALEN), the high address 0xCD _00(16 CALEN) of the last level cache system.

At this time, when constructing a data transfer instruction to the DMA, the chip processing core may write 0010 into the first 4 bits of the first field and write a 2-megabyte converted binary into the last 28 bits of the first field. Then, the processing core may convert the low address 0x3EAB _0000 of the memory system into binary and write the binary into the second field, and convert the high address 0xAB _00 of the memory system into binary and write the last sixteen bits of the third field. Finally, the processing core may write the upper address 0xCD _00 of the last level cache system into the first sixteen bits of the third field, and convert the lower address 0x3E5B _0000 of the last level cache system into a binary address into the fourth field.

After the processing core completes the structure of the data move instruction, the data move instruction may be broadcast to each DMA, so that each DMA moves 2 megabytes of data from the low address 0x3EAB _0000 and the high address 0xAB _00 of the memory system to the low address 0x3E5B _0000 and the high address 0xCD _00 of the last-level cache system in response to the data move instruction.

Therefore, since the data move instruction at least includes the data move type and data length field, the source storage address field, and the destination storage address field, when the DMA is called, the call overhead of the DMA can be reduced.

In one embodiment, the merging of 6 fields in the data move instruction shown in the related art may be adopted, so as to reduce the number of fields included in the data move instruction.

In practical applications, since the number of bits required for the data movement type is small, and one field (32 bits) is occupied and is wasted, the data movement type and the data length can be combined into one field. Since high speed memory typically has a small total capacity (e.g., several megabits), two fields, a high speed memory low address field and a high speed memory high address field, may be combined into one field.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating a data move instruction according to the present application. As shown in fig. 6, the data move instruction at least includes a first field, a second field, a third field, and a fourth field;

the second field is a field indicating the storage address of the last-level high-speed memory;

the third field is a low address field indicating a memory system;

the fourth field is a high address field indicating a memory system.

On the other hand, the order of the fields of the data move instruction and the positions of the data bits indicating different meanings in the fields may be adjusted according to actual situations, and are not limited herein.

The meaning of the first field indication can refer to the foregoing embodiments, and is not described in detail herein.

The second field indicates the start address of the storage space of the last-level high-speed memory. When the first field indicates that the data is moved from the last-level high-speed memory to the memory system, the storage address indicated by the second field is the initial position of the current storage position of the data. When the first field indicates that the data is moved from the memory system to the last-level high-speed memory, the storage address indicated by the second field is the initial position of the storage position after the data is moved.

The meaning of the third field and the fourth field may refer to the second field, and will not be described in detail.

As can be seen from the above, since the data move instruction includes only four fields, when the DMA is called, the call overhead of the DMA can be reduced.

Correspondingly, the application also provides a data moving method which is applied to the chip. According to the method, a data moving instruction is issued to a DMA (direct memory access) built in a memory partition by a processing core, so that the DMA can respond to the data moving instruction issued by the processing core, and the data to be moved can be moved in the memory partition, so that the memory access bandwidth in the chip is released, the data moving efficiency is improved, and the performance of the chip is prompted.

Referring to fig. 7, fig. 7 is a flowchart of a method for moving data, which is applied to a chip. As shown in fig. 7, the method may include:

s702, the processing core sends a data moving instruction to the DMA.

S704, the DMA transfers data between different memory spaces in the memory partition based on the data transfer command.

The chip may have the chip structure described in any of the above embodiments. In one embodiment, the chip may adopt a chip structure as shown in fig. 2. As shown in fig. 2, the chip includes at least one processing core; at least one memory partition; the memory partition comprises a cache system, a memory system and a direct memory access controller DMA. The DMA is connected with the cache system and the memory system respectively.

It should be noted that, in practical applications, the memory partition may include at least one level of cache system, at least one memory system, and one or more DMAs, which are not particularly limited herein.

In one embodiment, the chip may execute an artificial intelligence algorithm. For example, the chip may be an AI neural network chip or a GPU graphics processing chip.

The processing core is typically a computing core in a chip, and is used for executing code operations. For example, the processing core may generally perform data migration in the memory partition according to a program code specified by a developer.

The memory partitions are typically used to store data.

For example, referring to FIG. 2, the memory partition may include a last level cache (assuming the chip also includes at least L1 and L2 caches), and a memory system. At this time, when the processing core needs to obtain data, the L1 cache is usually accessed first. If the L1 cache stores the data needed by the processing core, the processing core completes the data acquisition. If the L1 cache does not store the data needed by the processing core, the processing core needs to retrieve the needed data from the L2 cache in the memory partition. And by analogy, if the last-level cache system does not relate to the data required by the kernel, the processing kernel continues to acquire the data from the memory system.

The DMA is used for carrying out data transfer between different storage spaces in the memory partition.

The data transfer instruction is specifically an instruction for triggering data transfer between storage spaces inside the memory partition.

In this application, the data transfer instruction may be constructed by a processing core of the chip and sent to the DMA, so that the DMA controls to complete the data transfer.

When data transfer between the memory spaces in the memory partitions needs to be carried, the processing core sends a data transfer instruction to the DMA.

The DMA may control data transfer between the memory spaces within the memory partitions in response to the data transfer command after receiving the data transfer command.

According to the technical scheme, the processing core sends the data moving instruction to the DMA, and the DMA can respond to the data moving instruction and control the data moving among different storage spaces in the memory partition, so that the data to be moved can be moved in the memory partition, the memory access bandwidth in the chip is released, the data moving efficiency is prompted, and the chip performance is improved.

In an embodiment, the chip may include a plurality of memory partitions, and in order to complete data migration in each memory partition, the processing core may send a data transfer instruction to each DMA in the plurality of memory partitions, so that each DMA may control data transfer in the memory partition in which the DMA is located.

For example, assume that a chip includes 4 memory partitions. Assuming that data needs to be moved from the memory system to the last-level high-speed memory, since there are 4 memory partitions in the chip, the processing core may send data move instructions to the DMAs in the memory partitions, respectively. After the DMA in the memory partitions receives the data transfer instruction, the DMA can control the data transfer in the memory partitions where the DMA is located.

In an embodiment, when the chip includes a plurality of Memory partitions, in order to facilitate a developer to write a program, at least one Memory included in the chip is a plurality of Memory partitions, and the plurality of Memory partitions all use Unified Memory Access (UMA).

To facilitate the developer to write the program, uniform memory access may be employed between last level cache systems in the plurality of memory partitions, and between memory systems in the plurality of memory partitions.

In practical applications, UMA may be employed between last level cache systems in the multiple memory partitions. UMA is also possible between memory systems in the multiple memory partitions.

In order to reduce the call overhead for the DMA, the processing core is configured to broadcast a data move instruction to at least one DMA in the at least one memory partition.

For example, assume that the chip includes 4 memory partitions, and unified memory access may be employed between last level cache systems in the four memory partitions (assume that the last level cache system is the L2 cache), and between memory systems in the multiple memory partitions.

In the above situation, if 8M data needs to be moved from the memory system to the L2 cache, it is actually necessary to complete the move of 2 megabits of data in each memory partition. At this time, the processing core may broadcast and send a data move instruction to the DMA in the plurality of memory partitions.

After receiving the data transfer instruction, the DMA in the plurality of memory partitions may extract 2 megabits of data from the storage location indicated by the data transfer instruction of the memory system, and transfer the 2 megabits of data to the storage location indicated by the data transfer instruction of the L2 high-speed memory, thereby completing the data transfer.

Because the processing kernel can broadcast and send the data transfer instruction to the DMA in the plurality of memory partitions to complete the data transfer in each memory partition, the number of times of the kernel calling the DMA is reduced, and the calling overhead of the DMA is reduced.

the third field is a low address field indicating a memory system;

the fourth field is a high address field indicating a memory system.

The application also provides an electronic device comprising the chip shown in any of the above embodiments.

For example, the electronic device may be a smart terminal such as a mobile phone, or may be another device that has a camera and can perform image processing. For example, when the electronic device acquires a captured image, the image may be processed, and the chip according to the embodiment of the present application may be used in the processing process to perform a calculation task.

The chip can improve the data transfer efficiency of the memory partitions and has higher performance, so that the processing efficiency of computing tasks can be improved in an auxiliary mode by using the chip, and the performance of electronic equipment is improved.

One skilled in the art will recognize that one or more embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

"and/or" as recited herein means having at least one of two, for example, "a and/or B" includes three scenarios: A. b, and "A and B".

The embodiments in the present application are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and functional operations described in this application may be implemented in the following: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this application and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this application can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows described above can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing system. Generally, a central processing system will receive instructions and data from a read-only memory and/or a random access memory. The essential components of a computer include a central processing system for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and 0xCD _00ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

Although this application contains many specific implementation details, these should not be construed as limiting the scope of any disclosure or of what may be claimed, but rather as merely describing features of particular disclosed embodiments. Certain features that are described in this application in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing is merely a preferred embodiment of one or more embodiments of the present application and is not intended to limit the scope of the one or more embodiments of the present application, such that any modifications, equivalents, improvements and the like which come within the spirit and principle of one or more embodiments of the present application are included within the scope of the one or more embodiments of the present application.

Claims

1. A chip, wherein the chip comprises:

at least one processing core and at least one memory partition;

wherein the memory partition comprises a cache system, a memory system, and a direct memory access controller (DMA);

the DMA is respectively connected with the cache system and the memory system and is used for carrying out data transfer between different storage spaces in the memory partition.

2. The chip of claim 1, wherein a first processing core of the at least one processing core is coupled to the DMA in a first memory partition of the at least one memory partition;

3. The chip according to claim 1 or 2, wherein the cache system comprises a plurality of levels of caches, wherein at least a part of the storage space of the last level of cache in the plurality of levels of caches is configured as a Scratch Pad Memory (SPM);

the DMA is used for carrying out data transfer between the storage space configured as the SPM in the last-level cache and the memory system.

4. The chip of any of claims 1 to 3, wherein a last level CACHE of the CACHE system supports three operation modes, wherein in a first operation mode, the entire memory space of the last level CACHE is configured as a CACHE CACHE, in a second operation mode, the entire memory space of the last level CACHE is configured as an SPM, and in a third operation mode, a part of the memory space of the last level CACHE is configured as a CACHE and another part of the memory space is configured as an SPM.

5. The chip of claim 4, wherein the memory partition further comprises a mode configurator configured to configure a mode of operation of a last level cache in the cache system based on user configuration information.

6. The chip according to any one of claims 1 to 5, wherein at least one processing core of the chip and the DMA in the at least one memory partition access each other through a network on a master chip;

and/or the DMA, the cache system and the memory system access each other through a sub-on-chip network.

7. The chip according to any one of claims 1 to 6, wherein the data movement between different memory spaces inside the memory partitions comprises at least one of:

data movement between different storage spaces in a last-level cache of the cache system;

data movement between different storage spaces in the memory system;

data movement between a storage space in a last level cache of the cache system and a storage space in the memory system.

8. The chip according to any one of claims 1 to 7, wherein the at least one memory included in the chip is a plurality of memory partitions, and the plurality of memory partitions are all accessed using a unified memory.

9. The chip of any one of claims 1 to 8, wherein the processing core is configured to broadcast a data move instruction to at least one DMA in the at least one memory partition.

10. The chip according to any one of claims 2 to 9, wherein the data movement instruction comprises: data move type, data length, source memory address, and destination memory address.

11. The chip of claim 10, wherein the data movement instruction includes a first field, a second field, a third field, and a fourth field;

wherein the first field is used for indicating a call type and a data length;

the second field is used for indicating a low address of a source storage address;

the third field is used for indicating a high address of the source storage address and a high address of a destination storage address;

the fourth field is to indicate a low address of the destination memory address.

12. The chip of any one of claims 1-11, wherein the DMA is configured to:

reading data from a first memory space within the memory partition and writing the read data to a second memory space within the memory partition.

13. The chip according to any of claims 1 to 12, wherein the memory system is an HBM high bandwidth memory.

14. A data moving method is applied to a chip; the chip is characterized by comprising at least one processing core and at least one memory partition, wherein the memory partition comprises a cache system, a memory system and a direct memory access controller (DMA);

the method comprises the following steps:

the processing kernel sends a data moving instruction to the DMA;

and the DMA carries out data transfer among different storage spaces in the memory partition based on the data transfer instruction.

15. The method of claim 14, wherein the sending, by the processing core, a data move instruction to the DMA comprises:

a first processing core of the at least one processing core sends a data moving instruction to a DMA in a first memory partition, wherein the at least one memory partition comprises the first memory partition;

the DMA carries out data transfer between different storage spaces in the memory partition based on the data transfer instruction, and the data transfer method comprises the following steps:

and the DMA in the first memory partition carries out data transfer among different memory spaces in the first memory partition based on the data transfer instruction.

16. The method according to claim 14 or 15, wherein the cache system comprises a plurality of levels of cache, wherein at least a part of the storage space of the last level of cache in the plurality of levels of cache is configured as a Scratch Pad Memory (SPM);

and the DMA carries out data transfer between the storage space configured as the SPM in the last-level cache and the memory system based on the data transfer instruction.

17. The method of claim 16, further comprising:

configuring an operating mode of a last level cache in the cache system based on user configuration information.

18. The method according to any one of claims 14-17, wherein the at least one memory partition included in the chip is a plurality of memory partitions, each of the plurality of memory partitions employing unified memory access;

the processing core sends a data moving instruction to the DMA, and the data moving instruction comprises the following steps:

the processing core broadcasts a data move instruction to at least one DMA in the at least one memory partition.

19. The method according to any one of claims 14 to 18, wherein the DMA performs data movement between different storage spaces within the memory partition based on the data movement instruction, including:

and the DMA reads data from a first storage space in the memory partition based on the data moving instruction, and writes the read data into a second storage space in the memory partition.

20. An electronic device, comprising: the chip of any one of claims 1 to 13.