WO2022121278A1 - 芯片、数据搬移方法和电子设备 - Google Patents

芯片、数据搬移方法和电子设备 Download PDF

Info

Publication number
WO2022121278A1
WO2022121278A1 PCT/CN2021/101547 CN2021101547W WO2022121278A1 WO 2022121278 A1 WO2022121278 A1 WO 2022121278A1 CN 2021101547 W CN2021101547 W CN 2021101547W WO 2022121278 A1 WO2022121278 A1 WO 2022121278A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
memory
dma controller
cache
mentioned
Prior art date
Application number
PCT/CN2021/101547
Other languages
English (en)
French (fr)
Inventor
冷祥纶
周俊
王文强
Original Assignee
上海阵量智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海阵量智能科技有限公司 filed Critical 上海阵量智能科技有限公司
Priority to JP2022527673A priority Critical patent/JP2023509818A/ja
Publication of WO2022121278A1 publication Critical patent/WO2022121278A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7825Globally asynchronous, locally synchronous, e.g. network on chip
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to computer technology, in particular to a chip, a data transfer method and an electronic device.
  • the processing core when data movement between the first storage space and the second storage space needs to be performed in the memory partition inside the chip, the processing core needs to first read the data from the first storage space and store it in the processing core. inside the kernel. Then, the processing core reads out and writes the stored data into the second storage space.
  • the present application discloses a chip, and the chip includes:
  • At least one processing core and at least one memory partition wherein, for each memory partition, the above-mentioned memory partition includes a cache system, a memory system, and a direct memory access DMA controller; the above-mentioned DMA controller, with the above-mentioned cache system and above-mentioned memory The systems are respectively connected to perform data transfer between different storage spaces within the above-mentioned memory partitions.
  • the first processing core in the at least one processing core is configured to send a data moving instruction to the at least one first DMA controller, wherein the at least one first DMA controller is included in the at least one first DMA controller.
  • the at least one first DMA controller is configured to perform data movement between different storage spaces within the at least one first memory partition based on the data movement instruction.
  • the cache system includes a multi-level cache; the DMA controller is used to perform data transfer between the storage space of the cache system and the storage space in the memory system, including the DMA controller. The data transfer between the storage space of the last level cache and the storage space in the memory system is performed.
  • the last level cache supports three working modes, wherein, in the first working mode, the entire storage space of the last level cache is configured as a cache memory, and in the second In the working mode, the entire storage space of the last-level cache is configured as a note memory SPM, and in the third working mode, a part of the storage space of the last-level cache is configured as a cache memory, and another part of the storage space is configured as a cache memory. Configured as SPM.
  • the memory partition further includes a mode configurator, and the mode configurator is configured to configure the working mode of the last level cache based on user configuration information.
  • the above-mentioned at least one processing core and the above-mentioned DMA controller mutually access each other through the main on-chip network; or, the above-mentioned DMA controller, the above-mentioned cache system and the above-mentioned memory system mutually access each other through the sub-chip on-chip network.
  • the use of the DMA controller to perform data transfer between different storage spaces within the memory partition includes performing at least one of the following: data movement; data movement between different storage spaces in the above-mentioned memory system; data movement between the storage space of the above-mentioned cache system and the storage space in the above-mentioned memory system.
  • all or part of the different storage spaces in the above-mentioned memory partitions use a unified memory architecture UMA.
  • the fact that the first processing core is configured to send a data moving instruction to the at least one first DMA controller includes that the first processing core is configured to broadcast the data moving instruction to the at least one second DMA controller , wherein the second DMA controller is included in the first memory partition in which the different storage spaces all use UMA.
  • the above-mentioned data move instruction includes: data move type, data length, source storage address, and destination storage address.
  • the data moving instruction includes a first field, a second field, a third field and a fourth field; wherein, the first field is used to indicate the data moving type and the data length; the first field The second field is used to indicate the low address of the source storage address; the third field is used to indicate the high address of the source storage address and the high address of the destination storage address; the fourth field is used to indicate the low address of the destination storage address.
  • the use of the DMA controller to perform data transfer between different storage spaces in the memory partition includes: reading data from the first storage space in the memory partition, and reading data from the first storage space in the memory partition. The obtained data is written into the second storage space in the above-mentioned memory partition.
  • the memory system described above is a high bandwidth memory HBM.
  • the present application also proposes a data moving method, which is applied to a chip, wherein the chip includes at least one processing core and at least one memory partition, and each memory partition includes a cache system, a memory system, and a direct memory access DMA controller; the above-mentioned The method includes: for each memory partition, performing data transfer between different storage spaces within the memory partition through the DMA controller.
  • the performing data transfer between different storage spaces within the memory partition by the DMA controller includes: transferring data to at least one first DMA via a first processing core in the at least one processing core
  • the controller sends a data moving instruction, wherein the at least one first DMA controller is included in at least one first memory partition; the at least one first DMA controller performs the at least one first memory partition based on the data moving instruction Data movement between different internal storage spaces.
  • the cache system includes a multi-level cache; the performing the data movement between different storage spaces within the memory partition by the DMA controller includes: performing the last step by the DMA controller Data movement between the storage space of the level cache and the storage space in the above-mentioned memory system.
  • the last level cache supports three working modes, wherein, in the first working mode, the entire storage space of the last level cache is configured as a cache memory, and in the second In the working mode, the entire storage space of the last level cache is configured as SPM, and in the third working mode, a part of the storage space of the last level cache is configured as a cache memory, and another part of the storage space is configured as a cache memory. SPM.
  • the memory partition further includes a mode configurator; the method further includes: based on the user configuration information, configuring the working mode of the last level cache through the mode configurator.
  • the at least one processing core and the DMA controller access each other through the main on-chip network; and/or, the DMA controller, the cache system and the memory system communicate with each other through the sub-chip network access.
  • data transfer between different storage spaces within the memory partition includes at least one of the following: data transfer between different storage spaces in the cache system; data transfer between different storage spaces in the memory system Data movement between storage spaces; data movement between the storage space of the above-mentioned cache system and the storage space of the above-mentioned memory system.
  • all or part of the different storage spaces in the above-mentioned memory partitions use a unified memory architecture UMA.
  • the sending the data transfer instruction to the at least one first DMA controller through the first processing core includes: broadcasting the data transfer to the at least one second DMA controller through the first processing core The instruction, wherein the second DMA controller is included in the first memory partition where the different storage spaces all adopt the unified memory architecture UMA.
  • the above-mentioned data move instruction includes: data move type, data length, source storage address, and destination storage address.
  • the data moving instruction includes a first field, a second field, a third field and a fourth field; wherein, the first field is used to indicate the data moving type and the data length; the first field The second field is used to indicate the low address of the source storage address; the third field is used to indicate the high address of the source storage address and the high address of the destination storage address; the fourth field is used to indicate the low address of the destination storage address.
  • performing the data movement between different storage spaces in the memory partition by the DMA controller includes: reading data from the first storage space in the memory partition through the DMA controller , and write the read data into the second storage space in the above-mentioned memory partition.
  • the memory system described above is a high bandwidth memory HBM.
  • the present application further provides an electronic device, including: the chip shown in any of the above embodiments.
  • the above-mentioned DMA controller is connected to the above-mentioned cache system and the above-mentioned memory system, respectively, and is used to perform data transfer between different storage spaces in the above-mentioned memory partition, it can be controlled that the above-mentioned data can be stored in the memory system.
  • the above-mentioned memory partition is internally moved without preempting the memory access bandwidth of the above-mentioned chip, so that in the above-mentioned data moving process, the memory access bandwidth inside the chip is released, the data moving efficiency is improved, and the chip performance is prompted.
  • the above-mentioned processing core sends a data transfer instruction to the above-mentioned DMA controller
  • the above-mentioned DMA can control the data transfer between different storage spaces in the above-mentioned memory partition in response to the above-mentioned data transfer instruction.
  • the data is moved inside the above-mentioned memory partition, thereby releasing the memory access bandwidth inside the chip, prompting the data moving efficiency and improving the chip performance.
  • the use of the chip can assist in improving the processing efficiency of computing tasks, thereby improving the performance of electronic equipment.
  • Figure 1 is an internal structure diagram of an AI chip
  • Fig. 2 is the internal structure of a kind of chip shown in this application;
  • Fig. 3 is a kind of chip structure diagram shown in the application.
  • Fig. 4 is a kind of chip structure diagram shown in the application.
  • FIG. 5 is a schematic diagram of a data movement instruction shown in the application.
  • FIG. 6 is a schematic diagram of a data movement instruction shown in this application.
  • FIG. 7 is a method flowchart of a data moving method shown in this application.
  • FIG. 1 is an internal structure diagram of an AI chip.
  • the processing core of the AI chip is connected to a memory partition; wherein, the memory partition at least includes a memory system and a cache system.
  • the processing core when part of the data in the memory system needs to be moved to the cache system, the processing core first reads the part of the data from the memory system through a read command and stores it in the processing core , and then, the processing core writes the partial data into the above-mentioned cache system through a write command.
  • the present application proposes a chip.
  • the chip adds a DMA (Direct Memory Access, direct memory access) controller connected to the cache system and the memory system in the memory partition, so that the above-mentioned DMA can execute the data between different storage spaces inside the above-mentioned memory partition. Move instructions, thereby freeing the memory access bandwidth inside the chip, improving data moving efficiency, and improving chip performance.
  • DMA Direct Memory Access, direct memory access
  • FIG. 2 is an internal structure of a chip shown in this application. As shown in Figure 2, the above chip includes:
  • At least one processing core 21 and at least one memory partition 22 are included in At least one processing core 21 and at least one memory partition 22 .
  • each memory partition 22 includes a cache system 221 , a memory system 222 , and a DMA controller 223 .
  • the DMA controller 223 is connected to the cache system 221 and the memory system 222 respectively, and is used for data transfer between different storage spaces within the memory partition 22 .
  • the last level cache included in the cache system 221 can be connected to the DMA controller 223 .
  • the DMA controller 223 may be connected to the cache of the corresponding level involved. There is no particular limitation here.
  • the DMA controller may read data from the first storage space in the memory partition, and write the read data into the second storage space in the memory partition.
  • the above-mentioned first storage space is a memory system
  • the above-mentioned second storage space is an L2 cache.
  • the DMA controller may control data transfer between the memory system and the L2 cache in response to a data transfer instruction sent by the processing core.
  • a memory partition may include one or more DMA controllers.
  • a memory partition includes a DMA controller responsible for moving data between all storage spaces within the memory partition.
  • a memory partition includes multiple DMA controllers, and each DMA controller in the multiple DMA controllers may be responsible for data movement between one or more pairs of storage spaces in the memory partition.
  • the present application does not limit the specific locations of these DMA controllers.
  • the DMA controllers may be distributed in each memory partition, or may be centralized in one of the memory partitions.
  • the above chip may specifically be any chip that requires high memory access bandwidth.
  • the above chips may be equipped with multi-channel DRAM (Dynamic Random Access Memory, dynamic random access memory) chips.
  • DRAM Dynamic Random Access Memory, dynamic random access memory
  • the above chip may be a CPU, a DSP, an MCU, or the like.
  • the aforementioned chip may execute artificial intelligence algorithms.
  • the above chip may be an AI neural network chip (eg, FPGA, TPU, etc.) or a GPU graphics processing chip.
  • the above-mentioned processing core usually a computing core in a chip, is used for executing code operations, and may include one or more processing units.
  • the above-mentioned processing core can usually perform data movement in the above-mentioned memory partition according to the program code formulated by the developer.
  • the data transfer between the storage spaces inside the above-mentioned memory partitions may generally include, the movement of the internal data of the cache system in the above-mentioned memory partitions, the movement of the internal data of the memory system in the above-mentioned memory partitions, and the above-mentioned memory partitions Data movement between the last level of cache and the memory system.
  • the above memory partitions are usually used to store data.
  • the chip usually adopts memory partitions with storage levels.
  • the above-mentioned memory partition may include a cache system with one or more levels of caches and a memory system.
  • the cache system 221 described above may include at least L1, L2 and L3 caches.
  • the processing core 21 when it needs to obtain data, it usually first accesses the L1 cache. If the data required by the processing core 21 is stored in the L1 cache, the processing core 21 completes the data acquisition. If the data required by the processing core 21 is not stored in the L1 cache, the processing core 21 continues to access the L2 cache to obtain the required data. And so on. If the last level cache, that is, the L3 cache, does not involve the data required by the processing core 21, the processing core 21 continues to acquire data from the memory system 222.
  • the last level cache can be used as the large-capacity cache
  • the DMA controller is used to store the storage space of the last level cache and the storage space in the memory system. Data movement between spaces.
  • At least a part of the storage space of the cache system is configured as a scratchpad memory (Scratchpad Memory, SPM)
  • SPM scratchpad Memory
  • the data transfer efficiency of this part of the storage space will be affected.
  • at least a part of the storage space of the last level cache is configured as SPM.
  • the DMA controller when performing data transfer, the DMA controller is used to perform data transfer between the storage space configured as SPM in the last level cache and the memory system. Since the data transfer between the storage space configured as SPM in the last level cache and the memory system is performed by the DMA controller, the data to be moved can be prevented from passing through the processing core, thereby releasing the bandwidth, shortening the data transfer path, and improving the Data movement efficiency.
  • the last level cache of the above-mentioned cache system supports three working modes, wherein, in the first working mode, all the storage space of the above-mentioned last level cache is Configured as a cache memory, in the second working mode, the entire storage space of the last level cache is configured as SPM, and in the third working mode, a part of the storage space of the last level cache is configured as a cache memory, and another part of the storage space is configured as SPM.
  • the above-mentioned memory partition may further include a mode configurator.
  • the above-mentioned mode configurator is configured to configure the working mode of the last-level cache in the above-mentioned cache system based on the user configuration information.
  • the developer can configure the working mode of the last-level cache through the mode configurator based on the user configuration information.
  • the entire storage space of the above-mentioned last-level cache can be configured as SPM.
  • the entire storage space of the last level cache can be configured as a cache memory.
  • part of the storage space of the last-level cache can be configured as cache memory, and part of the storage space can be configured as SPM to store AI. Operation parameters.
  • the above-mentioned memory system may be a global memory system.
  • it can be DRAM (Dynamic Random Access Memory, dynamic random access memory), SDRAM (synchronous dynamic random-access memory, synchronous dynamic random access memory) and so on.
  • the above-mentioned global memory system may be a high bandwidth memory (High Bandwidth Memory, HBM).
  • HBM High Bandwidth Memory
  • FIG. 3 is a structural diagram of a chip shown in this application. As shown in FIG. 3, the above-mentioned DMA controller, at least one processing core, and at least one memory partition are connected by a bus.
  • a processing core will send a data move instruction to the DMA controller, so that the DMA controller completes the data move.
  • the above-mentioned DMA controller is built into the above-mentioned memory partition, so that the DMA controller can control the above-mentioned data, so that the movement can be completed inside the above-mentioned memory partition without preempting The memory access bandwidth of the above chip.
  • the above DMA controller since the above DMA controller is connected to the above cache system and the above memory system respectively, and is used to perform data transfer between different storage spaces within the above memory partition, the above data can be controlled, so that the above data can be stored in the above memory partition.
  • the memory partition is internally moved without preempting the memory access bandwidth of the chip, and further, during the data transfer process, the memory access bandwidth inside the chip is released, the data transfer efficiency is improved, and the chip performance is prompted.
  • the first processing core in the at least one processing core is connected to at least one first DMA controller; the at least one first DMA controller is included in at least one first memory partition, and the first memory partition can be All or part of the above memory partitions.
  • the first processing core is configured to send a data transfer instruction to the at least one first DMA controller.
  • the at least one DMA controller is configured to perform data movement between different storage spaces within the at least one first memory partition based on the data movement instruction.
  • the above-mentioned DMA controller is connected with the above-mentioned first processing core.
  • the above-mentioned connection mode may be a bus-based connection.
  • the above-mentioned DMA controller and the above-mentioned processing core can access each other through a main on-chip network (NOC, network-on-chip).
  • NOC main on-chip network
  • the above-mentioned main on-chip network may be the main network in the above-mentioned chip.
  • the chip includes multiple processing cores and multiple memory partitions, the multiple processing cores and the DMA controllers in the multiple memory partitions can access each other through the main on-chip network.
  • the above DMA controller is respectively connected with the above cache system and the above memory system.
  • the above-mentioned connection mode may be a bus-based connection.
  • the DMA controller, the cache system, and the memory system access each other through a sub-chip network.
  • the above-mentioned sub-on-chip network may be a sub-network in the above-mentioned memory partition.
  • the above-mentioned chip includes a plurality of memory partitions
  • the above-mentioned multiple memory partitions can all use the above-mentioned sub-on-chip network, so that the DMA controller, the cache system and the memory system in each memory partition can pass the above-mentioned sub-on-chip network (NOC, network- on-chip) access each other.
  • NOC network- on-chip
  • the chip may generally include multiple memory partitions. These memory partitions can be connected in parallel with the processing cores.
  • FIG. 4 is a structural diagram of a chip shown in this application.
  • the above-mentioned chip includes multiple processing cores and multiple memory partitions. It should be noted that, only the last level of cache in the cache system is shown in the memory partition, and the caches of other levels are not shown in FIG. 4 .
  • Multiple processing cores and multiple memory partitions in the above-mentioned chip can access each other through the above-mentioned main on-chip network.
  • the above-mentioned chip includes a plurality of memory partitions, in order to facilitate the developers to write programs, the above-mentioned plurality of memory partitions all adopt a Unified Memory Architecture (UMA, Unified Memory Architecture).
  • UMA Unified Memory Architecture
  • UMA can be used for the last level cache in the above-mentioned multiple memory partitions.
  • the memory system in the above-mentioned multiple memory partitions may also employ UMA.
  • the effective addresses are the same between different last-level caches and the same between different memory systems. Therefore, when writing data to each last-level cache or each memory system, only one address needs to be entered, and there is no need to write data separately for multiple last-level caches or multiple memory systems, which improves the programming of developers. It also improves the efficiency of data storage.
  • Each processing core may send data moving instructions to one or more DMA controllers respectively.
  • the processing core in order to reduce the overhead of invoking the DMA controllers, may send data to at least one of the at least one memory partition.
  • the DMA controller broadcasts data movement instructions.
  • the processing core can broadcast and send a data movement instruction to the DMA controllers in the above-mentioned multiple memory partitions.
  • a chip may include 8 memory partitions.
  • the last level cache of 4 memory partitions in the above-mentioned 8 memory partitions (assuming that the last level cache is an L2 cache), and the memory systems in the above-mentioned multiple memory partitions can all use UMA.
  • the processing core can broadcast and send data transfer instructions to the DMA controllers in the four memory partitions that use UMA; on the other hand, it can send data to the DMA controllers in the four memory partitions that do not use UMA Data move instruction.
  • each above-mentioned DMA controller After each above-mentioned DMA controller receives the data move instruction, can extract 1 megabyte of data from the storage location indicated by the above-mentioned data move instruction of the memory system, and move the above-mentioned 1 megabyte of data to the storage indicated by the above-mentioned data move instruction of the L2 cache. location to complete the data transfer.
  • the processing core can broadcast and send data movement instructions to the DMA controllers in multiple memory partitions using UMA to complete data movement within each memory partition, the number of calls to the DMA controller by the processing core is reduced, thereby reducing Call overhead to the DMA controller.
  • a plurality of the above-mentioned DMA controllers included in the above-mentioned chip may be centralized in the same memory partition, and respectively correspond to the memory system and the cache system included in each memory partition in a one-to-one manner.
  • data movement instructions can be sent by broadcasting to the multiple DMA controllers in the above-mentioned memory partitions, thereby completing the data transfer between different storage spaces in each memory partition move.
  • the following introduces the improvement of the data transfer instruction in this application.
  • a new format of a data transfer instruction to the DMA controller is proposed.
  • the data moving instruction reduces the number of fields of the data moving instruction and reasonably sets the meaning of each field indication, thereby reducing the length of the data moving instruction and reducing the calling overhead to the DMA controller.
  • the data transfer instruction to the DMA controller includes 6 fields, which are the data transfer type field, the data length field, the last level cache low address field, the last level cache high address field, and the memory system low address field.
  • the address field, and the memory system high address field are the data transfer type field, the data length field, the last level cache low address field, the last level cache high address field, and the memory system low address field.
  • the above-mentioned data moving instruction may at least include a data moving type, a data length, a source storage address, and a destination storage address.
  • the above data transfer type specifically indicates the direction of data transfer.
  • the above-mentioned data movement type may indicate the data flow direction in the memory partition.
  • the above-mentioned data flow direction may include any of the following four types:
  • the movement of the cache system internal data in the above-mentioned memory partition, the movement of the internal data of the memory system in the above-mentioned memory partition, the data migration from the last level cache to the memory system in the above-mentioned memory partition, and from the memory system in the above-mentioned memory partition Move data to the last level of cache.
  • the above four data flow directions can be corresponded to the four types of identifiers, and when the DMA controller is actually called, the above four identifiers can be written into the above data transfer types, so that the DMA controller can identify the data this time. Moved data flow.
  • the above data length specifically indicates the amount of data to be transmitted. It can be understood that the size of the data has a corresponding relationship with the storage space. Therefore, if the starting position of the data in the storage space is known, the ending position of the data in the storage space can be obtained according to the data length of the data.
  • the above-mentioned source storage address specifically refers to the starting address of the current storage location of the data to be moved. For example, if the data is moved from the memory system to the last level of cache, the above-mentioned source storage address is the starting position of the data in the above-mentioned memory system.
  • the above-mentioned destination storage address specifically refers to the starting address of the storage location where the data to be moved needs to be moved. For example, if data is moved from the memory system to the last level of cache, the destination storage address is the starting position where the data is moved to the last level of cache.
  • the source storage space can be determined according to the source storage address field and the data length in the above-mentioned data moving instruction;
  • the destination storage address field and data length determine the destination storage space;
  • the data in the source storage space can be moved to the destination storage space according to the data move type in the above data move instruction.
  • FIG. 5 is a schematic diagram of a data moving instruction shown in the present application. As shown in Figure 5, the above-mentioned data movement instruction includes a first field, a second field, a third field and a fourth field;
  • the above-mentioned first field is a field indicating data movement type and data length
  • the above-mentioned second field is a field indicating the low address of the source storage address
  • the above-mentioned third field is a field indicating the high address of the source storage address and the high address of the destination storage address;
  • the above-mentioned fourth field is a field indicating the lower address of the destination storage address.
  • 0000 indicates that the data is moved within the cache system
  • 0001 indicates that the data is moved within the memory system
  • 0010 indicates that the data is moved from the memory system to the last level of cache
  • 0011 indicates that the data is moved from The last level of cache is moved to the memory system.
  • the processing core of the chip constructs the data transfer instruction to the DMA controller, it can write 0010 into the first 4 bits of the first field, and write 2MB converted binary into the last 28 bits of the first field. Then the processing core can convert the low address 0x3EAB_0000 of the memory system into binary and write into the second field, and convert the high address 0xAB_00 of the memory system into binary and write the last sixteen bits of the third field. Finally, the processing core can write the high address 0xCD_00 of the last level cache into the first sixteen bits of the third field, and convert the low address 0x3E5B_0000 of the last level cache into binary and write it into the fourth field .
  • the data moving instruction can be broadcast and sent to each DMA controller, so that each DMA controller responds to the above-mentioned data moving instruction, from the low address 0x3EAB_0000 of the above-mentioned memory system, high Address 0xAB_00, move 2 MB of data to the low address 0x3E5B_0000 and the high address 0xCD_00 of the last level cache above.
  • At least the data movement type and data length fields, the source storage address field, and the destination storage address field can be included. call overhead.
  • the six fields in the data moving instruction shown in the related art may be combined, thereby reducing the number of fields included in the data moving instruction.
  • FIG. 6 is a schematic diagram of a data moving instruction shown in the present application.
  • the above-mentioned data movement instruction includes at least a first field, a second field, a third field and a fourth field;
  • the above-mentioned first field is a field indicating data movement type and data length
  • the above-mentioned second field is a field indicating the storage address of the last level cache
  • the above-mentioned third field is a low address field indicating the memory system
  • the above-mentioned fourth field is a high address field indicating the memory system.
  • the above-mentioned second field indicates the starting address of the storage space of the last level cache.
  • the storage address indicated by the second field is the starting position of the current storage position of the data.
  • the storage address indicated by the second field is the starting position of the storage location after the data is moved.
  • the present application also proposes a data transfer method, which is applied to a chip.
  • the processing core issues a data moving instruction to the built-in DMA controller of the memory partition, so that the DMA controller can respond to the data moving instruction issued by the processing core, so that the data to be moved can be completed inside the memory partition. It can release the memory access bandwidth inside the chip, improve the data transfer efficiency, and prompt the chip performance.
  • FIG. 7 is a method flowchart of a data transfer method shown in the present application, which is applied to a chip. As shown in Figure 7, the above method may include:
  • the processing core sends a data transfer instruction to the DMA controller.
  • the DMA controller performs data transfer between different storage spaces within the memory partition based on the data transfer instruction.
  • the above-mentioned chip may be a chip having the chip structure shown in any of the above-mentioned embodiments. In one embodiment, the above-mentioned chip may adopt the chip structure shown in FIG. 2 . As shown in FIG. 2, the above-mentioned chip includes at least one processing core; at least one memory partition. Wherein, the above-mentioned memory partition includes a cache system, a memory system and a DMA controller. Wherein, the DMA controller is connected to the cache system and the memory system respectively.
  • the above-mentioned memory partition may include a cache system having one or more levels of caches, at least one memory system, and one or more DMA controllers, which are not particularly limited herein.
  • the aforementioned chip may execute artificial intelligence algorithms.
  • the above chip may be an AI neural network chip or a GPU graphics processing chip.
  • the above-mentioned processing core is usually a computing core in a chip, and is used for executing code operations.
  • the above-mentioned processing core can usually perform data movement in the above-mentioned memory partition according to the program code formulated by the developer.
  • the data transfer between the storage spaces inside the above-mentioned memory partitions may generally include, the movement of the internal data of the cache system in the above-mentioned memory partitions, the movement of the internal data of the memory system in the above-mentioned memory partitions, and the above-mentioned memory partitions Data movement between the last level of cache and the memory system.
  • the above memory partitions are usually used to store data.
  • the chip usually adopts memory partitions with storage levels.
  • the above-mentioned memory partition may include a cache system with one or more levels of caches and a memory system.
  • the cache system described above may include at least L1, L2, and L3 caches.
  • the processing core when the processing core needs to fetch data, it usually first accesses the L1 cache. If the data required by the processing core is stored in the L1 cache, the processing core completes the data acquisition. If the data required by the processing core is not stored in the L1 cache, the processing core continues to access the L2 cache to obtain the required data. And so on. If the last level cache, that is, the L3 cache, does not involve the data required by the processing core, the processing core continues to obtain data from the memory system.
  • the last level cache can be used as the large-capacity cache
  • the DMA controller is used to store the storage space of the last level cache and the storage space in the memory system. Data movement between spaces.
  • At least a part of the storage space of the cache system is configured as SPM
  • the data transfer efficiency of this part of the storage space will be affected.
  • at least a part of the storage space of the last level cache is configured as SPM.
  • the DMA controller when performing data transfer, the DMA controller is used to perform data transfer between the storage space configured as SPM in the last level cache and the memory system. Since the data transfer between the storage space configured as SPM in the last level cache and the memory system is performed by the DMA controller, the data to be moved can be prevented from passing through the processing core, thereby releasing the bandwidth, shortening the data transfer path, and improving the Data movement efficiency.
  • the last level cache of the above-mentioned cache system supports three working modes, wherein, in the first working mode, all the storage space of the above-mentioned last level cache is Configured as a cache memory, in the second working mode, the entire storage space of the last level cache is configured as SPM, and in the third working mode, a part of the storage space of the last level cache is configured as a cache memory, and another part of the storage space is configured as SPM.
  • the above-mentioned memory partition may further include a mode configurator.
  • the above-mentioned mode configurator is configured to configure the working mode of the last-level cache in the above-mentioned cache system based on the user configuration information.
  • the developer can configure the working mode of the last-level cache through the mode configurator based on the user configuration information.
  • the entire storage space of the above-mentioned last-level cache can be configured as SPM.
  • the entire storage space of the last level cache can be configured as a cache memory.
  • part of the storage space of the last-level cache can be configured as cache memory, and part of the storage space can be configured as SPM to store AI. Operation parameters.
  • the above-mentioned memory system may be a global memory system.
  • it can be DRAM, SDRAM, etc.
  • the above-mentioned global memory system may be HBM.
  • the above-mentioned DMA controller is used to perform data transfer between different storage spaces in the above-mentioned memory partition.
  • the DMA controller may read data from the first storage space in the memory partition, and write the read data into the second storage space in the memory partition.
  • the above-mentioned first storage space is a memory system
  • the above-mentioned second storage space is an L2 cache.
  • the DMA controller may control data transfer between the memory system and the L2 cache in response to a data transfer instruction sent by the processing core.
  • the above-mentioned data movement instruction is specifically used to trigger data movement between storage spaces within the above-mentioned memory partitions.
  • the above-mentioned data transfer instruction can be constructed by the processing core of the chip and sent to the DMA controller, so that the DMA controller controls the completion of the data transfer.
  • the processing core When data movement needs to be performed between storage spaces within the memory partition, the processing core sends a data movement instruction to the DMA controller.
  • the DMA controller can control the data moving between storage spaces within the memory partition in response to the data moving instruction.
  • the above-mentioned processing core sends a data transfer instruction to the above-mentioned DMA controller
  • the above-mentioned DMA controller can control the data transfer between different storage spaces in the above-mentioned memory partition in response to the above-mentioned data transfer instruction.
  • the data to be moved is completed within the above-mentioned memory partition, thereby releasing the memory access bandwidth inside the chip, prompting data transfer efficiency, and improving chip performance.
  • the chip may include multiple memory partitions, and in order to complete data migration in each memory partition, the processing core may send data movement instructions to the DMA controllers in the multiple memory partitions, so that each DMA The controller can control data movement within the memory partition where it is located.
  • the above processing core can send data movement instructions to the DMA controllers in the above 4 memory partitions respectively. After the DMA controller in the above-mentioned four memory partitions receives the data movement instruction, it can control the data movement inside the memory partition where it is located.
  • the above-mentioned chip when the above-mentioned chip includes multiple memory partitions, in order to facilitate the developer to write programs, the above-mentioned multiple memory partitions all use UMA.
  • the last level cache in the above multiple memory partitions and the memory system in the above multiple memory partitions can all use UMA.
  • UMA can be used for the last level cache in the above-mentioned multiple memory partitions.
  • the memory system in the above-mentioned multiple memory partitions may also employ UMA.
  • the effective addresses are the same between different last-level caches and the same between different memory systems. Therefore, when writing data to each last-level cache or each memory system, only one address needs to be entered, and there is no need to write data separately for multiple last-level caches or multiple memory systems, which improves the programming of developers. It also improves the efficiency of data storage.
  • the above-mentioned processing core is configured to broadcast a data moving instruction to at least one DMA controller in the above-mentioned at least one memory partition.
  • the processing core can broadcast and send a data movement instruction to the DMA controllers in the above-mentioned multiple memory partitions.
  • the chip includes 4 memory partitions, and the last level of cache in the above-mentioned 4 memory partitions (assuming, the last level of cache is L2 cache), and the memory system in the above-mentioned multiple memory partitions can all use UMA.
  • the processing core may broadcast and send a data moving instruction to the DMA controllers in the above-mentioned multiple memory partitions.
  • the DMA controller in the above-mentioned 4 memory sub-regions receives the data transfer instruction, it can extract 2 megabytes of data from the storage location indicated by the above-mentioned data transfer instruction of the memory system, and move the above-mentioned 2 megabytes of data to the above-mentioned data of the L2 cache. Move the data to the storage location indicated by the move instruction to complete the data move.
  • the processing core can broadcast and send data movement instructions to the DMA controllers in the above four memory partitions to complete data movement within each memory partition, the number of calls made by the processing core to the DMA controller is reduced, thereby reducing the need for DMA control. the call overhead of the device.
  • the following introduces the improvement of the data transfer instruction in this application.
  • a new format of a data transfer instruction to the DMA controller is proposed.
  • the data moving instruction reduces the number of fields of the data moving instruction and reasonably sets the meaning of each field indication, thereby reducing the length of the data moving instruction and reducing the calling overhead to the DMA controller.
  • the data transfer instruction to the DMA controller includes 6 fields, which are the data transfer type field, the data length field, the last level cache low address field, the last level cache high address field, and the memory system low address field.
  • the address field, and the memory system high address field are the data transfer type field, the data length field, the last level cache low address field, the last level cache high address field, and the memory system low address field.
  • the above-mentioned data moving instruction may at least include a data moving type, a data length, a source storage address, and a destination storage address.
  • the above data transfer type specifically indicates the direction of data transfer.
  • the above-mentioned data movement type may indicate the data flow direction in the memory partition.
  • the above-mentioned data flow direction may include any of the following four types:
  • the movement of the cache system internal data in the above-mentioned memory partition, the movement of the internal data of the memory system in the above-mentioned memory partition, the data migration from the last level cache to the memory system in the above-mentioned memory partition, and from the memory system in the above-mentioned memory partition Move data to the last level of cache.
  • the above four data flow directions can be corresponded to the four types of identifiers, and when the DMA controller is actually called, the above four identifiers can be written into the above data transfer types, so that the DMA controller can identify the data this time. Moved data flow.
  • the above data length specifically indicates the amount of data to be transmitted. It can be understood that the size of the data has a corresponding relationship with the storage space. Therefore, if the starting position of the data in the storage space is known, the ending position of the data in the storage space can be obtained according to the data length of the data.
  • the above-mentioned source storage address specifically refers to the starting address of the current storage location of the data to be moved. For example, if the data is moved from the memory system to the last level of cache, the above-mentioned source storage address is the starting position of the data in the above-mentioned memory system.
  • the above-mentioned destination storage address specifically refers to the starting address of the storage location where the data to be moved needs to be moved. For example, if data is moved from the memory system to the last level of cache, the destination storage address is the starting position where the data is moved to the last level of cache.
  • the source storage space can be determined according to the source storage address field and the data length in the above-mentioned data moving instruction;
  • the destination storage address field and data length determine the destination storage space;
  • the data in the source storage space can be moved to the destination storage space according to the data move type in the above data move instruction.
  • FIG. 5 is a schematic diagram of a data moving instruction shown in the present application. As shown in Figure 5, the above-mentioned data movement instruction includes a first field, a second field, a third field and a fourth field;
  • the above-mentioned first field is a field indicating data movement type and data length
  • the above-mentioned second field is a field indicating the low address of the source storage address
  • the above-mentioned third field is a field indicating the high address of the source storage address and the high address of the destination storage address;
  • the above-mentioned fourth field is a field indicating the lower address of the destination storage address.
  • 0000 indicates that the data is moved within the cache system
  • 0001 indicates that the data is moved within the memory system
  • 0010 indicates that the data is moved from the memory system to the last level of cache
  • 0011 indicates that the data is moved from The last level of cache is moved to the memory system.
  • the processing core of the chip constructs the data transfer instruction to the DMA controller, it can write 0010 into the first 4 bits of the first field, and write 2MB converted binary into the last 28 bits of the first field. Then the processing core can convert the low address 0x3EAB_0000 of the memory system into binary and write into the second field, and convert the high address 0xAB_00 of the memory system into binary and write the last sixteen bits of the third field. Finally, the processing core can write the high address 0xCD_00 of the last level cache into the first sixteen bits of the third field, and convert the low address 0x3E5B_0000 of the last level cache into binary and write it into the fourth field .
  • the data moving instruction can be broadcast and sent to each DMA controller, so that each DMA controller responds to the above-mentioned data moving instruction, from the low address 0x3EAB_0000 of the above-mentioned memory system, high Address 0xAB_00, move 2 MB of data to the low address 0x3E5B_0000 and high address 0xCD_00 of the last level cache system.
  • At least the data movement type and data length fields, the source storage address field, and the destination storage address field can be included. call overhead.
  • the six fields in the data moving instruction shown in the related art may be combined, thereby reducing the number of fields included in the data moving instruction.
  • FIG. 6 is a schematic diagram of a data moving instruction shown in the present application.
  • the above-mentioned data movement instruction includes at least a first field, a second field, a third field and a fourth field;
  • the above-mentioned first field is a field indicating data movement type and data length
  • the above-mentioned second field is a field indicating the storage address of the last level cache
  • the above-mentioned third field is a low address field indicating the memory system
  • the above-mentioned fourth field is a high address field indicating the memory system.
  • the above-mentioned second field indicates the starting address of the storage space of the last level cache.
  • the storage address indicated by the second field is the starting position of the current storage position of the data.
  • the storage address indicated by the second field is the starting position of the storage location after the data is moved.
  • the present application also provides an electronic device, including the chip shown in any of the foregoing embodiments.
  • the electronic device may be a smart terminal such as a mobile phone, or other devices that have a camera and can perform image processing.
  • the electronic device acquires the collected image, it can process the image, and the processing process can use the chip of the embodiment of the present application to perform the computing task.
  • the use of the chip can assist in improving the processing efficiency of computing tasks, thereby improving the performance of electronic equipment.
  • one or more embodiments of the present application may be provided as a method, system or computer program product. Accordingly, one or more embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present application may take the form of a computer program product implemented on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.
  • Embodiments of the subject matter and functional operations described in this application can be implemented in digital electronic circuits, in tangible embodiment of computer software or firmware, in computer hardware including the structures disclosed in this application and their structural equivalents, or in a combination of one or more.
  • Embodiments of the subject matter described in this application may be implemented as one or more computer programs, ie, one or more of computer program instructions encoded on a tangible non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. multiple modules.
  • the program instructions may be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical or electromagnetic signal, which is generated to encode and transmit information to a suitable receiver device for interpretation by the data.
  • the processing device executes.
  • the computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these.
  • the processes and logic flows described in this application can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output.
  • the processes and logic flows described above can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, eg, an FPGA (Field Programmable Gate Array) or an ASIC (Application Specific Integrated Circuit).
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing system.
  • a central processing system will receive instructions and data from read-only memory and/or random access memory.
  • the basic components of a computer include a central processing system for implementing or executing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to, one or more mass storage devices for storing data, such as magnetic, magneto-optical or optical disks, to receive data therefrom or to It transmits data, or both.
  • the computer does not have to have such a device.
  • the computer may be embedded in another device, such as a mobile phone, personal digital assistant (PDA), mobile audio or video player, game console, global positioning system (GPS) receiver, or a universal serial bus (USB) ) flash drives for portable storage devices, to name a few.
  • PDA personal digital assistant
  • GPS global positioning system
  • USB universal serial bus
  • Computer-readable media suitable for storage of computer program instructions and data include all forms of non-volatile memory, media, and memory devices including, for example, semiconductor memory devices (eg, EPROM, EEPROM, and flash memory devices), magnetic disks (eg, internal hard disks or memory devices). removable disks), magneto-optical disks, and 0xCD_00ROM and DVD-ROM disks.
  • semiconductor memory devices eg, EPROM, EEPROM, and flash memory devices
  • magnetic disks eg, internal hard disks or memory devices. removable disks
  • magneto-optical disks e.g, magneto-optical disks
  • 0xCD_00ROM and DVD-ROM disks 0xCD_00ROM and DVD-ROM disks.
  • the processor and memory may be supplemented by or incorporated in special purpose logic circuitry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Bus Control (AREA)

Abstract

本申请提出一种芯片、数据搬移方法和电子设备。其中,上述芯片可以包括,至少一个处理内核和至少一个存储器分区。其中,每个存储器分区包括高速缓存系统、内存系统,以及直接存储器访问DMA控制器。上述DMA控制器,与上述高速缓存系统以及上述内存系统分别连接,用于进行上述存储器分区内部的不同存储空间之间的数据搬移。

Description

芯片、数据搬移方法和电子设备
交叉引用声明
本发明要求于2020年12月10日提交中国专利局的申请号为202011458676.7的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术,具体涉及一种芯片、数据搬移方法和电子设备。
背景技术
随着计算机技术的快速发展,各类芯片的运算能力逐步提升。而芯片运算能力的提升,要求较高的数据搬移效率。
在相关技术中,当芯片内部的存储器分区中需要进行第一存储空间与第二存储空间之间的数据搬移时,处理内核需要先将数据从第一存储空间中读出,并存储在该处理内核内。然后,该处理内核再将存储的该数据读出并写入第二存储空间中。
可见,在相关技术中,存储器分区内部的不同存储空间之间进行数据搬移需要经过处理内核,数据搬移效率较低。
发明内容
有鉴于此,本申请公开一种芯片,上述芯片包括:
至少一个处理内核和至少一个存储器分区;其中,对于每个存储器分区,上述存储器分区包括高速缓存系统、内存系统,以及直接存储器访问DMA控制器;上述DMA控制器,与上述高速缓存系统以及上述内存系统分别连接,用于进行上述存储器分区内部的不同存储空间之间的数据搬移。在示出的一实施例中,上述至少一个处理内核中的第一处理内核用于向至少一个第一DMA控制器发送数据搬移指令,其中,上述至少一个第一DMA控制器包括在至少一个第一存储器分区中;上述至少一个第一DMA控制器,用于基于上述数据搬移指令,进行上述至少一个第一存储器分区内部的不同存储空间之间的数据搬移。
在示出的一实施例中,上述高速缓存系统包括多级高速缓存;上述DMA控制器用于进行上述高速缓存系统的存储空间与上述内存系统内的存储空间之间的数据搬移包括上述DMA控制器用于进行上述最后一级高速缓存的存储空间与上述内存系统内的存储空间之间的数据搬移。
在示出的一实施例中,上述最后一级高速缓存支持三种工作模式,其中,在第一工作模式中,上述最后一级高速缓存的全部存储空间被配置为高速缓存存储器,在第二工作模式中,上述最后一级高速缓存的全部存储空间被配置为便笺存储器SPM,在第三工作模式中,上述最后一级高速缓存的一部分存储空间被配置为高速缓存存储器,另一部分存储空间被配置为SPM。
在示出的一实施例中,上述存储器分区还包括模式配置器,上述模式配置器用于基于用户配置信息,配置上述最后一级高速缓存的工作模式。
在示出的一实施例中,上述至少一个处理内核与上述DMA控制器通过主片上网络互相访问;或,上述DMA控制器、上述高速缓存系统以及上述内存系统之间通过子片 上网络互相访问。
在示出的一实施例中,上述DMA控制器用于进行上述存储器分区内部的不同存储空间之间的数据搬移包括用于进行下列中的至少一种:上述高速缓存系统的不同存储空间之间的数据搬移;上述内存系统内的不同存储空间之间的数据搬移;上述高速缓存系统的存储空间与上述内存系统内的存储空间之间的数据搬移。
在示出的一实施例中,上述存储器分区中的不同存储空间全部或部分采用统一内存架构UMA。
在示出的一实施例中,上述第一处理内核用于向上述至少一个第一DMA控制器发送数据搬移指令包括,上述第一处理内核用于向至少一个第二DMA控制器广播数据搬移指令,其中上述第二DMA控制器包括在上述不同存储空间全部采用UMA的第一存储器分区中。
在示出的一实施例中,上述数据搬移指令包括:数据搬移类型、数据长度、源存储地址、以及目的存储地址。
在示出的一实施例中,上述数据搬移指令包括第一字段、第二字段、第三字段以及第四字段;其中,上述第一字段用于指示上述数据搬移类型和上述数据长度;上述第二字段用于指示上述源存储地址的低地址;上述第三字段用于指示上述源存储地址的高地址以及上述目的存储地址的高地址;上述第四字段用于指示上述目的存储地址的低地址。
在示出的一实施例中,上述DMA控制器用于进行上述存储器分区内部的不同存储空间之间的数据搬移包括用于:从上述存储器分区内的第一存储空间读取数据,并将读取到的数据写入上述存储器分区内的第二存储空间。
在示出的一实施例中,上述内存系统为高带宽存储器HBM。
本申请还提出一种数据搬移方法,应用于芯片,其中上述芯片包括至少一个处理内核和至少一个存储器分区,每个存储器分区包括高速缓存系统、内存系统,以及与直接存储器访问DMA控制器;上述方法包括:对于每个存储器分区,通过上述DMA控制器进行上述存储器分区内部的不同存储空间之间的数据搬移。
在示出的一实施例中,上述通过上述DMA控制器进行上述存储器分区内部的不同存储空间之间的数据搬移,包括:通过上述至少一个处理内核中的第一处理内核向至少一个第一DMA控制器发送数据搬移指令,其中,上述至少一个第一DMA控制器包括在至少一个第一存储器分区中;上述至少一个第一DMA控制器,基于上述数据搬移指令,进行上述至少一个第一存储器分区内部的不同存储空间之间的数据搬移。
在示出的一实施例中,上述高速缓存系统包括多级高速缓存;上述通过上述DMA控制器进行上述存储器分区内部的不同存储空间之间的数据搬移,包括:通过上述DMA控制器进行最后一级高速缓存的存储空间与上述内存系统内的存储空间之间的数据搬移。
在示出的一实施例中,上述最后一级高速缓存支持三种工作模式,其中,在第一工作模式中,上述最后一级高速缓存的全部存储空间被配置为高速缓存存储器,在第二工作模式中,上述最后一级高速缓存的全部存储空间被配置为SPM,在第三工作模式中, 上述最后一级高速缓存的一部分存储空间被配置为高速缓存存储器,另一部分存储空间被配置为SPM。
在示出的一实施例中,上述存储器分区还包括模式配置器;上述方法还包括:基于用户配置信息,通过上述模式配置器,配置上述最后一级高速缓存的工作模式。
在示出的一实施例中,上述至少一个处理内核与上述DMA控制器通过主片上网络互相访问;和/或,上述DMA控制器、上述高速缓存系统以及上述内存系统之间通过子片上网络互相访问。
在示出的一实施例中,上述存储器分区内部的不同存储空间之间的数据搬移包括下列中的至少一种:上述高速缓存系统的不同存储空间之间的数据搬移;上述内存系统内的不同存储空间之间的数据搬移;上述高速缓存系统的存储空间与上述内存系统内的存储空间之间的数据搬移。
在示出的一实施例中,上述存储器分区中的不同存储空间全部或部分采用统一内存架构UMA。
在示出的一实施例中,上述通过上述第一处理内核向上述至少一个第一DMA控制器发送上述数据搬移指令,包括:通过上述第一处理内核向至少一个第二DMA控制器广播数据搬移指令,其中上述第二DMA控制器包括在上述不同存储空间全部采用统一内存架构UMA的第一存储器分区中。
在示出的一实施例中,上述数据搬移指令包括:数据搬移类型、数据长度、源存储地址、以及目的存储地址。
在示出的一实施例中,上述数据搬移指令包括第一字段、第二字段、第三字段以及第四字段;其中,上述第一字段用于指示上述数据搬移类型和上述数据长度;上述第二字段用于指示上述源存储地址的低地址;上述第三字段用于指示上述源存储地址的高地址以及上述目的存储地址的高地址;上述第四字段用于指示上述目的存储地址的低地址。
在示出的一实施例中,上述通过上述DMA控制器进行上述存储器分区内部的不同存储空间之间的数据搬移,包括:通过上述DMA控制器从上述存储器分区内的第一存储空间读取数据,并将读取到的数据写入上述存储器分区内的第二存储空间。
在示出的一实施例中,上述内存系统为高带宽存储器HBM。
本申请还提出一种电子设备,包括:上述任一实施例示出的芯片。
由上述技术方案可知,一方面,由于上述DMA控制器与上述高速缓存系统以及上述内存系统分别连接,并用于进行上述存储器分区内部的不同存储空间之间的数据搬移,因此可以控制上述数据可以在上述存储器分区内部完成搬移,而不会抢占上述芯片的访存带宽,从而在上述数据搬移过程中,释放该芯片内部的访存带宽,提升数据搬移效率,提示芯片性能。
另一方面,由于上述处理内核向上述DMA控制器发送数据搬移指令,上述DMA可以响应于上述数据搬移指令,控制上述存储器分区中不同的存储空间之间的数据搬移,因此,可以使需要搬移的数据在上述存储器分区内部完成搬移,从而释放该芯片内部的 访存带宽,提示数据搬移效率,提升芯片性能。
还一方面,由于上述芯片可以提升存储器分区数据搬移效率,具有更高的性能,因此,使用该芯片可以辅助提高计算任务的处理效率,从而提升电子设备性能。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
为了更清楚地说明本申请一个或多个实施例或相关技术中的技术方案,下面将对实施例或相关技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请一个或多个实施例中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为一种AI芯片内部结构图;
图2为本申请示出的一种芯片的内部结构;
图3为本申请示出的一种芯片结构图;
图4为本申请示出的一种芯片结构图;
图5为本申请示出的一种数据搬移指令的示意图;
图6为本申请示出的一种数据搬移指令的示意图;
图7为本申请示出的一种数据搬移方法的方法流程图。
具体实施方式
下面将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的设备和方法的例子。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“上述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。还应当理解,本文中所使用的词语“如果”,取决于语境,可以被解释成为“在……时”或“当……时”或“响应于确定”。
随着计算机技术的快速发展,各类芯片的运算能力逐步提升。而芯片运算能力的提升,要求较高的数据搬移效率。
请参见图1,图1为一种AI芯片内部结构图。
如图1所示,AI芯片的处理内核与存储器分区连接;其中,上述存储器分区至少包括内存系统以及高速缓存系统。
在图1示出的AI芯片中,当内存系统中有部分数据需要搬移到高速缓存系统中时,处理内核通过读命令先将该部分数据从内存系统中读出,并存储在该处理内核内,然后,该处理内核通过写命令将该部分数据写入上述高速缓存系统中。
由此可见,在上述高速缓存系统与上述内存系统间进行数据搬移,需要至少占用两次访存带宽,这不仅使数据搬移延时很大,而且抢占了访存带宽,大大降低了芯片性能。本领域技术人员可以理解的是,上述高速缓存系统内部,以及该内存系统内部的数据搬移同样存在上述问题,在此不作详述。
有鉴于此,本申请提出一种芯片。该芯片通过在存储器分区中,加入与高速缓存系统以及内存系统分别连接的DMA(Direct Memory Access,直接存储器访问)控制器,以使上述DMA可以执行上述存储器分区内部的不同存储空间之间的数据搬移指令,从而释放该芯片内部的访存带宽,提升数据搬移效率,提升芯片性能。
以下对该芯片的内部结构进行说明。
请参见图2,图2为本申请示出的一种芯片的内部结构。如图2所示,上述芯片包括:
至少一个处理内核21和至少一个存储器分区22。
其中,每个存储器分区22包括高速缓存系统221、内存系统222,以及DMA控制器223。
上述DMA控制器223,与上述高速缓存系统221以及上述内存系统222分别连接,用于进行上述存储器分区22内部的不同存储空间之间的数据搬移。
需要说明的是,当控制上述高速缓存系统221与上述内存系统222之间的数据搬移时,上述高速缓存系统221包括的最后一级高速缓存可以与上述DMA控制器223连接。当控制上述高速缓存系统221内部的数据搬移时,上述DMA控制器223可以与涉及的相应级别的高速缓存连接。在此不作特别限定。
在实际应用中,上述DMA控制器可以从上述存储器分区内的第一存储空间读取数据,并将读取到的数据写入上述存储器分区内的第二存储空间。
例如,上述第一存储空间为内存系统,上述第二存储空间为L2高速缓存。上述DMA控制器可以响应于上述处理内核发出的数据搬移指令,控制在上述内存系统与上述L2高速缓存之间的数据搬移。
在此需要说明的是,一个存储器分区可以包括一个或多个DMA控制器。例如,存储器分区包括一个DMA控制器,负责该存储器分区内所有存储空间之间的数据搬移。再例如,存储器分区包括多个DMA控制器,该多个DMA控制器中的每个DMA控制器可以负责存储器分区中的一对或多对存储空间之间的数据搬移。当存在多个存储器分区时,本申请不对这些DMA控制器的具体位置进行限定。例如DMA控制器可以分散位于各存储器分区中,也可以集中位于其中的一个存储器分区中。
上述芯片,具体可以是任意需要高访存带宽的芯片。在实际应用中,上述芯片可以搭载了多通道的DRAM(Dynamic Random Access Memory,动态随机存取存储器)的芯片。
例如,上述芯片可以是CPU、DSP、MCU等。在一实施例中,上述芯片可以执行人工智能算法。例如,上述芯片可以是AI神经网络芯片(例如,FPGA、TPU等)或 GPU图形处理芯片。
上述处理内核,通常为芯片中的计算核心,用于执行代码运算,可以包括一个或多个处理单元。例如,上述处理内核通常可以依据开发人员制定的程序代码,在上述存储器分区中进行数据搬移。
在实际应用中,上述存储器分区内部的存储空间之间的数据搬移通常可以包括,上述存储器分区中的高速缓存系统内部数据的搬移,上述存储器分区中的内存系统内部数据的搬移,以及上述存储器分区中最后一级高速缓存与内存系统之间的数据搬移。
上述存储器分区,通常用于存储数据。
在实际应用中,通常芯片采用具有存储层次的存储器分区。其中,上述存储器分区可以包括具有一级或多级高速缓存的高速缓存系统以及内存系统。
例如,请继续参见图2,上述高速缓存系统221可以至少包括L1、L2和L3高速缓存。此时,处理内核21需要获取数据时,通常先访问L1高速缓存。如果该L1高速缓存中存储有上述处理内核21需要的数据,则上述处理内核21完成此次数据获取。如果该L1高速缓存中没有存储上述处理内核21需要的数据,上述处理内核21则继续访问上述L2高速缓存以获取需要的数据。以此类推。如果上述最后一级高速缓存即L3高速缓存中也不涉及处理内核21需要的数据时,上述处理内核21则继续从上述内存系统222中获取数据。
在上述例子中不难发现,芯片的性能很大程度上取决于缓存命中率(CHCHE HIT)。而为了提升缓存命中率,目前在芯片中提供了可以由开发人员直接管理的大容量高速缓存,从而提升缓存命中率。
通常,当上述高速缓存系统包括多级高速缓存时,最后一级高速缓存可以作为上述大容量高速缓存,则上述DMA控制器用于进行上述最后一级高速缓存的存储空间与上述内存系统内的存储空间之间的数据搬移。
当高速缓存系统的至少一部分存储空间被配置为便笺存储器(Scratchpad Memory,SPM)时,会影响这部分存储空间的数据搬移效率。在一实施例中,为了提升数据搬移效率,最后一级高速缓存的至少一部分存储空间被配置为SPM。
此时,当进行数据搬移时,上述DMA控制器用于进行上述最后一级高速缓存中被配置为SPM的存储空间与上述内存系统之间的数据搬移。由于通过DMA控制器进行上述最后一级高速缓存中被配置为SPM的存储空间与上述内存系统之间的数据搬移,可以避免被搬移数据经过处理内核,从而释放带宽,缩短数据搬移路径,提升了数据搬移效率。
在一实施例中,为了灵活适用多种业务场景,上述高速缓存系统的最后一级高速缓存支持三种工作模式,其中,在第一工作模式中,上述最后一级高速缓存的全部存储空间被配置为高速缓存存储器,在第二工作模式中,上述最后一级高速缓存的全部存储空间被配置为SPM,在第三工作模式中,上述最后一级高速缓存的一部分存储空间被配置为高速缓存存储器,另一部分存储空间被配置为SPM。
通过这种方式,开发人员可以根据需求灵活配置上述最后一级高速缓存,从而提升 上述芯片的适用性。
需要说明的是,为了可以实现动态配置最后一级高速缓存,在一实施例中,上述存储器分区还可以包括模式配置器。
上述模式配置器,用于基于用户配置信息,配置上述高速缓存系统中的最后一级高速缓存的工作模式。
在实际应用中,开发人员可以基于用户配置信息,通过上述模式配置器,配置上述最后一级高速缓存的工作模式。
例如,在多芯片级联分布式训练系统的场景中,由于芯片间的通信需要高容量、低延时,可以将上述最后一级高速缓存的全部存储空间配置为SPM。
再例如,在对性能要求不高的算法开发的场景中,由于无需开发人员之间管理最后一级高速缓存,可以将上述最后一级高速缓存的全部存储空间配置为高速缓存存储器。
再例如,在既需要数据传输效率,又注重数据复用率的场景中,可以将上述最后一级高速缓存的部分存储空间配置为高速缓存存储器,以及将部分存储空间配置为SPM,以存储AI运算参数。
上述内存系统,可以是全局内存系统。例如,可以是DRAM(Dynamic Random Access Memory,动态随机存取存储器),SDRAM(synchronous dynamic random-access memory,同步动态随机存取存储器)等等。
在一实施例中,为了提升访存带宽,上述全局内存系统可以是高带宽存储器(High Bandwidth Memory,HBM)。
需要说明的是,上述芯片内部可以采用总线或NOC(network-on-chip,片上网络)架构,可以根据实际需求进行设定。在相关技术中,请参见图3,图3为本申请示出的一种芯片结构图。如图3所示,上述DMA控制器、至少一个处理内核、以及至少一个存储器分区通过总线连接。
此时,假设存储器分区内部的内存系统需要向L2高速缓存搬移数据时,一处理内核将向上述DMA控制器发送一条数据搬移指令,以使DMA控制器完成搬移数据。
但是不难发现,在上述芯片结构中,即便通过DMA控制器释放了芯片的处理内核的工作压力,但是上述数据在搬移过程中,仍需要先从内存系统流向处理内核,再流向L2高速缓存。由此可见,采用上述芯片结构仍然存在数据搬移抢占访存带宽,以及数据搬移效率低的问题。
为了解决上述问题,如图2所示,在本申请中上述DMA控制器内置于上述存储器分区中,以使DMA控制器可以控制上述数据,从而可以在上述存储器分区内部完成搬移,而不会抢占上述芯片的访存带宽。
由上述技术方案可知,由于上述DMA控制器与上述高速缓存系统以及上述内存系统分别连接,并用于进行上述存储器分区内部的不同存储空间之间的数据搬移,因此可以控制上述数据,从而可以在上述存储器分区内部完成搬移,而不会抢占上述芯片的访存带宽,进而在上述数据搬移过程中,释放该芯片内部的访存带宽,提升数据搬移效率, 提示芯片性能。
在一实施例中,上述至少一个处理内核中的第一处理内核与至少一个第一DMA控制器连接;至少一个第一DMA控制器包括在至少一个第一存储器分区中,上述第一存储器分区可以为上述存储器分区的全部或部分。
上述第一处理内核用于向上述至少一个第一DMA控制器发送数据搬移指令。
上述至少一个DMA控制器,用于基于上述数据搬移指令,进行上述至少一个第一存储器分区内部的不同存储空间之间的数据搬移。
请继续参见图2,上述DMA控制器与上述第一处理内核连接。其中,上述连接方式可以是总线方式的连接。
在一实施例中,为了进一步提升芯片性能,上述DMA控制器,以及上述处理内核可以通过主片上网络(NOC,network-on-chip)互相访问。
上述主片上网络,可以是上述芯片内的主网络。当上述芯片包括多个处理内核,以及多个存储器分区时,上述多个处理内核,与上述多个存储器分区中的DMA控制器可以通过上述主片上网络互相访问。
请继续参见图2,上述DMA控制器,与上述高速缓存系统以及上述内存系统分别连接。其中,上述连接方式可以是总线方式的连接。
在一实施例中,为了进一步提升芯片性能,上述DMA控制器,上述高速缓存系统以及上述内存系统通过子片上网络互相访问。
上述子片上网络,可以是上述存储器分区内的子网络。当上述芯片包括多个存储器分区时,上述多个存储器分区均可以采用上述子片上网络,使各存储器分区中的DMA控制器、高速缓存系统以及内存系统可以通过上述子片上网络(NOC,network-on-chip)互相访问。
由于单颗存储器分区(包括高速缓存系统及内存系统)的带宽以及容量有限,为了提升访存带宽,以及芯片容量,在一实施例中,上述芯片通常可以包括多个存储器分区。这些存储器分区可以以并联的形式与处理内核连接。
请参见图4,图4为本申请示出的一种芯片结构图。如图4所示,上述芯片包括多个处理内核,以及多个存储器分区。需要说明的是,存储器分区中仅示意出高速缓存系统中的最后一级高速缓存,其他级别的高速缓存在图4中并未示出。
上述芯片中的多个处理内核,与多个存储器分区可以通过上述主片上网络互相访问。
采用上述方式,实现多存储器分区的并联,从而拓宽访存带宽以及芯片容量。
在上述情形中,即上述芯片包括多个存储器分区,为了方便开发人员编写程序,上述多个存储器分区均采用统一内存架构(UMA,Unified Memory Architecture)。
在实际应用中,上述多个存储器分区中的最后一级高速缓存可以采用UMA。上述多个存储器分区中的内存系统也可以采用UMA。
通过这种方式,对开发人员来讲,不同的最后一级高速缓存之间的有效地址相 同,不同内存系统之间的有效地址也相同。因此,在向各最后一级高速缓存,或者各内存系统写数据时,只需输入一个地址即可,无需针对多个最后一级高速缓存或多个内存系统分别写数据,提升了开发人员编程效率,也提升了数据存储效率。
每个处理内核可以向一个或多个DMA控制器分别发送数据搬移指令,在一些实施例中,为了减小对DMA控制器的调用开销,上述处理内核可以向上述至少一个存储器分区中的至少一个DMA控制器广播数据搬移指令。
在实际应用中,当存储器分区内需要进行数据搬移时,处理内核可以向上述多个存储器分区中的DMA控制器广播发送数据搬移指令。
例如,假设芯片可以包括8个存储器分区。其中,上述8个存储器分区中有4个存储器分区的最后一级高速缓存(假设,最后一级高速缓存为L2高速缓存),以及上述多个存储器分区中的内存系统可以均采用UMA。
在上述情形下,如果需要从内存系统移动8M数据至L2高速缓存时,实际上是需要在各存储器分区内完成1兆数据的搬移。此时,处理内核一方面,可以向上述采用UMA的4个存储器分区中的DMA控制器广播发送数据搬移指令;另一方面,可以向未采用UMA的4个存储器分区中的DMA控制器分别发送数据搬移指令。
上述各DMA控制器在接收到数据搬移指令后,可以从内存系统的上述数据搬移指令指示的存储位置提取1兆数据,并将上述1兆数据搬移至L2高速缓存的上述数据搬移指令指示的存储位置中,从而完成数据搬移。
由于处理内核可以向采用了UMA的多个存储器分区中的DMA控制器广播发送数据搬移指令来完成各存储器分区内部的数据搬移,因此,减少了处理内核对DMA控制器的调用次数,从而减少了对DMA控制器的调用开销。
在一实施例中,上述芯片包括的多个上述DMA控制器可以集中位于同一存储器分区中,并分别与各存储器分区中包括的内存系统与高速缓存系统一一对应。
此时,当需要通过该多个DMA控制器进行数据搬移时,可以通过向上述存储器分区中的多个DMA控制器广播发送数据搬移指令,从而完成各存储器分区中的不同存储空间之间的数据搬移。
以下介绍本申请对数据搬移指令的改进。在本申请中,为了进一步缩减对DMA控制器的调用开销,提出了一种全新格式的对DMA控制器的数据搬移指令。该数据搬移指令通过减少数据搬移指令字段数量,并合理的设置各字段指示的含义,从而缩减了数据搬移指令的长度,减少了对DMA控制器的调用开销。
在相关技术中,对DMA控制器的数据搬移指令包括6个字段,分别为数据搬移类型字段,数据长度字段,最后一级高速缓存低地址字段,最后一级高速缓存高地址字段,内存系统低地址字段,以及内存系统高地址字段。
由此可见,相关技术中的数据搬移指令比较冗长,当对DMA控制器进行调用时,需要对DMA控制器发送较长的数据搬移指令,从而增加对DMA控制器的调用开销。
为了解决这一问题,在一实施例中,上述数据搬移指令,至少可以包括数据搬移类型、数据长度,源存储地址,以及目的存储地址。
上述数据搬移类型,具体指示数据搬移方向。在一实施例中,上述数据搬移类型可以指示存储器分区中的数据流向。具体地,上述数据流向(数据搬移类型)可以包括以下四种中的任一:
上述存储器分区中的高速缓存系统内部数据的搬移,上述存储器分区中的内存系统内部数据的搬移,从上述存储器分区中最后一级高速缓存向内存系统的数据搬移,以及从上述存储器分区中内存系统向最后一级高速缓存的数据搬移。
在实际应用中,可以通过将上述四种数据流向与四种标识对应,并在实际调用DMA控制器时,将上述四种标识写入上述数据搬移类型,以使DMA控制器可以识别此次数据搬移的数据流向。
上述数据长度,具体指示需要传输的数据量大小。可以理解的是,数据量大小与存储空间具有对应关系,因此,如果知道该数据在存储空间中的起始位置,依据该数据的数据长度,可以得到该数据在存储空间中的终止位置。
上述源存储地址,具体是指待搬移数据当前存储位置的起始地址。例如,如果数据从内存系统搬移至最后一级高速缓存,则上述源存储地址为数据在上述内存系统中的起始位置。
上述目的存储地址,具体是指待搬移数据需要被搬移后的存储位置的起始地址。例如,如果数据从内存系统搬移至最后一级高速缓存,则上述目的存储地址为数据被搬移至上述最后一级高速缓存中的起始位置。
可以理解的是,当DMA控制器接收到数据搬移指令后,一方面,可以根据上述数据搬移指令中的源存储地址字段和数据长度确定源存储空间;另一方面,可以根据上述数据搬移指令中的目的存储地址字段和数据长度确定目的存储空间;再一方面,可以根据上述数据搬移指令中的数据搬移类型,将源存储空间的数据搬移至目的存储空间。
请参见图5,图5为本申请示出的一种数据搬移指令的示意图。如图5所示,上述数据搬移指令包括第一字段、第二字段、第三字段以及第四字段;
其中,上述第一字段为指示数据搬移类型和数据长度的字段;
上述第二字段为指示源存储地址的低地址的字段;
上述第三字段为指示源存储地址的高地址和目的存储地址的高地址的字段;
上述第四字段为指示目的存储地址的低地址的字段。
在此,需要说明的是,上述数据搬移指令中各字段的顺序,以及各字段中指示不同含义的数据位的位置可以根据实际情形进行调整,在此不作限定。
假设0000(二进制)指示数据在高速缓存系统内部搬移,0001(二进制)指示数据在内存系统内部搬移,0010(二进制)指示数据从内存系统搬移至最后一级高速缓存,0011(二进制)指示数据从最后一级高速缓存搬移至内存系统。
在上述情形下,假设从内存系统的低地址0x3EAB_0000(16进制),高地址0xAB_00(16进制),搬移2兆的数据至最后一级高速缓存的低地址0x3E5B_0000(16进制),高地址0xCD_00(16进制)。
此时,芯片的处理内核在构造对DMA控制器的数据搬移指令时,可以将0010写入第一字段的前4位,将2兆转换二进制写入上述第一字段的后28位。然后上述处理内核可以将上述内存系统的低地址0x3EAB_0000转换为二进制写入上述第二字段,并将上述内存系统的高地址0xAB_00转换为二进制写入上述第三字段的后十六位。最后,上述处理内核可以将上述最后一级高速缓存的高地址0xCD_00写入上述第三字段的前十六位,并将上述最后一级高速缓存的低地址0x3E5B_0000转换为二进制写入上述第四字段。
当上述处理内核完成上述数据搬移指令的构造后,可以将该数据搬移指令广播发送至各DMA控制器,以使各DMA控制器响应于上述数据搬移指令,从上述内存系统的低地址0x3EAB_0000,高地址0xAB_00,搬移2兆的数据至上述最后一级高速缓存的低地址0x3E5B_0000,高地址0xCD_00。
由上可知,由于上述数据搬移指令,至少可以包括数据搬移类型和数据长度字段,源存储地址字段,以及目的存储地址字段,因此,在对DMA控制器进行调用时,可以减少对DMA控制器的调用开销。
在一实施例中,可以采用对相关技术中示出的数据搬移指令中6个字段的进行合并,从而减少数据搬移指令包括的字段数量。
在实际应用中,由于数据搬移类型所需位数较少,占用一个字段(32位)有些浪费,因此可以将数据搬移类型与数据长度合并为一个字段。而由于最后一级高速缓存通常总容量较小(例如,几兆),因此,可以将最后一级高速缓存低地址字段和高地址字段合并为一个字段。
请参见图6,图6为本申请示出的一种数据搬移指令示意图。如图6所示,上述数据搬移指令至少包括第一字段、第二字段、第三字段以及第四字段;
其中,上述第一字段为指示数据搬移类型和数据长度的字段;
上述第二字段为指示最后一级高速缓存的存储地址的字段;
上述第三字段为指示内存系统的低地址字段;
上述第四字段为指示内存系统的高地址字段。
需要说明的是,一方面,上述数据搬移指令中各字段的顺序,以及各字段中指示不同含义的数据位的位置可以根据实际情形进行调整,在此不作限定。
上述第一字段指示的含义可参照前述实施例,在此不作详述。
上述第二字段指示最后一级高速缓存的存储空间的起始地址。当第一字段指示数据从最后一级高速缓存搬移至内存系统时,上述第二字段指示的存储地址为数据当前存储位置的起始位置。当第一字段指示数据从内存系统搬移至最后一级高速缓存时,上述第二字段指示的存储地址为数据被搬移后的存储位置的起始位置。
上述第三字段以及上述第四字段指示的含义可以参照前述实施例,在此不作详述。
由上可知,由于上述数据搬移指令只包括四个字段,因此,在对DMA控制器进行调用时,可以减少了对DMA控制器的调用开销。
相应的,本申请还提出一种数据搬移方法,应用于芯片。该方法通过由处理内核向存储器分区内置的DMA控制器下发数据搬移指令,以使上述DMA控制器可以响应于上述处理内核发出的数据搬移指令,使需要搬移的数据可以在上述存储器分区内部完成搬移,从而释放该芯片内部的访存带宽,提升数据搬移效率,提示芯片性能。
请参见图7,图7为本申请示出的一种数据搬移方法的方法流程图,应用于芯片。如图7所示,上述方法可以包括:
S702,上述处理内核向上述DMA控制器发送数据搬移指令。
S704,上述DMA控制器基于上述数据搬移指令,进行上述存储器分区内部的不同存储空间之间的数据搬移。
上述芯片,可以是具有上述任一实施例示出的芯片结构的芯片。在一实施例中,上述芯片可以采用如图2示出的芯片结构。如图2所示,上述芯片包括至少一个处理内核;至少一个存储器分区。其中,上述存储器分区包括高速缓存系统、内存系统和DMA控制器。其中,上述DMA控制器与上述高速缓存系统,以及内存系统分别连接。
需要说明的是,在实际应用中,上述存储器分区中可以包括具有一级或多级高速缓存的高速缓存系统、至少一内存系统,以及一个或多个DMA控制器,在此不作特别限定。
在一实施例中,上述芯片可以执行人工智能算法。例如,上述芯片可以是AI神经网络芯片或GPU图形处理芯片。
上述处理内核,通常为芯片中的计算核心,用于执行代码运算。例如,上述处理内核通常可以依据开发人员制定的程序代码,在上述存储器分区中进行数据搬移。
在实际应用中,上述存储器分区内部的存储空间之间的数据搬移通常可以包括,上述存储器分区中的高速缓存系统内部数据的搬移,上述存储器分区中的内存系统内部数据的搬移,以及上述存储器分区中最后一级高速缓存与内存系统之间的数据搬移。
上述存储器分区,通常用于存储数据。
在实际应用中,通常芯片采用具有存储层次的存储器分区。其中,上述存储器分区可以包括具有一级或多级高速缓存的高速缓存系统以及内存系统。
例如,请参见图2,上述高速缓存系统可以至少包括L1、L2和L3高速缓存。此时,处理内核需要获取数据时,通常先访问L1高速缓存。如果该L1高速缓存中存储有上述处理内核需要的数据,则上述处理内核完成此次数据获取。如果该L1高速缓存中没有存储上述处理内核需要的数据,上述处理内核则继续访问上述L2高速缓存以获取需要的数据。以此类推。如果上述最后一级高速缓存即L3高速缓存中也不涉及处理内核需要的数据时,上述处理内核则继续从上述内存系统中获取数据。
在上述例子中不难发现,芯片的性能很大程度上取决于缓存命中率(CHCHE HIT)。而为了提升缓存命中率,目前在芯片中提供了可以由开发人员直接管理的大容量高速缓存,从而提升缓存命中率。
通常,当上述高速缓存系统包括多级高速缓存时,最后一级高速缓存可以作为上述大容量高速缓存,则上述DMA控制器用于进行上述最后一级高速缓存的存储空间与上述内存系统内的存储空间之间的数据搬移。
当高速缓存系统的至少一部分存储空间被配置为SPM时,会影响这部分存储空间的数据搬移效率。在一实施例中,为了提升数据搬移效率,最后一级高速缓存的至少一部分存储空间被配置为SPM。
此时,当进行数据搬移时,上述DMA控制器用于进行上述最后一级高速缓存中被配置为SPM的存储空间与上述内存系统之间的数据搬移。由于通过DMA控制器进行上述最后一级高速缓存中被配置为SPM的存储空间与上述内存系统之间的数据搬移,可以避免被搬移数据经过处理内核,从而释放带宽,缩短数据搬移路径,提升了数据搬移效率。
在一实施例中,为了灵活适用多种业务场景,上述高速缓存系统的最后一级高速缓存支持三种工作模式,其中,在第一工作模式中,上述最后一级高速缓存的全部存储空间被配置为高速缓存存储器,在第二工作模式中,上述最后一级高速缓存的全部存储空间被配置为SPM,在第三工作模式中,上述最后一级高速缓存的一部分存储空间被配置为高速缓存存储器,另一部分存储空间被配置为SPM。
通过这种方式,开发人员可以根据需求灵活配置上述最后一级高速缓存,从而提升上述芯片的适用性。
需要说明的是,为了可以实现动态配置最后一级高速缓存,在一实施例中,上述存储器分区还可以包括模式配置器。
上述模式配置器,用于基于用户配置信息,配置上述高速缓存系统中的最后一级高速缓存的工作模式。
在实际应用中,开发人员可以基于用户配置信息,通过上述模式配置器,配置上述最后一级高速缓存的工作模式。
例如,在多芯片级联分布式训练系统的场景中,由于芯片间的通信需要高容量、低延时,可以将上述最后一级高速缓存的全部存储空间配置为SPM。
再例如,在对性能要求不高的算法开发的场景中,由于无需开发人员之间管理最后一级高速缓存,可以将上述最后一级高速缓存的全部存储空间配置为高速缓存存储器。
再例如,在既需要数据传输效率,又注重数据复用率的场景中,可以将上述最后一级高速缓存的部分存储空间配置为高速缓存存储器,以及将部分存储空间配置为SPM,以存储AI运算参数。
上述内存系统,可以是全局内存系统。例如,可以是DRAM,SDRAM等等。
在一实施例中,为了提升访存带宽,上述全局内存系统可以是HBM。
上述DMA控制器,用于进行上述存储器分区内部的不同存储空间之间的数据搬移。
在实际应用中,上述DMA控制器可以从上述存储器分区内的第一存储空间读取数据,并将读取到的数据写入上述存储器分区内的第二存储空间。
例如,上述第一存储空间为内存系统,上述第二存储空间为L2高速缓存。上述DMA控制器可以响应于上述处理内核发出的数据搬移指令,控制在上述内存系统与上述L2高速缓存之间的数据搬移。
上述数据搬移指令,具体用于触发上述存储器分区内部的存储空间之间的数据搬移。
在本申请中,上述数据搬移指令可以由芯片的处理内核构造并发送至DMA控制器,以使DMA控制器控制完成数据搬移。
当上述存储器分区内部的存储空间之间需要进行数据搬移时,上述处理内核向上述DMA控制器发送数据搬移指令。
上述DMA控制器在接收到上述数据搬移指令后,可以响应于上述数据搬移指令,控制上述存储器分区内部的存储空间之间的数据搬移。
由上述技术方案可知,由于上述处理内核向上述DMA控制器发送数据搬移指令,上述DMA控制器可以响应于上述数据搬移指令,控制上述存储器分区中不同的存储空间之间的数据搬移,因此,可以使需要搬移的数据在上述存储器分区内部完成搬移,从而释放该芯片内部的访存带宽,提示数据搬移效率,提升芯片性能。
在一实施例中,上述芯片可能包括多个存储器分区,为了在各存储器分区内完成数据迁移,上述处理内核可以向上述多个存储器分区中的DMA控制器分别发送数据搬移指令,以使各DMA控制器可以控制自身所处的存储器分区内部的数据搬移。
例如,假设芯片包括4个存储器分区。假设有数据需要从内存系统移动至最后一级高速缓存,由于芯片中存在4个存储器分区,因此,上述处理内核可以向上述4个存储器分区中的DMA控制器分别发送数据搬移指令。当上述4个存储器分区中的DMA控制器接收到数据搬移指令后,可以控制自身所处的存储器分区内部的数据搬移。
在一实施例中,当上述芯片包括多个存储器分区时,为了方便开发人员编写程序,上述多个存储器分区均采用UMA。
为了方便开发人员编写程序,上述多个存储器分区中的最后一级高速缓存,以及上述多个存储器分区中的内存系统可以均采用UMA。
在实际应用中,上述多个存储器分区中的最后一级高速缓存可以采用UMA。上述多个存储器分区中的内存系统也可以采用UMA。
通过这种方式,对开发人员来讲,不同的最后一级高速缓存之间的有效地址相同,不同内存系统之间的有效地址也相同。因此,在向各最后一级高速缓存,或者各内存系统写数据时,只需输入一个地址即可,无需针对多个最后一级高速缓存或多个内存 系统分别写数据,提升了开发人员编程效率,也提升了数据存储效率。
为了减小对DMA控制器的调用开销,上述处理内核,用于向上述至少一个存储器分区中的至少一个DMA控制器广播数据搬移指令。
在实际应用中,当存储器分区内需要进行数据搬移时,处理内核可以向上述多个存储器分区中的DMA控制器广播发送数据搬移指令。
例如,假设芯片包括4个存储器分区,并且上述4个存储器分区中的最后一级高速缓存(假设,最后一级高速缓存为L2高速缓存),以及上述多个存储器分区中的内存系统可以均采用UMA。
在上述情形下,如果需要从内存系统移动8M数据至L2高速缓存时,实际上是需要在各存储器分区内完成2兆数据的搬移。此时,处理内核可以向上述多个存储器分区中的DMA控制器广播发送数据搬移指令。
上述4个存储器分区中的DMA控制器在接收到数据搬移指令后,可以从内存系统的上述数据搬移指令指示的存储位置提取2兆数据,并将上述2兆数据搬移至L2高速缓存的上述数据搬移指令指示的存储位置中,从而完成数据搬移。
由于处理内核可以向上述4个存储器分区中的DMA控制器广播发送数据搬移指令来完成各存储器分区内部的数据搬移,因此,减少了处理内核对DMA控制器的调用次数,从而减少了对DMA控制器的调用开销。
以下介绍本申请对数据搬移指令的改进。在本申请中,为了进一步缩减对DMA控制器的调用开销,提出了一种全新格式的对DMA控制器的数据搬移指令。该数据搬移指令通过减少数据搬移指令字段数量,并合理的设置各字段指示的含义,从而缩减了数据搬移指令的长度,减少了对DMA控制器的调用开销。
在相关技术中,对DMA控制器的数据搬移指令包括6个字段,分别为数据搬移类型字段,数据长度字段,最后一级高速缓存低地址字段,最后一级高速缓存高地址字段,内存系统低地址字段,以及内存系统高地址字段。
由此可见,相关技术中的数据搬移指令比较冗长,当对DMA控制器进行调用时,需要对DMA控制器发送较长的数据搬移指令,从而增加对DMA控制器的调用开销。
为了解决这一问题,在一实施例中,上述数据搬移指令,至少可以包括数据搬移类型、数据长度,源存储地址,以及目的存储地址。
上述数据搬移类型,具体指示数据搬移方向。在一实施例中,上述数据搬移类型可以指示存储器分区中的数据流向。具体地,上述数据流向(数据搬移类型)可以包括以下四种中的任一:
上述存储器分区中的高速缓存系统内部数据的搬移,上述存储器分区中的内存系统内部数据的搬移,从上述存储器分区中最后一级高速缓存向内存系统的数据搬移,以及从上述存储器分区中内存系统向最后一级高速缓存的数据搬移。
在实际应用中,可以通过将上述四种数据流向与四种标识对应,并在实际调用 DMA控制器时,将上述四种标识写入上述数据搬移类型,以使DMA控制器可以识别此次数据搬移的数据流向。
上述数据长度,具体指示需要传输的数据量大小。可以理解的是,数据量大小与存储空间具有对应关系,因此,如果知道该数据在存储空间中的起始位置,依据该数据的数据长度,可以得到该数据在存储空间中的终止位置。
上述源存储地址,具体是指待搬移数据当前存储位置的起始地址。例如,如果数据从内存系统搬移至最后一级高速缓存,则上述源存储地址为数据在上述内存系统中的起始位置。
上述目的存储地址,具体是指待搬移数据需要被搬移后的存储位置的起始地址。例如,如果数据从内存系统搬移至最后一级高速缓存,则上述目的存储地址为数据被搬移至上述最后一级高速缓存中的起始位置。
可以理解的是,当DMA控制器接收到数据搬移指令后,一方面,可以根据上述数据搬移指令中的源存储地址字段和数据长度确定源存储空间;另一方面,可以根据上述数据搬移指令中的目的存储地址字段和数据长度确定目的存储空间;再一方面,可以根据上述数据搬移指令中的数据搬移类型,将源存储空间的数据搬移至目的存储空间。
请参见图5,图5为本申请示出的一种数据搬移指令的示意图。如图5所示,上述数据搬移指令包括第一字段、第二字段、第三字段以及第四字段;
其中,上述第一字段为指示数据搬移类型和数据长度的字段;
上述第二字段为指示源存储地址的低地址的字段;
上述第三字段为指示源存储地址的高地址和目的存储地址的高地址的字段;
上述第四字段为指示目的存储地址的低地址的字段。
在此,需要说明的是,上述数据搬移指令中各字段的顺序,以及各字段中指示不同含义的数据位的位置可以根据实际情形进行调整,在此不作限定。
假设0000(二进制)指示数据在高速缓存系统内部搬移,0001(二进制)指示数据在内存系统内部搬移,0010(二进制)指示数据从内存系统搬移至最后一级高速缓存,0011(二进制)指示数据从最后一级高速缓存搬移至内存系统。
在上述情形下,假设从内存系统的低地址0x3EAB_0000(16进制),高地址0xAB_00(16进制),搬移2兆的数据至最后一级高速缓存的低地址0x3E5B_0000(16进制),高地址0xCD_00(16进制)。
此时,芯片的处理内核在构造对DMA控制器的数据搬移指令时,可以将0010写入第一字段的前4位,将2兆转换二进制写入上述第一字段的后28位。然后上述处理内核可以将上述内存系统的低地址0x3EAB_0000转换为二进制写入上述第二字段,并将上述内存系统的高地址0xAB_00转换为二进制写入上述第三字段的后十六位。最后,上述处理内核可以将上述最后一级高速缓存的高地址0xCD_00写入上述第三字段的前十六位,并将上述最后一级高速缓存的低地址0x3E5B_0000转换为二进制写入上述第四字段。
当上述处理内核完成上述数据搬移指令的构造后,可以将该数据搬移指令广播发送至各DMA控制器,以使各DMA控制器响应于上述数据搬移指令,从上述内存系统的低地址0x3EAB_0000,高地址0xAB_00,搬移2兆的数据至上述最后一级高速缓存系统的低地址0x3E5B_0000,高地址0xCD_00。
由上可知,由于上述数据搬移指令,至少可以包括数据搬移类型和数据长度字段,源存储地址字段,以及目的存储地址字段,因此,在对DMA控制器进行调用时,可以减少对DMA控制器的调用开销。
在一实施例中,可以采用对相关技术中示出的数据搬移指令中6个字段的进行合并,从而减少数据搬移指令包括的字段数量。
在实际应用中,由于数据搬移类型所需位数较少,占用一个字段(32位)有些浪费,因此可以将数据搬移类型与数据长度合并为一个字段。而由于最后一级高速缓存通常总容量较小(例如,几兆),因此,可以将最后一级高速缓存低地址和高地址字段合并为一个字段。
请参见图6,图6为本申请示出的一种数据搬移指令示意图。如图6所示,上述数据搬移指令至少包括第一字段、第二字段、第三字段以及第四字段;
其中,上述第一字段为指示数据搬移类型和数据长度的字段;
上述第二字段为指示最后一级高速缓存的存储地址的字段;
上述第三字段为指示内存系统的低地址字段;
上述第四字段为指示内存系统的高地址字段。
需要说明的是,一方面,上述数据搬移指令中各字段的顺序,以及各字段中指示不同含义的数据位的位置可以根据实际情形进行调整,在此不作限定。
上述第一字段指示的含义可参照前述实施例,在此不作详述。
上述第二字段指示最后一级高速缓存的存储空间的起始地址。当第一字段指示数据从最后一级高速缓存搬移至内存系统时,上述第二字段指示的存储地址为数据当前存储位置的起始位置。当第一字段指示数据从内存系统搬移至最后一级高速缓存时,上述第二字段指示的存储地址为数据被搬移后的存储位置的起始位置。
上述第三字段以及上述第四字段指示的含义可以参照前述实施例,在此不作详述。
由上可知,由于上述数据搬移指令只包括四个字段,因此,在对DMA控制器进行调用时,可以减少了对DMA控制器的调用开销。
本申请还提出一种电子设备,包括上述任一实施例示出的芯片。
例如,该电子设备可以是手机等智能终端,或者是具有摄像头并可以进行图像处理的其他设备。示例性的,当该电子设备获取到采集的图像时,可以对图像进行处理,处理过程就可以采用本申请实施例的芯片来执行计算任务。
由于上述芯片可以提升存储器分区的数据搬移效率,具有更高的性能,因此, 使用该芯片可以辅助提高计算任务的处理效率,从而提升电子设备性能。
本领域技术人员应明白,本申请一个或多个实施例可提供为方法、系统或计算机程序产品。因此,本申请一个或多个实施例可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请一个或多个实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、光学存储器等)上实施的计算机程序产品的形式。
本申请中记载的“和/或”表示至少具有两者中的其中一个,例如,“A和/或B”包括三种方案:A、B、以及“A和B”。
本申请中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于数据处理设备实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的行为或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本申请中描述的主题及功能操作的实施例可以在以下中实现:数字电子电路、有形体现的计算机软件或固件、包括本申请中公开的结构及其结构性等同物的计算机硬件、或者它们中的一个或多个的组合。本申请中描述的主题的实施例可以实现为一个或多个计算机程序,即编码在有形非暂时性程序载体上以被数据处理装置执行或控制数据处理装置的操作的计算机程序指令中的一个或多个模块。可替代地或附加地,程序指令可以被编码在人工生成的传播信号上,例如机器生成的电、光或电磁信号,该信号被生成以将信息编码并传输到合适的接收机装置以由数据处理装置执行。计算机存储介质可以是机器可读存储设备、机器可读存储基板、随机或串行存取存储器设备、或它们中的一个或多个的组合。
本申请中描述的处理及逻辑流程可以由执行一个或多个计算机程序的一个或多个可编程计算机执行,以通过根据输入数据进行操作并生成输出来执行相应的功能。上述处理及逻辑流程还可以由专用逻辑电路—例如FPGA(现场可编程门阵列)或ASIC(专用集成电路)来执行,并且装置也可以实现为专用逻辑电路。
适合用于执行计算机程序的计算机包括,例如通用和/或专用微处理器,或任何其他类型的中央处理系统。通常,中央处理系统将从只读存储器和/或随机存取存储器接收指令和数据。计算机的基本组件包括用于实施或执行指令的中央处理系统以及用于存储指令和数据的一个或多个存储器设备。通常,计算机还将包括用于存储数据的一个或多个大容量存储设备,例如磁盘、磁光盘或光盘等,或者计算机将可操作地与此大容量存储设备耦接以从其接收数据或向其传送数据,抑或两种情况兼而有之。然而,计算机不是必须具有这样的设备。此外,计算机可以嵌入在另一设备中,例如移动电话、个人数字助理(PDA)、移动音频或视频播放器、游戏操纵台、全球定位系统(GPS)接收 机、或例如通用串行总线(USB)闪存驱动器的便携式存储设备,仅举几例。
适合于存储计算机程序指令和数据的计算机可读介质包括所有形式的非易失性存储器、媒介和存储器设备,例如包括半导体存储器设备(例如EPROM、EEPROM和闪存设备)、磁盘(例如内部硬盘或可移动盘)、磁光盘以及0xCD_00ROM和DVD-ROM盘。处理器和存储器可由专用逻辑电路补充或并入专用逻辑电路中。
虽然本申请包含许多具体实施细节,但是这些不应被解释为限制任何公开的范围或所要求保护的范围,而是主要用于描述特定公开的具体实施例的特征。本申请内在多个实施例中描述的某些特征也可以在单个实施例中被组合实施。另一方面,在单个实施例中描述的各种特征也可以在多个实施例中分开实施或以任何合适的子组合来实施。此外,虽然特征可以如上上述在某些组合中起作用并且甚至最初如此要求保护,但是来自所要求保护的组合中的一个或多个特征在一些情况下可以从该组合中去除,并且所要求保护的组合可以指向子组合或子组合的变型。
类似地,虽然在附图中以特定顺序描绘了操作,但是这不应被理解为要求这些操作以所示的特定顺序执行或顺次执行、或者要求所有例示的操作被执行,以实现期望的结果。在某些情况下,多任务和并行处理可能是有利的。此外,上述实施例中的各种系统模块和组件的分离不应被理解为在所有实施例中均需要这样的分离,并且应当理解,所描述的程序组件和系统通常可以一起集成在单个软件产品中,或者封装成多个软件产品。
由此,主题的特定实施例已被描述。其他实施例在所附权利要求书的范围以内。在某些情况下,权利要求书中记载的动作可以以不同的顺序执行并且仍实现期望的结果。此外,附图中描绘的处理并非必需所示的特定顺序或顺次顺序,以实现期望的结果。在某些实现中,多任务和并行处理可能是有利的。
以上上述仅为本申请一个或多个实施例的较佳实施例而已,并不用以限制本申请一个或多个实施例,凡在本申请一个或多个实施例的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请一个或多个实施例保护的范围之内。

Claims (20)

  1. 一种芯片,包括:
    至少一个处理内核和至少一个存储器分区;
    其中,对于每个存储器分区:
    所述存储器分区包括高速缓存系统、内存系统,以及直接存储器访问DMA控制器;
    所述DMA控制器,与所述高速缓存系统以及所述内存系统分别连接,用于进行所述存储器分区内部的不同存储空间之间的数据搬移。
  2. 根据权利要求1所述的芯片,其特征在于,所述DMA控制器用于进行所述存储器分区内部的不同存储空间之间的数据搬移包括用于进行下列中的至少一种:
    所述高速缓存系统的不同存储空间之间的数据搬移;
    所述内存系统内的不同存储空间之间的数据搬移;
    所述高速缓存系统的存储空间与所述内存系统内的存储空间之间的数据搬移。
  3. 根据权利要求2所述的芯片,其特征在于,所述高速缓存系统包括多级高速缓存;
    所述DMA控制器用于进行所述高速缓存系统的存储空间与所述内存系统内的存储空间之间的数据搬移包括所述DMA控制器用于进行最后一级高速缓存的存储空间与所述内存系统内的存储空间之间的数据搬移。
  4. 根据权利要求3中所述的芯片,其特征在于,所述最后一级高速缓存支持三种工作模式,其中,
    在第一工作模式中,所述最后一级高速缓存的全部存储空间被配置为高速缓存存储器,
    在第二工作模式中,所述最后一级高速缓存的全部存储空间被配置为便笺存储器SPM,
    在第三工作模式中,所述最后一级高速缓存的一部分存储空间被配置为高速缓存存储器,另一部分存储空间被配置为SPM。
  5. 根据权利要求4所述的芯片,其特征在于,所述存储器分区还包括模式配置器,所述模式配置器用于基于用户配置信息,配置所述最后一级高速缓存的工作模式。
  6. 根据权利要求1至5中任一项所述的芯片,其特征在于,所述至少一个处理内核与所述DMA控制器通过主片上网络互相访问;或
    所述DMA控制器、所述高速缓存系统以及所述内存系统之间通过子片上网络互相 访问。
  7. 根据权利要求1-6任一所述的芯片,其特征在于,所述存储器分区中的不同存储空间全部或部分采用统一内存架构UMA。
  8. 根据权利要求1至7任一所述的芯片,其特征在于,
    所述至少一个处理内核中的第一处理内核用于向至少一个第一DMA控制器发送数据搬移指令,其中,所述至少一个第一DMA控制器包括在至少一个第一存储器分区中;
    所述至少一个第一DMA控制器,用于基于所述数据搬移指令,进行所述至少一个第一存储器分区内部的不同存储空间之间的数据搬移。
  9. 根据权利要求8所述的芯片,其特征在于,所述第一处理内核用于向所述至少一个第一DMA控制器发送数据搬移指令包括,所述第一处理内核用于向至少一个第二DMA控制器广播数据搬移指令,其中所述第二DMA控制器包括在所述不同存储空间全部采用UMA的第一存储器分区中。
  10. 根据权利要求8或9所述的芯片,其特征在于,所述数据搬移指令包括:数据搬移类型、数据长度、源存储地址、以及目的存储地址。
  11. 根据权利要求10所述的芯片,其特征在于,所述数据搬移指令包括第一字段、第二字段、第三字段以及第四字段;
    其中,所述第一字段用于指示所述数据搬移类型和所述数据长度;
    所述第二字段用于指示所述源存储地址的低地址;
    所述第三字段用于指示所述源存储地址的高地址以及所述目的存储地址的高地址;
    所述第四字段用于指示所述目的存储地址的低地址。
  12. 根据权利要求1-11任一所述的芯片,其特征在于,所述DMA控制器用于进行所述存储器分区内部的不同存储空间之间的数据搬移包括用于:
    从所述存储器分区内的第一存储空间读取数据,并将读取到的数据写入所述存储器分区内的第二存储空间。
  13. 根据权利要求1-12任一所述的芯片,其特征在于,所述内存系统为高带宽存储器HBM。
  14. 一种数据搬移方法,应用于芯片,其中所述芯片包括至少一个处理内核和至少一个存储器分区,每个存储器分区包括高速缓存系统、内存系统、以及直接存储器访问DMA控制器;
    所述方法包括:对于每个存储器分区,
    通过所述DMA控制器进行所述存储器分区内部的不同存储空间之间的数据搬移。
  15. 根据权利要求14所述的方法,其特征在于,所述高速缓存系统包括多级高速缓存;
    所述通过所述DMA控制器进行所述存储器分区内部的不同存储空间之间的数据搬移,包括:
    通过所述DMA控制器进行最后一级高速缓存的存储空间与所述内存系统内的存储空间之间的数据搬移。
  16. 根据权利要求15所述的方法,其特征在于,所述方法还包括:
    基于用户配置信息配置所述最后一级高速缓存的工作模式。
  17. 根据权利要求14至16任一所述的方法,其特征在于,所述通过所述DMA控制器进行所述存储器分区内部的不同存储空间之间的数据搬移,包括:
    通过所述至少一个处理内核中的第一处理内核向至少一个第一DMA控制器发送数据搬移指令,其中,所述至少一个第一DMA控制器包括在至少一个第一存储器分区中;
    所述至少一个第一DMA控制器,基于所述数据搬移指令,进行所述至少一个第一存储器分区内部的不同存储空间之间的数据搬移。
  18. 根据权利要求17所述的方法,其特征在于,所述通过所述第一处理内核向所述至少一个第一DMA控制器发送所述数据搬移指令,包括:
    通过所述第一处理内核向至少一个第二DMA控制器广播数据搬移指令,其中所述第二DMA控制器包括在所述不同存储空间全部采用统一内存架构UMA的第一存储器分区中。
  19. 根据权利要求14-18任一所述的方法,其特征在于,所述通过所述DMA控制器进行所述存储器分区内部的不同存储空间之间的数据搬移,包括:
    通过所述DMA控制器从所述存储器分区内的第一存储空间读取数据,并将读取到的数据写入所述存储器分区内的第二存储空间。
  20. 一种电子设备,包括:权利要求1至13任一所述的芯片。
PCT/CN2021/101547 2020-12-10 2021-06-22 芯片、数据搬移方法和电子设备 WO2022121278A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2022527673A JP2023509818A (ja) 2020-12-10 2021-06-22 チップ、データ移行方法及び電子機器

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011458676.7A CN112506437A (zh) 2020-12-10 2020-12-10 芯片、数据搬移方法和电子设备
CN202011458676.7 2020-12-10

Publications (1)

Publication Number Publication Date
WO2022121278A1 true WO2022121278A1 (zh) 2022-06-16

Family

ID=74973679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101547 WO2022121278A1 (zh) 2020-12-10 2021-06-22 芯片、数据搬移方法和电子设备

Country Status (3)

Country Link
JP (1) JP2023509818A (zh)
CN (1) CN112506437A (zh)
WO (1) WO2022121278A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034376A (zh) * 2022-08-12 2022-09-09 上海燧原科技有限公司 神经网络处理器、批量标准化处理方法及存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506437A (zh) * 2020-12-10 2021-03-16 上海阵量智能科技有限公司 芯片、数据搬移方法和电子设备
CN113220346A (zh) * 2021-04-29 2021-08-06 上海阵量智能科技有限公司 一种硬件电路、数据搬移方法、芯片和电子设备
CN117529704A (zh) * 2022-05-18 2024-02-06 深圳市韶音科技有限公司 一种信号传输控制系统
CN116308999B (zh) * 2023-05-18 2023-08-08 南京砺算科技有限公司 图形处理器的数据处理方法及图形处理器、存储介质
CN116610630B (zh) * 2023-07-14 2023-11-03 上海芯高峰微电子有限公司 一种基于片上网络的多核系统和数据传输方法
CN117667828B (zh) * 2024-01-31 2024-05-03 摩尔线程智能科技(北京)有限责任公司 一种片上网络集成方法、装置和存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930357A (zh) * 2010-08-17 2010-12-29 中国科学院计算技术研究所 采用可配置的片上存储装置实现访存操作的系统及方法
CN102521201A (zh) * 2011-11-16 2012-06-27 刘大可 多核数字信号处理器片上系统及数据传输方法
US8677081B1 (en) * 2006-09-29 2014-03-18 Tilera Corporation Transferring and storing data in multicore and multiprocessor architectures
CN108153190A (zh) * 2017-12-20 2018-06-12 福建新大陆电脑股份有限公司 一种人工智能微处理器
CN109739785A (zh) * 2018-09-20 2019-05-10 威盛电子股份有限公司 多核系统的内连线结构
CN111797034A (zh) * 2020-06-24 2020-10-20 深圳云天励飞技术有限公司 一种数据管理方法、神经网络处理器和终端设备
CN112506437A (zh) * 2020-12-10 2021-03-16 上海阵量智能科技有限公司 芯片、数据搬移方法和电子设备

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2000051004A1 (en) * 1999-02-22 2000-08-31 Infineon Technologies Ag Methods and apparatus for facilitating direct memory access
US6859862B1 (en) * 2000-04-07 2005-02-22 Nintendo Co., Ltd. Method and apparatus for software management of on-chip cache
JP4204759B2 (ja) * 2001-03-09 2009-01-07 インターナショナル・ビジネス・マシーンズ・コーポレーション Dma転送制御方法及び制御装置
DE602004026823D1 (de) * 2004-02-12 2010-06-10 Irdeto Access Bv Verfahren und System zur externen Speicherung von Daten
CN101645052B (zh) * 2008-08-06 2011-10-26 中兴通讯股份有限公司 一种快速dma乒乓缓存方法
JP2011086131A (ja) * 2009-10-16 2011-04-28 Mitsubishi Electric Corp データ処理システム
US10078593B2 (en) * 2011-10-28 2018-09-18 The Regents Of The University Of California Multiple-core computer processor for reverse time migration
JP5776821B2 (ja) * 2013-08-26 2015-09-09 富士ゼロックス株式会社 情報処理装置、演算処理装置及びプログラム
CN104298645A (zh) * 2014-10-09 2015-01-21 深圳市国微电子有限公司 一种可灵活配置的可编程片上系统芯片及其启动配置方法
US9959227B1 (en) * 2015-12-16 2018-05-01 Amazon Technologies, Inc. Reducing input/output latency using a direct memory access (DMA) engine
CN107562659A (zh) * 2016-06-30 2018-01-09 中兴通讯股份有限公司 一种数据搬移装置及方法
CN109933553B (zh) * 2019-02-28 2020-09-29 厦门码灵半导体技术有限公司 一种控制系统及其设计方法、一组控制系统、电子装置
CN110059024B (zh) * 2019-04-19 2021-09-21 中国科学院微电子研究所 一种内存空间数据缓存方法及装置
CN111782154B (zh) * 2020-07-13 2023-07-04 芯象半导体科技(北京)有限公司 数据搬移方法、装置及系统
CN111739577B (zh) * 2020-07-20 2020-11-20 成都智明达电子股份有限公司 一种基于dsp的高效的ddr测试方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677081B1 (en) * 2006-09-29 2014-03-18 Tilera Corporation Transferring and storing data in multicore and multiprocessor architectures
CN101930357A (zh) * 2010-08-17 2010-12-29 中国科学院计算技术研究所 采用可配置的片上存储装置实现访存操作的系统及方法
CN102521201A (zh) * 2011-11-16 2012-06-27 刘大可 多核数字信号处理器片上系统及数据传输方法
CN108153190A (zh) * 2017-12-20 2018-06-12 福建新大陆电脑股份有限公司 一种人工智能微处理器
CN109739785A (zh) * 2018-09-20 2019-05-10 威盛电子股份有限公司 多核系统的内连线结构
CN111797034A (zh) * 2020-06-24 2020-10-20 深圳云天励飞技术有限公司 一种数据管理方法、神经网络处理器和终端设备
CN112506437A (zh) * 2020-12-10 2021-03-16 上海阵量智能科技有限公司 芯片、数据搬移方法和电子设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034376A (zh) * 2022-08-12 2022-09-09 上海燧原科技有限公司 神经网络处理器、批量标准化处理方法及存储介质
CN115034376B (zh) * 2022-08-12 2022-11-18 上海燧原科技有限公司 神经网络处理器的批量标准化处理方法及存储介质

Also Published As

Publication number Publication date
CN112506437A (zh) 2021-03-16
JP2023509818A (ja) 2023-03-10

Similar Documents

Publication Publication Date Title
WO2022121278A1 (zh) 芯片、数据搬移方法和电子设备
JP6817273B2 (ja) 不揮発性大容量メモリ・システムによるキャッシュ移動を提供するための装置および方法
CN112035381B (zh) 一种存储系统及存储数据处理方法
US20180239722A1 (en) Allocation of memory buffers in computing system with multiple memory channels
US11687276B2 (en) Data streaming for computational storage
US9569381B2 (en) Scheduler for memory
KR20050051672A (ko) 스케일러블 멀티채널 메모리 액세스를 위한 방법 및 메모리제어기
CN103210378A (zh) 使用高速缓存图像进行低电力音频解码和回放
CN106775477B (zh) Ssd主控数据传输管理装置及方法
CN114490433A (zh) 存储空间的管理方法、数据处理芯片、设备和存储介质
CN113033785A (zh) 芯片、神经网络训练系统、内存管理方法及装置、设备
WO2022227563A1 (zh) 一种硬件电路、数据搬移方法、芯片和电子设备
TWI471731B (zh) 記憶體存取方法、記憶體存取控制方法、spi快閃記憶體裝置以及spi控制器
CN116483553A (zh) 计算设备、数据处理方法、系统及相关设备
CN107025190B (zh) 系统及其操作方法
JP2009037639A (ja) ストリーミングidメソッドによるdmac発行メカニズム
WO2023142114A1 (zh) 数据处理方法、装置以及电子设备
CN117312201B (zh) 一种数据传输方法、装置及加速器设备、主机和存储介质
KR20220077863A (ko) 로컬버스를 이용한 호스트와 컨트롤러 간의 데이터 교환 시스템 및 그 방법
CN117666944A (zh) 用于执行数据处理功能的方法和存储装置
JP2023527770A (ja) メモリにおける推論
CN113176911A (zh) 一种配置方法、数据处理方法、芯片和电子设备
CN117033283A (zh) 一种axi总线id的动态压缩装置与方法
CN116795742A (zh) 存储设备、信息存储方法及系统

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022527673

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901990

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901990

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 21901990

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 24.11.2023)