WO2022111013A1 - 支援多种访问模式的设备、方法及可读存储介质 - Google Patents

支援多种访问模式的设备、方法及可读存储介质 Download PDF

Info

Publication number
WO2022111013A1
WO2022111013A1 PCT/CN2021/119945 CN2021119945W WO2022111013A1 WO 2022111013 A1 WO2022111013 A1 WO 2022111013A1 CN 2021119945 W CN2021119945 W CN 2021119945W WO 2022111013 A1 WO2022111013 A1 WO 2022111013A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
write
registers
register
register array
Prior art date
Application number
PCT/CN2021/119945
Other languages
English (en)
French (fr)
Inventor
刘恩赫
郝勇峥
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Publication of WO2022111013A1 publication Critical patent/WO2022111013A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention generally relates to the field of computers. More particularly, the present invention relates to an apparatus, method and readable storage medium supporting multiple access modes.
  • Neural network is an operation model, which is composed of a large number of nodes (or neurons) connected to each other, each node represents a specific output function, called an excitation function, and each connection between two nodes represents a The weighted value of the signal passing through this connection is called the weight, which is equivalent to the memory of the artificial neural network.
  • the output of the network varies according to the connection method of the network, the weight value and the excitation function.
  • the network itself is usually an approximation of a certain algorithm or function in nature, and it may also be an expression of a logic strategy.
  • the core operation in the convolutional neural network is the convolution operation.
  • the convolution kernel slides the displacement on the image matrix to obtain various eigenvalues, which requires a lot of hardware resources to support. Since the multipliers and adders in the artificial intelligence processing chip are limited, in practice, the convolution operation will be cut into multiple small areas for accumulation. Each accumulation result needs to be temporarily stored in the register file, and the calculation of each area is completed. After integrating these accumulated results, such an operation requires a large number of register accesses.
  • the current convolution operation requires a variety of access modes to support, these access modes include sequential access and skip access, etc. Sequential access is to read or write several results in sequence at one time, and skip access is to read several registers at intervals. Fetch or write several data. This access mode will be used alternately in the neural network inference, and when the operation continues for a period of time, the access to the register file will also become very complicated and inefficient.
  • the solution of the present invention provides a device, a method and a readable storage medium supporting multiple access modes.
  • the present invention discloses a register file supporting multiple access modes, including a register array having a P group and a Q bank, the multiple access modes including a skip write mode, the skip write mode
  • a register file supporting multiple access modes, including a register array having a P group and a Q bank, the multiple access modes including a skip write mode, the skip write mode
  • each data is stored in M registers, and the multiple access modes are stored by using (M+1) ⁇ N groups and R banks of the register array, where ( M+1) ⁇ N is not greater than P, and R is not greater than Q.
  • the present invention discloses a computing device supporting multiple access modes, the multiple access modes include multiple skip write modes, the i-th skip write mode is writing N i data each time, each The data is stored at intervals of Mi registers.
  • the computing device includes a register array with a P group and a Q bank, and the multiple access modes are stored using a (M q +1) ⁇ N q group and an R bank, where (M q +1) ⁇ N q is not greater than P , R is not greater than Q.
  • (M q +1) ⁇ N q is the maximum value of (M i +1) ⁇ N i in the plurality of skip write modes.
  • the present invention discloses an integrated circuit device including the aforementioned computing device, and also discloses a board including the aforementioned integrated circuit device.
  • the present invention provides a method for using a register array to support a skip write mode, the register array having a P group and a Q bank, and the skip write mode is to write N data to the register array at a time, Each data interval is stored in M registers, and the method includes: setting (M+1) ⁇ N groups and R banks of the register array as sub-arrays, wherein (M+1) ⁇ N is not greater than P, and R is not greater than greater than Q; synchronously enabling registers in one of the banks of the subarrays; and selecting to input the N data to the N registers in the group, respectively.
  • the present invention is a method of utilizing a register array to support multiple access modes, the register array having P groups and Q banks, the multiple access modes including multiple skip write modes, the ith skip write The mode is to write N i data each time, and store each data interval Mi registers.
  • the method includes: setting (M q +1) ⁇ N q group and R bank of the register array as sub-arrays, wherein (M q +1) ⁇ N q is not greater than P and R is not greater than Q; registering one of the banks of the sub-arrays; and selecting to input the N i data to the N i registers in the group, respectively.
  • (M q +1) ⁇ N q is the maximum value of (M i +1) ⁇ N i in the plurality of skip write modes.
  • the present invention is a computer readable storage medium having stored thereon computer program code utilizing a register array to support an access mode, the computer program code executing the aforementioned method when executed by a processing device.
  • the invention reduces the read and write ports of the register by appropriately planning the groups and libraries of the register array, that is, reduces the data of the read and write ports, and the selection logic of the access mode effectively reduces the power consumption of the register array.
  • FIG. 1 is a structural diagram illustrating a board according to an embodiment of the present invention.
  • FIG. 2 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram illustrating an internal structure of a computing device according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the internal structure of a processor core according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram showing when one processor core wants to write data to a processor core of another cluster
  • FIG. 6 is a schematic diagram illustrating a register file in an NRAM according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram illustrating a 16 ⁇ 4 sub-array according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram illustrating that an embodiment of the present invention accesses a subarray in an ORDER1 mode
  • FIG. 9 is a schematic diagram illustrating that an embodiment of the present invention accesses a subarray in an ORDER2 mode
  • FIG. 10 is a schematic diagram illustrating an embodiment of the present invention accessing a subarray in an ORDER4 mode
  • FIG. 11 is a schematic diagram illustrating an access sub-array in the STRIDE 3_2 mode according to an embodiment of the present invention
  • FIG. 12 is a schematic diagram illustrating an access sub-array in the STRIDE 3_4 mode according to an embodiment of the present invention
  • FIG. 13 is a schematic diagram illustrating an embodiment of the present invention accessing a sub-array in a STRIDE 1_2 mode
  • FIG. 14 is a schematic diagram illustrating access to subarrays in STRIDE 1_4 mode according to an embodiment of the present invention.
  • FIG. 15 is a flowchart illustrating a method for utilizing the aforementioned register array to support multiple access modes according to an embodiment of the present invention.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • FIG. 1 shows a schematic structural diagram of a board 10 according to an embodiment of the present invention.
  • the board 10 includes a chip 101, which is a system-on-chip (SoC), or a system-on-a-chip, and integrates one or more combined processing devices.
  • SoC system-on-chip
  • the combined processing device is an artificial
  • the intelligent computing unit is used to support various deep learning and machine learning algorithms to meet the intelligent processing requirements in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • a notable feature of cloud intelligence applications is the large amount of input data, which has high requirements on the storage capacity and computing capacity of the platform.
  • the board 10 in this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing power.
  • the chip 101 is connected to an external device 103 through an external interface device 102 .
  • the external device 103 is, for example, a server, a computer, a camera, a monitor, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 101 by the external device 103 through the external interface device 102 .
  • the calculation result of the chip 101 can be transmitted back to the external device 103 via the external interface device 102 .
  • the external interface device 102 may have different interface forms, such as a PCIe interface and the like.
  • the board 10 also includes a storage device 104 for storing data, which includes one or more storage units 105 .
  • the storage device 104 is connected to the control device 106 and the chip 101 through a bus and performs data transmission.
  • the control device 106 in the board 10 is configured to control the state of the chip 101 .
  • the control device 106 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 2 is a block diagram showing a combined processing device in the chip 101 of this embodiment.
  • the combined processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a DRAM 204.
  • the computing device 201 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor to perform deep learning or machine learning calculations, which can interact with the processing device 203 through the interface device 202 to Work together to complete a user-specified operation.
  • the interface device 202 is used to transmit data and control instructions between the computing device 201 and the processing device 203 .
  • the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write the input data into the storage device on-chip of the computing device 201.
  • the computing device 201 can obtain the control instruction from the processing device 203 via the interface device 202 and write it into the control cache on the computing device 201 .
  • the interface device 202 can also read the data in the storage device of the computing device 201 and transmit it to the processing device 203 .
  • the processing device 203 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 201, and the like.
  • the processing device 203 may be one or more types of central processing unit (CPU), graphics processing unit (GPU), or other general-purpose and/or special-purpose processors.
  • processors include but are not limited to digital signal processors (DSPs), application specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • DSPs digital signal processors
  • ASICs application specific integrated circuits
  • FPGAs field-programmable gate arrays
  • Programmable logic devices discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the computing device 201 of the present invention can be regarded as having a single-core structure or a homogeneous multi-core structure. However, when the computing device 201 and the processing device 203 are considered together, the two are considered to form a heterogeneous multi-core structure.
  • the DRAM 204 is used to store the data to be processed, and is a DDR memory with a size of 16G or more, and is used to save the data of the computing device 201 and/or the processing device 203.
  • FIG. 3 shows a schematic diagram of the internal structure of the computing device 201 .
  • the computing device 201 is used to process input data such as computer vision, speech, natural language, and data mining.
  • the computing device 201 in the figure is designed with a multi-core hierarchical structure.
  • the computing device 201 is a system-on-a-chip, which includes multiple clusters. Each cluster further includes a plurality of processor cores, in other words, the computing device 201 is constituted at the level of system-on-chip-cluster-processor cores.
  • the computing device 201 includes an external storage controller 301 , a peripheral communication module 302 , an on-chip interconnect module 303 , a synchronization module 304 , and multiple clusters 305 .
  • the peripheral communication module 302 is used for receiving a control signal from the processing device 203 through the interface device 202 to start the computing device 201 to perform tasks.
  • the on-chip interconnection module 303 connects the external storage controller 301 , the peripheral communication module 302 and the multiple clusters 305 to transmit data and control signals among the modules.
  • the synchronization module 304 is a global synchronization barrier controller (GBC), which is used to coordinate the work progress of each cluster and ensure the synchronization of information.
  • GBC global synchronization barrier controller
  • a plurality of clusters 305 are the computing cores of the computing device 201, and 4 are exemplarily shown in the figure. With the development of hardware, the computing device 201 of the present invention may further include 8, 16, 64, or even more. Cluster 305. Cluster 305 is used to efficiently execute deep learning algorithms.
  • each cluster 305 includes multiple processor cores (IPU cores) 306 and one memory core (MEM core) 307 .
  • IPU cores processor cores
  • MEM core memory core
  • processor cores 306 The number of processor cores 306 is exemplarily shown in the figure, and the present invention does not limit the number of processor cores 306 . Its internal structure is shown in Figure 4.
  • Each processor core 306 includes three modules: a control module 41 , an arithmetic module 42 and a storage module 43 .
  • the control module 41 is used to coordinate and control the work of the arithmetic module 42 and the storage module 43 to complete the task of deep learning, and it includes an instruction fetch unit (instruction fetch unit, IFU) 411 and an instruction decoding unit (instruction Decode unit, IDU) 412.
  • the instruction fetching unit 411 is used to acquire the instruction from the processing device 203 , and the instruction decoding unit 412 decodes the acquired instruction, and sends the decoding result to the operation module 42 and the storage module 43 as control information.
  • the operation module 42 includes a vector operation unit 421 and a matrix operation unit 422 .
  • the vector operation unit 421 is used to perform vector operations, and can support complex operations such as vector multiplication, addition, and nonlinear transformation;
  • the matrix operation unit 422 is responsible for the core calculation of the deep learning algorithm, that is, matrix multiplication and convolution.
  • the storage module 43 is used to store or transport related data, including a neuron storage unit (neuron RAM, NRAM) 431, a weight storage unit (weight RAM, WRAM) 432, an input/output direct memory access module (input/output direct memory access , IODMA) 433, move direct memory access module (move direct memory access, MVDMA) 434.
  • the NRAM 431 is used to store the feature map calculated by the processor core 306 and the intermediate results after the calculation;
  • the WRAM 432 is used to store the weights of the deep learning network; memory access;
  • the MVDMA 434 is used to control the memory access of the NRAM 431/WRAM 432 and the SRAM 308.
  • the storage core 307 is mainly used for storage and communication, that is, to store the shared data or intermediate results between the processor cores 306, and to execute the communication between the cluster 305 and the DRAM 204, the communication between the clusters 305, and the processor Communication among the cores 306, etc.
  • the memory core 307 has scalar operation capability for performing scalar operations.
  • the storage core 307 includes a shared storage unit (SRAM) 308, a broadcast bus 309, a cluster direct memory access (CDMA) 310 and a global direct memory access (GDMA) 311.
  • SRAM shared storage unit
  • CDMA cluster direct memory access
  • GDMA global direct memory access
  • the SRAM 308 assumes the role of a high-performance data transfer station.
  • the data multiplexed between different processor cores 306 in the same cluster 305 does not need to be obtained from the DRAM 204 through the processor cores 306, but is stored in the processor through the SRAM 308.
  • the storage core 307 only needs to quickly distribute the multiplexed data from the SRAM 308 to the multiple processor cores 306, so as to improve the communication efficiency between the cores and greatly reduce the on-chip and off-chip input/output accesses.
  • the broadcast bus 309, the CDMA 310 and the GDMA 311 are used to perform the communication between the processor cores 306, the communication between the clusters 305 and the data transmission between the clusters 305 and the DRAM 204, respectively. They will be explained separately below.
  • the broadcast bus 309 is used to complete high-speed communication among the processor cores 306 in the cluster 305.
  • the broadcast bus 309 in this embodiment supports inter-core communication methods including unicast, multicast and broadcast.
  • Unicast refers to point-to-point (ie, a single processor core to a single processor core) data transmission
  • multicast is a communication method that transmits a piece of data from SRAM 308 to specific processor cores 306, and broadcast is a communication method.
  • the communication method in which copies of data are transmitted from SRAM 308 to all processor cores 306 is a special case of multicast.
  • the CDMA 310 is used to control access to the SRAM 308 between different clusters 305 within the same computing device 201.
  • Figure 5 shows a schematic diagram when one processor core wants to write data to the processor cores of another cluster to illustrate the working principle of CDMA 310.
  • the same computing device includes multiple clusters. For the convenience of description, only cluster 0 and cluster 1 are shown in the figure, and cluster 0 and cluster 1 respectively include multiple processor cores. Cluster 0 shows only processor core 0, and cluster 1 shows only processor core 1. Core 0 wants to write data to Core 1.
  • processor core 0 sends a unicast write request to write data into local SRAM 0
  • CDMA 0 acts as the master
  • CDMA 1 acts as the slave
  • the master pushes the write request to the slave, that is, the master
  • the end sends the write address AW and the write data W, and transfers the data to SRAM 1 of cluster 1, and then the slave sends a write response B as a response.
  • the processor core 1 of cluster 1 sends a unicast read request to transfer the data from SRAM 1. read out.
  • the GDMA 311 cooperates with the external memory controller 301 to control the memory access from the SRAM 308 of the cluster 305 to the DRAM 204 , or to read data from the DRAM 204 to the SRAM 308 .
  • the communication between the DRAM 204 and the NRAM 431 or the WRAM 432 can be implemented through two channels. The first channel is to directly contact DRAM 204 and NRAM 431 or WRAM 432 through IODAM 433; the second channel is to transfer data between DRAM 204 and SRAM 308 through GDMA 311, and then through MVDMA 434 to transfer data between SRAM 308 and NRAM 431 or WRAM 432 transfers.
  • a data transmission channel can be selected according to its own hardware conditions.
  • GDMA 311 and the functionality of IODMA 433 may be integrated in the same component.
  • GDMA 311 and IODMA 433 are regarded as different components.
  • the function of GDMA 311, the function of IODMA 433, the function of CDMA 310, and the function of MVDMA 434 can also be realized by the same component.
  • the storage core 307 will cut the image to be calculated into a plurality of small blocks and assign them to each processor core 306.
  • the matrix operation unit 422 processes one small block at a time, so it will perform a large number of multiplication and addition operations, and then accumulate the multiplication and addition results until the accumulation is completed.
  • the specific structure of the NRAM 431 and the WRAM 432 of this embodiment is a register file for temporarily storing the accumulated result.
  • the matrix operation unit 422 will continue to frequently access the NRAM 431 and the WRAM 432 to update the accumulated results, until the accumulation operation ends, and output the accumulated results to the SRAM 308.
  • the register file 600 includes a register array 601, an enable logic group 602, and a read-write logic group 603.
  • the register array 601 includes a plurality of registers 604 which are logically arranged in a P ⁇ Q array, that is, the register array 601 has P groups and Q banks, each row in the figure represents a group, and each column represents a bank.
  • the enable logic group 602 includes Q enable logic gates 605, and each enable logic gate 605 is used to control all the registers 604 in one of the banks to be turned on to read and write the registers 604 of the bank.
  • the read-write logic group 603 includes P read-write logic gates 606, and each read-write logic gate 606 is used for writing data to a specific group or reading data from a specific group.
  • the register array 601 operates based on the clock signal CLK.
  • This embodiment supports multiple access modes to the register file 600 , and these access modes can be roughly divided into two types: a jump access mode and a sequential access module.
  • the skip access mode includes a skip write mode and an update read mode, wherein the skip write mode STRIDE M_N refers to that each time the register file 600 is accessed, N data is written to the register array 601 at intervals of M registers, and the update read mode refers to every time the register file 600 is accessed. Next, N pieces of data are read from M registers in the register array 601 .
  • STRIDE 3_2, STRIDE 3_4, STRIDE 1_2, and STRIDE 1_4 modes in the skip write mode are exemplarily described below.
  • STRIDE 3_2 writes 2 data for each interval of 3 register jumps.
  • the pointer i write the first data to the i-th register of the specific library, and write the second data to the i+4-th register of the specific library, and the pointer after writing Increment by 1.
  • the pointer i+1 the first data is written to the i+1th register of the specific bank, and the second data is written to the i+5th register of the specific bank, The pointer is incremented by 1 after writing.
  • the pointer i+2 write the 1st data to the i+2th register of the specific bank, and write the 2nd data to the i+6th register of the specific bank
  • the pointer is incremented by 1 after writing.
  • the 1st data is written to the i+3th register of the specific bank, and the 2nd data is written to the i+7th register of the specific bank.
  • the pointer is incremented by 1, it will become i+4, and the i+4th register has been used in the first clock cycle, and then pointing to the i+4th register will be the first clock cycle.
  • the data written in the past is overwritten, so at the 4th clock cycle, the pointer is incremented by 5 after writing to become i+8, so that the first data of the next clock cycle is written into the i+8th register of the specific library .
  • STRIDE 3_4 writes or reads 4 data for each interval of 3 register jumps.
  • the pointer i write the first data to the i-th register of the specific library, write the second data to the i+4-th register of the specific library, and write the third data to the i-th register of the specific library.
  • the pointer i+1 write the first data into the i+1th register of the specific library, write the second data into the i+5th register of the specific library, and write the second data into the i+5th register of the specific library.
  • the 3rd data is written to the i+9th register of the specific bank
  • the 4th data is written to the i+13th register of the specific bank
  • the pointer is incremented by 1 after writing.
  • the pointer i+2 write the first data into the i+2 register of the specific library, write the second data into the i+6 register of the specific library, and write the first data into the i+6 register of the specific library.
  • the 3rd data is written to the i+10th register of the specific bank, the 4th data is written to the i+14th register of the specific bank, and the pointer is incremented by 1 after writing.
  • the pointer i+3 write the 1st data into the i+3th register of the specific bank, write the 2nd data into the i+7th register of the specific bank, and
  • the 3rd data is written to the i+11th register of the specific bank, and the 4th data is written to the i+15th register of the specific bank.
  • S ⁇ (M+1)th clock cycle after writing The pointer is incremented by (S+1) ⁇ N ⁇ (M+1).
  • the pointer is incremented by 1, it will become i+4, and the i+4th register has been used in the first clock cycle, and then pointing to the i+4th register will be the first clock cycle.
  • the data written in the past is overwritten, so in the 4th clock cycle, the pointer is incremented by 13 after writing to become i+16, so that the first data of the next clock cycle is written into the i+16th register of the specific library .
  • STRIDE 1_2 writes 2 data for 1 register jump at each interval.
  • the pointer i write the first data to the i-th register of the specific library, write the second data to the i+2-th register of the specific library, and increment the pointer after writing 1.
  • the pointer is i+1, write the first data to the i+1th register of the specific bank, and write the second data to the i+3th register of the specific bank.
  • STRIDE 1_4 writes 4 data for 1 register jump at each interval.
  • the first data is written to the i-th register of the specific library
  • the second data is written to the i+2-th register of the specific library
  • the third data is written.
  • Write the i+4th register of a specific bank write the 4th data to the i+6th register of the specific bank, and increment the pointer by 1 after writing.
  • write the first data into the i+1th register of the specific library write the second data into the i+3th register of the specific library, and write the second data into the i+3th register of the specific library.
  • the 3rd data is written to the i+5th register of the specific bank, and the 4th data is written to the i+7th register of the specific bank.
  • the pointer is incremented by 1, it will be written in the first clock cycle
  • the data is overwritten, so in the third clock cycle, the pointer is incremented by 7 after writing to become i+8, so that the first data of the next clock cycle is written into the i+8th register of a specific library.
  • the update read mode is that in the accumulation process, the register array 601 is used to temporarily store the intermediate result of the partial sum. back to the original register.
  • the accumulation process is to repeatedly use the update read mode to update the intermediate result in the register.
  • the addition operation in this embodiment can be completed within 1 clock cycle.
  • the register for reading the intermediate result and the register for writing the updated intermediate result are the same. Therefore, the operation of the update read mode is different from the jump Similar to the writing module, each time N pieces of data are read from M registers in the register array 601, the incrementing method of the pointer is also the same as the incrementing method of the pointer of the jump writing module, so it will not be repeated.
  • the addition operation takes 2 clock cycles to complete, that is, the intermediate result of the partial sum is read in the current cycle, and it will be completed in the second clock cycle. Write the updated intermediate result. Therefore, it operates in roughly the same way as fixed-point numbers, but the pointers in the update read mode involving floating-point numbers need to be set 1 clock cycle earlier than the pointers of the jump writer.
  • the sequential access mode means that each time the register file 600 is accessed, multiple data are read or written from the continuous registers of the register array 601 along the bank direction.
  • the sequential access mode can be further divided into a sequential write mode and a sequential read mode. .
  • the sequential write mode is used to write S pieces of data into consecutive registers of the register array 601 at a time, and S is not greater than P.
  • S is not greater than (M+1) ⁇ N.
  • Sequential write modes exemplarily include ORDER1, ORDER2, and ORDER4 modes, where ORDER1 writes one piece of data sequentially, the pointer increments by 1 after access, ORDER2 writes 2 pieces of data sequentially, and the pointer increments by 2 after access, and ORDER4 means sequentially writes 4 data, the pointer is incremented by 4 after access.
  • the sequential read mode is used to read T pieces of data from consecutive registers of the register array 601 each time, and T is not greater than P.
  • T is not greater than (M+1) ⁇ N.
  • the sequential read mode is commonly used in the convolution calculation in the neural network. After the convolution calculation is completed, the calculation result is read out from the NRAM 431 and sent to the SRAM 308. At this time, the sequential read mode is adopted.
  • the sequential read mode exemplarily includes ROUT1, ROUT2, and ROUT4, where ROUT1 reads the data of one register sequentially at a time, and the pointer is incremented by 1, and ROUT2 reads the data of 2 registers in sequence each time, and the pointer is incremented by 2, ROUT4 reads the data of 4 registers sequentially each time, and the pointer is incremented by 4.
  • the computing device 201 may involve multiple access modes when performing neural network inference.
  • the computing device 201 of this embodiment needs to simultaneously support the aforementioned 7 access modes including ORDER1, ORDER2, ORDER4, STRIDE 3_2, STRIDE 3_4, STRIDE 1_2, and STRIDE 1_4.
  • the control module 41 first Identifying each of these access patterns requires a minimum unit of the number of clusters.
  • ORDER1 is to write 1 data in sequence at a time
  • ORDER2 is to write 2 data in sequence each time
  • ORDER4 is to write 4 data in sequence each time, in these three modes, the minimum unit of each library Only 4 registers are needed, and ORDER1 can write 4 rounds, ORDER2 can write 2 rounds, and ORDER4 can write 1 round.
  • STRIDE 3_2 writes 2 data in 3 registers every time, so a round (that is, the pointer increment is not 1, but 5) requires 8 registers to write, and STRIDE 3_4 is every 3 registers. 4 data is written in 3 registers every time interval, so the writing of one round (that is, the pointer increment is not 1, but 13) requires 16 registers, and STRIDE 1_2 writes 2 data for 1 register every interval, so The writing of a round (that is, the pointer increment is not 1, but 3) requires 4 registers, and STRIDE 1_4 writes 4 data for 1 register at a time interval, so a round (that is, the pointer increment is not 1, but 7) The writing requires 8 registers.
  • the minimum unit of each library is 16 registers, which can be used for STRIDE 3_2 to write 2 rounds, STRIDE3_4 to write 1 round, and STRIDE 1_2 to write Enter 4 rounds, STRIDE 1_4 write 2 rounds.
  • the minimum unit of each bank is 4 registers
  • the minimum unit of each bank is 16 registers.
  • 16 registers are exactly an integer multiple of 4 registers, so 16 registers can simultaneously write an integer number of rounds in the sequential write mode and the skip write mode in each bank.
  • control module 41 In addition to judging all sequential write modes and skip write modes, the control module 41 also includes sequential read and update read modes into the evaluation, but the number of registers occupied by each round of sequential read and update read modes is different from sequential write mode and skip write mode respectively.
  • the pattern is the same. Furthermore, under normal circumstances, the number of registers occupied by each round in sequential write mode is not much.
  • the minimum number of registers in each library is determined by the skip write mode, and the number of registers occupied by each round of each skip write mode is usually The number of registers are mostly integer multiples of each other.
  • the control module 41 wants to determine the number of registers required for each round of each mode, in theory, the least common divisor of the number of registers occupied by each round of all access modes is taken as the number of registers in each library ( That is, the number of groups), in fact, it is only necessary to determine the maximum value (M q +1) ⁇ N q of (M i +1) ⁇ N i in all skip write modes STRIDE M i _N i to ensure that each mode An integer number of rounds are accessed in each bank. Each mode accesses an integer number of rounds in each bank to avoid the problem that when the operation continues for a period of time, the free space of the register file is scattered around the heap irregularly.
  • the sub-array 607 since the sub-array 607 only needs (M q +1) ⁇ N q read-write logic gates 606 instead of all P read-write logic gates 606 participating in the operation, the NRAM 431 needs to control the read-write logic gates 606. The number is reduced for easy control.
  • the control module 41 logically sets the sub-array 607 to be 16 ⁇ 4, that is, an array with 16 groups and 4 banks, corresponding to 4 using
  • the enable logic gates 701-704 are used for synchronously enabling the registers of a bank, respectively, and the 16 read-write logic gates 705-720 are used to select a maximum of 4 data inputs to certain 4 registers in the 16 groups each time.
  • R x,y refers to the register of the yth bank of the xth group.
  • Figure 8 shows the mechanism for accessing subarrays in ORDER1 mode. Assume that all the registers of bank 0 (R 0,0 to R 15,0 ) and the registers of bank 1 group 0 to group 5 (R 0,1 to R 5,1 ) have been occupied, as shown in the figure indicating the register at the bottom of the net As shown, the pointer now points to R 6,1 . Taking the writing of 8 data in the ORDER1 mode in the queue of the control module 41 as an example, the enable logic gate 702 enables all the registers of the bank 1 to be turned on to receive data, while the other enable logic gates disable all the registers in the other banks. register.
  • ORDER1 Since ORDER1 writes one data in sequence at a time, when the first data is written for the first time, the first data is transmitted to all read-write logic gates 705-720. According to the pointer, only the read-write logic gate 711 allows The first data is written to R 6,1 , and the pointer is incremented by 1 to point to R 7,1 . When the second data is written for the second time, the second data is also transmitted to all read and write logic gates 705-720 , according to the pointer, only the read-write logic gate 712 allows the second data to be written to R 7,1 , and the pointer is incremented by 1 to point to R 8,1 . By analogy, 8 data are written into R 6,1 to R 13,1 respectively . The number i in the figure represents the i-th write.
  • Figure 9 shows the mechanism for accessing subarrays in ORDER2 mode. Assuming that no registers are occupied, the pointer now points to R 0,0 . Taking the writing of 16 data in the ORDER2 mode in the queue of the control module 41 as an example, the enable logic gate 701 enables all the registers of bank 0 to be turned on to receive data, and the other enable logic gates disable all the other banks. register.
  • ORDER2 Since ORDER2 writes 2 data in sequence each time, the first write is based on the pointer, the read and write logic gate 705 writes the first data into R 0,0 , and the read and write logic gate 706 allows the second data to be written Enter R 1,0 , add 2 to the pointer, point to R 2,0 , the second write is based on the pointer, the read and write logic gate 707 allows the third data to be written into R 2,0 , and the read and write logic gate 708 allows the fourth A data is written to R 3,0 , and the pointer is incremented by 2 to point to R 4,0 .
  • 16 data are written into R 0,0 to R 15,0 respectively, which is exactly full of bank 0, and finally the pointer points to R 0,1 .
  • Figure 10 shows the mechanism for accessing subarrays in ORDER4 mode. Assuming that all registers of bank 0, bank 1 and bank 2 group 0 to group 11 (R 0,2 to R 11,2 ) registers have been occupied, as shown in the register marked at the bottom of the net in the figure, then the pointer points to R 12,2 . Taking the writing of 8 data in the ORDER4 mode in the queue of the control module 41 as an example, the enable logic gate 703 enables all the registers of the bank 2 to be turned on to receive data, and the other enable logic gates disable all the other banks. register.
  • ORDER4 Since ORDER4 writes 4 data in sequence each time, the first write writes the first data into R 12,2 according to the pointer read and write logic gate 717 , and the read and write logic gate 718 allows the second data to be written into R 13,2 , the read-write logic gate 719 allows the third data to be written into R 14,2 , the read-write logic gate 720 allows the fourth data to be written into R 15,2 , and the pointer is incremented by 4 to point to R 0,3 , at this time
  • the enable logic gate 704 enables all registers of bank 3 to be turned on to receive data, while the other enable logic gates disable all registers in the remaining banks, and the second write reads and writes the logic gate 705 according to the pointer to allow the fifth data to be written R 0,3 , the read-write logic gate 706 allows the sixth data to be written to R 1,3 , the read-write logic gate 707 allows the seventh data to be written to R 2,3 , the read-write logic gate 708 allows the eighth data to be written Enter R
  • Figure 11 shows the mechanism for accessing subarrays in STRIDE 3_2 mode. Assuming that the registers of R 0,0 to R 7,0 are occupied, as shown in the register marked the bottom of the net in the figure, the pointer points to R 8,0 at this time. Taking the writing of 16 data in the STRIDE 3_2 mode to be executed in the queue of the control module 41 as an example, the enable logic gate 701 enables all the registers of bank 0 to be turned on to receive data, and the other enable logic gates disable the rest of the banks. all registers.
  • the first write is based on the pointer read-write logic gate 713 to write the first data into R 8,0 , the read-write logic gate 717 to write the second data to R 12,0 , and the pointer is incremented by 1 to point to R 9,0
  • the second write is based on the pointer read-write logic gate 714 to write the third data into R 9,0
  • the read-write logic gate 718 allows the fourth data to be written into R 13,0
  • the pointer is incremented by 1 to point to R 10 , 0 .
  • Figure 12 shows the mechanism for accessing subarrays in STRIDE 3_4 mode. Assuming that no registers are occupied, the pointer now points to R 0,0 . Taking the writing of 16 data in the STRIDE 3_4 mode in the queue of the control module 41 as an example, the enable logic gate 701 enables all the registers of bank 0 to be turned on to receive data, and the other enable logic gates disable the rest of the banks. all registers. The first write is based on the pointer read-write logic gate 705 to write the first data to R 0,0 , the read-write logic gate 709 to write the second data to R 4,0 , and the read-write logic gate 713 to write the third data to R 4,0 .
  • the data is written into R 8,0 , the read and write logic gate 717 allows the fourth data to be written into R 12,0 , the pointer is incremented by 1 to point to R 1,0 , and the read and write logic gate 706 allows the fifth data to be written in the second write.
  • the data is written into R 1,0 , the read and write logic gate 710 allows the sixth data to be written into R 5,0 , the read and write logic gate 714 allows the seventh data to be written into R 9,0 , and the read and write logic gate 718 allows the eighth data to be written into R 9,0 .
  • a data is written to R 13,0 , the pointer is incremented by 1, and points to R 2,0 .
  • the writing of 16 data is completed as shown in the figure.
  • Figure 13 shows the mechanism for accessing subarrays in STRIDE 1_2 mode. Assuming that all the registers of bank 0, bank 1, bank 2 and bank 3 group 0 to group 3 (R 0,3 to R 3,3 ) have been occupied, as shown in the register marked at the bottom of the net in the figure, at this time the pointer points to R 4,3 . Taking the writing of 8 data in the STRIDE 1_2 mode to be executed in the queue of the control module 41 as an example, the enable logic gate 704 enables all the registers of the bank 3 to be turned on to receive data, and the other enable logic gates disable the rest of the banks. all registers.
  • the first write is based on the pointer read and write logic gate 709 to write the first data into R 4,3 , and the read and write logic gate 711 to write the second data into R 6,3 , and the pointer is incremented by 1 to point to R 5,3 , the read-write logic gate 710 allows the third data to be written into R 5,3 during the second write, and the read-write logic gate 712 allows the fourth data to be written to R 7,3 , at this time the pointer is incremented by 3 to point to R 8 ,3 , the read-write logic gate 713 allows the fifth data to be written into R 8,3 , and the read-write logic gate 715 allows the sixth data to be written into R 10,3 .
  • the writing of 8 data is completed as shown in the figure.
  • Figure 14 shows the mechanism for accessing subarrays in STRIDE 1_4 mode. It is assumed that the registers of bank 0 group 0 to group 11 (R 0,0 to R 11,0 ) have been occupied, as shown in the register marked the bottom of the net in the figure, and the pointer points to R 12,0 at this time. Taking the writing of 8 data in the STRIDE 1_4 mode in the queue of the control module 41 as an example, first enable the logic gate 701 to enable all the registers of bank 0 to open to receive data, and the remaining enable logic gates to disable the rest of the banks. of all registers.
  • the first write is based on the pointer read and write logic gate 717 to write the first data into R 12,0 , the read and write logic gate 719 to write the second data to R 14,0 , and then enable the logic gate 702 to enable the bank 1 All registers are turned on to receive data, while the rest enable logic gates disable all registers in the rest of the bank, read and write logic gate 705 allows the third data to be written to R 0,1 , and read and write logic gate 707 allows the fourth data Write to R 2,1 , increment the pointer by 1 to point to R 13,0 .
  • the enable logic gate 701 enables all the registers of bank 0 to be reopened to receive data, while the other enable logic gates disable all registers in the remaining banks, and the read and write logic gate 718 allows the fifth data to be written for the second time.
  • Write R 13,0 , read and write logic gate 720 allows the sixth data to be written to R 15,0 , then enable logic gate 702 to re-enable all registers of bank 1 to be turned on to receive data, while the rest of the enable logic gates are disabled
  • the read-write logic gate 706 allows the seventh data to be written into R 1,1
  • the read-write logic gate 708 allows the eighth data to be written to R 3,1 .
  • the writing of 8 data is completed as shown in the figure.
  • the register access methods of sequential read mode and update read mode are basically the same as sequential write and jump write, the difference is that the data in the register is read instead of writing data to the register, which can be easily understood by those skilled in the art based on the aforementioned write modes.
  • the register access methods of sequential read mode and update read mode are not described in detail.
  • the enable logic group 602 and the read-write logic group 603 are a combination of logic gates to implement the aforementioned control method.
  • the combination of logic gates for realizing this kind of control mode is well known to those skilled in the art, so it is not repeated here.
  • This embodiment reduces the read and write ports of the registers by appropriately planning the groups and banks of the register array, that is, reduces the data of the read and write ports, and the selection logic of the access mode effectively reduces the power consumption of the register array.
  • Another embodiment of the present invention is a method for using the aforementioned register array to support multiple access modes.
  • the method is executed by the computing device 201 , as shown in FIG. 15 , and includes the following steps.
  • step 1501 the (M+1) ⁇ N group and the R bank of the register array are set as sub-arrays, wherein (M+1) ⁇ N is not greater than P and R is not greater than Q.
  • the computing device 201 of this embodiment also needs to support the aforementioned 7 access modes, such as ORDER1, ORDER2, ORDER4, STRIDE 3_2, STRIDE 3_4, STRIDE 1_2, STRIDE 1_4, etc.
  • the control module 41 Before inference, the control module 41 first identifies each of these access modes The library requires the smallest unit of the number of groups.
  • ORDER1 is to write 1 data in sequence at a time
  • ORDER2 is to write 2 data in sequence each time
  • ORDER4 is to write 4 data in sequence each time, in these three modes, the minimum unit of each library Only 4 registers are needed, and ORDER1 can write 4 rounds, ORDER2 can write 2 rounds, and ORDER4 can write 1 round.
  • the register array 601 when the access mode is STRIDE M_N, the register array 601 only needs (M+1) ⁇ N group plus the sub-array 607 of the R library to meet the storage requirements. Since the register array 601 is P ⁇ Q, so (M+1) ⁇ N cannot be greater than P, and R is any number but not greater than Q. In other words, the register array 601 of P ⁇ Q can logically cut out the sub-array 607 of ((M+1) ⁇ N) ⁇ R for storage of the access pattern of STRIDE M_N.
  • STRIDE 3_2 writes 2 data in 3 registers at each interval, so a round (that is, the pointer increment is not 1, but 5) requires 8 registers to write, and STRIDE 3_4 writes 4 for 3 registers at each interval Therefore, 16 registers are required to write a round (that is, the pointer increment is not 1, but 13), and STRIDE 1_2 writes 2 data for 1 register at each interval, so a round (that is, the pointer increment is not 1, but 3) writing requires 4 registers, STRIDE 1_4 writes 4 data for 1 register at each interval, so a round (that is, the pointer increment is not 1, but 7) requires 8 writes Registers, in terms of these 4 modes, the minimum unit of each library is 16 registers, which can be used for STRIDE 3_2 to write 2 rounds, STRIDE 3_4 to write 1 round, STRIDE 1_2 to write 4 rounds, STRIDE 1_4 Write 2 rounds.
  • the minimum unit of each bank is 4 registers
  • the minimum unit of each bank is 16 registers.
  • the 16 registers are exactly an integer multiple of 4 registers, so the 16 registers can simultaneously allow the sequential write mode and the skip write mode to write an integer number of rounds in each bank.
  • control module 41 In addition to judging all sequential write modes and skip write modes, the control module 41 also includes sequential read and update read modes into the evaluation, but the number of registers occupied by each round of sequential read and update read modes is different from sequential write mode and skip write mode respectively.
  • the pattern is the same. Furthermore, under normal circumstances, the number of registers occupied by each round in sequential write mode is not much.
  • the minimum number of registers in each library is determined by the skip write mode, and the number of registers occupied by each round of each skip write mode is usually The number of registers are mostly integer multiples of each other.
  • the control module 41 wants to determine the number of registers required for each round of each mode, in theory, the least common divisor of the number of registers occupied by each round of all access modes is taken as the number of registers in each library ( That is, the number of groups), in fact, it is only necessary to determine the maximum value (M q +1) ⁇ N q of (M i +1) ⁇ N i in all skip write modes STRIDE M i _N i to ensure that each mode An integer number of rounds are accessed in each bank. Each mode accesses an integer number of rounds in each bank to avoid the problem that when the operation continues for a period of time, the free space of the register file is scattered around the heap irregularly.
  • the sub-array 607 since the sub-array 607 only needs (M q +1) ⁇ N q read-write logic gates 606 instead of all P read-write logic gates 606 participating in the operation, the NRAM 431 needs to control the read-write logic gates 606. The number is reduced for easy control.
  • step 1502 the registers of one of the banks of the subarrays are synchronously enabled.
  • step 1503 N pieces of data are selected to be input into N registers in the group, respectively.
  • the control module 41 logically sets the sub-array 607 to be 16 ⁇ 4, that is, an array with 16 groups and 4 banks, corresponding to 4 enable logic gates 701-704
  • 16 read and write logic gates 705-720 are used to select a maximum of 4 data inputs to certain 4 registers in the 16 groups at a time.
  • Another embodiment of the present invention is a computer-readable storage medium having stored thereon computer program code for supporting an access mode using a register array, and when the computer program code is executed by a processor, executes the aforementioned implementations example method.
  • the above integrated units may be implemented in the form of software program modules. If implemented in the form of a software program module and sold or used as a stand-alone product, the integrated unit may be stored in a computer-readable memory.
  • the software product may be stored in a memory, which may include several instructions to cause a computer device (eg, a personal computer, a server or network equipment, etc.) to execute some or all of the steps of the method described in the embodiments of the present invention.
  • the aforementioned memory may include, but is not limited to, a U disk, a flash disk, a read-only memory (Read Only Memory, ROM), a random access memory (Random Access Memory, RAM), a mobile hard disk, a magnetic disk, or a CD, etc. that can store programs. medium of code.
  • the invention reduces the read and write ports of the register by appropriately planning the groups and libraries of the register array, that is, reduces the data of the read and write ports, and the selection logic of the access mode effectively reduces the power consumption of the register array.
  • the electronic device or device of the present invention may include servers, cloud servers, server clusters, data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, PC equipment, IoT terminals, mobile Terminals, mobile phones, driving recorders, navigators, sensors, cameras, cameras, video cameras, projectors, watches, headphones, mobile storage, wearable devices, visual terminals, autonomous driving terminals, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the electronic device or device of the present invention can also be applied to the Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical care and other fields. Further, the electronic device or device of the present invention can also be used in application scenarios related to artificial intelligence, big data and/or cloud computing, such as cloud, edge terminal, and terminal.
  • the electronic device or device with high computing power according to the solution of the present invention can be applied to a cloud device (such as a cloud server), while the electronic device or device with low power consumption can be applied to a terminal device and/or Edge devices (such as smartphones or cameras).
  • the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that the hardware resources of the cloud device can be retrieved from the hardware information of the terminal device and/or the edge device according to the hardware information of the terminal device and/or the edge device. Match the appropriate hardware resources to simulate the hardware resources of terminal devices and/or edge devices, so as to complete the unified management, scheduling and collaborative work of device-cloud integration or cloud-edge-device integration.
  • the present invention expresses some methods and their embodiments as a series of actions and their combinations, but those skilled in the art can understand that the solution of the present invention is not limited by the sequence of the described actions . Accordingly, based on the disclosure or teachings of the present invention, those skilled in the art will understand that some of the steps may be performed in other orders or simultaneously. Further, those skilled in the art can understand that the embodiments described in the present invention may be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different solutions, the present invention also has different emphases in the description of some embodiments. In view of this, those skilled in the art can understand the parts that are not described in detail in a certain embodiment of the present invention, and can also refer to the related descriptions of other embodiments.
  • units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units.
  • the aforementioned components or elements may be co-located or distributed over multiple network elements.
  • some or all of the units may be selected to achieve the purpose of the solutions described in the embodiments of the present invention.
  • multiple units in this embodiment of the present invention may be integrated into one unit or each unit physically exists independently.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits, and the like.
  • the physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, and the physical devices may include, but are not limited to, devices such as transistors or memristors.
  • various types of devices described herein eg, computing devices or other processing devices
  • suitable hardware processors such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (EDRAM), High Bandwidth Memory (High Bandwidth Memory) , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • a variable resistance memory Resistive Random Access Memory, RRAM
  • Dynamic Random Access Memory Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • HBM Hybrid Memory Cube
  • ROM and RAM etc.
  • a register file supporting multiple access modes including a register array, the register array having a P group and a Q bank, the multiple access modes including a skip write mode, the skip write mode being every write N pieces of data are stored in the register array, each data interval is stored in M registers, and the multiple access modes are stored using (M+1) ⁇ N groups and R banks of the register array, where (M+1) ⁇ N is not greater than P, and R is not greater than Q.
  • Item A2 The register file according to Item A1, further comprising R enable logic gates, each enable logic gate is used to synchronously enable a bank of registers.
  • Clause A4 The register file of Clause A1, wherein the plurality of access modes further include a sequential write mode for writing S pieces of data into consecutive registers of the register array at a time, where S is not greater than P.
  • Clause A5 The register file of Clause A1, wherein the plurality of access modes further comprises a sequential read mode to read T data from consecutive registers of the register array at a time, where T is not greater than P.
  • Clause A6 The register file of Clause A1, wherein the plurality of access modes further comprises an update read mode for reading N data at intervals of M registers in the register array at a time.
  • a computing device supporting multiple access modes including multiple skip write modes, the i-th skip write mode is to write N i data at a time, and each data interval M i Register storage, the computing device includes: a register array with a P group and a Q bank, the multiple access modes are stored using (M q +1) ⁇ N q group and R bank, where (M q +1) ⁇ N q is not greater than P, and R is not greater than Q; wherein, (M q +1) ⁇ N q is the maximum value of (M i +1) ⁇ N i in the plurality of skip write modes.
  • Clause A8 The computing device of Clause A7, further comprising R enable logic gates, each enable logic gate to synchronously enable a bank of registers.
  • Clause A9 The computing device of Clause A7, further comprising (M q +1) ⁇ N q write logic gates to select the N i data inputs to the (M q +1) ⁇ N q group N i registers in .
  • Clause A10 The computing device of Clause A7, wherein the plurality of access modes further comprises a plurality of sequential write modes, the i -th sequential write mode for writing Si data at a time to consecutive times of the register array. register, where Si is not greater than P.
  • Clause A11 The computing device of Clause A7, wherein the plurality of access modes further comprises a plurality of sequential read modes, the i-th sequential read mode for each read T i from consecutive registers of the register array data, where T i is not greater than P.
  • Clause A12 The computing device of Clause A7, wherein the plurality of access modes further comprises a plurality of update read modes, an i-th update read mode for reading from M i registers spaced apart in the register array at a time N i data.
  • Clause A13 The computing device of Clause A7, further comprising a control module to identify (M q +1) ⁇ N q .
  • a method of using a register array to support a skip write mode the register array having a P group and a Q bank, the skip write mode is to write N data to the register array at a time, each data interval M registers are stored, and the method includes: setting (M+1) ⁇ N group and R bank of the register array as sub-arrays, wherein (M+1) ⁇ N is not greater than P, and R is not greater than Q; synchronization; enabling a register of one of the banks of the subarrays; and selecting to input the N data to the N registers in the group, respectively.
  • a method of utilizing a register array to support multiple access modes the register array having a P group and a Q bank, the multiple access modes including multiple skip write modes, the ith skip write mode being each Writing N i data, each data is stored at intervals of Mi registers
  • the method includes: setting (M q +1) ⁇ N q group and R bank of the register array as sub-arrays, wherein (M q +1) ⁇ N q is not greater than P, and R is not greater than Q; synchronously enable the registers of one of the banks of the sub- arrays ; and select to input the Ni data to the Ni registers in the group respectively;
  • (M q +1) ⁇ N q is the maximum value of (M i +1) ⁇ N i in the plurality of skip write modes.
  • Clause A18 A computer-readable storage medium having stored thereon computer program code utilizing an array of registers to support an access mode, which when executed by a processing device, executes any one of clauses A16 to 17. method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

支援多种访问模式的设备、方法及可读存储介质,其中计算装置(201)包括在集成电路装置(20)中,集成电路装置(20)包括通用互联接口和其他处理装置(203)。计算装置(201)与其他处理装置(203)进行交互,共同完成用户指定的计算操作。集成电路装置(20)还可以包括存储装置(204),存储装置(204)分别与计算装置(201)和其他处理装置(203)连接,用于计算装置(201)和其他处理装置(203)的数据存储。

Description

支援多种访问模式的设备、方法及可读存储介质
相关申请的交叉引用
本申请要求于2020年11月27日申请的,申请号为2020113600447,名称为“支援多种访问模式的设备、方法及可读存储介质”的中国专利申请的优先权。
技术领域
本发明一般地涉及计算机领域。更具体地,本发明涉及一种支援多种访问模式的设备、方法及可读存储介质。
背景技术
神经网络是一种运算模型,由大量的节点(或称神经元)之间相互联接构成,每个节点代表一种特定的输出函数,称为激励函数,每两个节点间的连接都代表一个对于通过该连接信号的加权值,称之为权重,这相当于人工神经网络的记忆。网络的输出则依网络的连接方式,权重值和激励函数的不同而不同。而网络自身通常都是对自然界某种算法或者函数的逼近,也可能是对一种逻辑策略的表达。
卷积神经网络中最核心的操作便是卷积运算。卷积核在图像矩阵上滑动位移,以获得各种特征值,这需要大量的硬件资源来支撑。由于人工智能处理芯片中的乘法器和加法器是有限,因此实务上会将卷积运算切割成多个小区域进行累加,每次的累加结果都需要暂存在寄存器堆中,每块区域计算完毕后再整合这些累加结果,如此的操作便需要大量的寄存器访问。
不仅如此,在执行部分卷积运算(例如深度分离卷积)时,其写入是以深度方向的维度顺序进行,但读取时却可能是基于另一个维度的顺序进行,经过大量的访问后,将导致寄存器堆闲置可使用的空间没有规律的散落在堆内各处,寄存器堆的访问将变得非常复杂且低效。
再者,目前的卷积运算需要多种访问模式来支撑,这些访问模式包括顺序访问及跳跃访问等,顺序访问为一次依序读取或写入数个结果,跳跃访问为间隔数个寄存器读取或写入数个数据。这种访问模式会在神经网络推理时交替运用,当运算持续一段时间后同样会造成寄存器堆的访问将变得非常复杂且低效。
因此,一种支援多种访问模式的技术分案是迫切需要的。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种支援多种访问模式的设备、方法及可读存储介质。
在一个方面中,本发明揭露一种支援多种访问模式的寄存器堆,包括寄存器阵列,所述寄存器阵列具有P群与Q库,所述多种访问模式包括跳跃写模式,所述跳跃写模式为每次写入N个数据至所述寄存器阵列,每个数据间隔M个寄存器存储,所述多种访问模式利用所述寄存器阵列的(M+1)×N群与R库存储,其中(M+1)×N不大于P,R不大于Q。
在另一个方面,本发明揭露一种支援多种访问模式的计算装置,所述多种访问模式包括多个跳跃写模式,第i个跳跃写模式为每次写入N i个数据,每个数据间隔M i个寄存器 存储。所述计算装置包括寄存器阵列,具有P群与Q库,所述多种访问模式利用(M q+1)×N q群与R库存储,其中(M q+1)×N q不大于P,R不大于Q。其中,(M q+1)×N q为所述多个跳跃写模式中的(M i+1)×N i的最大值。
在另一个方面,本发明揭露一种集成电路装置,包括前述的计算装置,亦揭露一种板卡,包括根据前述的集成电路装置。
在另一个方面,本发明一种利用寄存器阵列以支援跳跃写模式的方法,所述寄存器阵列具有P群与Q库,所述跳跃写模式为每次写入N个数据至所述寄存器阵列,每个数据间隔M个寄存器存储,所述方法包括:设定所述寄存器阵列的(M+1)×N群与R库为子阵列,其中(M+1)×N不大于P,R不大于Q;同步使能所述子阵列其中一库的寄存器;以及选择将所述N个数据分别输入至所述群中的N个寄存器。
在另一个方面,本发明一种利用寄存器阵列以支援多种访问模式的方法,所述寄存器阵列具有P群与Q库,所述多种访问模式包括多个跳跃写模式,第i个跳跃写模式为每次写入N i个数据,每个数据间隔M i个寄存器存储。所述方法包括:设定所述寄存器阵列的(M q+1)×N q群与R库为子阵列,其中(M q+1)×N q不大于P,R不大于Q;同步使能所述子阵列其中一库的寄存器;以及选择将所述N i个数据分别输入至所述群中的N i个寄存器。其中,(M q+1)×N q为所述多个跳跃写模式中的(M i+1)×N i的最大值。
在另一个方面,本发明一种计算机可读存储介质,其上存储有利用寄存器阵列以支援访问模式的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行前述的方法。
本发明通过适当的规划寄存器阵列的群与库,减少了寄存器的读写端口,即减少了读写端口的数据,再加上访问模式的选择逻辑,有效降低了寄存器阵列的功耗开销。
附图说明
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分其中:
图1是示出本发明实施例的板卡的结构图;
图2是示出本发明实施例的集成电路装置的结构图;
图3是示出本发明实施例的计算装置的内部结构示意图;
图4是示出本发明实施例的处理器核的内部结构示意图;
图5是示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图;
图6是示出本发明实施例的NRAM中的寄存器堆的示意图;
图7是示出本发明实施例的16×4的子阵列的示意图;
图8是示出本发明实施例在ORDER1模式下访问子阵列的示意图;
图9是示出本发明实施例在ORDER2模式下访问子阵列的示意图;
图10是示出本发明实施例在ORDER4模式下访问子阵列的示意图;
图11是示出本发明实施例在STRIDE 3_2模式下访问子阵列的示意图;
图12是示出本发明实施例在STRIDE 3_4模式下访问子阵列的示意图;
图13是示出本发明实施例在STRIDE 1_2模式下访问子阵列的示意图;
图14是示出本发明实施例在STRIDE 1_4模式下访问子阵列的示意图;以及
图15是示出本发明实施例利用前述寄存器阵列以支援多种访问模式的方法的流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
下面结合附图来详细描述本发明的具体实施方式。
图1示出本发明实施例的一种板卡10的结构示意图。如图1所示,板卡10包括芯片101,其是一种系统级芯片(System on Chip,SoC),或称片上系统,集成有一个或多个组合处理装置,组合处理装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡10适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片101通过对外接口装置102与外部设备103相连接。外部设备103例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备103通过对外接口装置102传递至芯片101。芯片101的计算结果可以经由对外接口装置102传送回外部设备103。根据不同的应用场景,对外接口装置102可以具有不同的接口形式,例如PCIe接口等。
板卡10还包括用于存储数据的存储器件104,其包括一个或多个存储单元105。存储器件104通过总线与控制器件106和芯片101进行连接和数据传输。板卡10中的控制器件106配置用于对芯片101的状态进行调控。为此,在一个应用场景中,控制器件106可以包括单片机(Micro Controller Unit,MCU)。
图2是示出此实施例的芯片101中的组合处理装置的结构图。如图2中所示,组合处理装置20包括计算装置201、接口装置202、处理装置203和DRAM 204。
计算装置201配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以通过接口装置202与处理装置203进行交互,以共同完成用户指定的操作。
接口装置202用于在计算装置201与处理装置203间传输数据和控制指令。例如,计 算装置201可以经由接口装置202从处理装置203中获取输入数据,写入计算装置201片上的存储装置。进一步,计算装置201可以经由接口装置202从处理装置203中获取控制指令,写入计算装置201片上的控制缓存中。替代地或可选地,接口装置202也可以读取计算装置201的存储装置中的数据并传输给处理装置203。
处理装置203作为通用的处理装置,执行包括但不限于数据搬运、对计算装置201的开启和/或停止等基本控制。根据实现方式的不同,处理装置203可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。如前所述,仅就本发明的计算装置201而言,其可以视为具有单核结构或者同构多核结构。然而,当将计算装置201和处理装置203整合共同考虑时,二者视为形成异构多核结构。
DRAM 204用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置201和/或处理装置203的数据。
图3示出了计算装置201的内部结构示意图。计算装置201用以处理计算机视觉、语音、自然语言、数据挖掘等输入数据,图中的计算装置201采用多核分层结构设计,计算装置201作为一个片上系统,其包括多个集群(cluster),每个集群又包括多个处理器核,换言之,计算装置201是以片上系统-集群-处理器核的层次所构成的。
以片上系统的层级来看,如图3所示,计算装置201包括外部存储控制器301、外设通信模块302、片上互联模块303、同步模块304以及多个集群305。
外部存储控制器301可以有多个,在图中示例性地展示2个,其用以响应处理器核发出的访问请求,访问外部存储设备,例如图2中的DRAM 204,从而自片外读取数据或是将数据写入。外设通信模块302用以通过接口装置202接收来自处理装置203的控制信号,启动计算装置201执行任务。片上互联模块303将外部存储控制器301、外设通信模块302及多个集群305连接起来,用以在各个模块间传输数据和控制信号。同步模块304是一种全局同步屏障控制器(global barrier controller,GBC),用以协调各集群的工作进度,确保信息的同步。多个集群305是计算装置201的计算核心,在图中示例性地展示4个,随着硬件的发展,本发明的计算装置201还可以包括8个、16个、64个、甚至更多的集群305。集群305用以高效地执行深度学习算法。
以集群的层级来看,如图3所示,每个集群305包括多个处理器核(IPU core)306及一个存储核(MEM core)307。
处理器核306在图中示例性地展示4个,本发明不限制处理器核306的数量。其内部架构如图4所示。每个处理器核306包括三大模块:控制模块41、运算模块42及存储模块43。
控制模块41用以协调并控制运算模块42和存储模块43的工作,以完成深度学习的任务,其包括取指单元(instruction fetch unit,IFU)411及指令译码单元(instruction decode unit,IDU)412。取指单元411用以获取来自处理装置203的指令,指令译码单元412则将获取的指令进行译码,并将译码结果作为控制信息发送给运算模块42和存储模块43。
运算模块42包括向量运算单元421及矩阵运算单元422。向量运算单元421用以执行 向量运算,可支持向量乘、加、非线性变换等复杂运算;矩阵运算单元422负责深度学习算法的核心计算,即矩阵乘及卷积。
存储模块43用来存储或搬运相关数据,包括神经元存储单元(neuron RAM,NRAM)431、权值存储单元(weight RAM,WRAM)432、输入/输出直接内存访问模块(input/output direct memory access,IODMA)433、搬运直接内存访问模块(move direct memory access,MVDMA)434。NRAM 431用以存储供处理器核306计算的特征图及计算后的中间结果;WRAM 432则用以存储深度学习网络的权值;IODMA 433通过广播总线309控制NRAM 431/WRAM 432与DRAM 204的访存;MVDMA 434则用以控制NRAM 431/WRAM 432与SRAM 308的访存。
回到图3,存储核307主要用以存储和通信,即存储处理器核306间的共享数据或中间结果、以及执行集群305与DRAM 204之间的通信、集群305间彼此的通信、处理器核306间彼此的通信等。在其他实施例中,存储核307具有标量运算的能力,用以执行标量运算。
存储核307包括共享存储单元(SRAM)308、广播总线309、集群直接内存访问模块(cluster direct memory access,CDMA)310及全局直接内存访问模块(global direct memory access,GDMA)311。SRAM 308承担高性能数据中转站的角色,在同一个集群305内不同处理器核306之间所复用的数据不需要通过处理器核306各自向DRAM 204获得,而是经SRAM 308在处理器核306间中转,存储核307只需要将复用的数据从SRAM 308迅速分发给多个处理器核306即可,以提高核间通讯效率,亦大大减少片上片外的输入/输出访问。
广播总线309、CDMA 310及GDMA 311则分别用来执行处理器核306间的通信、集群305间的通信和集群305与DRAM 204的数据传输。以下将分别说明。
广播总线309用以完成集群305内各处理器核306间的高速通信,此实施例的广播总线309支持核间通信方式包括单播、多播与广播。单播是指点对点(即单一处理器核至单一处理器核)的数据传输,多播是将一份数据从SRAM 308传输到特定几个处理器核306的通信方式,而广播则是将一份数据从SRAM 308传输到所有处理器核306的通信方式,属于多播的一种特例。
CDMA 310用以控制在同一个计算装置201内不同集群305间的SRAM 308的访存。图5示出当一个处理器核欲将数据写入至另一个集群的处理器核时的示意图,以说明CDMA 310的工作原理。在此应用场景中,同一个计算装置包括多个集群,为方便说明,图中仅展示集群0与集群1,集群0与集群1分别包括多个处理器核,同样为了说明方便,图中的集群0仅展示处理器核0,集群1仅展示处理器核1。处理器核0欲将数据写入至处理器核1。
首先,处理器核0发送单播写请求将数据写入本地的SRAM 0中,CDMA 0作为主(master)端,CDMA 1作为从(slave)端,主端向从端推送写请求,即主端发送写地址AW和写数据W,将数据传送到集群1的SRAM 1中,接着从端发送写响应B作为回应,最后集群1的处理器核1发送单播读请求将数据从SRAM 1中读取出来。
回到图3,GDMA 311与外部存储控制器301协同,用以控制集群305的SRAM 308到DRAM 204的访存,或是将数据自DRAM 204读取至SRAM 308中。从前述可知,DRAM 204与NRAM 431或WRAM 432间的通信可以经由2个渠道来实现。第一个渠道是通过 IODAM 433直接联系DRAM 204与NRAM 431或WRAM 432;第二个渠道是先经由GDMA 311使得数据在DRAM 204与SRAM 308间传输,再经过MVDMA 434使得数据在SRAM 308与NRAM 431或WRAM 432间传输。虽然表面上看来第二个渠道需要更多的元件参与,数据流较长,但实际上在部分实施例中,第二个渠道的带宽远大于第一个渠道,因此DRAM 204与NRAM 431或WRAM 432间的通信通过第二个渠道可能更有效率。本发明的实施例可根据本身硬件条件选择数据传输渠道。
在其他实施例中,GDMA 311的功能和IODMA 433的功能可以整合在同一部件中。本发明为了方便描述,将GDMA 311和IODMA 433视为不同部件,对于本领域技术人员来说,只要其实现的功能以及达到的技术效果与本发明类似,即属于本发明的保护范围。进一步地,GDMA 311的功能、IODMA 433的功能、CDMA 310的功能、MVDMA 434的功能亦可以由同一部件来实现。
由于矩阵运算单元422中的乘法器和加法器数量是有限的,当矩阵运算单元422在执行卷积计算时,存储核307会将待计算的图像切割成多个小块,分配给各个处理器核306。矩阵运算单元422每次处理一个小块,因此会执行大量的乘加操作,再把这些乘加结果累加起来,直至累加完毕。为配合这种运算方式,此实施例的NRAM 431及WRAM 432的具体结构是一种寄存器堆,用来暂存累加结果。矩阵运算单元422会持续高频的访问NRAM 431及WRAM 432以更新的累加结果,直至累加操作结束,将累加结果输出至SRAM 308。
图6示出NRAM 431中的寄存器堆的示意图,如图所示,寄存器堆600包括寄存器阵列601、使能逻辑组602、读写逻辑组603。寄存器阵列601包括多个寄存器604,逻辑上排列成P×Q阵列,即寄存器阵列601具有P群(group)与Q库(bank),图中的每一行代表一个群,每一列代表一个库。使能逻辑组602包括Q个使能逻辑门605,每个使能逻辑门605用以控制其中一个库中的所有寄存器604开启以读写该库的寄存器604。读写逻辑组603包括P个读写逻辑门606,每个读写逻辑门606用以写入数据至特定群或自特定群读取数据。寄存器阵列601基于时钟信号CLK来运作。
此实施例支持多种访问寄存器堆600的模式,这些访问模式大致上可以分为跳跃访问模式及顺序访问模块两种。
跳跃访问模式包括跳跃写模式和更新读模式,其中跳跃写模式STRIDE M_N指的是每次访问寄存器堆600时,间隔M个寄存器写入N个数据至寄存器阵列601,更新读模式指的是每次自寄存器阵列601中间隔M个寄存器读取N个数据。
以下示例性的说明跳跃写模式中的STRIDE 3_2、STRIDE 3_4、STRIDE 1_2、STRIDE 1_4模式。
STRIDE 3_2为每次间隔3个寄存器跳跃写入2个数据。在第1个时钟周期中,根据指针为i,将第1个数据写入特定库的第i个寄存器,以及将第2个数据写入特定库的第i+4个寄存器,写入后指针递增1。在第2个时钟周期中,根据指针为i+1,将第1个数据写入特定库的第i+1个寄存器,以及将第2个数据写入特定库的第i+5个寄存器,写入后指针递增1。在第3个时钟周期中,根据指针为i+2,将第1个数据写入特定库的第i+2个寄存器,以及将第2个数据写入特定库的第i+6个寄存器,写入后指针递增1。在第4个时钟周期中,根据指针为i+3,将第1个数据写入特定库的第i+3个寄存器,以及将第2个数据写入特定库的第i+7个寄存器。这时,如果指针递增1会变为i+4,而第i+4个寄 存器已在第1个时钟周期中被使用过了,再指向第i+4个寄存器便会将第1个时钟周期中写入的数据覆盖过去,因此在第4个时钟周期时,写入后指针递增5,成为i+8,使得下一个时钟周期的第1个数据写入特定库的第i+8个寄存器。
STRIDE 3_4为每次间隔3个寄存器跳跃写入或读取4个数据。在第1个时钟周期中,根据指针为i,将第1个数据写入特定库的第i个寄存器,将第2个数据写入特定库的第i+4个寄存器,将第3个数据写入特定库的第i+8个寄存器,将第4个数据写入特定库的第i+12个寄存器,写入后指针递增1。在第2个时钟周期中,根据指针为i+1,将第1个数据写入特定库的第i+1个寄存器,将第2个数据写入特定库的第i+5个寄存器,将第3个数据写入特定库的第i+9个寄存器,将第4个数据写入特定库的第i+13个寄存器,写入后指针递增1。在第3个时钟周期中,根据指针为i+2,将第1个数据写入特定库的第i+2个寄存器,将第2个数据写入特定库的第i+6个寄存器,将第3个数据写入特定库的第i+10个寄存器,将第4个数据写入特定库的第i+14个寄存器,写入后指针递增1。在第4个时钟周期中,根据指针为i+3,将第1个数据写入特定库的第i+3个寄存器,将第2个数据写入特定库的第i+7个寄存器,将第3个数据写入特定库的第i+11个寄存器,将第4个数据写入特定库的第i+15个寄存器,在第S×(M+1)个时钟周期时,写入后指针递增(S+1)×N×(M+1)。
这时,如果指针递增1会变为i+4,而第i+4个寄存器已在第1个时钟周期中被使用过了,再指向第i+4个寄存器便会将第1个时钟周期中写入的数据覆盖过去,因此在第4个时钟周期时,写入后指针递增13,成为i+16,使得下一个时钟周期的第1个数据写入特定库的第i+16个寄存器。
STRIDE 1_2为每次间隔1个寄存器跳跃写入2个数据。在第1个时钟周期中,根据指针为i,将第1个数据写入特定库的第i个寄存器,将第2个数据写入特定库的第i+2个寄存器,写入后指针递增1。在第2个时钟周期中,根据指针为i+1,将第1个数据写入特定库的第i+1个寄存器,将第2个数据写入特定库的第i+3个寄存器,这时,如果指针递增1会将第1个时钟周期中写入的数据覆盖过去,因此在第3个时钟周期时,写入后指针递增3,成为i+4,使得下一个时钟周期的第1个数据写入特定库的第i+4个寄存器。
STRIDE 1_4为每次间隔1个寄存器跳跃写入4个数据。在第1个时钟周期中,根据指针为i,将第1个数据写入特定库的第i个寄存器,将第2个数据写入特定库的第i+2个寄存器,将第3个数据写入特定库的第i+4个寄存器,将第4个数据写入特定库的第i+6个寄存器,写入后指针递增1。在第2个时钟周期中,根据指针为i+1,将第1个数据写入特定库的第i+1个寄存器,将第2个数据写入特定库的第i+3个寄存器,将第3个数据写入特定库的第i+5个寄存器,将第4个数据写入特定库的第i+7个寄存器,这时,如果指针递增1会将第1个时钟周期中写入的数据覆盖过去,因此在第3个时钟周期时,写入后指针递增7,成为i+8,使得下一个时钟周期的第1个数据写入特定库的第i+8个寄存器。
综上所述,当遇到整数个(M+1)个时钟周期时,指针不是递增1,而是必须递增整数个(M+1)×N-M。
当访问模式为STRIDE M_N时,寄存器阵列601只需要(M+1)×N群加上R库的子阵列607即可满足存储需求,由于寄存器阵列601为P×Q,因此(M+1)×N不能大于P,R为任意数量但不大于Q。换言之,P×Q的寄存器阵列601可以在逻辑上切割出((M+1)×N)×R的子阵列607供STRIDE M_N的访问模式存储。可以理解的是,当 (M+1)×N=P且R=Q时,整个寄存器阵列601都会用于STRIDE M_N的存储。
更新读模式是在累加过程中,寄存器阵列601用来暂存部分和的中间结果,向量运算单元421或矩阵运算单元422读取中间结果并与一个数据进行加法后,再将更新的中间结果存回原寄存器中。累加的过程就是重复利用更新读模式来更新寄存器中的中间结果。以下将详述在定点数与浮点数在更新读模式下的运作方式。
如运算涉及定点数,此实施例的加法操作在1个时钟周期内即可完成,读取中间结果的寄存器与写入更新的中间结果的寄存器是同一个,因此更新读模式的运作方式与跳跃写模块类似,每次自寄存器阵列601中间隔M个寄存器读取N个数据,其指针的递增方式亦与跳跃写模块的指针的递增方式一致,故不赘述。
如运算涉及浮点数,以32位浮点数(float 32)为例,加法操作需要耗时2个时钟周期才能完成,即当下周期读取部分和的中间结果,要在第2个时钟周期才会写入更新的中间结果。因此,其运作方式与定点数大致相同,但涉及浮点数的更新读模式的指针需要比跳跃写模块的指针提前1个时钟周期设定。
顺序访问模式指的是每次访问寄存器堆600时,自寄存器阵列601沿着库方向的连续寄存器中读取或写入多个数据,顺序访问模式还可以再分为顺序写模式和顺序读模式。
在此实施例中,顺序写模式用以每次写入S个数据至寄存器阵列601的连续寄存器中,S不大于P。在一种应用场景中,当多种访问模式包括跳跃写和顺序写时,S不大于(M+1)×N。顺序写模式示例性地包括ORDER1、ORDER2、ORDER4模式,其中ORDER1为顺序写入1个数据,访问后指针递增1,ORDER2顺序写入2个数据,访问后指针递增2,ORDER4为顺序写入4个数据,访问后指针递增4。
顺序读模式用以每次自寄存器阵列601的连续寄存器中读取T个数据,T不大于P。在一种应用场景中,当多种访问模式包括更新读和顺序读时,T不大于(M+1)×N。顺序读模式常用在神经网络中的卷积计算,当卷积计算结束后,计算结果自NRAM 431读出并发送至SRAM 308,这时会采用顺序读模式。顺序读模式示例性的包括ROUT1、ROUT2、ROUT4,其中ROUT1为每次顺序读取1个寄存器的数据,且指针递增1,ROUT2为每次顺序读取2个寄存器的数据,且指针递增2,ROUT4为每次顺序读取4个寄存器的数据,且指针递增4。
随着神经网络模型越来越复杂,各种算子所需要的寄存器访问操作可能不一致,使得计算装置201在进行神经网络推理时会涉及到多种访问模式。在一种应用场景中,此实施例的计算装置201需要同时支援前述ORDER1、ORDER2、ORDER4、STRIDE 3_2、STRIDE 3_4、STRIDE 1_2、STRIDE 1_4等7种访问模式,在进行推理前,控制模块41先识别这些访问模式中每个库需要群数量的最小单位。
ORDER1为每次依序写入1个数据,ORDER2为每次依序写入2个数据,ORDER4为每次依序写入4个数据,以这3种模式来说,每个库的最小单位只需要4个寄存器,便可以供ORDER1写入4个轮回,ORDER2写入2个轮回,ORDER4写入1个轮回。
再考虑各种跳跃写模式,STRIDE 3_2为每次间隔3个寄存器写入2个数据,因此一个轮回(即指针递增不为1,而是5)的写入需要8个寄存器,STRIDE 3_4为每次间隔3个寄存器写入4个数据,因此一个轮回(即指针递增不为1,而是13)的写入需要16个寄存器,STRIDE 1_2为每次间隔1个寄存器写入2个数据,因此一个轮回(即指针递增不为1,而是3)的写入需要4个寄存器,STRIDE 1_4为每次间隔1个寄存器写入4个数 据,因此一个轮回(即指针递增不为1,而是7)的写入需要8个寄存器,以这4种模式来说,每个库的最小单位是16个寄存器,便可以供STRIDE 3_2写入2个轮回,STRIDE3_4写入1个轮回,STRIDE 1_2写入4个轮回,STRIDE 1_4写入2个轮回。
在顺序写模式中,每个库的最小单位是4个寄存器,在跳跃写模式中,每个库的最小单位是16个寄存器。16个寄存器恰好是4个寄存器的整数倍,故16个寄存器可以同时让顺序写模式与跳跃写模式在每个库中均写入整数个轮回。
除了判断所有顺序写模式和跳跃写模式外,控制模块41也会将顺序读与更新读模式纳入评估,但顺序读与更新读模式每个轮回所占用的寄存器数量分别与顺序写模式和跳跃写模式无异。再者,一般情况下,顺序写模式下每个轮回占用的寄存器不多,通常是由跳跃写模式决定了每个库的寄存器最少个数,而每个跳跃写模式的每个轮回所占用的寄存器数量大多互为整数倍。综上所述,当控制模块41欲判断每个模式每个轮回需要的寄存器数量时,理论上取所有访问模式每个轮回所占用的寄存器数量的最小公约数作为每个库的寄存器个数(也就是群数),实际上仅需判断所有跳跃写模式STRIDE M i_N i中的(M i+1)×N i的最大值(M q+1)×N q,便可以确保每个模式在每个库中均访问整数个轮回。各模式在每个库中均访问整数个轮回可以避免当运算持续一段时间后,寄存器堆闲置可使用的空间没有规律的散落在堆内各处的问题。再者,由于子阵列607只需要(M q+1)×N q个读写逻辑门606,而不是全部P个读写逻辑门606都参与运算,因此NRAM 431需要控制的读写逻辑门606的数量减少了,方便控制。
为了同时支援前述7种访问模式,如图7所示,控制模块41将子阵列607在逻辑上设定为16×4,也就是具有16个群和4个库的阵列,对应有4个使能逻辑门701-704用以分别同步使能一库的寄存器,和16个读写逻辑门705-720用以每次选择最多4个数据输入至16个群中的某4个寄存器。为方便说明,R x,y指的是第x群第y库的寄存器。
图8示出在ORDER1模式下访问子阵列的机制。假设库0全部的寄存器(R 0,0至R 15,0)和库1群0至群5(R 0,1至R 5,1)的寄存器已被占用,如图中标示网底的寄存器所示,这时指针指向R 6,1。以控制模块41的队列中应执行ORDER1模式下8个数据的写入为例,使能逻辑门702使能库1所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。由于ORDER1为每次顺序写入1个数据,因此第1次写入第1个数据时,第1个数据传输至所有的读写逻辑门705-720,根据指针,只有读写逻辑门711让第1个数据通过写入至R 6,1,指针加1,指向R 7,1,第2次写入第2个数据时,第2个数据同样传输至所有的读写逻辑门705-720,根据指针,只有读写逻辑门712让第2个数据写入至R 7,1,指针加1,指向R 8,1。以此类推,8个数据分别写入R 6,1至R 13,1。图中的数字i表示第i次写入。
图9示出在ORDER2模式下访问子阵列的机制。假设没有任何寄存器已被占用,这时指针指向R 0,0。以控制模块41的队列中应执行ORDER2模式下16个数据的写入为例,使能逻辑门701使能库0所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。由于ORDER2为每次顺序写入2个数据,因此第1次写入根据指针,读写逻辑门705让第1个数据写入R 0,0,且读写逻辑门706让第2个数据写入R 1,0,指针加2,指向R 2,0,第2次写入根据指针,读写逻辑门707让第3个数据写入R 2,0,且读写逻辑门708让第4个数据写入R 3,0,指针加2,指向R 4,0。以此类推,16个数据分别写入R 0,0至R 15,0,恰好写满库0,最后指针指向R 0,1
图10示出在ORDER4模式下访问子阵列的机制。假设库0、库1的全部寄存器和库2群0至群11(R 0,2至R 11,2)的寄存器已被占用,如图中标示网底的寄存器所示,这时指针指向R 12,2。以控制模块41的队列中应执行ORDER4模式下8个数据的写入为例,使能逻辑门703使能库2所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。由于ORDER4为每次顺序写入4个数据,因此第1次写入根据指针读写逻辑门717让第1个数据写入R 12,2,读写逻辑门718让第2个数据写入R 13,2,读写逻辑门719让第3个数据写入R 14,2,读写逻辑门720让第4个数据写入R 15,2,指针加4,指向R 0,3,这时使能逻辑门704使能库3所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器,第2次写入根据指针读写逻辑门705让第5个数据写入R 0,3,读写逻辑门706让第6个数据写入R 1,3,读写逻辑门707让第7个数据写入R 2,3,读写逻辑门708让第8个数据写入R 3,3,指针加4,指向R 4,3
图11示出在STRIDE 3_2模式下访问子阵列的机制。假设R 0,0至R 7,0的寄存器已被占用,如图中标示网底的寄存器所示,这时指针指向R 8,0。以控制模块41的队列中应执行STRIDE 3_2模式下16个数据的写入为例,使能逻辑门701使能库0所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。第1次写入根据指针读写逻辑门713让第1个数据写入R 8,0,读写逻辑门717让第2个数据写入R 12,0,指针加1,指向R 9,0,第2次写入根据指针读写逻辑门714让第3个数据写入R 9,0,读写逻辑门718让第4个数据写入R 13,0,指针加1,指向R 10,0。以此类推,如图示般完成16个数据的写入。
图12示出在STRIDE 3_4模式下访问子阵列的机制。假设没有任何寄存器已被占用,这时指针指向R 0,0。以控制模块41的队列中应执行STRIDE 3_4模式下16个数据的写入为例,使能逻辑门701使能库0所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。第1次写入根据指针读写逻辑门705让第1个数据写入R 0,0,读写逻辑门709让第2个数据写入R 4,0,读写逻辑门713让第3个数据写入R 8,0,读写逻辑门717让第4个数据写入R 12,0,指针加1,指向R 1,0,第2次写入时读写逻辑门706让第5个数据写入R 1,0,读写逻辑门710让第6个数据写入R 5,0,读写逻辑门714让第7个数据写入R 9,0,读写逻辑门718让第8个数据写入R 13,0,指针加1,指向R 2,0。以此类推,如图示般完成16个数据的写入。
图13示出在STRIDE 1_2模式下访问子阵列的机制。假设库0、库1、库2全部及库3群0至群3(R 0,3至R 3,3)的寄存器已被占用,如图中标示网底的寄存器所示,这时指针指向R 4,3。以控制模块41的队列中应执行STRIDE 1_2模式下8个数据的写入为例,使能逻辑门704使能库3所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。第1次写入根据指针读写逻辑门709让第1个数据写入R 4,3,读写逻辑门711让第2个数据写入R 6,3,指针加1,指向R 5,3,第2次写入时读写逻辑门710让第3个数据写入R 5,3,读写逻辑门712让第4个数据写入R 7,3,这时指针加3,指向R 8,3,读写逻辑门713让第5个数据写入R 8,3,读写逻辑门715让第6个数据写入R 10,3。以此类推,如图示般完成8个数据的写入。
图14示出在STRIDE 1_4模式下访问子阵列的机制。假设库0群0至群11(R 0,0至R 11,0)的寄存器已被占用,如图中标示网底的寄存器所示,这时指针指向R 12,0。以控制模块41的队列中应执行STRIDE 1_4模式下8个数据的写入为例,首先使能逻辑门701使能库0所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器。 第1次写入根据指针读写逻辑门717让第1个数据写入R 12,0,读写逻辑门719让第2个数据写入R 14,0,接着使能逻辑门702使能库1所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器,读写逻辑门705让第3个数据写入R 0,1,读写逻辑门707让第4个数据写入R 2,1,指针加1,指向R 13,0。这时使能逻辑门701使能库0所有的寄存器重新开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器,第2次写入时读写逻辑门718让第5个数据写入R 13,0,读写逻辑门720让第6个数据写入R 15,0,接着使能逻辑门702重新使能库1所有的寄存器开启准备接收数据,而其余使能逻辑门禁用其余库中的所有寄存器,读写逻辑门706让第7个数据写入R 1,1,读写逻辑门708让第8个数据写入R 3,1。如图示般完成8个数据的写入。
顺序读模式和更新读模式的寄存器访问方式基本上与顺序写和跳跃写相同,差别在于读取寄存器中的数据而不是写入数据至寄存器,本领域技术人员可轻易根据前述各写入模式理解顺序读模式和更新读模式的寄存器访问方式,故不赘述。
使能逻辑组602与读写逻辑组603为逻辑门组合,以实现前述的控制方式。实现这类控制方式的逻辑门组合为本领域技术人员所熟知,故不赘述。
此实施例通过适当的规划寄存器阵列的群与库,减少了寄存器的读写端口,即减少了读写端口的数据,再加上访问模式的选择逻辑,有效降低了寄存器阵列的功耗开销。
本发明的另一个实施例为一种利用前述寄存器阵列以支援多种访问模式的方法,其方法由计算装置201所执行,如图15所示,包括以下步骤。
在步骤1501中,设定寄存器阵列的(M+1)×N群与R库为子阵列,其中(M+1)×N不大于P,R不大于Q。
此实施例的计算装置201同样需要支援前述ORDER1、ORDER2、ORDER4、STRIDE 3_2、STRIDE 3_4、STRIDE 1_2、STRIDE 1_4等7种访问模式,在进行推理前,控制模块41先识别这些访问模式中每个库需要群数量的最小单位。
ORDER1为每次依序写入1个数据,ORDER2为每次依序写入2个数据,ORDER4为每次依序写入4个数据,以这3种模式来说,每个库的最小单位只需要4个寄存器,便可以供ORDER1写入4个轮回,ORDER2写入2个轮回,ORDER4写入1个轮回。
再考虑各种跳跃写模式,当访问模式为STRIDE M_N时,寄存器阵列601只需要(M+1)×N群加上R库的子阵列607即可满足存储需求,由于寄存器阵列601为P×Q,因此(M+1)×N不能大于P,R为任意数量但不大于Q。换言之,P×Q的寄存器阵列601可以在逻辑上切割出((M+1)×N)×R的子阵列607供STRIDE M_N的访问模式存储。
STRIDE 3_2为每次间隔3个寄存器写入2个数据,因此一个轮回(即指针递增不为1,而是5)的写入需要8个寄存器,STRIDE 3_4为每次间隔3个寄存器写入4个数据,因此一个轮回(即指针递增不为1,而是13)的写入需要16个寄存器,STRIDE 1_2为每次间隔1个寄存器写入2个数据,因此一个轮回(即指针递增不为1,而是3)的写入需要4个寄存器,STRIDE 1_4为每次间隔1个寄存器写入4个数据,因此一个轮回(即指针递增不为1,而是7)的写入需要8个寄存器,以这4种模式来说,每个库的最小单位是16个寄存器,便可以供STRIDE 3_2写入2个轮回,STRIDE 3_4写入1个轮回,STRIDE 1_2写入4个轮回,STRIDE 1_4写入2个轮回。
在顺序写模式中,每个库的最小单位是4个寄存器,在跳跃写模式中,每个库的最小单位是16个寄存器。16个寄存器恰好是4个寄存器的整数倍,故16个寄存器可以同 时让顺序写模式与跳跃写模式在每个库中均写入整数个轮回。
除了判断所有顺序写模式和跳跃写模式外,控制模块41也会将顺序读与更新读模式纳入评估,但顺序读与更新读模式每个轮回所占用的寄存器数量分别与顺序写模式和跳跃写模式无异。再者,一般情况下,顺序写模式下每个轮回占用的寄存器不多,通常是由跳跃写模式决定了每个库的寄存器最少个数,而每个跳跃写模式的每个轮回所占用的寄存器数量大多互为整数倍。综上所述,当控制模块41欲判断每个模式每个轮回需要的寄存器数量时,理论上取所有访问模式每个轮回所占用的寄存器数量的最小公约数作为每个库的寄存器个数(也就是群数),实际上仅需判断所有跳跃写模式STRIDE M i_N i中的(M i+1)×N i的最大值(M q+1)×N q,便可以确保每个模式在每个库中均访问整数个轮回。各模式在每个库中均访问整数个轮回可以避免当运算持续一段时间后,寄存器堆闲置可使用的空间没有规律的散落在堆内各处的问题。再者,由于子阵列607只需要(M q+1)×N q个读写逻辑门606,而不是全部P个读写逻辑门606都参与运算,因此NRAM 431需要控制的读写逻辑门606的数量减少了,方便控制。
在步骤1502中,同步使能子阵列其中一库的寄存器。在步骤1503中,选择将N个数据分别输入至群中的N个寄存器。为了同时支援前述7种访问模式,控制模块41将子阵列607在逻辑上设定为16×4,也就是具有16个群和4个库的阵列,对应有4个使能逻辑门701-704用以分别同步使能一库的寄存器,和16个读写逻辑门705-720用以每次选择最多4个数据输入至16个群中的某4个寄存器。
本发明另一个实施例为一种计算机可读存储介质,其上存储有利用寄存器阵列以支援访问模式的计算机程序代码,当所述计算机程序代码由处理器运行时,执行如前所述各实施例的方法。在一些实现场景中,上述集成的单元可以采用软件程序模块的形式来实现。如果以软件程序模块的形式实现并作为独立的产品销售或使用时,所述集成的单元可以存储在计算机可读取存储器中。基于此,当本发明的方案以软件产品(例如计算机可读存储介质)的形式体现时,该软件产品可以存储在存储器中,其可以包括若干指令用以使得计算机设备(例如个人计算机、服务器或者网络设备等)执行本发明实施例所述方法的部分或全部步骤。前述的存储器可以包括但不限于U盘、闪存盘、只读存储器(Read Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
本发明通过适当的规划寄存器阵列的群与库,减少了寄存器的读写端口,即减少了读写端口的数据,再加上访问模式的选择逻辑,有效降低了寄存器阵列的功耗开销。
根据不同的应用场景,本发明的电子设备或装置可以包括服务器、云端服务器、服务器集群、数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、PC设备、物联网终端、移动终端、手机、行车记录仪、导航仪、传感器、摄像头、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、视觉终端、自动驾驶终端、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。本发明的电子设备或装置还可以被应用于互联网、物联网、数据中心、能源、交通、公共管理、制造、教育、电网、电信、金融、零售、工地、医疗等领域。进一步,本发明的电子设备或装置还可以用于云端、边缘端、终端等与人工智能、大数据和/或云计算相关的应用场景中。在一个或多个实施例中, 根据本发明方案的算力高的电子设备或装置可以应用于云端设备(例如云端服务器),而功耗小的电子设备或装置可以应用于终端设备和/或边缘端设备(例如智能手机或摄像头)。在一个或多个实施例中,云端设备的硬件信息和终端设备和/或边缘端设备的硬件信息相互兼容,从而可以根据终端设备和/或边缘端设备的硬件信息,从云端设备的硬件资源中匹配出合适的硬件资源来模拟终端设备和/或边缘端设备的硬件资源,以便完成端云一体或云边端一体的统一管理、调度和协同工作。
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在本发明中,作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元示出的部件可以是或者也可以不是物理单元。前述部件或单元可以位于同一位置或者分布到多个网络单元上。另外,根据实际的需要,可以选择其中的部分或者全部单元来实现本发明实施例所述方案的目的。另外,在一些场景中,本发明实施例中的多个单元可以集成于一个单元中或者各个单元物理上单独存在。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如中央处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube,HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款A1、一种支援多种访问模式的寄存器堆,包括寄存器阵列,所述寄存器阵列具 有P群与Q库,所述多种访问模式包括跳跃写模式,所述跳跃写模式为每次写入N个数据至所述寄存器阵列,每个数据间隔M个寄存器存储,所述多种访问模式利用所述寄存器阵列的(M+1)×N群与R库存储,其中(M+1)×N不大于P,R不大于Q。
条款A2、根据条款A1所述的寄存器堆,还包括R个使能逻辑门,每个使能逻辑门用以同步使能一库的寄存器。
条款A3、根据条款A1所述的寄存器堆,还包括(M+1)×N个写入逻辑门,用以选择将所述N个数据输入至(M+1)×N群中的N个寄存器。
条款A4、根据条款A1所述的寄存器堆,其中所述多种访问模式还包括顺序写模式,用以每次写入S个数据至所述寄存器阵列的连续寄存器中,其中S不大于P。
条款A5、根据条款A1所述的寄存器堆,其中所述多种访问模式还包括顺序读模式,用以每次自所述寄存器阵列的连续寄存器中读取T个数据,其中T不大于P。
条款A6、根据条款A1所述的寄存器堆,其中所述多种访问模式还包括更新读模式,用以每次自所述寄存器阵列中间隔M个寄存器读取N个数据。
条款A7、一种支援多种访问模式的计算装置,所述多种访问模式包括多个跳跃写模式,第i个跳跃写模式为每次写入N i个数据,每个数据间隔M i个寄存器存储,所述计算装置包括:寄存器阵列,具有P群与Q库,所述多种访问模式利用(M q+1)×N q群与R库存储,其中(M q+1)×N q不大于P,R不大于Q;其中,(M q+1)×N q为所述多个跳跃写模式中的(M i+1)×N i的最大值。
条款A8、根据条款A7所述的计算装置,还包括R个使能逻辑门,每个使能逻辑门用以同步使能一库的寄存器。
条款A9、根据条款A7所述的计算装置,还包括(M q+1)×N q个写入逻辑门,用以选择所述N i个数据输入至(M q+1)×N q群中的N i个寄存器。
条款A10、根据条款A7所述的计算装置,其中所述多种访问模式还包括多个顺序写模式,第i个顺序写模式用以每次写入S i个数据至所述寄存器阵列的连续寄存器中,其中S i不大于P。
条款A11、根据条款A7所述的计算装置,其中所述多种访问模式还包括多个顺序读模式,第i个顺序读模式用以每次自所述寄存器阵列的连续寄存器中读取T i个数据,其中T i不大于P。
条款A12、根据条款A7所述的计算装置,其中所述多种访问模式还包括多个更新读模式,第i个更新读模式用以每次自所述寄存器阵列中间隔M i个寄存器读取N i个数据。
条款A13、根据条款A7所述的计算装置,还包括控制模块,用以识别(M q+1)×N q
条款A14、一种集成电路装置,包括根据条款A7至13任一项所述的计算装置。
条款A15、一种板卡,包括根据条款A14所述的集成电路装置。
条款A16、一种利用寄存器阵列以支援跳跃写模式的方法,所述寄存器阵列具有P群与Q库,所述跳跃写模式为每次写入N个数据至所述寄存器阵列,每个数据间隔M个寄存器存储,所述方法包括:设定所述寄存器阵列的(M+1)×N群与R库为子阵列,其中(M+1)×N不大于P,R不大于Q;同步使能所述子阵列其中一库的寄存器;以及选择将所述N个数据分别输入至所述群中的N个寄存器。
条款A17、一种利用寄存器阵列以支援多种访问模式的方法,所述寄存器阵列具有P群与Q库,所述多种访问模式包括多个跳跃写模式,第i个跳跃写模式为每次写入N i个 数据,每个数据间隔M i个寄存器存储,所述方法包括:设定所述寄存器阵列的(M q+1)×N q群与R库为子阵列,其中(M q+1)×N q不大于P,R不大于Q;同步使能所述子阵列其中一库的寄存器;以及选择将所述N i个数据分别输入至所述群中的N i个寄存器;其中,(M q+1)×N q为所述多个跳跃写模式中的(M i+1)×N i的最大值。
条款A18、一种计算机可读存储介质,其上存储有利用寄存器阵列以支援访问模式的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款A16至17任一项所述的方法。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (18)

  1. 一种支援多种访问模式的寄存器堆,包括寄存器阵列,所述寄存器阵列具有P群与Q库,所述多种访问模式包括跳跃写模式,所述跳跃写模式为每次写入N个数据至所述寄存器阵列,每个数据间隔M个寄存器存储,所述多种访问模式利用所述寄存器阵列的(M+1)×N群与R库存储,其中(M+1)×N不大于P,R不大于Q。
  2. 根据权利要求1所述的寄存器堆,还包括R个使能逻辑门,每个使能逻辑门用以同步使能一库的寄存器。
  3. 根据权利要求1所述的寄存器堆,还包括(M+1)×N个写入逻辑门,用以选择将所述N个数据输入至(M+1)×N群中的N个寄存器。
  4. 根据权利要求1所述的寄存器堆,其中所述多种访问模式还包括顺序写模式,用以每次写入S个数据至所述寄存器阵列的连续寄存器中,其中S不大于P。
  5. 根据权利要求1所述的寄存器堆,其中所述多种访问模式还包括顺序读模式,用以每次自所述寄存器阵列的连续寄存器中读取T个数据,其中T不大于P。
  6. 根据权利要求1所述的寄存器堆,其中所述多种访问模式还包括更新读模式,用以每次自所述寄存器阵列中间隔M个寄存器读取N个数据。
  7. 一种支援多种访问模式的计算装置,所述多种访问模式包括多个跳跃写模式,第i个跳跃写模式为每次写入N i个数据,每个数据间隔M i个寄存器存储,所述计算装置包括:
    寄存器阵列,具有P群与Q库,所述多种访问模式利用(M q+1)×N q群与R库存储,其中(M q+1)×N q不大于P,R不大于Q;
    其中,(M q+1)×N q为所述多个跳跃写模式中的(M i+1)×N i的最大值。
  8. 根据权利要求7所述的计算装置,还包括R个使能逻辑门,每个使能逻辑门用以同步使能一库的寄存器。
  9. 根据权利要求7所述的计算装置,还包括(M q+1)×N q个写入逻辑门,用以选择所述N i个数据输入至(M q+1)×N q群中的N i个寄存器。
  10. 根据权利要求7所述的计算装置,其中所述多种访问模式还包括多个顺序写模式,第i个顺序写模式用以每次写入S i个数据至所述寄存器阵列的连续寄存器中,其中S i不大于P。
  11. 根据权利要求7所述的计算装置,其中所述多种访问模式还包括多个顺序读模式,第i个顺序读模式用以每次自所述寄存器阵列的连续寄存器中读取T i个数据,其中T i不大于P。
  12. 根据权利要求7所述的计算装置,其中所述多种访问模式还包括多个更新读模式,第i个更新读模式用以每次自所述寄存器阵列中间隔M i个寄存器读取N i个数据。
  13. 根据权利要求7所述的计算装置,还包括控制模块,用以识别(M q+1)×N q
  14. 一种集成电路装置,包括根据权利要求7至13任一项所述的计算装置。
  15. 一种板卡,包括根据权利要求14所述的集成电路装置。
  16. 一种利用寄存器阵列以支援跳跃写模式的方法,所述寄存器阵列具有P群与Q库,所述跳跃写模式为每次写入N个数据至所述寄存器阵列,每个数据间隔M个寄存器存储,所述方法包括:
    设定所述寄存器阵列的(M+1)×N群与R库为子阵列,其中(M+1)×N不大于P, R不大于Q;
    同步使能所述子阵列其中一库的寄存器;以及
    选择将所述N个数据分别输入至所述群中的N个寄存器。
  17. 一种利用寄存器阵列以支援多种访问模式的方法,所述寄存器阵列具有P群与Q库,所述多种访问模式包括多个跳跃写模式,第i个跳跃写模式为每次写入N i个数据,每个数据间隔M i个寄存器存储,所述方法包括:
    设定所述寄存器阵列的(M q+1)×N q群与R库为子阵列,其中(M q+1)×N q不大于P,R不大于Q;
    同步使能所述子阵列其中一库的寄存器;以及
    选择将所述N i个数据分别输入至所述群中的N i个寄存器;
    其中,(M q+1)×N q为所述多个跳跃写模式中的(M i+1)×N i的最大值。
  18. 一种计算机可读存储介质,其上存储有利用寄存器阵列以支援访问模式的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求16至17任一项所述的方法。
PCT/CN2021/119945 2020-11-27 2021-09-23 支援多种访问模式的设备、方法及可读存储介质 WO2022111013A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011360044.7A CN114565075A (zh) 2020-11-27 2020-11-27 支援多种访问模式的设备、方法及可读存储介质
CN202011360044.7 2020-11-27

Publications (1)

Publication Number Publication Date
WO2022111013A1 true WO2022111013A1 (zh) 2022-06-02

Family

ID=81712217

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119945 WO2022111013A1 (zh) 2020-11-27 2021-09-23 支援多种访问模式的设备、方法及可读存储介质

Country Status (2)

Country Link
CN (1) CN114565075A (zh)
WO (1) WO2022111013A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050108503A1 (en) * 2003-11-18 2005-05-19 International Business Machines Corporation Two dimensional addressing of a matrix-vector register array
CN101123113A (zh) * 2007-09-20 2008-02-13 上海交通大学 同步动态随机访问存储器的访问方法及控制装置
CN101443731A (zh) * 2006-05-26 2009-05-27 Vns组合有限责任公司 计算机的循环寄存器阵列
CN101620524A (zh) * 2009-07-03 2010-01-06 中国人民解放军国防科学技术大学 支持矩阵整体读写操作的矩阵寄存器文件
CN101667453A (zh) * 2008-09-05 2010-03-10 爱特梅尔公司 用以存取存储器的方法及系统
CN102012803A (zh) * 2010-11-25 2011-04-13 中国人民解放军国防科学技术大学 支持多宽度simd和多粒度simt的可配置矩阵寄存器单元
US20190250915A1 (en) * 2016-04-26 2019-08-15 Onnivation, LLC Computing Machine Using a Matrix Space For Matrix and Array Processing
CN110176260A (zh) * 2018-02-21 2019-08-27 三星电子株式会社 支持跳跃计算模式的存储器器件及其操作方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050108503A1 (en) * 2003-11-18 2005-05-19 International Business Machines Corporation Two dimensional addressing of a matrix-vector register array
CN101443731A (zh) * 2006-05-26 2009-05-27 Vns组合有限责任公司 计算机的循环寄存器阵列
CN101123113A (zh) * 2007-09-20 2008-02-13 上海交通大学 同步动态随机访问存储器的访问方法及控制装置
CN101667453A (zh) * 2008-09-05 2010-03-10 爱特梅尔公司 用以存取存储器的方法及系统
CN101620524A (zh) * 2009-07-03 2010-01-06 中国人民解放军国防科学技术大学 支持矩阵整体读写操作的矩阵寄存器文件
CN102012803A (zh) * 2010-11-25 2011-04-13 中国人民解放军国防科学技术大学 支持多宽度simd和多粒度simt的可配置矩阵寄存器单元
US20190250915A1 (en) * 2016-04-26 2019-08-15 Onnivation, LLC Computing Machine Using a Matrix Space For Matrix and Array Processing
CN110176260A (zh) * 2018-02-21 2019-08-27 三星电子株式会社 支持跳跃计算模式的存储器器件及其操作方法

Also Published As

Publication number Publication date
CN114565075A (zh) 2022-05-31

Similar Documents

Publication Publication Date Title
WO2022161318A1 (zh) 数据处理装置、方法及相关产品
WO2023071238A1 (zh) 计算图的编译、调度方法及相关产品
CN110059797B (zh) 一种计算装置及相关产品
CN111047022A (zh) 一种计算装置及相关产品
CN114580606A (zh) 数据处理方法、装置、计算机设备和存储介质
CN113469336A (zh) 优化神经网络模型的编译方法、执行方法及相关产品
WO2022111013A1 (zh) 支援多种访问模式的设备、方法及可读存储介质
WO2022134873A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2022095675A1 (zh) 神经网络稀疏化的装置、方法及相关产品
CN113469337B (zh) 用于优化神经网络模型的编译方法及其相关产品
CN116185378A (zh) 计算图的优化方法、数据处理方法及相关产品
CN116185377A (zh) 计算图的优化方法、计算装置及相关产品
CN115437602A (zh) 任意精度计算加速器、集成电路装置、板卡及方法
CN112667227A (zh) 可视化设计流水线的方法及可读存储介质
CN113791996B (zh) 集成电路装置、电子设备、板卡和计算方法
CN113742266B (zh) 集成电路装置、电子设备、板卡和计算方法
WO2022001457A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法
WO2022001454A1 (zh) 集成计算装置、集成电路芯片、板卡和计算方法
WO2022001499A1 (zh) 一种计算装置、芯片、板卡、电子设备和计算方法
CN113469365B (zh) 基于神经网络模型的推理和编译方法及其相关产品
WO2022134872A1 (zh) 数据处理装置、数据处理方法及相关产品
WO2022001498A1 (zh) 计算装置、集成电路芯片、板卡、电子设备和计算方法
WO2022135599A1 (zh) 融合分支结构的装置、板卡、方法及可读存储介质
WO2023241478A1 (zh) 人工智能加速器流水线性能分析方法及设备
WO2023016382A1 (zh) 用于片上系统的方法及其相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896517

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896517

Country of ref document: EP

Kind code of ref document: A1