WO2017181419A1 - 一种子矩阵运算装置及方法 - Google Patents

一种子矩阵运算装置及方法 Download PDF

Info

Publication number
WO2017181419A1
WO2017181419A1 PCT/CN2016/080023 CN2016080023W WO2017181419A1 WO 2017181419 A1 WO2017181419 A1 WO 2017181419A1 CN 2016080023 W CN2016080023 W CN 2016080023W WO 2017181419 A1 WO2017181419 A1 WO 2017181419A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
matrix
data
matrix operation
instruction
Prior art date
Application number
PCT/CN2016/080023
Other languages
English (en)
French (fr)
Inventor
刘少礼
张潇
陈云霁
陈天石
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to EP16899006.7A priority Critical patent/EP3447653A4/en
Priority to PCT/CN2016/080023 priority patent/WO2017181419A1/zh
Publication of WO2017181419A1 publication Critical patent/WO2017181419A1/zh
Priority to US16/167,425 priority patent/US10534841B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/17Function evaluation by approximation methods, e.g. inter- or extrapolation, smoothing, least mean square method
    • G06F17/175Function evaluation by approximation methods, e.g. inter- or extrapolation, smoothing, least mean square method of multidimensional data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]

Definitions

  • the present invention belongs to the field of computers, and in particular, to a sub-matrix operation device and method for acquiring sub-matrix data from matrix data according to a sub-matrix operation instruction, and performing sub-matrix operations according to the sub-matrix data.
  • More and more algorithms in the current computer field involve matrix operations, including artificial neural network algorithms and graphics rendering algorithms.
  • matrix operations including artificial neural network algorithms and graphics rendering algorithms.
  • sub-matrix operations are more and more frequently appear in various computing tasks. Therefore, for those solutions to solve the matrix operation problem, the efficiency and difficulty of the sub-matrix operation must be considered at the same time.
  • One known solution for performing sub-matrix operations in the prior art is to use a general purpose processor that performs general purpose instructions through a general purpose register file and general purpose functions to perform sub-matrix operations.
  • a general-purpose processor is mostly used for scalar calculations, and the performance is low when performing sub-matrix operations.
  • mutual communication between general-purpose processors may become a performance bottleneck.
  • the amount of code for implementing sub-matrix operations is larger than that of normal matrix operations.
  • a sub-matrix calculation is performed using a graphics processing unit (GPU) in which sub-matrix operations are performed by executing general SIMD instructions using a general purpose register file and a general purpose stream processing unit.
  • GPU graphics processing unit
  • the GPU on-chip cache is too small, and it is necessary to continuously perform off-chip data transfer when performing large-scale sub-matrix operations, and the off-chip bandwidth becomes a main performance bottleneck.
  • sub-matrix calculations are performed using specially tailored matrix computing devices, where sub-matrix operations are performed using custom register files and custom processing units.
  • the existing dedicated matrix operation devices are limited by the register file and cannot be flexibly supported. Submatrix operations of different lengths.
  • the present invention provides a sub-matrix operation device and method capable of efficiently implementing various sub-matrix operations in conjunction with a sub-matrix operation instruction set.
  • the present invention provides a sub-matrix operation device for acquiring sub-matrix data from matrix data according to a sub-matrix operation instruction, and performing sub-matrix operations according to the sub-matrix data, the apparatus comprising:
  • a storage unit for storing matrix data
  • a register unit for storing sub-matrix information
  • a sub-matrix operation unit configured to acquire a sub-matrix operation instruction, and obtain sub-matrix information from the register unit according to the sub-matrix operation instruction, and then obtain a sub-matrix data in the storage unit according to the sub-matrix information Matrix data, and then, sub-matrix operations are performed based on the acquired sub-matrix data to obtain sub-matrix operation results.
  • the present invention also provides a sub-matrix operation method for obtaining sub-matrix data from matrix data according to a sub-matrix operation instruction, and performing sub-matrix operations according to the sub-matrix data, the method comprising:
  • the sub-matrix operation device provided by the invention temporarily stores the sub-matrix data participating in the calculation on the scratch pad memory (Scratchpad Memory), so that the sub-matrix operation process can more flexibly and effectively support data of different widths, and the method includes a large number of matrix calculations.
  • the execution performance of the task, the instruction adopted by the present invention has a compact format, which makes the instruction set easy to use and the supported matrix length flexible.
  • FIG. 1 is a schematic diagram of a sub-matrix operation device provided by the present invention.
  • FIG. 2 is a schematic diagram of a format of an instruction set provided by the present invention.
  • Figure 3 is a schematic illustration of a submatrix of the present invention.
  • FIG. 4 is a schematic diagram of a sub-matrix operation device according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a sub-matrix multiplier matrix instruction executed by a sub-matrix operation apparatus according to an embodiment of the present invention.
  • Figure 6 is a schematic diagram of matrix data and sub-matrix data in an embodiment of the present invention.
  • FIG. 7 is a flowchart of performing a convolutional neural network operation by a sub-matrix operation apparatus according to an embodiment of the present invention.
  • the invention provides a sub-matrix operation device and method, comprising a storage unit, a register unit and a sub-matrix operation unit.
  • the storage unit stores sub-matrix data
  • the register unit stores sub-matrix information
  • the sub-matrix operation unit is based on the sub-matrix operation instruction.
  • the sub-matrix information is obtained in the register unit, and then the corresponding sub-matrix data is obtained in the storage unit according to the sub-matrix information, and then the sub-matrix operation is performed according to the obtained sub-matrix data to obtain the sub-matrix operation result.
  • the invention temporarily stores the sub-matrix data participating in the calculation on the cache memory, so that the sub-matrix operation process can support the data of different widths more flexibly and effectively, and improve the execution performance of the task including a large number of sub-matrix calculations.
  • the scratchpad memory can pass each A variety of storage devices (SRAM, DRAM, eDRAM, memristor, 3D-DRAM and non-volatile storage, etc.) are implemented.
  • FIG. 1 is a schematic diagram of a sub-matrix operation device provided by the present invention. As shown in FIG. 1, the device includes:
  • a storage unit for storing matrix data
  • a register unit for storing sub-matrix information.
  • a register file may be composed of a plurality of register units, each register unit storing different sub-matrix information, and the sub-matrix information is scalar data. ;
  • a sub-matrix operation unit configured to acquire a sub-matrix operation instruction, and obtain sub-matrix information from the register unit according to the sub-matrix operation instruction, and then obtain sub-matrix data in the matrix data in the storage unit according to the sub-matrix information, and then The sub-matrix operation is performed according to the obtained sub-matrix data, and the sub-matrix operation result is obtained.
  • the 2 is a schematic diagram of a format of an instruction set provided by the present invention.
  • the instruction set adopts a Load/Store structure, and the sub-matrix operation unit does not operate on data in the memory.
  • the sub-matrix instruction set adopts the Very Long Instruction Word architecture.
  • the instruction set uses fixed-length instructions, so that the sub-matrix operation device can operate on the next sub-matrix in the decoding stage of the previous sub-matrix operation instruction.
  • the instruction takes a value.
  • the sub-matrix operation instruction includes an operation code for indicating a function of the sub-matrix operation instruction, and an operation field for indicating data information of the sub-matrix operation instruction, and the data information is a number of the register unit.
  • the sub-matrix operation unit accesses the corresponding register unit according to the number of the register unit to obtain the sub-matrix information, or the sub-matrix operation unit may directly perform the sub-matrix operation on the immediate data as the sub-matrix data.
  • Sub-matrix multiply vector instruction (SMMV)
  • the device fetches the specified sub-matrix data from the specified start address of the scratchpad memory according to the row width, column width and line spacing of the sub-matrix in the instruction, and extracts the vector data at the same time Multiply the matrix multiplication vector in the matrix operation unit, and write the result back to the specified address of the scratch pad memory; it is worth noting that the vector can be stored as a special form matrix (a matrix of only one row of elements) at high speed.
  • Temporary storage In memory is a special form matrix (a matrix of only one row of elements) at high speed.
  • VMSM vector multiplier matrix instruction
  • the device fetches vector data from a specified address of the scratch pad memory, and according to the submatrix start address in the instruction, the row width and column width of the submatrix, and the row of the submatrix
  • the spacing is taken out of the specified sub-matrix, the multiplication of the vector multiplier matrix is performed in the matrix unit, and the result is written back to the specified address of the scratch pad memory; it is worth noting that the vector can be used as a special form of matrix (only one row of elements) The matrix) is stored in the scratchpad memory.
  • Sub-matrix multiplying scalar instruction (SMMS), according to which the device fetches the specified sub-matrix data from the specified address of the scratch pad memory according to the row width and the column width of the sub-matrix in the instruction and the line spacing of the sub-matrix
  • the specified scalar data is taken out from the specified address of the scalar register file, the sub-matrix multiplier operation is performed in the matrix operation unit, and the result is written back to the specified address of the scratch pad memory.
  • the scalar register file is not only stored. There are various data information of the sub-matrix (including starting address, line width, column width, and line spacing), and the scalar data itself is also stored.
  • a tensor operation instruction (TENS) according to which the device fetches the specified two sub-matrix data from the cache memory, performs a tensor operation on the two sub-matrix data in the matrix operation unit, and writes the calculation result back to the high speed.
  • the specified address of the scratchpad memory The specified address of the scratchpad memory.
  • Submatrix addition instruction (SMA), according to the instruction, the device extracts the specified two sub-matrix data from the cache memory, adds the two sub-matrix data in the matrix operation unit, and writes the calculation result back to the high-speed temporary The specified address of the memory.
  • Submatrix addition instruction SMS
  • the device extracts the specified two sub-matrix data from the cache memory, performs subtraction on the two sub-matrix data in the matrix operation unit, and writes the calculation result back to the high speed The specified address of the memory.
  • Sub-matrix multiplication instruction SMM
  • the device extracts the specified two pieces of sub-matrix data from the cache memory, performs a bit multiplication operation on the two sub-matrix data in the matrix operation unit, and writes the calculation result back to The specified address of the scratch pad memory.
  • a convolutional instruction (CONV) according to which the convolution filtering of the matrix is performed by a convolution kernel.
  • the device extracts the specified convolution kernel matrix from the scratchpad memory, and starts filtering the submatrix data covered by the convolution kernel at the current position, starting from the starting address of the matrix to be convolved, ie
  • the convolution kernel and the submatrix are subjected to a bit multiplication operation in the matrix operation unit, and the obtained matrix is subjected to element summation to obtain a filtering result of the current position, and the result is written back to the designated address of the scratch pad memory.
  • the displacement parameter given in the instruction move to the next position on the matrix to be convolved, and repeat the above operation until it moves to the end position.
  • Sub-matrix transport command (SMMOVE), according to which the device stores the designated sub-matrix stored in the scratch pad memory to another address of the scratch pad memory.
  • the sub-matrix information stored in the register unit includes a start address (start_addr) of the sub-matrix data in the storage unit, a line width (iter1) of the sub-matrix data, a column width (iter2) of the sub-matrix data, and a line interval ( Stride1), where the line spacing refers to the data interval between the last line of the sub-matrix data, the end-of-line data of the previous line, and the line head data of the next line.
  • start_addr a start address of the sub-matrix data in the storage unit
  • line width iter1
  • iter2 column width
  • Stride1 line interval
  • the matrix data is actually stored in a one-dimensional manner in the storage unit.
  • the row width of the sub-matrix is the sub-matrix in FIG.
  • the number of elements in each row of the matrix, the column width of the submatrix is the number of elements in each column of the submatrix in Figure 3, and the row spacing of the submatrix is the last element in the submatrix in Fig. 3 to the first element in the next row. The address spacing between them.
  • start_addr After reading iter1 data, skip stride1 data and then read iter1 data, and repeat iter2 times to obtain complete sub-matrix data.
  • the sub-matrix operation device further includes an instruction processing unit configured to acquire the sub-matrix operation instruction, and process the sub-matrix operation instruction, and then provide the sub-matrix operation unit to the sub-matrix operation unit.
  • the instruction processing unit includes an instruction fetching module, a decoding module, an instruction queue, and a dependency processing unit, wherein the fetching module acquires the sub-matrix operation instruction, and the decoding module decodes the obtained sub-matrix operation instruction, and the instruction queue
  • the decoded sub-matrix operation instructions are sequentially stored, and the dependency processing unit determines whether the sub-matrix operation instruction and the previous sub-matrix operation instruction access the same sub-matrix data before the sub-matrix operation unit acquires the sub-matrix operation instruction, If yes, storing the sub-matrix operation instruction in the storage queue, and waiting for the execution of the previous sub-matrix operation instruction, and then providing the sub-matrix operation instruction in the storage queue to the sub-mat
  • the storage unit is further configured to store the result of the sub-matrix operation, and preferably, the height may be adopted.
  • the scratchpad memory is used as a storage unit.
  • the present invention further includes an input/output unit directly connected to the storage unit, the input/output unit is configured to store the matrix data in the storage unit, or obtain the sub-matrix operation result from the storage unit.
  • the sub-matrix operation unit includes a sub-matrix addition unit, a sub-matrix multiplication unit, a size comparison unit, a nonlinear operation unit, and a sub-matrix scalar multiplication unit.
  • the sub-matrix operation unit is a multi-stream water level structure, and the multi-stream water level structure includes a first flow level, a second flow level, and a third flow level, wherein the sub-matrix addition component and the sub-matrix multiplication component are at the first flow level, and the size is compared.
  • the component is at the second flow level, and the nonlinear computation component and the sub-matrix scalar multiplication component are at the third flow level.
  • the invention also provides a sub-matrix operation method, comprising:
  • step S3 the method further includes:
  • step S3 Determining whether the sub-matrix operation instruction and the previous sub-matrix operation instruction access the same sub-matrix data, and if so, storing the sub-matrix operation instruction in a storage queue, waiting for the execution of the previous sub-matrix operation instruction, and then The step S3 is performed, otherwise, step S3 is directly executed.
  • step S3 further includes storing the sub-matrix operation result.
  • the method further includes: Step S4, acquiring the stored sub-matrix operation result.
  • sub-matrix operations include sub-matrix addition, sub-matrix multiplication, size comparison, nonlinear operation, and sub-matrix scalar multiplication.
  • the multi-stream water level structure is used for sub-matrix operation, and the multi-stream water level structure includes a first flow level, a second flow level, and a third flow level, wherein the sub-matrix addition and the sub-matrix multiplication are performed at the first flow level, The second flow level performs a size comparison operation, and performs a nonlinear operation and a sub-moment at the third flow level.
  • Array scalar multiplication is used for sub-matrix operation, and the multi-stream water level structure includes a first flow level, a second flow level, and a third flow level, wherein the sub-matrix addition and the sub-matrix multiplication are performed at the first flow level, The second flow level performs a size comparison operation, and performs a nonlinear operation and a sub-moment
  • the device includes an instruction module, a decoding module, an instruction queue, a scalar register file, a dependency processing unit, a storage queue, and a matrix operation unit. , cache, IO memory access module, where:
  • the fetch module which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
  • the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue;
  • the instruction queue considering that different instructions may have dependencies on the included scalar registers, for buffering the decoded instructions, and transmitting the instructions when the dependencies are satisfied;
  • a scalar register file that provides multiple scalar registers required by the device during operation
  • a dependency processing unit that processes the storage dependencies that an instruction may have with the previous instruction.
  • the matrix operation instruction accesses the scratch pad memory, and the front and back instructions may access the same block of memory.
  • the instruction In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.
  • the storage queue, the module is an ordered queue, and instructions related to the previous instruction on the data are stored in the queue until the storage relationship is eliminated;
  • a matrix operation unit that is responsible for all sub-matrix operations of the device, including but not limited to sub-matrix addition operations, sub-matrix plus scalar operations, sub-matrix subtraction operations, sub-matrix subtraction operations, sub-matrix multiplication operations, sub-matrix multiplier operations
  • Submatrix division (parallel division) operation, submatrix and operation, and submatrix or operation, submatrix operation instructions are sent to the arithmetic unit for execution;
  • a high-speed temporary storage memory device which is a temporary storage device dedicated to matrix data and capable of supporting matrix data of different sizes
  • IO memory access module which is used to directly access the scratchpad memory, responsible for high Read data or write data in the scratchpad memory.
  • FIG. 5 is a flowchart of performing matrix matrix multiplication vector execution by a matrix operation apparatus according to an embodiment of the present invention. As shown in FIG. 5, the process of executing a sub-matrix multiplication vector instruction includes:
  • the fetch module extracts the sub-matrix multiplication vector instruction, and sends the instruction to the decoding module.
  • the decoding module decodes the instruction and sends the instruction to the instruction queue.
  • the sub-matrix multiply vector instruction needs to obtain data in the scalar register corresponding to the operation domain in the instruction from the scalar register file, including input vector address, input vector length, input sub-matrix address, and input sub-matrix. Line width, input sub-matrix column width, input sub-matrix line spacing, output vector address, output vector length.
  • the instruction is sent to the dependency processing unit.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.
  • the strip sub-matrix multiplication vector instruction is sent to the matrix operation unit.
  • the matrix operation unit extracts the required sub-matrix and vector data from the cache according to the position information of the required data, and then performs the multiplication operation in the matrix operation unit.
  • FIG. 6 is a flowchart of a method for performing a convolutional neural network operation by a matrix operation unit according to an embodiment of the present invention, where the method is mainly implemented by a sub-matrix operation instruction.
  • the computational features of the convolutional neural network are: for n ⁇ y ⁇ x scale feature image input (where n is the number of input feature images, y is the feature image length, x is the feature image width), and there are n ⁇ h ⁇ w scale
  • the convolution kernel the convolution kernel constantly moves on the input image, and the convolution kernel is convoluted with the data of the input image covered by itself at each position to obtain a value of a corresponding point on the output image.
  • the convolutional neural network can be implemented by a submatrix convolutional instruction loop.
  • the data is expanded in the dimension of the number of images, and the input data image is changed from a three-dimensional array of n ⁇ y ⁇ x to a two-dimensional matrix of y ⁇ (x ⁇ n).
  • the convolution kernel data becomes a two-dimensional matrix of h ⁇ (w ⁇ n).
  • the process of implementing a convolutional neural network includes:
  • the decoder extracts a CONV operation instruction.
  • the matrix operation unit reads the convolution kernel matrix data and the submatrix data of the convolution kernel at the start position of the input image from the scratch pad memory.
  • the two matrix data are subjected to the operation of the bit multiplication and the element accumulation summation in the matrix operation unit, and the result is written back. Then the matrix operation unit continues to read in the convolution kernel, and at the same time, reads the data according to the start address of the next sub-matrix to be convolved obtained from the displacement parameter in the instruction.
  • the convolved result matrix is stored off-chip by the IO instruction.
  • this embodiment adopts a more efficient method to implement convolution operation, that is, the three-dimensional input image and the convolution kernel are both expanded into a two-dimensional form.
  • this is not the apparatus and method of the present invention to implement convolution operation.
  • a more general method is to perform a convolution operation on each of the two-dimensional images of the input and one of the corresponding convolution kernels through the sub-matrix instruction to obtain a partial and final convolution of the output.
  • the result is the sum of the partial sums of the convolution operations of all the two-dimensional images and the corresponding convolution kernels. Therefore, submatrix arithmetic instructions can implement convolution operations in a variety of ways.
  • the present invention provides a matrix operation device and cooperates with a corresponding sub-matrix operation instruction set, which can well solve the problem that more and more algorithms in the current computer field contain a large number of sub-matrix operations, compared with existing ones.
  • the traditional solution the invention can have the advantages of streamlined instruction set, convenient use, flexible sub-matrix support, and sufficient on-chip buffering.
  • the present invention can be applied to a variety of computational tasks involving a large number of sub-matrix operations, including reverse training and forward prediction of artificial neural network algorithms that currently perform well.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Geometry (AREA)
  • Architecture (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

一种子矩阵运算装置及方法,装置包括存储单元、寄存器单元和子矩阵运算单元,存储单元中存储有子矩阵数据,寄存器单元中存储有子矩阵信息,子矩阵运算单元根据子矩阵运算指令在寄存器单元中获取子矩阵信息,然后,根据该子矩阵信息在存储单元中获取相应的子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。本方法将参与计算的子矩阵数据暂存在高速暂存存储器上,使得子矩阵运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量子矩阵计算任务的执行性能。

Description

一种子矩阵运算装置及方法 技术领域
本发明属于计算机领域,尤其涉及一种子矩阵运算装置及方法,用于根据子矩阵运算指令从矩阵数据中获取子矩阵数据,并根据该子矩阵数据执行子矩阵运算。
背景技术
当前计算机领域有越来越多的算法涉及到矩阵运算,包括人工神经网络算法和图形的渲染算法。与此同时,作为矩阵运算中的一个重要组成部分,子矩阵运算也越来越频繁的出现在各种计算任务中。所以对于那些面向解决矩阵运算问题的方案,必须同时考虑子矩阵运算实现的效率和难度。
在现有技术中一种进行子矩阵运算的已知方案是使用通用处理器,该方法通过通用寄存器堆和通用功能部件来执行通用指令,从而执行子矩阵运算。然而,该方法的缺点之一是单个通用处理器多用于标量计算,在进行子矩阵运算时运算性能较低。而使用多个通用处理器并行执行时,通用处理器之间的相互通讯又有可能成为性能瓶颈,同时,实现子矩阵运算的代码量也大于正常的矩阵运算。
在另一种现有技术中,使用图形处理器(GPU)来进行子矩阵计算,其中,通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来进行子矩阵运算。然而,上述方案中,GPU片上缓存太小,在进行大规模子矩阵运算时需要不断进行片外数据搬运,片外带宽成为了主要性能瓶颈。
在另一种现有技术中,使用专门定制的矩阵运算装置来进行子矩阵计算,其中,使用定制的寄存器堆和定制的处理单元进行子矩阵运算。然而,目前已有的专用矩阵运算装置受限于寄存器堆,不能够灵活地支 持不同长度的子矩阵运算。
综上所述,现有的不管是片上多核通用处理器、片间互联通用处理器(单核或多核)、还是片间互联,图形处理器都无法进行高效的子矩阵运算,并且这些现有技术在处理子矩阵运算问题时存在着代码量大,受限于片间通讯,片上缓存不够,支持的子矩阵规模不够灵活等问题。
发明内容
(一)要解决的技术问题
本发明提供一种子矩阵运算装置及方法,能配合子矩阵运算指令集,高效地实现各种子矩阵运算。
(二)技术方案
本发明提供一种子矩阵运算装置,用于根据子矩阵运算指令从矩阵数据中获取子矩阵数据,并根据该子矩阵数据执行子矩阵运算,装置包括:
存储单元,用于存储矩阵数据;
寄存器单元,用于存储子矩阵信息;
子矩阵运算单元,用于获取子矩阵运算指令,并根据该子矩阵运算指令从所述寄存器单元中获取子矩阵信息,然后,根据该子矩阵信息在所述存储单元中的矩阵数据中获取子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。
本发明还提供一种子矩阵运算方法,用于根据子矩阵运算指令从矩阵数据中获取子矩阵数据,并根据该子矩阵数据执行子矩阵运算,方法包括:
S1,存储矩阵数据;
S2,存储子矩阵信息;
S3,获取子矩阵运算指令,并根据该子矩阵运算指令获取子矩阵信息,然后,根据该子矩阵信息从存储的矩阵数据中获取子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。
(三)有益效果
本发明提供的子矩阵运算装置,将参与计算的子矩阵数据暂存在高速暂存存储器上(Scratchpad Memory),使得子矩阵运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量矩阵计算任务的执行性能,本发明采用的指令具有精简的格式,使得指令集使用方便、支持的矩阵长度灵活。
附图说明
图1是是本发明提供的子矩阵运算装置的示意图。
图2是本发明提供的指令集格式示意图。
图3是本发明一个子矩阵的示意图。
图4是本发明实施例提供的子矩阵运算装置的示意图。
图5是本发明实施例提供的子矩阵运算装置执行子矩阵乘子矩阵指令的流程图。
图6是本发明实施例中矩阵数据和子矩阵数据的示意图。
图7是本发明实施例提供的子矩阵运算装置执行卷积神经网络运算的流程图。
具体实施方式
本发明提供一种子矩阵运算装置及方法,包括存储单元、寄存器单元和子矩阵运算单元,存储单元中存储有子矩阵数据,寄存器单元中存储有子矩阵信息,子矩阵运算单元根据子矩阵运算指令在寄存器单元中获取子矩阵信息,然后,根据该子矩阵信息在存储单元中获取相应的子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。本发明将参与计算的子矩阵数据暂存在高速暂存存储器上,使得子矩阵运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量子矩阵计算任务的执行性能。其中高速暂存存储器可以通过各 种不同存储器件(SRAM、DRAM、eDRAM、忆阻器、3D-DRAM和非易失存储等)实现。
图1是本发明提供的子矩阵运算装置的示意图,如图1所示,装置包括:
存储单元,用于存储矩阵数据;
寄存器单元,用于存储子矩阵信息,在具体应用中,可以由多个寄存器单元组成一个寄存器堆,每个寄存器单元存储有不同的子矩阵信息,需要说明书的是,子矩阵信息均为标量数据;
子矩阵运算单元,用于获取子矩阵运算指令,并根据该子矩阵运算指令从寄存器单元中获取子矩阵信息,然后,根据该子矩阵信息在存储单元中的矩阵数据中获取子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。
图2是本发明提供的指令集格式示意图,如图2所示,指令集采用Load/Store结构,子矩阵运算单元不会对内存中的数据进行操作。子矩阵指令集采用超长指令集架构(Very Long Instruction Word),同时,指令集采用定长指令,使得子矩阵运算装置在上一条子矩阵运算指令的译码阶段就可以对下一条子矩阵运算指令进行取值。子矩阵运算指令包括一操作码和多个操作域,其中,操作码用于指示该子矩阵运算指令的功能,操作域用于指示该子矩阵运算指令的数据信息,数据信息为寄存器单元的编号或者立即数,子矩阵运算单元根据寄存器单元的编号访问对应的寄存器单元,从而获取子矩阵信息,或者,子矩阵运算单元也可以直接将立即数作为子矩阵数据进行相应的子矩阵运算。
需要说明的是,针对不同功能的运算指令,其操作码也不同,具体地,在本发明提供的一套指令集中,包含有不同功能的子矩阵运算指令:
子矩阵乘向量指令(SMMV),根据该指令,装置从高速暂存存储器的指定起始地址,根据指令中子矩阵的行宽、列宽和行间距取出指定的子矩阵数据,同时取出向量数据,在矩阵运算单元中进行矩阵乘向量的乘法运算,并将结果写回至高速暂存存储器的指定地址;值得说明的是,向量可以作为特殊形式的矩阵(只有一行元素的矩阵)存储于高速暂存 存储器中。
向量乘子矩阵指令(VMSM),根据该指令,装置从高速暂存存储器的指定地址取出向量数据,同时根据指令中的子矩阵起始地址、子矩阵的行宽和列宽以及子矩阵的行间距取出指定的子矩阵,在矩阵单元中进行向量乘子矩阵的乘法运算,并将结果写回至高速暂存存储器的指定地址;值得说明的是,向量可以作为特殊形式的矩阵(只有一行元素的矩阵)存储于高速暂存存储器中。
子矩阵乘标量指令(SMMS),根据该指令,装置从高速暂存存储器的指定地址,根据指令中的子矩阵的行宽和列宽以及子矩阵的行间距,取出指定的子矩阵数据,从标量寄存器堆的指定地址中取出指定的标量数据,在矩阵运算单元中进行子矩阵乘标量的运算,并将结果写回至高速暂存存储器的指定地址,需要说明的是,标量寄存器堆不仅存储有子矩阵的各种数据信息(包括起始地址、行宽、列宽和行间距),还存有标量数据本身。
张量运算指令(TENS),根据该指令,装置从高速暂存存储器取出指定的两块子矩阵数据,在矩阵运算单元中对两子矩阵数据进行张量运算,并将计算结果写回至高速暂存存储器的指定地址。
子矩阵加法指令(SMA),根据该指令,装置从高速暂存存储器取出指定的两块子矩阵数据,在矩阵运算单元中对两子矩阵数据进行加法运算,并将计算结果写回至高速暂存存储器的指定地址。
子矩阵加法指令(SMS),根据该指令,装置从高速暂存存储器取出指定的两块子矩阵数据,在矩阵运算单元中对两子矩阵数据进行减法运算,并将计算结果写回至高速暂存存储器的指定地址。
子矩阵乘法指令(SMM),根据该指令,装置从高速暂存存储器取出指定的两块子矩阵数据,在矩阵运算单元中对两子矩阵数据进行对位乘法运算,并将计算结果写回至高速暂存存储器的指定地址。
卷积指令(CONV),根据该指令,实现用卷积核对矩阵进行卷积滤波。装置从高速暂存存储器取出指定的卷积核矩阵,从待卷积矩阵存储的起始地址开始,对当前位置下卷积核覆盖的子矩阵数据进行滤波,即 在矩阵运算单元中对卷积核和子矩阵进行对位乘法运算,并对得到的矩阵进行元素求和,得到当前位置的滤波结果,将结果写回至高速暂存存储器的指定地址。然后根据指令中给定的位移参数,在待卷积矩阵上移动至下一位置,重复上面的运算,直到移动至结束位置。
子矩阵搬运指令(SMMOVE),根据该指令,装置将高速暂存存储器中存储的指定子矩阵存至高速暂存存储器的另一处地址。
另外,寄存器单元中存储的子矩阵信息包括子矩阵数据在存储单元中的起始地址(start_addr)、子矩阵数据的行宽(iter1)、子矩阵数据的列宽(iter2)、以及行间隔(stride1),其中,行间隔是指子矩阵数据相邻两行间,上一行的行末数据到下一行的行首数据的数据间隔。如图3所示,矩阵数据实际在存储单元中是以一维的方式存储的,子矩阵的起始地址即图3中子矩阵左上角元素的地址,子矩阵的行宽即图3中子矩阵每一行元素的个数,子矩阵的列宽即图3中子矩阵每一列元素的个数,子矩阵的行间距即图3中子矩阵上一行最后一个元素到下一行第一个元素之间的地址间距。则装置在实际读取子矩阵数据时,只需要从start_addr开始,每读取iter1个数据后跳过stride1个数据再读取iter1个数据,重复iter2次即可获得完整的子矩阵数据。
进一步,子矩阵运算装置还包括指令处理单元,用于获取子矩阵运算指令,并对该子矩阵运算指令进行处理后,提供给子矩阵运算单元。具体地,指令处理单元包括取指模块、译码模块、指令队列及依赖关系处理单元,其中,取指模块获取子矩阵运算指令,译码模块对获取的子矩阵运算指令进行译码,指令队列对译码后的子矩阵运算指令进行顺序存储,依赖关系处理单元在子矩阵运算单元获取子矩阵运算指令前,判断该子矩阵运算指令与前一子矩阵运算指令是否访问相同的子矩阵数据,若是,则将该子矩阵运算指令存储在存储队列中,等待前一子矩阵运算指令执行完毕后,再将所述存储队列中的该子矩阵运算指令提供给所述子矩阵运算单元,否则,直接将该子矩阵运算指令提供给所述子矩阵运算单元。
进一步,存储单元还用于存储子矩阵运算结果,优选地,可采用高 速暂存存储器作为存储单元,另外,本发明还包括输入输出单元,其与存储单元直接连接,输入输出单元用于将矩阵数据存储于存储单元,或者,从存储单元中获取子矩阵运算结果。
进一步,子矩阵运算单元包括子矩阵加法部件、子矩阵乘法部件、大小比较部件、非线性运算部件和子矩阵标量乘法部件。并且,子矩阵运算单元为多流水级结构,多流水级结构包括第一流水级、第二流水级和第三流水级,其中,子矩阵加法部件和子矩阵乘法部件处于第一流水级,大小比较部件处于第二流水级,非线性运算部件和子矩阵标量乘法部件处于第三流水级。
本发明还提供一种子矩阵运算方法,包括:
S1,存储矩阵数据;
S2,存储子矩阵信息;
S3,获取子矩阵运算指令,并根据该子矩阵运算指令获取子矩阵信息,然后,根据该子矩阵信息从存储的矩阵数据中获取子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。
进一步,在步骤S3之前,还包括:
获取子矩阵运算指令;
对获取的子矩阵运算指令进行译码;
判断该子矩阵运算指令与前一子矩阵运算指令是否访问相同的子矩阵数据,若是,则将该子矩阵运算指令存储在一存储队列中,等待前一子矩阵运算指令执行完毕后,再将执行所述步骤S3,否则,直接执行步骤S3。
进一步,步骤S3还包括,存储子矩阵运算结果。
进一步,方法还包括:步骤S4,获取存储的子矩阵运算结果。
进一步,子矩阵运算包括子矩阵加法运算、子矩阵乘法运算、大小比较运算、非线性运算和子矩阵标量乘法运算。并且,采用多流水级结构进行子矩阵运算,多流水级结构包括第一流水级、第二流水级和第三流水级,其中,在第一流水级进行子矩阵加法运算和子矩阵乘法运算,在第二流水级进行大小比较运算,在第三流水级进行非线性运算和子矩 阵标量乘法运算。
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图4是本发明实施例提供的子矩阵运算装置的示意图,如图4所示,装置包括取指模块、译码模块、指令队列、标量寄存器堆、依赖关系处理单元、存储队列、矩阵运算单元、高速暂存器、IO内存存取模块,其中:
取指模块,该模块负责从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;
译码模块,该模块负责对指令进行译码,并将译码后指令传给指令队列;
指令队列,考虑到不同指令在包含的标量寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发射指令;
标量寄存器堆,提供装置在运算过程中所需的多个标量寄存器;
依赖关系处理单元,该模块处理指令与前一条指令可能存在的存储依赖关系。矩阵运算指令会访问高速暂存存储器,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
存储队列,该模块是一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该队列内直至存储关系被消除;
矩阵运算单元,该模块负责装置的所有子矩阵运算,包括但不限于子矩阵加法操作、子矩阵加标量操作、子矩阵减法操作、子矩阵减标量操作、子矩阵乘法操作、子矩阵乘标量操作、子矩阵除法(对位相除)操作、子矩阵与操作和子矩阵或操作,子矩阵运算指令被送往该运算单元执行;
高速暂存存储器器,该模块是矩阵数据专用的暂存存储装置,能够支持不同大小的矩阵数据;
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高 速暂存存储器中读取数据或写入数据。
图5是本发明实施例提供的矩阵运算装置执行子矩阵乘向量执行的流程图,如图5所示,执行子矩阵乘向量指令的过程包括:
S1,取指模块取出该条子矩阵乘向量指令,并将该指令送往译码模块。
S2,译码模块对指令译码,并将指令送往指令队列。
S3,在指令队列中,该子矩阵乘向量指令需要从标量寄存器堆中获取指令中操作域所对应的标量寄存器里的数据,包括输入向量地址、输入向量长度、输入子矩阵地址、输入子矩阵行宽、输入子矩阵列宽、输入子矩阵行间距、输出向量地址、输出向量长度。
S4,在取得需要的标量数据后,该指令被送往依赖关系处理单元。依赖关系处理单元分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。该条指令需要在存储队列中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。
S5,依赖关系不存在后,该条子矩阵乘向量指令被送往矩阵运算单元。矩阵运算单元根据所需数据的位置信息从高速暂存器中取出需要的子矩阵和向量数据,然后在矩阵运算单元中完成乘法运算。
S6,运算完成后,将结果写回至高速暂存存储器的指定地址。
图6为本发明实施例提供的矩阵运算单元进行卷积神经网络运算的方法的流程图,该方法主要由子矩阵运算指令实现。卷积神经网络的运算特征是:对于n×y×x规模的特征图像输入(其中n是输入特征图像数,y是特征图像长,x是特征图像宽),有n×h×w规模的卷积核,卷积核在输入图像上不断移动,在每个位置卷积核与自己所覆盖的输入图像的数据进行卷积运算,得到输出图像上对应的一个点的值。针对这种运算特征,卷积神经网络可以由一条子矩阵卷积指令循环实现。在实际的存储中,如图6所示,数据存储时在图像个数的维度上展开,输入数据图像由n×y×x的三维数组变成y×(x×n)的二维矩阵,相同地,卷积核数据变成h×(w×n)的二维矩阵。如图7所示,实现卷积神经网络的过程包括:
S1,通过IO指令将待卷积的矩阵数据和卷积核矩阵数据存至矩阵专用高速暂存存储器的指定地址;
S2,译码器取出CONV运算指令,根据该指令,矩阵运算单元从高速暂存存储器中读取卷积核矩阵数据和该卷积核在输入图像起始位置的子矩阵数据。
S3,两矩阵数据在矩阵运算单元中进行对位相乘和元素累加求和的运算,并写回结果。然后矩阵运算单元继续读入卷积核,同时根据指令中位移参数得到的下一个待卷积的子矩阵的起始地址,读取数据。
S4,在CONV指令执行过程中,上面过程不断循环,直到完成卷积核在待卷积矩阵最后一个位置上的卷积运算。
S5,通过IO指令将卷积后的结果矩阵存至片外。
需声明,本实施例采用了一种更加高效的方法实现卷积运算,即将三维的输入图像和卷积核均展开成二维形式,实际上,这不是本发明的装置和方法实现卷积运算的唯一方式,一种更通用的方法是对输入的每一张二维图像,与对应的卷积核中的一个面通过子矩阵指令执行卷积运算,得到输出结果的一个部分和,最终的卷积结果是所有的二维图像和与之相对应的卷积核中的面进行卷积运算得到的部分和的累加。故,子矩阵运算指令可以以多种方式实现卷积操作。
综上所述,本发明提供矩阵运算装置,并配合相应的子矩阵运算指令集,能够很好地解决当前计算机领域越来越多的算法包含大量子矩阵运算的问题,相比于已有的传统解决方案,本发明可以具有指令集精简、使用方便、支持的子矩阵规模灵活、片上缓存充足等优点。本发明可以用于多种包含大量子矩阵运算的计算任务,包括目前表现十分出色的人工神经网络算法的反向训练和正向预测。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (19)

  1. 一种子矩阵运算装置,用于根据子矩阵运算指令从矩阵数据中获取子矩阵数据,并根据该子矩阵数据执行子矩阵运算,其特征在于,装置包括:
    存储单元,用于存储矩阵数据;
    寄存器单元,用于存储子矩阵信息;
    子矩阵运算单元,用于获取子矩阵运算指令,并根据该子矩阵运算指令从所述寄存器单元中获取子矩阵信息,然后,根据该子矩阵信息在所述存储单元中的矩阵数据中获取子矩阵数据,接着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。
  2. 根据权利要求1所述的子矩阵运算装置,其特征在于,所述子矩阵运算指令包括一操作码和至少一操作域,其中,所述操作码用于指示该子矩阵运算指令的功能,操作域用于指示该子矩阵运算指令的数据信息。
  3. 根据权利要求2所述的子矩阵运算装置,其特征在于,所述数据信息为寄存器单元的编号,所述子矩阵运算单元根据寄存器单元的编号访问对应的寄存器单元,从而获取子矩阵信息。
  4. 根据权利要求1所述的子矩阵运算装置,其特征在于,所述子矩阵信息包括子矩阵数据在所述存储单元中的起始地址、子矩阵数据的行宽、子矩阵数据的列宽、以及行间隔,其中,行间隔是指子矩阵数据相邻两行间,上一行的行末数据到下一行的行首数据的数据间隔。
  5. 根据权利要求1所述的子矩阵运算装置,其特征在于,还包括:
    指令处理单元,用于获取子矩阵运算指令,并对该子矩阵运算指令进行处理后,提供给所述子矩阵运算单元。
  6. 根据权利要求5所述的子矩阵运算装置,其特征在于,所述指令处理单元包括:
    取指模块,用于获取子矩阵运算指令;
    译码模块,用于对获取的子矩阵运算指令进行译码;
    指令队列,用于对译码后的子矩阵运算指令进行顺序存储;
    依赖关系处理单元,用于在所述子矩阵运算单元获取子矩阵运算指令前,判断该子矩阵运算指令与前一子矩阵运算指令是否访问相同的子矩阵数据,若是,则将该子矩阵运算指令存储在所述存储队列中,等待前一子矩阵运算指令执行完毕后,再将所述存储队列中的该子矩阵运算指令提供给所述子矩阵运算单元,否则,直接将该子矩阵运算指令提供给所述子矩阵运算单元。
  7. 根据权利要求1所述的子矩阵运算装置,其特征在于,所述存储单元还用于存储所述子矩阵运算结果。
  8. 根据权利要求7所述的子矩阵运算装置,其特征在于,还包括:
    输入输出单元,用于将矩阵数据存储于所述存储单元,或者,从所述存储单元中获取子矩阵运算结果。
  9. 根据权利要求1所述的子矩阵运算装置,其特征在于,所述存储单元为高速暂存存储器。
  10. 根据权利要求1所述的子矩阵运算装置,其特征在于,所述子矩阵运算单元包括子矩阵加法部件、子矩阵乘法部件、大小比较部件、非线性运算部件和子矩阵标量乘法部件。
  11. 根据权利要求10所述的子矩阵运算装置,其特征在于,所述子矩阵运算单元为多流水级结构,所述多流水级结构包括第一流水级、第二流水级和第三流水级,其中,所述子矩阵加法部件和子矩阵乘法部件处于第一流水级,大小比较部件处于第二流水级,非线性运算部件和子矩阵标量乘法部件处于第三流水级。
  12. 一种子矩阵运算方法,用于根据子矩阵运算指令从矩阵数据中获取子矩阵数据,并根据该子矩阵数据执行子矩阵运算,其特征在于,方法包括:
    S1,存储矩阵数据;
    S2,存储子矩阵信息;
    S3,获取子矩阵运算指令,并根据该子矩阵运算指令获取子矩阵信息,然后,根据该子矩阵信息从存储的矩阵数据中获取子矩阵数据,接 着,根据获取的子矩阵数据进行子矩阵运算,得到子矩阵运算结果。
  13. 根据权利要求12所述的子矩阵运算方法,其特征在于,所述子矩阵运算指令包括一操作码和至少一操作域,其中,所述操作码用于指示该子矩阵运算指令的功能,操作域用于指示该子矩阵运算指令的数据信息。
  14. 根据权利要求12所述的子矩阵运算方法,其特征在于,所述子矩阵信息包括子矩阵数据在所述存储单元中的起始地址、子矩阵数据的行宽、子矩阵数据的列宽、以及行间隔,其中,行间隔是指子矩阵数据相邻两行间,上一行的行末数据到下一行的行首数据的数据间隔。
  15. 根据权利要求12所述的子矩阵运算方法,其特征在于,在所述步骤S3之前,还包括:
    获取子矩阵运算指令;
    对获取的子矩阵运算指令进行译码;
    判断该子矩阵运算指令与前一子矩阵运算指令是否访问相同的子矩阵数据,若是,则将该子矩阵运算指令存储在一存储队列中,等待前一子矩阵运算指令执行完毕后,再将执行所述步骤S3,否则,直接执行所述步骤S3。
  16. 根据权利要求12所述的子矩阵运算方法,其特征在于,所述步骤S3还包括,存储所述子矩阵运算结果。
  17. 根据权利要求16所述的子矩阵运算方法,其特征在于,还包括:步骤S4,获取存储的子矩阵运算结果。
  18. 根据权利要求12所述的子矩阵运算方法,其特征在于,所述子矩阵运算包括子矩阵加法运算、子矩阵乘法运算、大小比较运算、非线性运算和子矩阵标量乘法运算。
  19. 根据权利要求18所述的子矩阵运算方法,其特征在于,采用多流水级结构进行子矩阵运算,所述多流水级结构包括第一流水级、第二流水级和第三流水级,其中,在所述第一流水级进行子矩阵加法运算和子矩阵乘法运算,在所述第二流水级进行大小比较运算,在所述第三流水级进行非线性运算和子矩阵标量乘法运算。
PCT/CN2016/080023 2016-04-22 2016-04-22 一种子矩阵运算装置及方法 WO2017181419A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP16899006.7A EP3447653A4 (en) 2016-04-22 2016-04-22 SUBMATRIX OPERATING DEVICE AND METHOD
PCT/CN2016/080023 WO2017181419A1 (zh) 2016-04-22 2016-04-22 一种子矩阵运算装置及方法
US16/167,425 US10534841B2 (en) 2016-04-22 2018-10-22 Appartus and methods for submatrix operations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2016/080023 WO2017181419A1 (zh) 2016-04-22 2016-04-22 一种子矩阵运算装置及方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/167,425 Continuation-In-Part US10534841B2 (en) 2016-04-22 2018-10-22 Appartus and methods for submatrix operations

Publications (1)

Publication Number Publication Date
WO2017181419A1 true WO2017181419A1 (zh) 2017-10-26

Family

ID=60115442

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/080023 WO2017181419A1 (zh) 2016-04-22 2016-04-22 一种子矩阵运算装置及方法

Country Status (3)

Country Link
US (1) US10534841B2 (zh)
EP (1) EP3447653A4 (zh)
WO (1) WO2017181419A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109978131A (zh) * 2017-12-28 2019-07-05 北京中科寒武纪科技有限公司 集成电路芯片装置及相关产品
US10534841B2 (en) 2016-04-22 2020-01-14 Cambricon Technologies Corporation Limited Appartus and methods for submatrix operations
KR20200137843A (ko) * 2019-05-31 2020-12-09 한국전자통신연구원 확장 명령어 생성 및 처리 방법 및 이를 이용한 장치
CN113724127A (zh) * 2021-08-02 2021-11-30 成都统信软件技术有限公司 一种图像矩阵卷积的实现方法、计算设备及储存介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190228809A1 (en) * 2019-03-29 2019-07-25 Intel Corporation Technologies for providing high efficiency compute architecture on cross point memory for artificial intelligence operations
CN112446007A (zh) * 2019-08-29 2021-03-05 上海华为技术有限公司 一种矩阵运算方法、运算装置以及处理器
CN112579971B (zh) * 2019-09-29 2024-04-16 广州希姆半导体科技有限公司 矩阵运算电路、矩阵运算装置及矩阵运算方法
WO2021088569A1 (en) * 2019-11-05 2021-05-14 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Convolution method and device, electronic device
CN111865804B (zh) * 2020-06-15 2022-03-11 烽火通信科技股份有限公司 一种通过硬件发包机制提升路由下发效率的方法及系统
CN116348882A (zh) * 2020-06-30 2023-06-27 华为技术有限公司 一种卷积神经网络数据处理方法及其相关设备
CN112067015B (zh) * 2020-09-03 2022-11-22 青岛歌尔智能传感器有限公司 基于卷积神经网络的计步方法、装置及可读存储介质
CN116795432B (zh) * 2023-08-18 2023-12-05 腾讯科技(深圳)有限公司 运算指令的执行方法、装置、电路、处理器及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214160A (zh) * 2011-07-08 2011-10-12 中国科学技术大学 一种基于龙芯3a的单精度矩阵乘法优化方法
CN102253925A (zh) * 2010-05-18 2011-11-23 江苏芯动神州科技有限公司 一种矩阵转置的方法
CN103530276A (zh) * 2013-09-25 2014-01-22 中国科学技术大学 一种基于龙芯3b的自适应矩阵乘法优化方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4888679A (en) * 1988-01-11 1989-12-19 Digital Equipment Corporation Method and apparatus using a cache and main memory for both vector processing and scalar processing by prefetching cache blocks including vector data elements
CN100545804C (zh) 2003-08-18 2009-09-30 上海海尔集成电路有限公司 一种基于cisc结构的微控制器及其指令集的实现方法
JP4657998B2 (ja) * 2006-07-21 2011-03-23 ルネサスエレクトロニクス株式会社 シストリックアレイ
US8473539B1 (en) * 2009-09-01 2013-06-25 Xilinx, Inc. Modified givens rotation for matrices with complex numbers
CN103699360B (zh) 2012-09-27 2016-09-21 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
WO2017181419A1 (zh) 2016-04-22 2017-10-26 北京中科寒武纪科技有限公司 一种子矩阵运算装置及方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102253925A (zh) * 2010-05-18 2011-11-23 江苏芯动神州科技有限公司 一种矩阵转置的方法
CN102214160A (zh) * 2011-07-08 2011-10-12 中国科学技术大学 一种基于龙芯3a的单精度矩阵乘法优化方法
CN103530276A (zh) * 2013-09-25 2014-01-22 中国科学技术大学 一种基于龙芯3b的自适应矩阵乘法优化方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3447653A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10534841B2 (en) 2016-04-22 2020-01-14 Cambricon Technologies Corporation Limited Appartus and methods for submatrix operations
CN109978131A (zh) * 2017-12-28 2019-07-05 北京中科寒武纪科技有限公司 集成电路芯片装置及相关产品
CN109978131B (zh) * 2017-12-28 2020-05-22 中科寒武纪科技股份有限公司 集成电路芯片装置、方法及相关产品
KR20200137843A (ko) * 2019-05-31 2020-12-09 한국전자통신연구원 확장 명령어 생성 및 처리 방법 및 이를 이용한 장치
KR102416325B1 (ko) * 2019-05-31 2022-07-04 한국전자통신연구원 확장 명령어 생성 및 처리 방법 및 이를 이용한 장치
CN113724127A (zh) * 2021-08-02 2021-11-30 成都统信软件技术有限公司 一种图像矩阵卷积的实现方法、计算设备及储存介质
CN113724127B (zh) * 2021-08-02 2023-05-05 成都统信软件技术有限公司 一种图像矩阵卷积的实现方法、计算设备及储存介质

Also Published As

Publication number Publication date
US20190057063A1 (en) 2019-02-21
US10534841B2 (en) 2020-01-14
EP3447653A4 (en) 2019-11-13
EP3447653A1 (en) 2019-02-27

Similar Documents

Publication Publication Date Title
CN108388541B (zh) 卷积运算装置及方法
WO2017181419A1 (zh) 一种子矩阵运算装置及方法
KR102123633B1 (ko) 행렬 연산 장치 및 방법
CN107315574B (zh) 一种用于执行矩阵乘运算的装置和方法
CN109542515B (zh) 运算装置及方法
US10338925B2 (en) Tensor register files
CN108427990B (zh) 神经网络计算系统和方法
CN109313556B (zh) 可中断和可重启矩阵乘法指令、处理器、方法和系统
US10372456B2 (en) Tensor processor instruction set architecture
CN111580865B (zh) 一种向量运算装置及运算方法
WO2017185418A1 (zh) 一种用于执行神经网络运算以及矩阵/向量运算的装置和方法
KR102486029B1 (ko) 비트폭이 다른 연산 데이터를 지원하는 연산 유닛, 연산 방법 및 연산 장치
JP2020533691A (ja) Simd命令を用いた効率的な直接畳み込み
US9141386B2 (en) Vector logical reduction operation implemented using swizzling on a semiconductor chip
WO2017185393A1 (zh) 一种用于执行向量内积运算的装置和方法
CN112348182B (zh) 一种神经网络maxout层计算装置
EP3447690A1 (en) Maxout layer operation apparatus and method
KR20230018361A (ko) 벡터 연산들을 위한 회전식 누산기
TWI508023B (zh) 平行及向量式吉伯特-詹森-科西(gilbert-johnson-keerthi)圖形處理技術
KR20230082621A (ko) 얕은 파이프라인을 갖는 고도의 병렬 처리 아키텍처
WO2018192161A1 (zh) 运算装置及方法
CN118277718A (zh) 一种多核dsp上的通用矩阵乘优化方法及装置
CN118051168A (zh) 数据读取方法、装置、计算机设备、存储介质和程序产品

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016899006

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2016899006

Country of ref document: EP

Effective date: 20181122

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899006

Country of ref document: EP

Kind code of ref document: A1