WO2021036729A1 - 一种矩阵运算方法、运算装置以及处理器 - Google Patents

一种矩阵运算方法、运算装置以及处理器 Download PDF

Info

Publication number
WO2021036729A1
WO2021036729A1 PCT/CN2020/107303 CN2020107303W WO2021036729A1 WO 2021036729 A1 WO2021036729 A1 WO 2021036729A1 CN 2020107303 W CN2020107303 W CN 2020107303W WO 2021036729 A1 WO2021036729 A1 WO 2021036729A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
sub
arithmetic
address
instruction
Prior art date
Application number
PCT/CN2020/107303
Other languages
English (en)
French (fr)
Inventor
肖聪
张争争
陈铁
王平
吴正成
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021036729A1 publication Critical patent/WO2021036729A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Definitions

  • the embodiments of the present application relate to the field of communication technology, and in particular, to a matrix operation method, an operation device, and a processor.
  • Matrix operations account for an increasing proportion of all types of operations and processing algorithms. Matrix operations also have a wide range of applications in artificial intelligence, digital image processing, and radar signal and data processing.
  • the more common types of operations also include: scalar operations and vector operations.
  • the existing general technology is to process various algorithms and data calculation processing of wireless communications in a digital signal processor (digital signal processor, DSP) and a central processing unit (center process unit, CPU).
  • DSP digital signal processor
  • CPU central processing unit
  • DSP and CPU can efficiently process scalar operations and vector operations.
  • DSP and CPU due to the large amount of matrix operation data, DSP and CPU have extremely low efficiency and high power consumption in the process of processing matrix operations.
  • each instruction in the DSP and CPU can only perform operations on one scalar or vector, so the matrix operation needs to be disassembled into operations on multiple scalars or vectors, which requires a large number of instructions to be read and translated. Perform the corresponding calculation after the code. It is easy to understand that a lot of resources are needed in the process of reading instructions and decoding, which wastes time and increases power consumption.
  • the embodiments of the present application provide a matrix operation method, arithmetic device, and a processor, which are used to improve the calculation efficiency, reduce the calculation power consumption, and save the calculation resources in the matrix calculation process.
  • an embodiment of the present application provides a matrix operation method, which is applied to an arithmetic device, the arithmetic device includes at least one arithmetic module, and each arithmetic module includes (M*N) arithmetic units, and The (M*N) arithmetic units in each arithmetic module are arranged into a two-dimensional matrix array with M rows and N columns, where M and N are integers greater than or equal to 2.
  • the method includes: the arithmetic device obtains a matrix operation instruction, so The matrix operation instruction carries the address of the matrix to be operated; the address of the sub-matrix is obtained according to the address of the matrix to be operated, and the sub-matrix is a two-dimensional matrix with M rows and N columns obtained by dividing the matrix to be operated; according to The address of the sub-matrix reads the matrix elements of the sub-matrix into the operation module, one operation module corresponds to one sub-matrix; according to the matrix operation instruction, the operation module is controlled to perform matrix operations on the sub-matrix, Get the result of the matrix operation.
  • the matrix to be operated can be directly obtained according to the matrix operation instruction, and then the matrix to be operated is split based on the granularity of the matrix operation, so as to perform matrix operations on the divided sub-matrices respectively.
  • the matrix to be operated is split into multiple sub-matrices for operation, which can reduce the instructions in the matrix operation process, save operation resources and improve operation Efficiency, reduce computing power consumption.
  • the reading the matrix elements of the sub-matrix into the operation module according to the address of the sub-matrix includes: taking out the address of the sub-matrix The matrix element of the sub-matrix; the matrix element of the m-th row and the n-th column in the sub-matrix is stored in the arithmetic unit of the m-th row and the n-th column in the arithmetic module, and the value of m is A positive integer less than or equal to M, and the value of n is a positive integer less than or equal to N.
  • controlling the operation module to perform matrix operations on the sub-matrix according to the matrix operation instruction includes: controlling the operation module in the operation module according to the matrix operation instruction
  • Each arithmetic unit performs a multiply-accumulate operation or a complex multiply-accumulate operation on the sub-matrix
  • the arithmetic unit includes at least one of the following: a multiply-accumulate operation unit and a complex multiply-accumulate operation unit.
  • the matrix operation instruction further includes at least one of the following: an example identifier, a loop operation instruction, and a destination address.
  • the method further includes: querying a pre-stored example table through the example identifier to obtain the target example, so The target example is used to indicate the data storage form of the matrix elements of the matrix to be operated on.
  • the first instruction is used to indicate the number of cycles corresponding to the matrix operation instruction.
  • the first indication may also be referred to as the cycle number indication.
  • the matrix operation instruction further includes a second instruction, the second instruction is used to instruct preprocessing and matrix transposition, and the preprocessing includes: matrix inversion And/or conjugated.
  • the method further includes: storing the matrix operation result in the destination address, and the destination address The address is the memory address.
  • the address of the matrix to be calculated is a memory address.
  • an embodiment of the present application provides an arithmetic device, including: the arithmetic device includes at least one arithmetic module, each arithmetic module includes (M*N) arithmetic units, and all the arithmetic modules in each arithmetic module
  • the (M*N) arithmetic units are arranged into a two-dimensional matrix array with M rows and N columns, and the M and N are respectively integers greater than or equal to 2, and the arithmetic device is used to execute the above-mentioned first aspect and the first aspect
  • an embodiment of the present application provides a processor, which includes an arithmetic device configured to execute the above-mentioned first aspect and the method described in any one of the possible implementation manners of the first aspect.
  • the processor further includes: at least one of a central processing unit and a digital signal processing unit, and the central processing unit and the digital signal processing unit are configured to:
  • the arithmetic device sends a matrix arithmetic instruction.
  • FIG. 1 is a schematic diagram of the architecture of a heterogeneous processor provided in an embodiment of the application
  • FIG. 2 is a schematic structural diagram of an arithmetic device provided in an embodiment of the application.
  • FIG. 3 is a schematic diagram of the structure of each arithmetic module in an arithmetic device provided in an embodiment of the application;
  • FIG. 4 is a schematic structural diagram of an arithmetic module composed of a processing unit PE provided in an embodiment of the application;
  • FIG. 5 is a schematic diagram of an embodiment of a matrix operation method provided in an embodiment of the application.
  • the embodiments of the present application provide a matrix operation method, arithmetic device, and a processor, which are used to improve the operation efficiency, reduce the operation power consumption, and save the operation resources in the matrix operation process.
  • the matrix operation method in the embodiment of the present application can be used in various matrix operation systems, and is especially suitable for a heterogeneous processor architecture based on matrix acceleration, which can make the heterogeneous processor have better flexibility and Programmability can improve its matrix computing capabilities.
  • Fig. 1 shows a schematic structural diagram of a heterogeneous processor provided in an embodiment of the present application.
  • the heterogeneous processor 10 includes: a first computing device 101, a second computing device 102, and a memory 103, wherein the above three are connected in pairs, the first computing device 101 and the second computing device 102 Data is transferred between them through the memory 103.
  • the first arithmetic device 101 is used to control the calculation process and perform calculations on scalars and vectors.
  • the first arithmetic device 101 may be a CPU or a DSP unit.
  • the CPU or DSP unit can be connected to the second arithmetic device.
  • the configuration of matrix operation instructions and state feedback are completed through a dedicated hardware channel;
  • the second operation device 102 is used to perform operations on the matrix and execute the matrix operation method in the embodiment of the present application;
  • the memory 103 is used to store operation data and corresponding operations result.
  • the first computing device 101 and the second computing device 102 can perform parallel processing or serial processing.
  • the second computing device 102 may include: at least one computing module 1021, each computing module includes (M*N) computing units 10211, each computing module 1021 (M* N) arithmetic units 12011 are arranged in a two-dimensional matrix array with M rows and N columns, and M and N are integers greater than or equal to 2.
  • the computing module 1021 may be a processing element system (PEs), and the computing unit 10211 may be a processing element (PE).
  • PEs processing element system
  • PEs processing element system
  • the processing unit system PEs consists of (M*N) processing units PE arranged in a two-dimensional matrix array, and each processing unit PE consists of a multiply accumulate (MAC) or a complex multiply accumulate (CMAC) unit constitute.
  • FIG. 4 shows a schematic diagram of the structure of the processing unit system PEs composed of (4*4) processing units PE.
  • each processing unit PE consists of three input ports A, B, and C, and one output port D, which can perform (A*B+C) operations or (A*B+temp) operations, where temp is the operation result of the last processing unit PE, which can realize the vertical self-accumulation of the operation result of the single processing unit system PEs.
  • the aforementioned multiple processing unit systems PEs may implement the (M*N) matrix operation of the multiple processing unit systems PEs through an accumulator (such as an ACC addition tree), and then laterally accumulate their operation results.
  • an accumulator such as an ACC addition tree
  • FIG. 5 is a schematic diagram of an embodiment of a matrix operation method provided in an embodiment of the application.
  • an embodiment of the matrix operation method in the embodiment of the present application includes:
  • the arithmetic device acquires a matrix operation instruction, and the matrix operation instruction carries the address of the matrix to be operated.
  • the matrix operation instruction carries the address of the matrix to be operated on, and the operation device can obtain the address of the matrix to be operated on through the matrix operation instruction.
  • an instruction format of the matrix operation instruction may include: operation code, output matrix precision, matrix integer, matrix operation dimension, number of cycles, A matrix address, B matrix address, C matrix address And the matrix operation example logo.
  • the operation code is used to indicate the main functions of the matrix operation instructions, including but not limited to the functions of the following types of instructions: system instructions, load/store instructions, and operation instructions.
  • the system commands may include: pattern table refresh, sync processing (data-related synchronization) and other commands.
  • Load/store instructions refer to instructions that do not perform operations, only obtain data from the memory, or store data back to the memory.
  • Operation instructions can include: matrix multiplication, matrix addition, matrix multiplication and addition, matrix multiplication and accumulation, matrix point scale (corresponding element multiplication) and other instructions.
  • Operational instructions can include complete data load and store functions, and do not rely on the above load/store instructions, decomposition and inversion instructions, etc.
  • the precision of the output matrix can include single-precision floating point (SF) or double-precision floating point (double-point, DF), etc.;
  • the input matrix integer refers to: before performing matrix operations on the input matrix or after obtaining the results of the matrix operations Carry out inversion operation and conjugate operation on the matrix;
  • the dimension of matrix operation can be (M, N, P), which means that the matrix operation of (M*N) A matrix and N*P B matrix will get (M *P) C matrix; the number of cycles is the number of executions of this matrix operation;
  • a matrix address and B matrix address are the input addresses of the matrix to be calculated;
  • C matrix address is: the storage address of the C matrix, the C matrix It is obtained after matrix operation on A matrix address and B matrix address.
  • the matrix operation example identification is used to identify the corresponding target example in the matrix operation example table.
  • the target example can be used to indicate the data storage form (ie addressing form) of the matrix element in the matrix to be operated, which is convenient for matrix operation and transposition to obtain matrix elements .
  • the storage and arrangement of matrices to be calculated may not all be regular matrices.
  • the matrix to be calculated can be a continuous 4D or 3D matrix, a discontinuous 2D matrix, a triangular matrix or other irregular matrix.
  • the matrix instruction format only indicates the matrix operation example identifier, which can reduce the length and configuration overhead of the matrix operation instruction.
  • the arithmetic device obtains the address of a sub-matrix according to the address of the matrix to be operated, and the sub-matrix is a two-dimensional matrix with M rows and N columns obtained by dividing the matrix to be operated.
  • the arithmetic device obtains the address of the matrix to be operated through the matrix operation instruction, and further, the arithmetic device divides the matrix to be operated into multiple sub-matrices based on the two-dimensional matrix dimensions in the arithmetic unit to obtain the addresses of the multiple sub-matrices.
  • the arithmetic device reads the matrix elements of the sub-matrix into the arithmetic module according to the address of the sub-matrix, and one arithmetic module corresponds to one sub-matrix.
  • the arithmetic device takes out the corresponding matrix element from the address of the sub-matrix, and transfers the matrix element to the corresponding position in the arithmetic module. For example, store the matrix elements in the mth row and nth column of the sub-matrix into the arithmetic unit in the mth row and nth column in the arithmetic module, the value of m is a positive integer less than or equal to M, and the value of n It is a positive integer less than or equal to N.
  • the arithmetic device inputs the matrix elements from the first row to the fourth row of the sub-matrix from the A, B or C ports of the processing unit PE to the calculation In the processing unit PE from row 1 to row 4 in the unit system PEs, and the arrangement order of the matrix elements in the input operation unit system PEs is the same as the arrangement order of the matrix elements in the sub-matrix.
  • the arithmetic device can use memory access technology, such as gather-scatter technology to achieve multiple sub-matrices by accessing memory at one time.
  • the arithmetic device controls the arithmetic module to perform a matrix operation on the sub-matrix according to the matrix operation instruction to obtain a matrix operation result.
  • the arithmetic unit includes at least one of the following: a multiply-accumulate arithmetic unit and a complex multiply-accumulate arithmetic unit, and the arithmetic device controls each arithmetic unit in the arithmetic module to perform a multiply-accumulate MAC operation or a complex multiply-accumulate on the sub-matrix according to the matrix operation instruction CMAC operation.
  • the aforementioned matrix operation instruction includes: an example identifier, a first instruction (that is, an instruction of the number of cycles), a second instruction, and a destination address.
  • the first indication is used for the number of cycles corresponding to the matrix operation instruction, or is used to indicate the number of cycles corresponding to the matrix operation instruction, and the method of generating the first address of the matrix to be operated for the next matrix operation.
  • the number of cycles is used in conjunction with the data storage form in the pattern table. Its purpose is not only to indicate how many matrix operations to perform, but also to inform how the first address of the matrix for the next matrix operation is calculated and generated of.
  • the second indication is used to indicate preprocessing and matrix transposition, and the preprocessing may include but is not limited to: matrix inversion and/or conjugate.
  • the arithmetic device queries the pre-stored example table through the example identifier in the matrix operation instruction to obtain the corresponding target example, and further, the arithmetic device performs matrix operations according to the target example.
  • the above example table can be preloaded, or it can be dynamically refreshed through a dedicated command channel every time the system is started and running.
  • the above-mentioned transposition operation can be implemented based on the data selector MUX.
  • the first level MUX is used to complete the transposition
  • the multi-level MUX is used to complete the transposition.
  • the arithmetic device stores the result of the matrix operation through the destination address in the matrix operation instruction. For example, the arithmetic device stores the matrix operation results corresponding to the A matrix and the B matrix in the address of the C matrix.
  • the matrix to be operated can be directly obtained according to the matrix operation instruction, and then the matrix to be operated is split based on the matrix operation granularity, so that the multiple sub-matrices after the division are respectively subjected to matrix operations.
  • the matrix operation only one matrix operation instruction is needed to complete the operation of the matrix to be operated, and the matrix to be operated is divided into multiple sub-matrices for operation, which can reduce the number of instructions in the matrix operation process, save operation resources, improve operation efficiency, and reduce Operational power consumption.
  • the second arithmetic device 102 includes: a plurality of arithmetic modules 1021, each arithmetic module 1021 includes (M*N) arithmetic units 10211, and (M*N) in each arithmetic module 1021 )
  • the arithmetic units 10211 are arranged in a two-dimensional matrix with M rows and N columns.
  • the second operation device 102 is configured to perform the following operations: obtain a matrix operation instruction, the matrix operation instruction carries the address of the matrix to be operated; obtain the address of a sub-matrix according to the address of the matrix to be operated, and the sub-matrix is a segmentation A two-dimensional matrix of M rows and N columns obtained from the matrix to be operated; the matrix elements of the sub-matrix are read into the operation module 1021 according to the address of the sub-matrix, and one operation module corresponds to one sub-matrix; according to The matrix operation instruction controls the operation module 1021 to perform a matrix operation on the sub-matrix to obtain a matrix operation result.
  • the second arithmetic device 102 is specifically configured to: fetch the matrix elements of the sub-matrix from the address of the sub-matrix;
  • the matrix element of is stored in the arithmetic unit in the mth row and nth column of the arithmetic module, the value of m is a positive integer less than or equal to M, and the value of n is a positive integer less than or equal to N Integer.
  • the second arithmetic device 102 is specifically configured to: control each arithmetic unit 10211 in the arithmetic module to perform a multiplication-accumulation operation or a complex multiplication on the sub-matrix according to the matrix operation instruction
  • the arithmetic unit 10211 includes at least one of the following: a multiply-accumulate operation unit and a complex multiply-accumulate operation unit.
  • the second computing device 102 is further configured to: query a pre-stored example table through the example identifier to obtain the target example ,
  • the target example is used to indicate the data storage form of the matrix elements of the matrix to be operated on.
  • the matrix operation instruction further includes a first instruction
  • the first instruction is used to indicate the number of cycles corresponding to the matrix operation instruction.
  • the matrix operation instruction further includes a second instruction, the second instruction is used to indicate preprocessing and matrix transposition, and the preprocessing includes: matrix inversion and/or Conjugate.
  • the second arithmetic device 102 is further configured to: store the result of the matrix operation in the destination address, and the The destination address is the memory address.
  • the address of the matrix to be calculated is a memory address.
  • An embodiment of the present application provides a processor, which may specifically be the heterogeneous processor 10 described in FIG. 1 above.
  • an operating system and operating instructions, executable modules or data structures, or a subset of them, or an extended set of them are stored in the memory 103.
  • the operating instructions may include various operating instructions. To achieve various operations.
  • the operating system may include various system programs for implementing various basic services and processing hardware-based tasks.
  • the second operation device 102 receives the matrix operation instruction sent by the first operation device 101, and further the matrix operation instruction executes the matrix operation method described in the above method embodiment.
  • the device embodiments described above are only illustrative, and the units described as separate parts may or may not be physically separated, and the parts displayed as units may be or may be It is not a physical unit, that is, it can be located in one place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the technical solutions in this embodiment.
  • the connection relationship between the modules indicates that they have a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.
  • this application can be implemented by means of software plus necessary general hardware.
  • it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memory, Dedicated components and so on to achieve.
  • all functions completed by computer programs can be easily implemented with corresponding hardware.
  • the specific hardware structures used to achieve the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. Circuit etc.
  • software program implementation is a better implementation.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, server, or network device, etc.) execute the methods described in each embodiment of this application .
  • a computer device which can be a personal computer, server, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, a solid state disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Advance Control (AREA)

Abstract

本申请实施例公开了一种矩阵运算方法、运算装置以及处理器,用于在矩阵运算过程中,提高其运算效率,降低其运算功耗,节约其运算资源。该方法应用于运算装置中,该运算装置包括至少一个运算模块,每一个运算模块中包括(M*N)个运算单元,并且每一个运算模块中的(M*N)个运算单元排列成M行N列的二维矩阵阵列,M和N分别为大于或等于2的整数,矩阵运算方法包括:根据矩阵运算指令中携带的待运算矩阵的地址得到子矩阵的地址,子矩阵为切分待运算矩阵得到的M行N列的二维矩阵;根据子矩阵的地址读取子矩阵的矩阵元素至运算模块中,一个运算模块对应一个子矩阵;根据矩阵运算指令控制运算模块对子矩阵进行矩阵运算,得到矩阵运算结果。

Description

一种矩阵运算方法、运算装置以及处理器
本申请要求于2019年08月29日提交中国专利局、申请号为201910809027.8、发明名称为“一种矩阵运算方法、运算装置以及处理器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及通信技术领域,尤其涉及一种矩阵运算方法、运算装置以及处理器。
背景技术
随着通信技术的发展,尤其是5G技术的发展和应用,矩阵运算在所有运算和处理的算法类型中的占比越来越大。矩阵运算在人工智能、数字图像处理以及雷达信号及数据处理中也存在着广泛的应用。
除上述的矩阵运算之外,较为常见的运算类型还包括:标量运算和矢量运算。在算法实现上,现有的通用技术是在数字信号处理器(digital signal processor,DSP)和中央处理单元(center process unit,CPU)中处理无线通信的各种算法和数据运算处理。
DSP和CPU能够高效地处理标量运算以及矢量运算,但由于矩阵运算数据量大,DSP和CPU在处理矩阵运算过程中效率极低并且功耗较大。具体来说,DSP和CPU中的每个指令只能完成对一个标量或矢量的运算,从而矩阵运算中需要被拆解为对多个标量或矢量的运算,从而需要读取大量的指令并译码后执行相应的计算。容易理解,在读取指令并译码过程中需要占用大量的资源,浪费时间的同时增加了功耗。
发明内容
为了解决上述的技术问题,本申请实施例中提供了一种矩阵运算方法、运算装置以及处理器,用于在矩阵运算过程中,提高其运算效率,降低其运算功耗,节约其运算资源。
第一方面,本申请实施例中提供了一种矩阵运算方法,该方法应用于运算装置中,该运算装置包括至少一个运算模块,每一个运算模块中包括(M*N)个运算单元,并且每一个运算模块中的(M*N)个运算单元排列成M行N列的二维矩阵阵列,M和N分别为大于或等于2的整数,该方法包括:运算装置获取矩阵运算指令,所述矩阵运算指令中携带有待运算矩阵的地址;根据所述待运算矩阵的地址得到子矩阵的地址,所述子矩阵为切分所述待运算矩阵得到的M行N列的二维矩阵;根据所述子矩阵的地址读取所述子矩阵的矩阵元素至所述运算模块中,一个运算模块对应一个子矩阵;根据所述矩阵运算指令控制所述运算模块对所述子矩阵进行矩阵运算,得到矩阵运算结果。
在上述第一方面的矩阵运算方法中,可以直接根据矩阵运算指令直接获取待运算矩阵,然后基于矩阵运算粒度对待运算矩阵进行拆分,从而分别对划分后的多个子矩阵进行矩阵运算,在上述的整个矩阵运算过程中,只需要一个矩阵运算指令即可完成对待运算矩阵的运算,并且将待运算矩阵拆分为多个子矩阵进行运算,可以减少矩阵运算过程中的指令节约运算资源同时提高运算效率,降低运算功耗。
在第一方面的一种可能的实现方式中,所述根据所述子矩阵的地址读取至所述子矩阵的矩阵元素至所述运算模块中,包括:从所述子矩阵的地址中取出所述子矩阵的矩阵元素;将所述子矩阵中第m行、第n列的矩阵元素存入所述运算模块中第m行、第n列的运算单元中, 所述m的取值为小于或等于M的正整数,所述n的取值为小于或等于N的正整数。
在第一方面的一种可能的实现方式中,所述根据所述矩阵运算指令控制所述运算模块对所述子矩阵进行矩阵运算,包括:根据所述矩阵运算指令控制所述运算模块中的每一个运算单元对所述子矩阵执行乘累加运算或复数乘累加运算,所述运算单元包括以下至少一项:乘累加运算单元和复数乘累加运算单元。
在第一方面的一种可能的实现方式中,所述矩阵运算指令中还包括以下至少一项:范例标识、循环运算指示和目的地址。
在第一方面的一种可能的实现方式中,若所述矩阵运算指令中包括所述范例标识,所述方法还包括:通过所述范例标识对预先存储的范例表进行查询得到目标范例,所述目标范例用于指示所述待运算矩阵的矩阵元素的数据存储形式。
在第一方面的一种可能的实现方式中,若所述矩阵运算指令中包括第一指示,所述第一指示用于指示所述矩阵运算指令对应的循环次数。其中所述第一指示也可以称之为循环次数指示。
在第一方面的一种可能的实现方式中,所述矩阵运算指令中还包括第二指示,所述第二指示用于指示预处理和矩阵转置,所述预处理包括:矩阵的取反和/或共轭。
在第一方面的一种可能的实现方式中,若所述矩阵运算指令中还包括所述目的地址,所述方法还包括:将所述矩阵运算结果存储到所述目的地址中,所述目的地址为内存地址。
在第一方面的一种可能的实现方式中,所述待运算矩阵的地址为内存地址。
在第一方面的一种可能的实现方式中,上述的M和N的取值相等。
第二方面,本申请实施例中提供了一种运算装置,包括:所述运算装置包括至少一个运算模块,每一个运算模块中包括(M*N)个运算单元,每一个运算模块中的所述(M*N)个运算单元排列成M行N列的二维矩阵阵列,所述M和N分别为大于或等于2的整数,所述运算装置用于执行上述第一方面以及第一方面中任意一种实现方式中所述的矩阵运算方法。
第三方面,本申请实施例中提供了一种处理器,包括:运算装置,所述运算装置用于执行上述第一方面以及第一方面中任一种可能的实现方式中所述的方法。
在第三方面的一种可能的实现方式中,所述处理器还包括:中央处理单元和数字信号处理单元中的至少一项,所述中央处理单元和数字信号处理单元用于:向所述运算装置发送矩阵运算指令。
附图说明
图1为本申请实施例中提供的一个异构处理器的架构示意图;
图2为本申请实施例中提供的一个运算装置的结构示意图;
图3为本申请实施例中提供的一个运算装置中每个运算模块的结构示意图;
图4为本申请实施例中提供的一个由处理单元PE组成的运算模块的结构示意图;
图5为本申请实施例中提供的一个矩阵运算方法的实施例示意图。
具体实施方式
本申请实施例中提供了一种矩阵运算方法、运算装置以及处理器,用于在矩阵运算过程中,提高其运算效率,降低其运算功耗,节约其运算资源。
下面结合附图,对本申请的实施例进行描述。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
本申请实施例中的矩阵运算方法可以使用于各种矩阵运算系统中,尤其适用于一种基于矩阵加速的异构处理器架构中,可以使得该异构处理器既有较好的灵活性和编程性,又能提升其矩阵运算能力。
图1中示出了本申请实施例中提供的一种异构处理器的架构示意图。
如图1所示,异构处理器10包括:第一运算装置101、第二运算装置102和内存103,其中上述三者之间两两相互连,第一运算装置101和第二运算装置102之间通过内存103来传递数据。
第一运算装置101用于控制运算流程以及对标量和矢量进行运算,具体来说,第一运算装置101可以是CPU或DSP单元,可选的,CPU或DSP单元与第二运算装置之间可以通过专门的硬件通道来完成矩阵运算指令的配置和状态反馈;第二运算装置102用于对矩阵进行运算,执行本申请实施例中的矩阵运算方法;内存103用于存储运算数据以及相应的运算结果。在不同算法需求的场景下,第一运算装置101和第二运算装置102可以并行处理,也可以串行处理。如图2和图3所示,第二运算装置102中可以包括:至少一个运算模块1021,每一个运算模块中包括(M*N)个运算单元10211,每一个运算模块1021中的(M*N)个运算单元12011排列成M行N列的二维矩阵阵列,M和N分别为大于或等于2的整数。
举例来说,运算模块1021可以是处理单元系统(process element system,PEs),运算单元10211可以是处理单元(process element,PE)。
处理单元系统PEs由(M*N)个处理单元PE呈二维矩阵阵列排列,每个处理单元PE由一个乘累加(multiply accumulate,MAC),或一个复数乘累加(complex multiply accumulate,CMAC)单元构成。图4中示出了处理单元系统PEs由(4*4)个处理单元PE的组成结构示意图。
如图4所示,每个处理单元PE由三个输入端口A、B和C组成,和一个输出端口D,可完成(A*B+C)运算或(A*B+temp)运算,其中temp是上次处理单元PE的运算结果,可实现单个处理单元系统PEs的运算结果的竖向自累加。
可选的,上述的多个处理单元系统PEs可通过累加器(如ACC加法树),实现多个处理单元系统PEs的(M*N)矩阵运算之后,对其运算结果的横向累加。
为了便于理解本申请实施例中的矩阵运算方法,下面结合附图对该矩阵运算方法进行详细描述。
图5为本申请实施例中提供的一个矩阵运算方法的实施例示意图。
如图5所示,本申请实施例中矩阵运算方法的一个实施例,包括:
201、运算装置获取矩阵运算指令,该矩阵运算指令中携带有待运算矩阵的地址。
矩阵运算指令中携带有待运算矩阵的地址,运算装置通过该矩阵运算指令可以获取待运算矩阵的地址。
可选的,如图3所示,矩阵运算指令的一种指令格式可以包括:操作码、输出矩阵精度、 矩阵整型、矩阵运算维度、循环次数、A矩阵地址、B矩阵地址、C矩阵地址和矩阵运算范例标识。
其中操作码用于指示矩阵运算指令的主要功能,包括但不限于以下几类指令的功能:系统类指令、下载(load)/存储(store)类指令、以及运算类指令。其中系统类指令可以包括:范例(pattern)表刷新、sync处理(数据相关的同步)等指令。load/store类指令是指:不做运算,只从内存获取数据,或将数据存回到内存中等功能对应的指令。运算类指令可以包括:矩阵乘,矩阵加,矩阵乘加,矩阵乘累加,矩阵点称(对应元素相乘)等指令。运算类指令可以包含完整的数据load和store功能,不依赖于上述的load/store类指令,分解求逆类指令等。
输出矩阵精度可以包括单精度浮点(single float point,SF)或双精度浮点(double float point,DF)等;输入矩阵整型是指:对输入矩阵执行矩阵运算之前或得到矩阵运算结果之后对矩阵进行取反运算和共轭运算等;矩阵运算维度可以为(M,N,P),其含义是对(M*N)的A矩阵和N*P的B矩阵进行矩阵运算得到(M*P)的C矩阵;循环次数是此次矩阵运算的执行次数;A矩阵地址和B矩阵地址为输入的待运算矩阵的读取地址;C矩阵地址是:C矩阵的存入地址,C矩阵是对A矩阵地址和B矩阵地址进行矩阵运算后得到的。
矩阵运算范例标识是用于识别矩阵运算范例表中相应的目标范例,目标范例可以用于指示待运算矩阵中矩阵元素的数据存储形式(即寻址形式),便于矩阵运算、转置获取矩阵元素。例如,4D、3D、或2D的数据存储形式,待运算矩阵的存储和排列可能并不都是规整的矩阵。待运算矩阵可以是连续排列的4D或3D矩阵,也可以是不连续排列的2D矩阵,还可以是三角矩阵或是其它不规律的矩阵。矩阵指令格式中只指示矩阵运算范例标识,可以降低矩阵运算指令的长度和配置开销。
202、运算装置根据待运算矩阵的地址得到子矩阵的地址,子矩阵为切分待运算矩阵得到的M行N列的二维矩阵;
运算装置在通过矩阵运算指令获知待运算矩阵的地址,进一步,运算装置基于运算单元中二维矩阵维度将待运算矩阵划分为多个子矩阵,得到多个子矩阵的地址。
203、运算装置根据子矩阵的地址读取至子矩阵的矩阵元素至运算模块中,一个运算模块对应一个子矩阵。
可选的,运算装置从子矩阵的地址中取出相应的矩阵元素,并将其矩阵元素转移至运算模块中的相应位置。例如,将子矩阵中第m行、第n列的矩阵元素存入运算模块中第m行、第n列的运算单元中,m的取值为小于或等于M的正整数,n的取值为小于或等于N的正整数。
以上述图4为例,若划分后的子矩阵为4行4列,运算装置将子矩阵的第1行至第4行的矩阵元素分别从处理单元PE的A、B或C端口输入至运算单元系统PEs中第1行至第4行的处理单元PE中,并且输入运算单元系统PEs中矩阵元素的排列顺序与子矩阵中矩阵元素的排列顺序相同。
运算装置可以通过内存访问技术,如gather-scatter技术是实现一次访存得到多个子矩阵。
204、运算装置根据矩阵运算指令控制运算模块对子矩阵进行矩阵运算,得到矩阵运算结果。
可选的,运算单元包括以下至少一项:乘累加运算单元和复数乘累加运算单元,运算装置根据矩阵运算指令控制运算模块中的每一个运算单元对子矩阵执行乘累加MAC运算或复数乘累加CMAC运算。
可选的,上述的矩阵运算指令中包括:范例标识、第一指示(即循环次数指示)、第二指示、和目的地址。
第一指示用于所述矩阵运算指令对应的循环次数,或者,用于指示所述矩阵运算指令对应的循环次数,以及下次矩阵运算的待运算矩阵的首地址的生成方式。具体来说,该循环次数是和范例(pattern)表中数据存储形式配合使用,其目的不仅仅在于指示要执行多少次矩阵运算,同时告知下次矩阵运算的矩阵的首地址是如何计算并生成的。
第二指示用于指示预处理和矩阵的转置,所述预处理可以包括但不限于:矩阵的取反和/或共轭。
运算装置通过矩阵运算指令中的范例标识对预先存储的范例表进行查询,得到对应的目标范例,进而,运算装置根据目标范例进行矩阵运算。
上述的范例表可以是预加载的,也可以在每次系统启动以及运行过程中,通过专用的指令通道动态刷新的。
上述的转置操作可以基于数据选择器MUX实现,当子矩阵的维度n较小时,采用一级MUX完成转置,当子矩阵的维度n较大时,采用多级MUX完成转置。
运算装置通过矩阵运算指令中的目的地址存储矩阵运算结果。例如,运算装置将A矩阵和B矩阵对应的矩阵运算结果存储到C矩阵地址中。
本申请实施例中,可以直接根据矩阵运算指令直接获取待运算矩阵,然后基于矩阵运算粒度对待运算矩阵进行拆分,从而分别对划分后的多个子矩阵进行矩阵运算,在上述的整个矩阵运算过程中,只需要一个矩阵运算指令即可完成对待运算矩阵的运算,并且将待运算矩阵拆分为多个子矩阵进行运算,可以减少矩阵运算过程中的指令数量,节约运算资源,提高运算效率,降低运算功耗。
下面对本申请实施例中提供的运算装置进行详细说明。
如图2和图3所示,第二运算装置102包括:多个运算模块1021,每一个运算模块1021中包括(M*N)个运算单元10211,每一个运算模块1021中的(M*N)个运算单元10211呈现M行N列的二维矩阵排列。
第二运算装置102用于执行以下操作:获取矩阵运算指令,所述矩阵运算指令中携带有待运算矩阵的地址;根据所述待运算矩阵的地址得到子矩阵的地址,所述子矩阵为切分所述待运算矩阵得到的M行N列的二维矩阵;根据所述子矩阵的地址读取至所述子矩阵的矩阵元素至所述运算模块1021中,一个运算模块对应一个子矩阵;根据所述矩阵运算指令控制所述运算模块1021对所述子矩阵进行矩阵运算,得到矩阵运算结果。
在一种可能的实现方式中,所述第二运算装置102具体用于:从所述子矩阵的地址中取出所述子矩阵的矩阵元素;将所述子矩阵中第m行、第n列的矩阵元素存入所述运算模块中第m行、第n列的运算单元中,所述m的取值为小于或等于M的正整数,所述n的取值为小于或等于N的正整数。
在一种可能的实现方式中,所述第二运算装置102具体用于:根据所述矩阵运算指令控 制所述运算模块中的每一个运算单元10211对所述子矩阵执行乘累加运算或复数乘累加运算,所述运算单元10211包括以下至少一项:乘累加运算单元和复数乘累加运算单元。
在一种可能的实现方式中,若所述矩阵运算指令中还包括所述范例标识,所述第二运算装置102还用于:通过所述范例标识对预先存储的范例表进行查询得到目标范例,所述目标范例用于指示所述待运算矩阵的矩阵元素的数据存储形式。
在一种可能的实现方式中,若所述矩阵运算指令中还包括第一指示,所述第一指示用于指示所述矩阵运算指令对应的循环次数。
在一种可能的实现方式中,所述矩阵运算指令中还包括第二指示,所述第二指示用于指示预处理和矩阵的转置,所述预处理包括:矩阵的取反和/或共轭。
在一种可能的实现方式中,若所述矩阵运算指令中还包括所述目的地址,所述第二运算装置102还用于:将所述矩阵运算结果存储到所述目的地址中,所述目的地址为内存地址。
在一种可能的实现方式中,所述待运算矩阵的地址为内存地址。
在一种可能的实现方式中,上述的M和N的取值相等。
需要说明的是,上述图5对应的方法实施例中所述的操作均可以援引到第二运算装置102中执行,相关操作的详细描述可参阅上述方法实施例中的描述,此处不再赘述。
本申请实施例中提供了一种处理器,其具体可以是上述图1中所述的异构处理器10。
在异构处理器10中,内存103中存储有操作系统和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。操作系统可包括各种系统程序,用于实现各种基础业务以及处理基于硬件的任务。
第二运算装置102通过接收第一运算装置101发送的矩阵运算指令,并更加所述矩阵运算指令执行上述方法实施例中所述的矩阵运算方法。
另外需说明的是,以上所描述的装置实施例都仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例中技术方案的目的。另外,本申请提供的装置实施例的附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下,软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。

Claims (18)

  1. 一种矩阵运算方法,其特征在于,所述方法应用于运算装置中,所述运算装置包括至少一个运算模块,每一个运算模块中包括(M*N)个运算单元,每一个运算模块中的所述(M*N)个运算单元排列成M行N列的二维矩阵阵列,所述M和N分别为大于或等于2的整数,所述方法包括:
    获取矩阵运算指令,所述矩阵运算指令中携带有待运算矩阵的地址;
    根据所述待运算矩阵的地址得到子矩阵的地址,所述子矩阵为切分所述待运算矩阵得到的M行N列的二维矩阵;
    根据所述子矩阵的地址读取所述子矩阵的矩阵元素至所述运算模块中,一个运算模块对应一个子矩阵;
    根据所述矩阵运算指令控制所述运算模块对所述子矩阵进行矩阵运算,得到矩阵运算结果。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述子矩阵的地址读取至所述子矩阵的矩阵元素至所述运算模块中,包括:
    从所述子矩阵的地址中取出所述子矩阵的矩阵元素;
    将所述子矩阵中第m行、第n列的矩阵元素存入所述运算模块中第m行、第n列的运算单元中,所述m的取值为小于或等于M的正整数,所述n的取值为小于或等于N的正整数。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述矩阵运算指令控制所述运算模块对所述子矩阵进行矩阵运算,包括:
    根据所述矩阵运算指令控制所述运算模块中的每一个运算单元对所述子矩阵执行乘累加运算或复数乘累加运算,所述运算单元包括以下至少一项:乘累加运算单元和复数乘累加运算单元。
  4. 根据权利要求1-3所述的方法,其特征在于,若所述矩阵运算指令中还包括所述范例标识,所述方法还包括:
    通过所述范例标识对预先存储的范例表进行查询得到目标范例,所述目标范例用于指示所述待运算矩阵的矩阵元素的数据存储形式。
  5. 根据权利要求4所述的方法,其特征在于,若所述矩阵运算指令中还包括第一指示,所述第一指示用于指示所述矩阵运算指令对应的循环次数。
  6. 根据权利要求1-3所述的方法,其特征在于,所述矩阵运算指令中还包括第二指示,所述第二指示用于指示预处理和矩阵的转置,所述预处理包括:矩阵的取反和/或共轭。
  7. 根据权利要求1-3所述的方法,其特征在于,若所述矩阵运算指令中还包括所述目的地址,所述方法还包括:
    将所述矩阵运算结果存储到所述目的地址中,所述目的地址为内存地址。
  8. 根据权利要求1或2所述的方法,其特征在于,所述M的取值等于所述N的取值。
  9. 一种运算装置,所述运算装置包括至少一个运算模块,每一个运算模块中包括(M*N)个运算单元,每一个运算模块中的所述(M*N)个运算单元排列成M行N列的二维矩阵阵列, 所述M和N分别为大于或等于2的整数;所述运算装置用于执行以下操作:
    获取矩阵运算指令,所述矩阵运算指令中携带有待运算矩阵的地址;
    根据所述待运算矩阵的地址得到子矩阵的地址,所述子矩阵为切分所述待运算矩阵得到的M行N列的二维矩阵;
    根据所述子矩阵的地址读取至所述子矩阵的矩阵元素至所述运算模块中,一个运算模块对应一个子矩阵;
    根据所述矩阵运算指令控制所述运算模块对所述子矩阵进行矩阵运算,得到矩阵运算结果。
  10. 根据权利要求9所述的装置,其特征在于,所述运算装置具体用于:
    从所述子矩阵的地址中取出所述子矩阵的矩阵元素;
    将所述子矩阵中第m行、第n列的矩阵元素存入所述运算模块中第m行、第n列的运算单元中,所述m的取值为小于或等于M的正整数,所述n的取值为小于或等于N的正整数。
  11. 根据权利要求10所述的装置,其特征在于,所述运算装置具体用于:
    根据所述矩阵运算指令控制所述运算模块中的每一个运算单元对所述子矩阵执行乘累加运算或复数乘累加运算,所述运算单元包括以下至少一项:乘累加运算单元和复数乘累加运算单元。
  12. 根据权利要求9-11所述的装置,其特征在于,若所述矩阵运算指令中还包括所述范例标识,所述运算装置还用于:
    通过所述范例标识对预先存储的范例表进行查询得到目标范例,所述目标范例用于指示所述待运算矩阵的矩阵元素的数据存储形式。
  13. 根据权利要求12所述的装置,其特征在于,若所述矩阵运算指令中还包括第一指示,所述第一指示用于指示所述矩阵运算指令对应的循环次数。
  14. 根据权利要求9-11所述的装置,其特征在于,所述矩阵运算指令中还包括第二指示,所述第二指示用于指示预处理和矩阵转置,所述预处理包括:矩阵的取反和/或共轭。
  15. 根据权利要求14所述的装置,其特征在于,若所述矩阵运算指令中还包括所述目的地址,所述运算装置还用于:将所述矩阵运算结果存储到所述目的地址中,所述目的地址为内存地址。
  16. 根据权利要求9或10中任一项所述的装置,其特征在于,所述M的取值等于所述N的取值。
  17. 一种处理器,包括:运算装置,所述运算装置用于执行上述权利要求1至8中任一项所述的方法。
  18. 根据权利要求17所述的处理器,其特征在于,所述处理器还包括:中央处理单元和数字信号处理单元中的至少一项,所述中央处理单元和数字信号处理单元用于:向所述运算装置发送矩阵运算指令。
PCT/CN2020/107303 2019-08-29 2020-08-06 一种矩阵运算方法、运算装置以及处理器 WO2021036729A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910809027.8 2019-08-29
CN201910809027.8A CN112446007A (zh) 2019-08-29 2019-08-29 一种矩阵运算方法、运算装置以及处理器

Publications (1)

Publication Number Publication Date
WO2021036729A1 true WO2021036729A1 (zh) 2021-03-04

Family

ID=74685008

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/107303 WO2021036729A1 (zh) 2019-08-29 2020-08-06 一种矩阵运算方法、运算装置以及处理器

Country Status (2)

Country Link
CN (1) CN112446007A (zh)
WO (1) WO2021036729A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112991142B (zh) * 2021-03-31 2023-06-16 腾讯科技(深圳)有限公司 图像数据的矩阵运算方法、装置、设备及存储介质
CN113296733A (zh) * 2021-04-25 2021-08-24 阿里巴巴新加坡控股有限公司 数据处理方法以及装置
CN113254078B (zh) * 2021-06-23 2024-04-12 北京中科通量科技有限公司 一种在gpdpu模拟器上高效执行矩阵加法的数据流处理方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445471A (zh) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
CN108628799A (zh) * 2018-04-17 2018-10-09 上海交通大学 可重构的单指令多数据脉动阵列结构、处理器及电子终端
US10164660B1 (en) * 2016-12-23 2018-12-25 Intel Corporation Syndrome-based Reed-Solomon erasure decoding circuitry
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8751556B2 (en) * 2010-06-11 2014-06-10 Massachusetts Institute Of Technology Processor for large graph algorithm computations and matrix operations
CN102446160B (zh) * 2011-09-06 2015-02-18 中国人民解放军国防科学技术大学 面向双精度simd部件的矩阵乘实现方法
CN103294648B (zh) * 2013-05-08 2016-06-01 中国人民解放军国防科学技术大学 支持多mac运算部件向量处理器的分块矩阵乘法向量化方法
CN104317768B (zh) * 2014-10-15 2017-02-15 中国人民解放军国防科学技术大学 面向cpu+dsp异构系统的矩阵乘加速方法
CN108491359B (zh) * 2016-04-22 2019-12-24 北京中科寒武纪科技有限公司 子矩阵运算装置及方法
EP3447653A4 (en) * 2016-04-22 2019-11-13 Cambricon Technologies Corporation Limited SUBMATRIX OPERATING DEVICE AND METHOD
US10275243B2 (en) * 2016-07-02 2019-04-30 Intel Corporation Interruptible and restartable matrix multiplication instructions, processors, methods, and systems
US10146738B2 (en) * 2016-12-31 2018-12-04 Intel Corporation Hardware accelerator architecture for processing very-sparse and hyper-sparse matrix data
CN109032670B (zh) * 2018-08-08 2021-10-19 上海寒武纪信息科技有限公司 神经网络处理装置及其执行向量复制指令的方法
CN109445850A (zh) * 2018-09-19 2019-03-08 成都申威科技有限责任公司 一种基于申威26010处理器的矩阵转置方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445471A (zh) * 2016-10-13 2017-02-22 北京百度网讯科技有限公司 处理器和用于在处理器上执行矩阵乘运算的方法
US10164660B1 (en) * 2016-12-23 2018-12-25 Intel Corporation Syndrome-based Reed-Solomon erasure decoding circuitry
CN109992743A (zh) * 2017-12-29 2019-07-09 华为技术有限公司 矩阵乘法器
CN108628799A (zh) * 2018-04-17 2018-10-09 上海交通大学 可重构的单指令多数据脉动阵列结构、处理器及电子终端

Also Published As

Publication number Publication date
CN112446007A (zh) 2021-03-05

Similar Documents

Publication Publication Date Title
US11698773B2 (en) Accelerated mathematical engine
TWI749249B (zh) 芯片裝置、芯片、智能設備以及神經網絡的運算方法
WO2021036729A1 (zh) 一种矩阵运算方法、运算装置以及处理器
US10831862B2 (en) Performing matrix multiplication in hardware
US20220261622A1 (en) Special purpose neural network training chip
US20180107630A1 (en) Processor and method for executing matrix multiplication operation on processor
US10311127B2 (en) Sparse matrix vector multiplication
CN106846235B (zh) 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
WO2022226721A1 (zh) 一种矩阵乘法器及矩阵乘法器的控制方法
US11397791B2 (en) Method, circuit, and SOC for performing matrix multiplication operation
EP4071619A1 (en) Address generation method, related device and storage medium
CN107957975B (zh) 一种计算方法及相关产品
US10127040B2 (en) Processor and method for executing memory access and computing instructions for host matrix operations
CN111178505B (zh) 卷积神经网络的加速方法和计算机可读存储介质
US11983128B1 (en) Multidimensional and multiblock tensorized direct memory access descriptors
CN111353125B (zh) 运算方法、装置、计算机设备和存储介质
WO2023003756A2 (en) Multi-lane cryptographic engines with systolic architecture and operations thereof
CN118069315A (zh) 基于分布式平台的稀疏三角矩阵的求解方法及装置
CN116502028A (zh) 基于浮点数压缩技术的大规模fft实现方法及装置
CN118193914A (zh) 面向分布式平台的lu分解方法、装置、设备及存储介质
CN112395007A (zh) 运算方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20859553

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20859553

Country of ref document: EP

Kind code of ref document: A1