WO2023078364A1 - Operation method and apparatus for matrix multiplication - Google Patents

Operation method and apparatus for matrix multiplication Download PDF

Info

Publication number
WO2023078364A1
WO2023078364A1 PCT/CN2022/129619 CN2022129619W WO2023078364A1 WO 2023078364 A1 WO2023078364 A1 WO 2023078364A1 CN 2022129619 W CN2022129619 W CN 2022129619W WO 2023078364 A1 WO2023078364 A1 WO 2023078364A1
Authority
WO
WIPO (PCT)
Prior art keywords
bit
bits
precision
sign
floating
Prior art date
Application number
PCT/CN2022/129619
Other languages
French (fr)
Chinese (zh)
Inventor
雷洪
甄德根
吴桐庆
孔德辉
徐科
Original Assignee
深圳市中兴微电子技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市中兴微电子技术有限公司 filed Critical 深圳市中兴微电子技术有限公司
Publication of WO2023078364A1 publication Critical patent/WO2023078364A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06GANALOGUE COMPUTERS
    • G06G7/00Devices in which the computing operation is performed by varying electric or magnetic quantities
    • G06G7/12Arrangements for performing computing operations, e.g. operational amplifiers
    • G06G7/16Arrangements for performing computing operations, e.g. operational amplifiers for multiplication or division
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present invention relate to the field of matrix multiplication, and in particular, to an operation method and device for matrix multiplication.
  • the neural network in artificial intelligence has an increasing demand for the convolution and fully connected computing capabilities in accelerators.
  • convolution and fully connected operations can be converted into matrix multiplication operations.
  • Matrix multiplication consists of multiplication and addition, and the computing power of multiplication and addition of existing accelerators has been increased from GOPS to TOPS.
  • the improvement of multiplication and addition computing power in the accelerator requires more computing units to support it.
  • AI accelerators mainly support input data types such as INT8, INT16, INT32, FP16, FP32, and FP64.
  • input data types such as INT8, INT16, INT32, FP16, FP32, and FP64.
  • six independent computing units are required to support 6 input operations. Therefore, the disadvantage of this AI accelerator is that, for the same neural network, there is generally one type of input data, and only one computing unit performs calculations at the same time, but multiple independent computing units are required, which increases chip area and cost.
  • Embodiments of the present invention provide a matrix multiplication operation method and device to at least solve the problem of increased chip area and cost caused by the need for independent operation units of various input data types in the accelerator in the related art.
  • a kind of operation method of matrix multiplication comprising: respectively splitting two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and four N
  • the bit integer data is divided into corresponding sign bits and precision bits respectively;
  • the matrix multiplication operation is performed on the two floating-point data by adding the exponent bits, XORing the sign bits, and multiplying the precision bits, and performing matrix multiplication by using the sign bits
  • Exclusive OR and precision bit multiplication perform a matrix multiplication operation on the four integer data pairs, and multiplex a multiplication unit and an addition unit in the matrix multiplication operation of the floating point data and the integer data.
  • performing matrix multiplication on the two floating-point data by adding exponent bits, sign bit XOR, and precision bit multiplication includes: combining the exponent bits of the first floating-point data with the second The exponent bits of the two floating-point data are added, the sign bit of the first floating-point data and the sign bit of the second floating-point data are XORed, and the first floating-point data The precision bits of the data are multiplied by the precision bits of the second floating-point data.
  • the two-by-two matrix multiplication operation of the four integer data by sign bit XOR and precision bit multiplication includes: combining the sign bit of the first integer data with the second integer data XORing the sign bit, and multiplying the precision bit of the first integer data by the precision bit of the second integer data to obtain a first operation result including the sign bit and the precision bit; XOR operation is performed on the sign bit of the three integer data and the sign bit of the fourth integer data, and the precision bit of the third integer data is multiplied by the precision bit of the fourth integer data to obtain A second operation result including a sign bit and a precision bit; adding the first operation result to the second operation result.
  • two 2N-bit floating-point data are split into corresponding sign bits, precision bits, and exponent bits
  • four N-bit integer data are split into corresponding sign bits and Precision bit, including: splitting two 16-bit floating-point data into 1-bit sign bit, 11-bit precision bit and 4-bit exponent bit, and splitting four 8-bit integer data respectively 1 bit for sign and 7 bits for precision.
  • performing matrix multiplication operation on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits includes: combining 1 bit sign bit and 11 bit precision bits The first floating-point data is multiplied by the second floating-point data composed of 1-bit sign bit and 11-bit precision bits to obtain 1-bit sign bit and 22-bit original code, and then convert it to complement code to obtain 1-bit sign bit 22bit complement code.
  • the matrix multiplication operation of the four integer data by sign bit XOR and precision bit multiplication includes: first integer data composed of 1 bit sign bit and 7 bit precision bit, and 1 bit Multiply the second integer data composed of sign bit and 7bit precision bits to obtain the first operation result composed of 1bit sign bit and 14bit original code; combine the third integer data composed of 1bit sign bit and 7bit precision bits with 1bit sign Multiply the fourth integer data composed of 1 bit and 7bit precision bits to obtain the second operation result composed of 1bit sign bit and 14bit original code; add the first multiplication operation and the second multiplication operation result to obtain the 1bit sign bit sum 15bit original code, and then convert it from original code to complement code to get 1bit sign bit and 15bit complement code.
  • the addition operation in the matrix multiplication operation of floating-point data includes: selecting the largest number from the split exponents; respectively calculating the step difference of each index relative to the largest number; according to The step difference right-shifts the product data bits; adds the shifted product data.
  • an arithmetic device for matrix multiplication including: a splitting module, configured to split two 2N-bit floating-point data into corresponding sign bits, precision bits, and exponents respectively bit, and split the four N-bit integer data into corresponding sign bits and precision bits respectively; the operation module is set to multiply the two floats by adding exponent bits, sign bit XOR and precision bits Perform matrix multiplication operation on point data, and perform matrix multiplication operation on the four integer data two by two by sign bit XOR and precision bit multiplication, and perform matrix multiplication operation on the floating point data and the integer data
  • the multiplication unit and the addition unit are multiplexed in the multiplication operation.
  • the operation module includes: a first operation unit configured to perform an addition operation on the exponent bits of the first floating-point data and the exponent bits of the second floating-point data, and add the first Performing an XOR operation on the sign bit of the floating-point data and the sign bit of the second floating-point data, and performing an XOR operation on the precision bit of the first floating-point data and the precision bit of the second floating-point data multiplication operation.
  • the operation module further includes: a second operation unit configured to perform an XOR operation on the sign bit of the first integer data and the sign bit of the second integer data, and A multiplication operation of the precision bit of the integer type data and the precision bit of the second integer type data to obtain the first operation result including the sign bit and the precision bit; combining the sign bit of the third integer type data with the fourth integer type Performing an XOR operation on the sign bit of the data, and multiplying the precision bit of the third integer data by the precision bit of the fourth integer data to obtain a second operation result including the sign bit and the precision bit; and adding the first operation result to the second operation result
  • a computer-readable storage medium is also provided, and a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to perform any one of the above methods when running Steps in the examples.
  • an electronic device including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above Steps in the method examples.
  • the multiplication and addition operation resources of the accelerator can be reused in the process of matrix multiplication, thereby greatly reducing the chip area and cost of the accelerator.
  • Fig. 1 is the flowchart of the computing method of matrix multiplication according to the embodiment of the present invention.
  • Fig. 2 is the structural block diagram of the computing device of matrix multiplication according to the embodiment of the present invention.
  • Fig. 3 is a structural block diagram of an arithmetic device for matrix multiplication according to another embodiment of the present invention.
  • FIG. 4 is a schematic diagram of multiplexing 4 pairs of FP16 multipliers and 8 pairs of INT8 multipliers according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of preprocessing before multiplication of FP16 and INT8 according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of FP16 and INT8 independently implementing multiplication and converting to a complement form according to an embodiment of the present invention
  • Fig. 7 is a schematic diagram of multiplication splitting and multiplexing of FP16 and INT8 according to an embodiment of the present invention
  • Fig. 8 is a schematic diagram of addition multiplexing of fp16 and int8 according to an embodiment of the present invention.
  • the accelerator For the input data type supported by the accelerator, it usually contains multiple matrix multiplication units of the input data type. And because usually only one input data type is used for neural network calculations, only one of the matrix multiplication units of multiple input data types is working at the same time, but matrix multiplication units of multiple input data types must exist with the accelerator.
  • an embodiment of the present invention provides an operation method for matrix multiplication.
  • the core of the matrix multiplication operation is the adder and the multiplier.
  • the matrix multiplication operation method provided in this embodiment mainly reuses the multiplier and the adder in the matrix multiplication, and can greatly reduce the area under the condition that the function is realized. consume.
  • multiplication and multiplexing are realized by multiplication after data splitting.
  • the multiplexing principle in this embodiment is 2 multiplications and 1 addition of nbit shaping data, and 1 multiplication of 2nbit floating-point data.
  • Perform resource reuse For example: INT8*INT8+INT8*INT8 resources are multiplexed with FP16*FP16 resources, INT16*INT16+INT16*INT16 resources are multiplexed with FP32*FP32 resources, INT32*INT32+INT32*INT32 resources are multiplexed with FP64*FP64 resources are multiplexed.
  • Fig. 1 is the flow chart of the operation method of matrix multiplication according to the embodiment of the present invention, as shown in Fig. 1, this flow process comprises the following steps:
  • Step S102 splitting the two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and splitting the four N-bit integer data into corresponding sign bits and precision bits;
  • Step S104 performing matrix multiplication operation on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits, and performing matrix multiplication operations on the four integer data by multiplying sign bits XOR and precision bits
  • the matrix multiplication operation is performed on the data in pairs, and the multiplication unit and the addition unit are multiplexed in the matrix multiplication operation of the floating-point data and the integer data.
  • step S104 performing matrix multiplication on the two split floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits includes: The exponent bit of the type data is added to the exponent bit of the second floating-point type data, the sign bit of the first floating-point type data is XORed with the sign bit of the second floating-point type data, and the The precision bits of the first floating-point data are multiplied by the precision bits of the second floating-point data.
  • step S104 performing matrix multiplication operation on the four split integer data two by two by sign bit XOR and precision bit multiplication includes: taking the sign bit of the first integer data performing an XOR operation with the sign bit of the second integer data, and multiplying the precision bit of the first integer data by the precision bit of the second integer data to obtain a value including the sign bit and the precision bit The first operation result; XOR operation is performed on the sign bit of the third integer data and the sign bit of the fourth integer data, and the precision bit of the third integer data is compared with the precision of the fourth integer data bit multiplication to obtain a second operation result including a sign bit and a precision bit; adding the first operation result to the second operation result.
  • step S102 includes: splitting two 16-bit floating-point data into 1-bit sign bits , 11-bit precision bits and 4-bit exponent bits, and split the four 8-bit integer data into 1-bit sign bits and 7-bit precision bits.
  • Matrix multiplication of floating-point data includes: multiplying the first floating-point data composed of 1-bit sign and 11-bit precision with the second floating-point data composed of 1-bit sign and 11-bit precision Get 1-bit sign bit and 22-bit original code, and then convert it to complement code to get 1-bit sign bit and 22-bit complement code.
  • performing matrix multiplication operation on the four integer data INT8 by means of sign bit XOR and precision bit multiplication includes: first integer data composed of 1 bit sign bit and 7 bit precision bit, and Multiply the second integer data composed of 1bit sign bit and 7bit precision bits to obtain the first operation result composed of 1bit sign bit and 14bit original code; combine the third integer data composed of 1bit sign bit and 7bit precision bits with 1bit
  • the fourth integer data composed of the sign bit and the 7bit precision bit is multiplied to obtain the second operation result composed of the 1bit sign bit and the 14bit original code; the first multiplication operation is added to the second multiplication operation result to obtain the 1bit sign bit and 15bit original code, and then convert it from original code to complement code to obtain 1bit sign bit and 15bit complement code.
  • the addition operation in the matrix multiplication operation of floating-point data includes: selecting the largest number from the split exponents; respectively calculating the step difference of each index relative to the largest number; according to The step difference right-shifts the product data bits; adds the shifted product data.
  • the method of resource multiplexing in this embodiment can be, but is not limited to, multiplied matrix A with 4 rows and 4 columns of nbit integer data by matrix B with 4 rows and 4 columns, or multiplied A with 4 rows and 8 columns of 2nbit floating-point data Matrix B with 8 columns and 4 rows.
  • a computing device for matrix multiplication is also provided, and the device is used to realize the above-mentioned embodiments and preferred implementation modes, and what has been explained will not be repeated here.
  • the term "module” may be a combination of software and/or hardware that realizes a predetermined function. For example, an arithmetic unit consisting of a multiplier and an adder.
  • FIG. 2 is a structural block diagram of a computing device for matrix multiplication according to an embodiment of the present invention. As shown in FIG. 2 , the computing device 100 includes a splitting module 10 and a computing module 20 .
  • the splitting module 10 is configured to split two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and split four N-bit integer data into corresponding sign bits respectively and precision bits.
  • the operation module 20 is configured to perform matrix multiplication on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits, and performing matrix multiplication operations on the four floating-point data by multiplying sign bits XOR and precision bits.
  • the matrix multiplication operation is performed on two integer data, and the multiplication unit and the addition unit are multiplexed in the matrix multiplication operation of the floating point data and the integer data.
  • Fig. 3 is a structural block diagram of a computing device for matrix multiplication according to an embodiment of the present invention.
  • the computing device 100 includes all modules shown in Fig. 2, and the computing module 10 includes a first computing unit 11 and the second arithmetic unit 12 .
  • the first computing unit 11 is configured to perform an addition operation on the exponent bit of the first floating-point data and the exponent bit of the second floating-point data, and add the sign bit of the first floating-point data to the second floating-point data
  • the sign bit of the point data is XORed, and the precision bit of the first floating point data is multiplied by the precision bit of the second floating point data.
  • the second computing unit 12 is configured to perform exclusive OR operation on the sign bit of the first integer data and the sign bit of the second integer data, and combine the precision bit of the first integer data with the second integer data
  • the exclusive OR operation is performed on the sign bit of the third integer type data and the sign bit of the fourth integer type data, and the first
  • the precision bit of the three integer data is multiplied by the precision bit of the fourth integer data to obtain a second operation result including a sign bit and a precision bit
  • combining the first operation result with the second operation result add the results
  • the multiplication and addition operation resources of the accelerator can be reused in the process of matrix multiplication, thereby greatly reducing the chip area of the accelerator and reducing the cost.
  • the above-mentioned modules can be realized by software or hardware. For the latter, it can be realized by the following methods, but not limited to this: the above-mentioned modules are all located in the same processor; or, the above-mentioned modules can be combined in any combination The forms of are located in different processors.
  • INT8*INT8+INT8*INT8 resources and FP16*FP16 resources is used as an example below, as shown in FIG. 4 .
  • 4 multiplications and 3 additions of FP16 data type and 8 multiplications and 7 additions of INT8 data type are multiplexed.
  • matrix multiplication and multiplexing are mainly multiplication and addition.
  • the operation flow of this embodiment is mainly divided into three stages: input data preprocessing, multiplication and addition.
  • the input data is preprocessed.
  • the input fp16 and int8 are first converted into a fixed format, the main purpose is to enable subsequent multiplication to be multiplexed.
  • the method of fp16 is to split it into fix12 and the exponent part, and the method of int8 is to convert it into the format of 1-bit sign bit and 7-bit original code.
  • each multiplication unit will input 2 fp16s that need to be multiplied or 4 int8s that need to be multiplied in pairs.
  • the operation method of multiplying two fp16s in pairs is specifically: two 1bit sign bits 11bit original code composed of fix12 Multiply to get 1bit sign bit 22bit original code, then convert it to complement code, and finally get 1bit sign bit 22bit complement code.
  • the multiplication operation method of 4 int8s is as follows: multiply the fix8 composed of two 1bit sign bit 7bit original codes to obtain the 1bit sign bit 14bit original code, and then add the two 1bit sign bit 14bit original codes to obtain the 1bit sign bit 15bit original code , and then convert it from the original code to the complement code, and finally get the 1-bit sign bit 15-bit complement code.
  • the multiplication of two FP16s can be divided into 3 operations: exponential addition, sign bit XOR, 11bit precision bit multiplication, and the multiplication of four INT8 pairs can be expressed as 4 operations: 2 sign bits XOR, two 7bit precision multiplication.
  • the multiplication of fp16 and the multiplication of int8 can be split according to the splitting manner shown in FIG. 7 .
  • the operation 3 of fp16 is divided into smaller granularity 7bit*7bit, 7bit*4bit, 7bit*4bit, 4bit*4bit, and the operation D of int8 is divided into 7bit*4bit, 7bit *4bit format.
  • three multipliers DSP7*4, DSP7*4, and DSP7*7 can be reused in the end.
  • addition implementation resources in fp16 and int8 matrix multiplication are also fully reused, so a subsequent method for implementing addition operations by multiplexing resources is proposed.
  • the addition in the fp16 matrix multiplication operation is performed in the manner shown in FIG. 8 .
  • the first step is to find the largest index, and find the largest index from the four split indexes, as shown in Figure 8.
  • the 4 indices are compared in pairs and the larger value is selected to obtain 2 indices, and then the 2 indices are compared and the larger value is the maximum value among the 4 indices.
  • the second step is to calculate the step difference, and calculate the difference between the largest exponent and the 4 indices in the first step, and obtain the step difference of the 4 numbers relative to the largest exponent;
  • the third step shifting, the 4 product data bits are shifted to the right, and the number of shifted digits is the step difference calculated in the second step;
  • the fourth step is addition. For the first time, add 4 numbers in pairs to get 2 numbers, and then add the 2 numbers for the second time to get 1 number. add_0_3 is the final result.
  • the addition in the int8 matrix multiplication operation is performed in the manner shown in FIG. 8 .
  • the addition of fp16 and int8 is implemented by multiplexing the following parts: 8 adders for the first addition, 4 adders for the second addition, and 2 adders for the third addition , the adder for the fourth addition.
  • the matrix multiplication operation unit can be reused for various operation precisions, and the area consumption is greatly reduced under the premise of ensuring the function. That is, in the case of limited chip area resources, matrix multiplication operations with more precision can be realized, so that artificial intelligence accelerators can support more precision. Thereby improving the computing power of the artificial intelligence accelerator and increasing its application scenarios.
  • Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is set to execute the steps in any one of the above method embodiments when running.
  • the above-mentioned computer-readable storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk magnetic disk or optical disk and other media that can store computer programs.
  • An embodiment of the present invention also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any one of the above method embodiments.
  • the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
  • each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices In fact, they can be implemented in program code executable by a computing device, and thus, they can be stored in a storage device to be executed by a computing device, and in some cases, can be executed in an order different from that shown here. Or described steps, or they are fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.

Abstract

The embodiments of the present invention provide an operation method and apparatus for matrix multiplication. The operation method comprises: respectively splitting two pieces of floating-point-type data of 2N bits into corresponding sign bits, precision bits and index bits, and respectively splitting four pieces of integer-type data of N bits into corresponding sign bits and precision bits; and performing a matrix multiplication operation on the two pieces of floating-point-type data by means of the addition of the index bits, an XOR operation of the sign bits and the multiplication of the precision bits, performing a matrix multiplication operation on every two pieces of the four pieces of integer-type data by means of an XOR operation of the sign bits and the multiplication of the precision bits, and multiplexing a multiplication unit and an addition unit during the matrix multiplication operation of the floating-point-type data and that of the integer-type data. In the present invention, input data of different data types is split, such that multiplication and addition operation resources of an accelerator can be multiplexed during a matrix multiplication process, thereby greatly reducing the area of a chip of the accelerator, and also reducing the cost.

Description

矩阵乘法的运算方法及装置Operation method and device for matrix multiplication 技术领域technical field
本发明实施例涉及矩阵乘法领域,具体而言,涉及一种矩阵乘法的运算方法及装置。Embodiments of the present invention relate to the field of matrix multiplication, and in particular, to an operation method and device for matrix multiplication.
背景技术Background technique
随着技术的进步,人工智能中的神经网络对加速器中具有的卷积运算和全连接运算能力的需求越来越大,然而卷积运算和全连接运算又可以转化为矩阵乘法运算。矩阵乘法运算由乘法和加法组成,现有的加速器的乘法和加法算力已经从GOPS提升到TOPS。与此同时,加速器中乘法和加法算力的提升,需要有更多的运算单元来支撑。但是,对于芯片设计者来说,需要尽可能用更小的面积和成本来支持更多的运算单元,从而达到更大的算力。With the advancement of technology, the neural network in artificial intelligence has an increasing demand for the convolution and fully connected computing capabilities in accelerators. However, convolution and fully connected operations can be converted into matrix multiplication operations. Matrix multiplication consists of multiplication and addition, and the computing power of multiplication and addition of existing accelerators has been increased from GOPS to TOPS. At the same time, the improvement of multiplication and addition computing power in the accelerator requires more computing units to support it. However, for chip designers, it is necessary to support more computing units with a smaller area and cost as much as possible, so as to achieve greater computing power.
现有的AI加速器主要支持INT8、INT16、INT32、FP16、FP32、FP64等输入数据类型,如要实现支持前面6种数据类型作为输入的AI加速器,那么需要用6个独立的运算单元来分别支持6种输入的运算。因此,该AI加速器的缺点是,对于同一神经网络一般是一种输入数据类型,同一时刻只有一种运算单元进行运算,却需要多种独立的运算单元,因此带来芯片面积和成本的增加。Existing AI accelerators mainly support input data types such as INT8, INT16, INT32, FP16, FP32, and FP64. To implement an AI accelerator that supports the first six data types as input, six independent computing units are required to support 6 input operations. Therefore, the disadvantage of this AI accelerator is that, for the same neural network, there is generally one type of input data, and only one computing unit performs calculations at the same time, but multiple independent computing units are required, which increases chip area and cost.
发明内容Contents of the invention
本发明实施例提供了一种矩阵乘法的运算方法及装置,以至少解决相关技术中在加速器中需要独立的多种输入数据类型的运算单元,从而导致的芯片面积和成本增加问题。Embodiments of the present invention provide a matrix multiplication operation method and device to at least solve the problem of increased chip area and cost caused by the need for independent operation units of various input data types in the accelerator in the related art.
根据本发明的一个实施例,提供了一种矩阵乘法的运算方法,包括:将两个2N比特的浮点型数据分别拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据分别拆分为对应的符号位和精度位;通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算,以及通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算,并在所述浮点型数据和所述整型数据的矩阵乘法运算中复用乘法单元和加法单元。According to one embodiment of the present invention, a kind of operation method of matrix multiplication is provided, comprising: respectively splitting two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and four N The bit integer data is divided into corresponding sign bits and precision bits respectively; the matrix multiplication operation is performed on the two floating-point data by adding the exponent bits, XORing the sign bits, and multiplying the precision bits, and performing matrix multiplication by using the sign bits Exclusive OR and precision bit multiplication perform a matrix multiplication operation on the four integer data pairs, and multiplex a multiplication unit and an addition unit in the matrix multiplication operation of the floating point data and the integer data.
在一个示例性实施例中,通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算包括:将第一浮点型数据的指数位与第二浮点型数据的指数位进行加法运算,将所述第一浮点型数据的符号位与所述第二浮点型数据的符号位进行异或运算,以及将所述第一浮点型数据的精度位与所述第二浮点型数据的精度位进行乘法运算。In an exemplary embodiment, performing matrix multiplication on the two floating-point data by adding exponent bits, sign bit XOR, and precision bit multiplication includes: combining the exponent bits of the first floating-point data with the second The exponent bits of the two floating-point data are added, the sign bit of the first floating-point data and the sign bit of the second floating-point data are XORed, and the first floating-point data The precision bits of the data are multiplied by the precision bits of the second floating-point data.
在一个示例性实施例中,通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算包括:将第一整型数据的符号位与第二整型数据的符号位进行异或运算,以及将所述第一整型数据的精度位与所述第二整型数据的精度位的乘法运算,以得到包含符号位和精度位的第一运算结果;将第三整型数据的符号位与第四整型数据的符号位进行异或运算,以及将所述第三整型数据的精度位与所述第四整型数据的精度位的乘法运算,以得到包含符号位和精度位的第二运算结果;将所述第一运算结果和所述第二运算结果相加。In an exemplary embodiment, the two-by-two matrix multiplication operation of the four integer data by sign bit XOR and precision bit multiplication includes: combining the sign bit of the first integer data with the second integer data XORing the sign bit, and multiplying the precision bit of the first integer data by the precision bit of the second integer data to obtain a first operation result including the sign bit and the precision bit; XOR operation is performed on the sign bit of the three integer data and the sign bit of the fourth integer data, and the precision bit of the third integer data is multiplied by the precision bit of the fourth integer data to obtain A second operation result including a sign bit and a precision bit; adding the first operation result to the second operation result.
在一个示例性实施例中,将两个2N比特的浮点型数据拆分为对应的符号位、精度位和指 数位,以及将四个N比特的整型数据拆分为对应的符号位和精度位,包括:将两个16比特的浮点型数据分别拆分为1比特的符号位、11比特的精度位和4比特的指数位,以及将四个8比特的整型数据分别拆分为1比特的符号位和7比特的精度位。In an exemplary embodiment, two 2N-bit floating-point data are split into corresponding sign bits, precision bits, and exponent bits, and four N-bit integer data are split into corresponding sign bits and Precision bit, including: splitting two 16-bit floating-point data into 1-bit sign bit, 11-bit precision bit and 4-bit exponent bit, and splitting four 8-bit integer data respectively 1 bit for sign and 7 bits for precision.
在一个示例性实施例中,通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算包括:将1比特符号位和11比特精度位组成的第一浮点型数据,与1比特符号位和11比特精度位组成的第二浮点型数据相乘得到1比特符号位和22比特原码,再将其转换为补码,得到1bit符号位22bit补码。In an exemplary embodiment, performing matrix multiplication operation on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits includes: combining 1 bit sign bit and 11 bit precision bits The first floating-point data is multiplied by the second floating-point data composed of 1-bit sign bit and 11-bit precision bits to obtain 1-bit sign bit and 22-bit original code, and then convert it to complement code to obtain 1-bit sign bit 22bit complement code.
在一个示例性实施例中,通过符号位异或和精度位相乘对所述四个整型数据进行矩阵乘法运算包括:将1bit符号位和7bit精度位组成的第一整型数据,与1bit符号位和7bit精度位组成的第二整型数据相乘得到由1bit符号位和14bit原码组成的第一运算结果;将1bit符号位和7bit精度位组成的第三整型数据,与1bit符号位和7bit精度位组成的第四整型数据相乘得到由1bit符号位和14bit原码组成的第二运算结果;将所述第一乘法运算与第二乘法运算结果相加得到1bit符号位和15bit原码,再将其由原码转换为补码,得到1bit符号位和15bit补码。In an exemplary embodiment, the matrix multiplication operation of the four integer data by sign bit XOR and precision bit multiplication includes: first integer data composed of 1 bit sign bit and 7 bit precision bit, and 1 bit Multiply the second integer data composed of sign bit and 7bit precision bits to obtain the first operation result composed of 1bit sign bit and 14bit original code; combine the third integer data composed of 1bit sign bit and 7bit precision bits with 1bit sign Multiply the fourth integer data composed of 1 bit and 7bit precision bits to obtain the second operation result composed of 1bit sign bit and 14bit original code; add the first multiplication operation and the second multiplication operation result to obtain the 1bit sign bit sum 15bit original code, and then convert it from original code to complement code to get 1bit sign bit and 15bit complement code.
在一个示例性实施例中,浮点型数据的矩阵乘法运算中的相加运算包括:从拆分出的指数中选出最大数;分别计算各指数相对于所述最大数的阶差;按照所述阶差对乘积数据位进行右移位;将移位后的乘积数据进行相加。In an exemplary embodiment, the addition operation in the matrix multiplication operation of floating-point data includes: selecting the largest number from the split exponents; respectively calculating the step difference of each index relative to the largest number; according to The step difference right-shifts the product data bits; adds the shifted product data.
根据本发明的另一个实施例,提供了一种矩阵乘法的运算装置,包括:拆分模块,设置为将两个2N比特的浮点型数据分别拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据分别拆分为对应的符号位和精度位;运算模块,设置为通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算,以及通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算,并在所述浮点型数据和所述整型数据的矩阵乘法运算中复用乘法单元和加法单元。According to another embodiment of the present invention, there is provided an arithmetic device for matrix multiplication, including: a splitting module, configured to split two 2N-bit floating-point data into corresponding sign bits, precision bits, and exponents respectively bit, and split the four N-bit integer data into corresponding sign bits and precision bits respectively; the operation module is set to multiply the two floats by adding exponent bits, sign bit XOR and precision bits Perform matrix multiplication operation on point data, and perform matrix multiplication operation on the four integer data two by two by sign bit XOR and precision bit multiplication, and perform matrix multiplication operation on the floating point data and the integer data The multiplication unit and the addition unit are multiplexed in the multiplication operation.
在一个示例性实施例中,所述运算模块包括:第一运算单元,设置为将第一浮点型数据的指数位与第二浮点型数据的指数位进行加法运算,将所述第一浮点型数据的符号位与所述第二浮点型数据的符号位进行异或运算,以及将所述第一浮点型数据的精度位与所述第二浮点型数据的精度位进行乘法运算。In an exemplary embodiment, the operation module includes: a first operation unit configured to perform an addition operation on the exponent bits of the first floating-point data and the exponent bits of the second floating-point data, and add the first Performing an XOR operation on the sign bit of the floating-point data and the sign bit of the second floating-point data, and performing an XOR operation on the precision bit of the first floating-point data and the precision bit of the second floating-point data multiplication operation.
在一个示例性实施例中,所述运算模块还包括:第二运算单元,设置为将第一整型数据的符号位与第二整型数据的符号位进行异或运算,以及将所述第一整型数据的精度位与所述第二整型数据的精度位的乘法运算,以得到包含符号位和精度位的第一运算结果;将第三整型数据的符号位与第四整型数据的符号位进行异或运算,以及将所述第三整型数据的精度位与所述第四整型数据的精度位的乘法运算,以得到包含符号位和精度位的第二运算结果;以及将所述第一运算结果和所述第二运算结果相加In an exemplary embodiment, the operation module further includes: a second operation unit configured to perform an XOR operation on the sign bit of the first integer data and the sign bit of the second integer data, and A multiplication operation of the precision bit of the integer type data and the precision bit of the second integer type data to obtain the first operation result including the sign bit and the precision bit; combining the sign bit of the third integer type data with the fourth integer type Performing an XOR operation on the sign bit of the data, and multiplying the precision bit of the third integer data by the precision bit of the fourth integer data to obtain a second operation result including the sign bit and the precision bit; and adding the first operation result to the second operation result
根据本发明的又一个实施例,还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。According to yet another embodiment of the present invention, a computer-readable storage medium is also provided, and a computer program is stored in the computer-readable storage medium, wherein the computer program is configured to perform any one of the above methods when running Steps in the examples.
根据本发明的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above Steps in the method examples.
在本发明实施例,通过将不同数据类型的输入数据进行拆分,从而可以在矩阵乘法过程中复用加速器的乘法和加法运算资源,从而大大减少了加速器的芯片面积和降低了成本。In the embodiment of the present invention, by splitting the input data of different data types, the multiplication and addition operation resources of the accelerator can be reused in the process of matrix multiplication, thereby greatly reducing the chip area and cost of the accelerator.
附图说明Description of drawings
图1是根据本发明实施例的矩阵乘法的运算方法的流程图;Fig. 1 is the flowchart of the computing method of matrix multiplication according to the embodiment of the present invention;
图2是根据本发明实施例的矩阵乘法的运算装置的结构框图;Fig. 2 is the structural block diagram of the computing device of matrix multiplication according to the embodiment of the present invention;
图3是根据本发明另一实施例的矩阵乘法的运算装置的结构框图;Fig. 3 is a structural block diagram of an arithmetic device for matrix multiplication according to another embodiment of the present invention;
图4是根据本发明实施例的4对FP16乘加与8对INT8乘加复用示意图;FIG. 4 is a schematic diagram of multiplexing 4 pairs of FP16 multipliers and 8 pairs of INT8 multipliers according to an embodiment of the present invention;
图5是根据本发明实施例的FP16和INT8乘法之前的预处理示意图;5 is a schematic diagram of preprocessing before multiplication of FP16 and INT8 according to an embodiment of the present invention;
图6是根据本发明实施例的FP16和INT8单独实现乘法运算且转换为补码形式的示意图;FIG. 6 is a schematic diagram of FP16 and INT8 independently implementing multiplication and converting to a complement form according to an embodiment of the present invention;
图7是根据本发明实施例的FP16和INT8乘法拆分复用示意图;Fig. 7 is a schematic diagram of multiplication splitting and multiplexing of FP16 and INT8 according to an embodiment of the present invention;
图8是根据本发明实施例的fp16和int8加法复用示意图。Fig. 8 is a schematic diagram of addition multiplexing of fp16 and int8 according to an embodiment of the present invention.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本发明的实施例。Embodiments of the present invention will be described in detail below with reference to the drawings and in combination with the embodiments.
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first" and "second" in the description and claims of the present invention and the above drawings are used to distinguish similar objects, but not necessarily used to describe a specific sequence or sequence.
对于加速器支持输入数据类型,通常情况下会包含多个该输入数据类型的矩阵乘法运算单元。又由于通常神经网络计算只会使用一种输入数据类型,那么同一时刻多种输入数据类型的矩阵乘法运算单元,只有一种处于工作状态,但是多种输入数据类型的矩阵乘法运算单元又必须存在与加速器中。For the input data type supported by the accelerator, it usually contains multiple matrix multiplication units of the input data type. And because usually only one input data type is used for neural network calculations, only one of the matrix multiplication units of multiple input data types is working at the same time, but matrix multiplication units of multiple input data types must exist with the accelerator.
为了解决上述问题,本发明实施例提供了一种矩阵乘法的运算方法。矩阵乘法运算的核心是加法器和乘法器,本实施例提供的矩阵乘法的运算方式,主要复用矩阵乘法中的乘法器和加法器,在满足功能实现的情况下,能大幅度的减少面积消耗。In order to solve the above problem, an embodiment of the present invention provides an operation method for matrix multiplication. The core of the matrix multiplication operation is the adder and the multiplier. The matrix multiplication operation method provided in this embodiment mainly reuses the multiplier and the adder in the matrix multiplication, and can greatly reduce the area under the condition that the function is realized. consume.
在本实施例中,利用数据拆分后乘法来实现乘法复用,本实施例中的复用原则是nbit整形数据的2个乘法和1个加法,与2nbit浮点型数据的1个乘法,进行资源复用。例如:INT8*INT8+INT8*INT8的资源与FP16*FP16的资源进行复用,INT16*INT16+INT16*INT16的资源与FP32*FP32的资源进行复用,INT32*INT32+INT32*INT32的资源与FP64*FP64的资源进行复用。In this embodiment, multiplication and multiplexing are realized by multiplication after data splitting. The multiplexing principle in this embodiment is 2 multiplications and 1 addition of nbit shaping data, and 1 multiplication of 2nbit floating-point data. Perform resource reuse. For example: INT8*INT8+INT8*INT8 resources are multiplexed with FP16*FP16 resources, INT16*INT16+INT16*INT16 resources are multiplexed with FP32*FP32 resources, INT32*INT32+INT32*INT32 resources are multiplexed with FP64*FP64 resources are multiplexed.
图1是根据本发明实施例的矩阵乘法的运算方法流程图,如图1所示,该流程包括如下步骤:Fig. 1 is the flow chart of the operation method of matrix multiplication according to the embodiment of the present invention, as shown in Fig. 1, this flow process comprises the following steps:
步骤S102,将两个2N比特的浮点型数据分别拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据分别拆分为对应的符号位和精度位;Step S102, splitting the two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and splitting the four N-bit integer data into corresponding sign bits and precision bits;
步骤S104,通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算,以及通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算,并在所述浮点型数据和所述整型数据的矩阵乘法运算中复用乘法单元和加法单元。Step S104, performing matrix multiplication operation on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits, and performing matrix multiplication operations on the four integer data by multiplying sign bits XOR and precision bits The matrix multiplication operation is performed on the data in pairs, and the multiplication unit and the addition unit are multiplexed in the matrix multiplication operation of the floating-point data and the integer data.
在一个示例性实施例中,步骤S104中通过指数位相加、符号位异或和精度位相乘对拆分后的所述两个浮点型数据进行矩阵乘法运算包括:将第一浮点型数据的指数位与第二浮点型数据的指数位进行加法运算,将所述第一浮点型数据的符号位与所述第二浮点型数据的符号 位进行异或运算,以及将所述第一浮点型数据的精度位与所述第二浮点型数据的精度位进行乘法运算。In an exemplary embodiment, in step S104, performing matrix multiplication on the two split floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits includes: The exponent bit of the type data is added to the exponent bit of the second floating-point type data, the sign bit of the first floating-point type data is XORed with the sign bit of the second floating-point type data, and the The precision bits of the first floating-point data are multiplied by the precision bits of the second floating-point data.
在一个示例性实施例中,步骤S104中通过符号位异或和精度位相乘对拆分后的所述四个整型数据两两进行矩阵乘法运算包括:将第一整型数据的符号位与第二整型数据的符号位进行异或运算,以及将所述第一整型数据的精度位与所述第二整型数据的精度位的乘法运算,以得到包含符号位和精度位的第一运算结果;将第三整型数据的符号位与第四整型数据的符号位进行异或运算,以及将所述第三整型数据的精度位与所述第四整型数据的精度位的乘法运算,以得到包含符号位和精度位的第二运算结果;将所述第一运算结果和所述第二运算结果相加。In an exemplary embodiment, in step S104, performing matrix multiplication operation on the four split integer data two by two by sign bit XOR and precision bit multiplication includes: taking the sign bit of the first integer data performing an XOR operation with the sign bit of the second integer data, and multiplying the precision bit of the first integer data by the precision bit of the second integer data to obtain a value including the sign bit and the precision bit The first operation result; XOR operation is performed on the sign bit of the third integer data and the sign bit of the fourth integer data, and the precision bit of the third integer data is compared with the precision of the fourth integer data bit multiplication to obtain a second operation result including a sign bit and a precision bit; adding the first operation result to the second operation result.
在一个INT8*INT8+INT8*INT8的资源与FP16*FP16的资源进行复用的示例性实施例中,步骤S102包括:将两个16比特的浮点型数据分别拆分为1比特的符号位、11比特的精度位和4比特的指数位,以及将四个8比特的整型数据分别拆分为1比特的符号位和7比特的精度位。In an exemplary embodiment in which INT8*INT8+INT8*INT8 resources are multiplexed with FP16*FP16 resources, step S102 includes: splitting two 16-bit floating-point data into 1-bit sign bits , 11-bit precision bits and 4-bit exponent bits, and split the four 8-bit integer data into 1-bit sign bits and 7-bit precision bits.
在一个INT8*INT8+INT8*INT8的资源与FP16*FP16的资源进行复用的示例性实施例中,通过指数位相加、符号位异或和精度位相乘对拆分后的所述两个浮点型数据进行矩阵乘法运算包括:将1比特符号位和11比特精度位组成的第一浮点型数据,与1比特符号位和11比特精度位组成的第二浮点型数据相乘得到1比特符号位和22比特原码,再将其转换为补码,得到1bit符号位22bit补码。In an exemplary embodiment in which resources of INT8*INT8+INT8*INT8 are multiplexed with resources of FP16*FP16, the two parts after splitting are performed by adding exponent bits, XORing sign bits, and multiplying precision bits. Matrix multiplication of floating-point data includes: multiplying the first floating-point data composed of 1-bit sign and 11-bit precision with the second floating-point data composed of 1-bit sign and 11-bit precision Get 1-bit sign bit and 22-bit original code, and then convert it to complement code to get 1-bit sign bit and 22-bit complement code.
在一个示例性实施例中,通过符号位异或和精度位相乘对所述四个整型数据INT8进行矩阵乘法运算包括:将1bit符号位和7bit精度位组成的第一整型数据,与1bit符号位和7bit精度位组成的第二整型数据相乘得到由1bit符号位和14bit原码组成的第一运算结果;将1bit符号位和7bit精度位组成的第三整型数据,与1bit符号位和7bit精度位组成的第四整型数据相乘得到由1bit符号位和14bit原码组成的第二运算结果;将所述第一乘法运算与第二乘法运算结果相加得到1bit符号位和15bit原码,再将其由原码转换为补码,得到1bit符号位和15bit补码。In an exemplary embodiment, performing matrix multiplication operation on the four integer data INT8 by means of sign bit XOR and precision bit multiplication includes: first integer data composed of 1 bit sign bit and 7 bit precision bit, and Multiply the second integer data composed of 1bit sign bit and 7bit precision bits to obtain the first operation result composed of 1bit sign bit and 14bit original code; combine the third integer data composed of 1bit sign bit and 7bit precision bits with 1bit The fourth integer data composed of the sign bit and the 7bit precision bit is multiplied to obtain the second operation result composed of the 1bit sign bit and the 14bit original code; the first multiplication operation is added to the second multiplication operation result to obtain the 1bit sign bit and 15bit original code, and then convert it from original code to complement code to obtain 1bit sign bit and 15bit complement code.
在一个示例性实施例中,浮点型数据的矩阵乘法运算中的相加运算包括:从拆分出的指数中选出最大数;分别计算各指数相对于所述最大数的阶差;按照所述阶差对乘积数据位进行右移位;将移位后的乘积数据进行相加。In an exemplary embodiment, the addition operation in the matrix multiplication operation of floating-point data includes: selecting the largest number from the split exponents; respectively calculating the step difference of each index relative to the largest number; according to The step difference right-shifts the product data bits; adds the shifted product data.
本实施例的资源复用的方法可以且不仅限于应用于nbit的整形数据4行4列的矩阵A乘以4行4列的矩阵B,或者2nbit的浮点数据4行8列的A乘以8列4行的矩阵B。The method of resource multiplexing in this embodiment can be, but is not limited to, multiplied matrix A with 4 rows and 4 columns of nbit integer data by matrix B with 4 rows and 4 columns, or multiplied A with 4 rows and 8 columns of 2nbit floating-point data Matrix B with 8 columns and 4 rows.
在本实施例中还提供了一种矩阵乘法的运算装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。例如,由乘法器和加法器组成的运算器。In this embodiment, a computing device for matrix multiplication is also provided, and the device is used to realize the above-mentioned embodiments and preferred implementation modes, and what has been explained will not be repeated here. As used below, the term "module" may be a combination of software and/or hardware that realizes a predetermined function. For example, an arithmetic unit consisting of a multiplier and an adder.
图2是根据本发明实施例的矩阵乘法的运算装置的结构框图,如图2所示,该运算装置100包括拆分模块10和运算模块20。FIG. 2 is a structural block diagram of a computing device for matrix multiplication according to an embodiment of the present invention. As shown in FIG. 2 , the computing device 100 includes a splitting module 10 and a computing module 20 .
拆分模块10,设置为将两个2N比特的浮点型数据分别拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据分别拆分为对应的符号位和精度位。The splitting module 10 is configured to split two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and split four N-bit integer data into corresponding sign bits respectively and precision bits.
运算模块20,设置为通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算,以及通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘 法运算,并在所述浮点型数据和所述整型数据的矩阵乘法运算中复用乘法单元和加法单元。The operation module 20 is configured to perform matrix multiplication on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits, and performing matrix multiplication operations on the four floating-point data by multiplying sign bits XOR and precision bits. The matrix multiplication operation is performed on two integer data, and the multiplication unit and the addition unit are multiplexed in the matrix multiplication operation of the floating point data and the integer data.
图3是根据本发明实施例的矩阵乘法的运算装置的结构框图,如图3所示,该运算装置100除包括图2所示的所有模块外,所述运算模块10包括第一运算单元11和第二运算单元12。Fig. 3 is a structural block diagram of a computing device for matrix multiplication according to an embodiment of the present invention. As shown in Fig. 3, the computing device 100 includes all modules shown in Fig. 2, and the computing module 10 includes a first computing unit 11 and the second arithmetic unit 12 .
第一运算单元11,设置为将第一浮点型数据的指数位与第二浮点型数据的指数位进行加法运算,将所述第一浮点型数据的符号位与所述第二浮点型数据的符号位进行异或运算,以及将所述第一浮点型数据的精度位与所述第二浮点型数据的精度位进行乘法运算。The first computing unit 11 is configured to perform an addition operation on the exponent bit of the first floating-point data and the exponent bit of the second floating-point data, and add the sign bit of the first floating-point data to the second floating-point data The sign bit of the point data is XORed, and the precision bit of the first floating point data is multiplied by the precision bit of the second floating point data.
第二运算单元12,设置为将第一整型数据的符号位与第二整型数据的符号位进行异或运算,以及将所述第一整型数据的精度位与所述第二整型数据的精度位的乘法运算,以得到包含符号位和精度位的第一运算结果;将第三整型数据的符号位与第四整型数据的符号位进行异或运算,以及将所述第三整型数据的精度位与所述第四整型数据的精度位的乘法运算,以得到包含符号位和精度位的第二运算结果;以及将所述第一运算结果和所述第二运算结果相加The second computing unit 12 is configured to perform exclusive OR operation on the sign bit of the first integer data and the sign bit of the second integer data, and combine the precision bit of the first integer data with the second integer data The multiplication operation of the precision bit of the data to obtain the first operation result including the sign bit and the precision bit; the exclusive OR operation is performed on the sign bit of the third integer type data and the sign bit of the fourth integer type data, and the first The precision bit of the three integer data is multiplied by the precision bit of the fourth integer data to obtain a second operation result including a sign bit and a precision bit; and combining the first operation result with the second operation result add the results
在本实施例提供的运算装置中,通过将不同数据类型的输入数据进行拆分,从而可以在矩阵乘法过程中复用加速器的乘法和加法运算资源,从而大大减少了加速器的芯片面积和降低了成本。In the arithmetic device provided in this embodiment, by splitting the input data of different data types, the multiplication and addition operation resources of the accelerator can be reused in the process of matrix multiplication, thereby greatly reducing the chip area of the accelerator and reducing the cost.
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that the above-mentioned modules can be realized by software or hardware. For the latter, it can be realized by the following methods, but not limited to this: the above-mentioned modules are all located in the same processor; or, the above-mentioned modules can be combined in any combination The forms of are located in different processors.
为了便于对本发明的理解,下面以INT8*INT8+INT8*INT8的资源与FP16*FP16的资源进行复用进行举例说明,如图4所示。图中FP16数据类型的4个乘法3个加法和INT8数据类型的8个乘法7个加法资源复用。In order to facilitate the understanding of the present invention, the multiplexing of INT8*INT8+INT8*INT8 resources and FP16*FP16 resources is used as an example below, as shown in FIG. 4 . In the figure, 4 multiplications and 3 additions of FP16 data type and 8 multiplications and 7 additions of INT8 data type are multiplexed.
本实施例中矩阵乘法复用的地方主要是乘法和加法,本实施例的运算流程主要分3个阶段:输入数据预处理、乘法和加法。In this embodiment, matrix multiplication and multiplexing are mainly multiplication and addition. The operation flow of this embodiment is mainly divided into three stages: input data preprocessing, multiplication and addition.
首先,输入数据进行预处理。First, the input data is preprocessed.
具体地,在本实施例中,先将输入的fp16和int8转换成固定格式,主要目的是为了使后续乘法能够复用。如图5所示,fp16的做法是将其拆分成fix12和指数部分,int8的做法是将其转换为1bit符号位7bit原码的格式。Specifically, in this embodiment, the input fp16 and int8 are first converted into a fixed format, the main purpose is to enable subsequent multiplication to be multiplexed. As shown in Figure 5, the method of fp16 is to split it into fix12 and the exponent part, and the method of int8 is to convert it into the format of 1-bit sign bit and 7-bit original code.
其次,每个乘法单元将输入2个需要相乘的fp16或者4个需要两两相乘的int8。2个fp16两两相乘的运算方式具体为:两个1bit符号位11bit原码组成的fix12相乘得到1bit符号位22bit原码,再将其转换为补码,最终得到1bit符号位22bit补码。4个int8相乘运算方式具体为:两个1bit符号位7bit原码组成的fix8相乘得到1bit符号位14bit原码,再将2个1bit符号位14bit原码相加得到1bit符号位15bit原码,再将其由原码转换为补码,最终得到1bit符号位15bit补码。Secondly, each multiplication unit will input 2 fp16s that need to be multiplied or 4 int8s that need to be multiplied in pairs. The operation method of multiplying two fp16s in pairs is specifically: two 1bit sign bits 11bit original code composed of fix12 Multiply to get 1bit sign bit 22bit original code, then convert it to complement code, and finally get 1bit sign bit 22bit complement code. The multiplication operation method of 4 int8s is as follows: multiply the fix8 composed of two 1bit sign bit 7bit original codes to obtain the 1bit sign bit 14bit original code, and then add the two 1bit sign bit 14bit original codes to obtain the 1bit sign bit 15bit original code , and then convert it from the original code to the complement code, and finally get the 1-bit sign bit 15-bit complement code.
在另一实施例中,如果FP16和INT8乘法单独处理,那么它们的实现方式如图6,最终两个INT8的乘积相加来减少输出数据个数。In another embodiment, if the multiplication of FP16 and INT8 is processed separately, their implementation is shown in Figure 6, and finally the products of two INT8s are added to reduce the number of output data.
如图6所示,两个FP16相乘可以拆分成3个运算:指数加法、符号位异或、11bit精度位乘法,四个INT8两两相乘可以表达为4个运算:2个符号位异或、2个7bit精度乘法。As shown in Figure 6, the multiplication of two FP16s can be divided into 3 operations: exponential addition, sign bit XOR, 11bit precision bit multiplication, and the multiplication of four INT8 pairs can be expressed as 4 operations: 2 sign bits XOR, two 7bit precision multiplication.
在本实施例中,为了在此处充分复用fp16和int8的乘法资源,因此提出了后续的资源 复用实现乘法运算的方法。In this embodiment, in order to fully reuse the multiplication resources of fp16 and int8 here, a subsequent resource reuse method for multiplication is proposed.
在本实施例中,可将fp16的乘法和int8的乘法运算按照图7中的拆分方式进行拆分。具体地,如图6所示,将fp16的运算3拆分成更小粒度的7bit*7bit、7bit*4bit、7bit*4bit、4bit*4bit,将int8的运算D拆分成7bit*4bit、7bit*4bit的形式。这样最终可以复用3个乘法器DSP7*4、DSP7*4、DSP7*7。In this embodiment, the multiplication of fp16 and the multiplication of int8 can be split according to the splitting manner shown in FIG. 7 . Specifically, as shown in Figure 6, the operation 3 of fp16 is divided into smaller granularity 7bit*7bit, 7bit*4bit, 7bit*4bit, 4bit*4bit, and the operation D of int8 is divided into 7bit*4bit, 7bit *4bit format. In this way, three multipliers DSP7*4, DSP7*4, and DSP7*7 can be reused in the end.
在本实施例中,也充分复用fp16和int8矩阵乘法中的加法实现资源,因此提出了后续的资源复用实现加法运算的方法。In this embodiment, the addition implementation resources in fp16 and int8 matrix multiplication are also fully reused, so a subsequent method for implementing addition operations by multiplexing resources is proposed.
在本实施例中,将fp16矩阵乘法运算中的加法按照图8方式进行。In this embodiment, the addition in the fp16 matrix multiplication operation is performed in the manner shown in FIG. 8 .
第一步,找最大指数,从4个拆分后的指数中寻找最大指数,如图8所示。4个指数两两进行比较选出较大值得到2个指数,再将2个指数进行比较其较大值即为4个指数中的最大值。The first step is to find the largest index, and find the largest index from the four split indexes, as shown in Figure 8. The 4 indices are compared in pairs and the larger value is selected to obtain 2 indices, and then the 2 indices are compared and the larger value is the maximum value among the 4 indices.
第二步,算阶差,分别计算第一步中最大指数与4个指数的差值,得到4个数相对于最大指数的阶差;The second step is to calculate the step difference, and calculate the difference between the largest exponent and the 4 indices in the first step, and obtain the step difference of the 4 numbers relative to the largest exponent;
第三步,移位,4个乘积数据位进行右移位,移位位数为第二步中算出的阶差;In the third step, shifting, the 4 product data bits are shifted to the right, and the number of shifted digits is the step difference calculated in the second step;
第四步,加法,4个数第一次每两两相加得到2个数,然后将这2个数第二次相加得到1个数,add_0_3即为最终结果。The fourth step is addition. For the first time, add 4 numbers in pairs to get 2 numbers, and then add the 2 numbers for the second time to get 1 number. add_0_3 is the final result.
在本实施例中,将int8矩阵乘法运算中的加法,按照图8方式进行。In this embodiment, the addition in the int8 matrix multiplication operation is performed in the manner shown in FIG. 8 .
即,4个数第一次每两两相加得到2个数,2个数第二次相加得到1个数,add_0_7即为最终结果。That is, 4 numbers are added every pair for the first time to get 2 numbers, and 2 numbers are added for the second time to get 1 number, and add_0_7 is the final result.
在本实施例中,fp16和int8的加法实现,复用了以下几个部分:第一次加法的8个加法器,第二次加法的4个加法器,第三次加法的2个加法器,第四次加法的加法器。In this embodiment, the addition of fp16 and int8 is implemented by multiplexing the following parts: 8 adders for the first addition, 4 adders for the second addition, and 2 adders for the third addition , the adder for the fourth addition.
在本实施例中,能够让多种运算精度复用矩阵乘法运算单元,在保证功能的前提下,大大减少了面积消耗。即,在芯片面积资源有限的情况下,能够实现更多精度的矩阵乘法运算,使人工智能加速器能够支持更多的精度。从而提升人工智能加速器的算力,和增加它的应用场景。In this embodiment, the matrix multiplication operation unit can be reused for various operation precisions, and the area consumption is greatly reduced under the premise of ensuring the function. That is, in the case of limited chip area resources, matrix multiplication operations with more precision can be realized, so that artificial intelligence accelerators can support more precision. Thereby improving the computing power of the artificial intelligence accelerator and increasing its application scenarios.
本发明的实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is set to execute the steps in any one of the above method embodiments when running.
在一个示例性实施例中,上述计算机可读存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。In an exemplary embodiment, the above-mentioned computer-readable storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM) , mobile hard disk, magnetic disk or optical disk and other media that can store computer programs.
本发明的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。An embodiment of the present invention also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to perform the steps in any one of the above method embodiments.
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。In an exemplary embodiment, the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
本实施例中的具体示例可以参考上述实施例及示例性实施方式中所描述的示例,本实施例在此不再赘述。For specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and exemplary implementation manners, and details will not be repeated here in this embodiment.
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算 装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。Obviously, those skilled in the art should understand that each module or each step of the above-mentioned present invention can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network formed by multiple computing devices In fact, they can be implemented in program code executable by a computing device, and thus, they can be stored in a storage device to be executed by a computing device, and in some cases, can be executed in an order different from that shown here. Or described steps, or they are fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module for implementation. As such, the present invention is not limited to any specific combination of hardware and software.
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the principle of the present invention shall be included in the protection scope of the present invention.

Claims (12)

  1. 一种矩阵乘法的运算方法,包括:An operation method for matrix multiplication, comprising:
    将两个2N比特的浮点型数据分别拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据分别拆分为对应的符号位和精度位;Split two 2N-bit floating-point data into corresponding sign bits, precision bits, and exponent bits, and split four N-bit integer data into corresponding sign bits and precision bits;
    通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算,以及通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算,并在所述浮点型数据和所述整型数据的矩阵乘法运算中复用乘法单元和加法单元。Perform matrix multiplication on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits, and pairwise multiplying the four integer data by multiplying sign bits XOR and precision bits performing a matrix multiplication operation, and multiplexing a multiplication unit and an addition unit in the matrix multiplication operation of the floating-point data and the integer data.
  2. 根据权利要求1所述的方法,其中,通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算包括:The method according to claim 1, wherein performing matrix multiplication on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits comprises:
    将第一浮点型数据的指数位与第二浮点型数据的指数位进行加法运算,将所述第一浮点型数据的符号位与所述第二浮点型数据的符号位进行异或运算,以及将所述第一浮点型数据的精度位与所述第二浮点型数据的精度位进行乘法运算。Adding the exponent bit of the first floating-point data to the exponent bit of the second floating-point data, and exclusive-ORing the sign bit of the first floating-point data with the sign bit of the second floating-point data performing an OR operation, and performing a multiplication operation on the precision bits of the first floating-point data and the precision bits of the second floating-point data.
  3. 根据权利要求1所述的方法,其中,通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算包括:The method according to claim 1, wherein performing matrix multiplication operation on the four integer data two by two by sign bit XOR and precision bit multiplication comprises:
    将第一整型数据的符号位与第二整型数据的符号位进行异或运算,以及将所述第一整型数据的精度位与所述第二整型数据的精度位的乘法运算,以得到包含符号位和精度位的第一运算结果;performing an XOR operation on the sign bit of the first integer data and the sign bit of the second integer data, and multiplying the precision bits of the first integer data by the precision bits of the second integer data, to obtain a first operation result including a sign bit and a precision bit;
    将第三整型数据的符号位与第四整型数据的符号位进行异或运算,以及将所述第三整型数据的精度位与所述第四整型数据的精度位的乘法运算,以得到包含符号位和精度位的第二运算结果;performing an XOR operation on the sign bit of the third integer data and the sign bit of the fourth integer data, and multiplying the precision bits of the third integer data by the precision bits of the fourth integer data, to obtain a second operation result including a sign bit and a precision bit;
    将所述第一运算结果和所述第二运算结果相加。adding the first operation result and the second operation result.
  4. 根据权利要求1所述的方法,其中,将两个2N比特的浮点型数据拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据拆分为对应的符号位和精度位,包括:The method according to claim 1, wherein, two 2N-bit floating-point data are split into corresponding sign bits, precision bits and exponent bits, and four N-bit integer data are split into corresponding Sign and precision bits, including:
    将两个16比特的浮点型数据分别拆分为1比特的符号位、11比特的精度位和4比特的指数位,以及将四个8比特的整型数据分别拆分为1比特的符号位和7比特的精度位。Split two 16-bit floating-point data into 1-bit sign bit, 11-bit precision bit and 4-bit exponent bit, and split four 8-bit integer data into 1-bit sign bit bits and 7 bits of precision.
  5. 根据权利要求4所述的方法,其中,通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算包括:The method according to claim 4, wherein, performing matrix multiplication on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits comprises:
    将1比特符号位和11比特精度位组成的第一浮点型数据,与1比特符号位和11比特精度位组成的第二浮点型数据相乘得到1比特符号位和22比特原码,再将其转换为补码,得到1bit符号位22bit补码。Multiply the first floating-point data composed of 1-bit sign bit and 11-bit precision bit with the second floating-point data composed of 1-bit sign bit and 11-bit precision bit to obtain 1-bit sign bit and 22-bit original code, Then convert it to complement code to get 1bit sign bit 22bit complement code.
  6. 根据权利要求4所述的方法,其中,通过符号位异或和精度位相乘对所述四个整型数据进行矩阵乘法运算包括:The method according to claim 4, wherein performing matrix multiplication on the four integer data by sign bit XOR and precision bit multiplication comprises:
    将1bit符号位和7bit精度位组成的第一整型数据,与1bit符号位和7bit精度位组成的第二整型数据相乘得到由1bit符号位和14bit原码组成的第一运算结果;Multiplying the first integer data composed of 1 bit sign bit and 7 bit precision bits with the second integer data composed of 1 bit sign bit and 7 bit precision bits to obtain the first operation result composed of 1 bit sign bit and 14 bit original code;
    将1bit符号位和7bit精度位组成的第三整型数据,与1bit符号位和7bit精度位组成的第四整型数据相乘得到由1bit符号位和14bit原码组成的第二运算结果;Multiplying the third integer data composed of 1 bit sign bit and 7 bit precision bits with the fourth integer data composed of 1 bit sign bit and 7 bit precision bits to obtain the second operation result composed of 1 bit sign bit and 14 bit original code;
    将所述第一乘法运算与第二乘法运算结果相加得到1bit符号位和15bit原码,再将其由原码转换为补码,得到1bit符号位和15bit补码。Adding the result of the first multiplication operation and the second multiplication operation to obtain a 1-bit sign bit and a 15-bit original code, and then converting the original code into a complement code to obtain a 1-bit sign bit and a 15-bit complement code.
  7. 根据权利要求4所述的方法,其中,浮点型数据的矩阵乘法运算中的相加运算包括:The method according to claim 4, wherein the addition operation in the matrix multiplication operation of floating-point data comprises:
    从拆分出的指数中选出最大数;Select the largest number from the split indices;
    分别计算各指数相对于所述最大数的阶差;Calculate the step difference of each index with respect to said maximum number respectively;
    按照所述阶差对乘积数据位进行右移位;right-shifting the product data bits according to the step difference;
    将移位后的乘积数据进行相加。The shifted product data are added.
  8. 一种矩阵乘法的运算装置,包括:A computing device for matrix multiplication, comprising:
    拆分模块,用于将两个2N比特的浮点型数据分别拆分为对应的符号位、精度位和指数位,以及将四个N比特的整型数据分别拆分为对应的符号位和精度位;The split module is used to split two 2N-bit floating-point data into corresponding sign bits, precision bits and exponent bits, and split four N-bit integer data into corresponding sign bits and precision bits;
    运算模块,设置为通过指数位相加、符号位异或和精度位相乘对所述两个浮点型数据进行矩阵乘法运算,以及通过符号位异或和精度位相乘对所述四个整型数据两两进行矩阵乘法运算,并在所述浮点型数据和所述整型数据的矩阵乘法运算中复用乘法单元和加法单元。An operation module, configured to perform matrix multiplication on the two floating-point data by adding exponent bits, XORing sign bits, and multiplying precision bits, and performing matrix multiplication on the four floating-point data by multiplying sign bits XOR and precision bits Integer data is subjected to matrix multiplication operation two by two, and a multiplication unit and an addition unit are multiplexed in the matrix multiplication operation of the floating point data and the integer data.
  9. 根据权利要求8所述的装置,其中,所述运算模块包括:The device according to claim 8, wherein the computing module comprises:
    第一运算单元,设置为将第一浮点型数据的指数位与第二浮点型数据的指数位进行加法运算,将所述第一浮点型数据的符号位与所述第二浮点型数据的符号位进行异或运算,以及将所述第一浮点型数据的精度位与所述第二浮点型数据的精度位进行乘法运算。The first computing unit is configured to add the exponent bit of the first floating-point data to the exponent bit of the second floating-point data, and add the sign bit of the first floating-point data to the second floating-point data Execute an XOR operation on the sign bit of the floating-point data, and perform a multiplication operation on the precision bits of the first floating-point data and the precision bits of the second floating-point data.
  10. 根据权利要求8所述的装置,其中,所述运算模块还包括:The device according to claim 8, wherein the computing module further comprises:
    第二运算单元,设置为将第一整型数据的符号位与第二整型数据的符号位进行异或运算,以及将所述第一整型数据的精度位与所述第二整型数据的精度位的乘法运算,以得到包含符号位和精度位的第一运算结果;将第三整型数据的符号位与第四整型数据的符号位进行异或运算,以及将所述第三整型数据的精度位与所述第四整型数据的精度位的乘法运算,以得到包含符号位和精度位的第二运算结果;以及将所述第一运算结果和所述第二运算结果相加。The second operation unit is configured to perform an exclusive OR operation on the sign bit of the first integer data and the sign bit of the second integer data, and perform an XOR operation on the precision bit of the first integer data and the second integer data The multiplication operation of the precision bits to obtain the first operation result including the sign bit and the precision bit; the exclusive OR operation is performed on the sign bit of the third integer type data and the sign bit of the fourth integer type data, and the third multiplying the precision bits of the integer data by the precision bits of the fourth integer data to obtain a second operation result including a sign bit and a precision bit; and combining the first operation result and the second operation result add up.
  11. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,其中,所述计算机程序被处理器执行时实现所述权利要求1至7任一项中所述的方法的步骤。A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, wherein, when the computer program is executed by a processor, the steps of the method described in any one of claims 1 to 7 are implemented .
  12. 一种电子装置,包括存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现所述权利要求1至7任一项中所述的方法的步骤。An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements claims 1 to 7 when executing the computer program The step of the method described in any one.
PCT/CN2022/129619 2021-11-03 2022-11-03 Operation method and apparatus for matrix multiplication WO2023078364A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111294888.0 2021-11-03
CN202111294888.0A CN116090513A (en) 2021-11-03 2021-11-03 Operation method and device for matrix multiplication

Publications (1)

Publication Number Publication Date
WO2023078364A1 true WO2023078364A1 (en) 2023-05-11

Family

ID=86208771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/129619 WO2023078364A1 (en) 2021-11-03 2022-11-03 Operation method and apparatus for matrix multiplication

Country Status (2)

Country Link
CN (1) CN116090513A (en)
WO (1) WO2023078364A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor
CN108287681A (en) * 2018-02-14 2018-07-17 中国科学院电子学研究所 A kind of single-precision floating point fusion point multiplication operation unit
CN113157247A (en) * 2021-04-23 2021-07-23 西安交通大学 Reconfigurable integer-floating point multiplier

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986264A (en) * 2010-11-25 2011-03-16 中国人民解放军国防科学技术大学 Multifunctional floating-point multiply and add calculation device for single instruction multiple data (SIMD) vector microprocessor
CN108287681A (en) * 2018-02-14 2018-07-17 中国科学院电子学研究所 A kind of single-precision floating point fusion point multiplication operation unit
CN113157247A (en) * 2021-04-23 2021-07-23 西安交通大学 Reconfigurable integer-floating point multiplier

Also Published As

Publication number Publication date
CN116090513A (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN107451658B (en) Fixed-point method and system for floating-point operation
US20210349692A1 (en) Multiplier and multiplication method
US10949168B2 (en) Compressing like-magnitude partial products in multiply accumulation
CN111832719A (en) Fixed point quantization convolution neural network accelerator calculation circuit
US20130282778A1 (en) System and Method for Signal Processing in Digital Signal Processors
US4366548A (en) Adder for exponent arithmetic
US11604970B2 (en) Micro-processor circuit and method of performing neural network operation
Vassiliadis et al. Hard-wired multipliers with encoded partial products
TWI763079B (en) Multiplier and method for floating-point arithmetic, integrated circuit chip, and computing device
US11468311B2 (en) Micro-processor circuit and method of performing neural network operation
WO2022170811A1 (en) Fixed-point multiply-add operation unit and method suitable for mixed-precision neural network
TWI776213B (en) Hardware circuit and method for multiplying sets of inputs, and non-transitory machine-readable storage device
CN116205244B (en) Digital signal processing structure
WO2023078364A1 (en) Operation method and apparatus for matrix multiplication
CN115827555A (en) Data processing method, computer device, storage medium and multiplier structure
CN113283591B (en) Efficient convolution implementation method and device based on Winograd algorithm and approximate multiplier
CN113986194A (en) Neural network approximate multiplier implementation method and device based on preprocessing
CN113608718A (en) Method for realizing acceleration of prime number domain large integer modular multiplication calculation
CN114115803A (en) Approximate floating-point multiplier based on partial product probability analysis
US20080071852A1 (en) Method to perform a subtraction of two operands in a binary arithmetic unit plus arithmetic unit to perform such a method
JP2682142B2 (en) Multiplier
US20230259581A1 (en) Method and apparatus for floating-point data type matrix multiplication based on outer product
TWI804043B (en) Multi-input multi-output adder and operating method thereof
EP4036704A1 (en) Multiplier
US20220075598A1 (en) Systems and Methods for Numerical Precision in Digital Multiplier Circuitry

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22889385

Country of ref document: EP

Kind code of ref document: A1