WO2019057093A1 - 乘法电路、片上系统及电子设备 - Google Patents

乘法电路、片上系统及电子设备 Download PDF

Info

Publication number
WO2019057093A1
WO2019057093A1 PCT/CN2018/106559 CN2018106559W WO2019057093A1 WO 2019057093 A1 WO2019057093 A1 WO 2019057093A1 CN 2018106559 W CN2018106559 W CN 2018106559W WO 2019057093 A1 WO2019057093 A1 WO 2019057093A1
Authority
WO
WIPO (PCT)
Prior art keywords
bit
circuit
data
value
bits
Prior art date
Application number
PCT/CN2018/106559
Other languages
English (en)
French (fr)
Inventor
徐斌
王开兴
田清霖
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18858914.7A priority Critical patent/EP3674883B1/en
Publication of WO2019057093A1 publication Critical patent/WO2019057093A1/zh
Priority to US16/822,720 priority patent/US11249721B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/4806Computations with complex numbers
    • G06F7/4812Complex multiplication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • G06F7/5235Multiplying only using indirect methods, e.g. quarter square method, via logarithmic domain
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/487Multiplying; Dividing

Definitions

  • the present application relates to the field of data processing and, more particularly, to multiplying circuits, system on a chip, and electronic devices.
  • the present application provides a multiplying circuit, a system on chip, and an electronic device, which can reduce the overhead of data conversion between a linear domain and a logarithmic domain, and can improve the speed of various operations based on multiplication.
  • the present application provides a multiplication circuit for performing multiplication operations of two data A and B, including: an addition sub-circuit for acquiring log domain data corresponding to the A and the B respectively a and logarithmic domain data b, performing an addition operation on the a and the b to obtain c, the c including an integer part and a fractional part; an exponential operation sub-circuit for performing a base of 2, the index being the c An exponential operation of the fractional part to obtain an exponential operation result; a shift sub-circuit for shifting the exponential operation result according to the integer part of the c to obtain a shift result; and an output sub-circuit for using the a And the sign of the b, in combination with the shift result, output the product of the A and the B.
  • the logarithmic domain data a and the logarithmic domain data b are obtained by taking a base 2 logarithm of the absolute values of the A and the B, respectively. And combined with their sign bits, including 1 + m + n binary bits, m, n are positive integers, where the first bit is a sign bit, m bits are integer parts, and n bits are fractional parts.
  • the integer part of the c is a sum of an integer part of the a and an integer part of the b; the fractional part of the c is a fractional part of the a The sum of the fractional parts of b.
  • the logarithmic domain data corresponding to the value 0 is defined as: the sign bit has a value of 1, and the integer and fractional parts are both 0.
  • the A and the B both comprise 1+j+k binary bits, and j and k are positive integers, wherein the first bit is a sign bit, j The bit is the integer part and the k bits are the fractional part.
  • the j binary bits are included, and the fractional portion includes k binary bits; wherein when the left shift number is less than 0, shifting the x bit to the left is equal to shifting the absolute value of the x to the right.
  • the exponent operation sub-circuit is a decoding circuit, and the decoding circuit is configured to obtain a result of the exponential operation according to the fractional part of the c; or
  • the index operation sub-circuit is a look-up table circuit, and the look-up table circuit is configured to obtain a result of the exponential operation according to the fractional part table of the c.
  • the multiplying circuit further includes an accumulator for accumulating a product of the data A and B with another data from the multiplying circuit; or An accumulator is used to accumulate the product of the data A and B with the product from another multiplication circuit.
  • the multiplication circuit of the first aspect realizes multiplication by means of an addition sub-circuit, an exponential operation sub-circuit, a shift sub-circuit, and an output sub-circuit, and does not require a complicated exponential operation circuit, and the realization of these sub-circuits is more resource-saving than the multiplication circuit. It occupies less logic resources, which can reduce the area occupied by the device and power consumption.
  • a system-on-a-chip comprising: a processor core, a multiply hardware circuit array, a data input buffer, and one or more multiplication hardware circuits as described in the first aspect or any possible implementation, a data output buffer and a control circuit; the control circuit is coupled to the processor core, the data input circuit, and the data output circuit; the data input circuit is configured to acquire, by the control circuit, the processor core
  • the multiplication hardware circuit array is configured to acquire data in the data input buffer for processing, obtain a processed result, and output the result to the data output buffer; the control circuit is further configured to be used with the processor The core interacts such that the processor core acquires data in the data output buffer.
  • the method further includes a logarithmic conversion circuit, configured to perform the log domain conversion on an output of the multiplication hardware circuit array, and input the converted result to the data Input cache.
  • the logarithmic domain conversion circuit includes: an integer calculation sub-circuit, a fractional calculation sub-circuit, and a second symbol bit determination sub-circuit, wherein the linear domain array output data is A binary number consisting of 1+j+k bits, where j and k are both positive integers, and 1 bit is a second sign bit for indicating positive and negative signs S, and j bits are used to indicate the absolute value of the linear domain data.
  • the integer calculation sub-circuit for outputting j+k bits of data according to the linear domain array
  • the value h1 of the number of bits of the non-zero highest bit of the binary number is calculated, and the difference between h1 and k is calculated, and the difference is used to represent the value of the integer part of the logarithm of the absolute value of A1.
  • the fractional calculation sub-circuit is configured to output non-zero from the upper bit to the lower bit according to the linear domain array output data a predetermined number of s bits after the highest bit, resulting in the said
  • the absolute value of the output data of the linear domain array takes the value of the fractional part of the logarithm of the base 2; the second sign bit determines the sub-circuit for determining the log domain array according to the sign of the output data of the linear domain array, respectively The symbols of the data are output, thereby obtaining the log domain array output data.
  • the fractional calculation sub-circuit is specifically configured to: obtain, by looking up a table or decoding, a value N1 corresponding to an s bit of the non-zero highest bit of the A1 from a high bit to a low bit.
  • the value N2 corresponding to the s bit after the non-zero highest bit of the A2 from the upper to the lower bits is obtained by looking up the table or decoding, wherein the table stores the value N corresponding to all possible values of the s bits.
  • the fractional calculation sub-circuit is specifically configured to: compare a value corresponding to the s bit of the non-zero highest bit of the A1 from a high bit to a low bit with a preset 2 n Comparison value comparison, wherein the ith comparison value is smaller than the i+1th comparison value, and the ith comparison value corresponds to a value N i ; the s bit corresponding to the non-zero highest bit of the A1 from the high to the low If the value is greater than or equal to the T1 comparison value and less than the T1+1 comparison value, it is determined that the N1 is N T1 ; the value corresponding to the s bit after the non-zero highest bit of the A2 from the high to the low is 2 n comparison value comparisons, wherein the ith comparison value is smaller than the i+1th comparison value, and the ith comparison value corresponds to a value N i ; when the A2 is from the high to the low nonzero highest level When the value
  • the fractional calculation sub-circuit is specifically configured to: use a value corresponding to a high x bit of the s bit after the non-zero highest bit of the A1 from a high bit to a low bit 2 n interval comparisons, wherein the ith interval corresponds to a pair of values ⁇ i and ⁇ i, x is greater than 0 and less than s, corresponding to the high x bits of the s bits after the non-zero highest bit of the A1 from the high to the low
  • the pair of values ⁇ 1 and ⁇ 1 corresponding to the first interval are found, and the result of x* ⁇ 1+ ⁇ 1 is calculated, and the N1 is obtained according to the result of the x* ⁇ 1+ ⁇ 1;
  • the value corresponding to the high x bit of the s bit after the non-zero highest bit of the high bit to the low bit is compared with the preset 2 n interval, wherein the s bit after the non-
  • the present application provides a multiplication hardware circuit for multiplication operations.
  • the hardware circuit herein refers to a circuit implemented based on an ASIC, an FPGA, or the like, rather than a general purpose processor (eg, based on an x86, ARM architecture).
  • a processor that needs to read instructions to perform specific operations.
  • the multiplication hardware circuit in this embodiment refers to a hardware circuit capable of realizing multiplication, and is not limited to performing other operations on the basis of multiplication, for example, an accumulation operation or the like.
  • the multiplication hardware circuit in the present application includes: a log domain adder and a linear domain conversion circuit, and the linear domain conversion circuit includes an exponential operation sub-circuit, a shift sub-circuit, and a sign bit determining sub-circuit;
  • the log domain adder is configured to add the first log domain data a1 and the second log domain data a2 to obtain log domain data c1, wherein the log domain data refers to the absolute value of the linear domain data.
  • the value takes the base 2 logarithm and combines the data obtained from the positive and negative sign bits of the linear domain data; a1 and a2 perform logarithmic domain transformation on the first data A1 and the second data A2 for the multiplication respectively.
  • a1, a2 and c1 are binary numbers composed of 1+m+n bits, m and n are both positive integers, and 1 bit is the first sign bit for indicating positive and negative signs (hereinafter also referred to as “symbols”). , the application does not distinguish), m bits are used to indicate the value of the integer part, and n bits are used to indicate the value of the fractional part;
  • the log domain data is in a relative relationship with the linear domain, if a data is subjected to a log domain conversion (ie, the absolute value of the data is taken as a base 2 logarithm and combined with the positive and negative sign bits of the data.
  • a log domain conversion ie, the absolute value of the data is taken as a base 2 logarithm and combined with the positive and negative sign bits of the data.
  • the data before the log domain conversion is called linear domain data
  • the converted data is called log domain data.
  • logarithm of taking the absolute value of -8 as the base 2 is log 2
  • 3, and then combining the sign of -8 (-), thus obtaining -3, then Let -8 be linear domain data and -3 be log domain data.
  • the data can also be converted in a linear domain (ie, taking the absolute value of the data of 2 and combining it with the sign bit of the data). For example, for -3, the
  • the prefixes "first”, “second”, etc. are only used to distinguish different individuals of the same type of nouns modified, and do not represent other special meanings.
  • the first logarithmic domain data and the second logarithmic domain data are distinguished from different individuals of the modified noun "logarithmic domain data", and the "first logarithmic domain data" is a specific format.
  • the logarithmic domain data and the "second log domain number field” are logarithmic domain data of another format.
  • the multiplication hardware circuit further includes: an exponential operation sub-circuit for obtaining N' according to the value N1 of the fractional part of c1, wherein the value of N' is the N1 power of 2; that is, the exponential operation sub-circuit is used to calculate 2 N1 Obtaining N', it should be noted that the calculation process in this application is often error-prone because many numbers are implemented in digital circuits based on ASIC and FPGA (for example, there are many bits of decimals, even irrational numbers). The representation of ) is limited by hardware (different bit widths can represent different data ranges).
  • N' in the present application is the N1 power of 2
  • a specific hardware limitation such as a specific bit width
  • the multiplication hardware circuit further includes: a shift sub-circuit for shifting N' obtained by the exponential operation sub-circuit according to the value M1 of the integer part of c1 to obtain an absolute value of C1, wherein C1 is a product of A1 multiplied by B1;
  • shifting a negative digit to the left indicates that the absolute value of the negative digit is shifted right (eg, right shift
  • the multiplication hardware circuit further includes: a sign bit determining sub-circuit for determining the sign of C1 according to the value of the sign bit of a1 and the value of the sign bit of a2, and the absolute value of C1 obtained according to the shift sub-circuit and C1 Positive and negative signs get C1.
  • the sign of C1 is essentially determined by the symbols of A1 and B1, and the symbols of A1 and B1 determine the symbols of a1 and a2, respectively. Therefore, the symbol of C1 can be obtained by the two logarithmic domain adders of a1 and a2.
  • the symbol determines that the principle of determination is a technique well known to those skilled in the art, that is, if one is positive and one is negative, the result of multiplication is negative; if both are positive or both are negative, the multiplication result is positive.
  • the multiplication is realized by means of an exponential operation sub-circuit, a shift sub-circuit, and a sign bit determining sub-circuit.
  • complicated exponential operation sub-circuits are not required, and the sub-circuit implementation ratio multiplication
  • the circuit saves resources and occupies less logic resources, thereby reducing the area occupied by the device and power consumption.
  • the N′ is a binary number formed by 1+w bits, and is used to represent a fraction greater than or equal to 1 and less than 2, wherein 1 bit represents The number is the value of the fractional part of the decimal, and the w bit represents the value of the fractional part of the decimal;
  • the shifting circuit is specifically configured to shift the M1-(wk) bit to the left to obtain a final shift result, and the rightmost k bit of the final shift result is used to indicate the absolute value of the C1.
  • the value of the fractional part, the j-bit to the left of the rightmost k-bit of the final displacement result is used to indicate the value of the integer part of the absolute value of C1.
  • the N′ is a binary number formed by 1+w bits, and is used to represent a fraction greater than or equal to 1 and less than 2, wherein 1 bit represents The number is the value of the fractional part of the decimal, and the w bit represents the value of the fractional part of the decimal.
  • the shift sub-circuit includes a first shift sub-circuit and a second shift sub-circuit
  • the first shift sub-circuit is used to shift the M1 bit to the left N; Note: the left shift negative digit is actually a right shift positive digit.
  • the second shift sub-circuit is configured to shift the result of the shifting of the first shift sub-circuit by a left-(wk) bit to obtain a final shift result, and the rightmost k-bit of the final shift result
  • a j-bit to the left of the rightmost k-bit of the final displacement result is used to indicate a value of an integer part of the absolute value of the C1.
  • the exponential operation sub-circuit is a decoding sub-circuit, and the decoding sub-circuit is used according to the c1 Said N1 decoding to obtain said N'; or
  • the index operation sub-circuit is a look-up sub-circuit, and the look-up sub-circuit is used to obtain the N' according to the N1 look-up table of the c1.
  • the multiplying hardware circuit further includes an accumulator for accumulating the C1 with another linear domain data C2 from the multiplying hardware circuit; or
  • the accumulator is used to accumulate the C1 with the linear domain data C3 from another multiplication hardware circuit.
  • the present application discloses a system-on-chip SoC, including a processor core, a multiplication hardware circuit array and data input formed by one or more multiplication hardware circuits in the first aspect and various implementations of the first aspect.
  • the control circuit is coupled to the processor core, the multiplying hardware circuit array, the data input circuit, and the data output circuit;
  • the data input circuit is configured to acquire data from the processor core by the control circuit;
  • the multiplication hardware circuit array is configured to acquire data in the data input buffer for processing, obtain a processed result, and output the result to the data output buffer by using the control circuit.
  • the composition of the multiplication hardware circuit array ie, how many multiplication hardware circuits are selected, in what manner, etc.
  • the present application is not focused, and the application focuses on the specifics of the multiplication hardware circuits constituting the array.
  • the input buffer and the output buffer may be implemented by a storage medium such as SRAM or eDRAM.
  • the SoC further includes a logarithmic conversion circuit
  • a logarithmic conversion circuit is configured to perform the log domain conversion on an output of the multiplying hardware circuit array, and input the converted result to the data input buffer. Specifically, the data is obtained from the output buffer, and then converted and output to the input buffer, so that the subsequent multiplication hardware circuit array can acquire data from the input buffer for operation.
  • the logarithmic domain conversion circuit includes: an integer calculation sub-circuit, a fractional calculation sub-circuit, and a second symbol bit determination a sub-circuit, wherein the linear domain array output data is a binary number composed of 1+j+k bits, wherein j and k are both positive integers, and 1 bit is a second sign bit for indicating positive and negative signs S , j bits are used to indicate the value J of the integer part of the absolute value of the linear domain data, k bits are used to indicate the value K of the fractional part of the absolute value of the linear domain data;
  • the integer calculation sub-circuit is configured to calculate a difference between h1 and k according to a value h1 of a bit number of a non-zero highest bit of a binary number of j+k bits of the linear domain array output data, the difference
  • the value indicating the absolute value of the output data of the linear domain array is a value of the base part of the logarithm of the base 2, wherein the lowest bit of the binary number of the j+k bits of the output data of the linear domain array is recorded as the 0th bit ;
  • the fractional calculation sub-circuit is configured to obtain the linearity by a predetermined number s (greater than or equal to k, if not enough s, complement 0) bits after the non-zero highest bit of the high-order to low-order data is output according to the linear domain array
  • the absolute value of the output data of the domain array takes the value of the fractional part of the logarithm of the base 2; specifically, it can be obtained by looking up the table or decoding method;
  • a second sign bit determining sub-circuit for determining a sign of the log domain array output data according to the symbol of the linear domain array output data, thereby obtaining the log domain array output data.
  • A1, A2 mentioned in the above aspects are output data of the linear domain array, and the decimal calculation sub-circuit Specifically used for:
  • the fractional calculation sub-circuit may further have another implementation manner, that is, the fractional calculation sub-circuit is specifically configured to:
  • the present application discloses an electronic device (which may be a mobile phone, a tablet, a smart watch, a smart TV, and the like), and includes the second aspect and the second aspect, various implementations (or Four systems and various implementations of the fourth aspect, the system-on-a-chip (SoC), memory;
  • SoC system-on-a-chip
  • the memory is used to store instructions required for the program to run
  • the processor core in the SoC is configured to execute the instruction running program, and send data to be processed to the multiplication hardware circuit array;
  • the multiplication hardware circuit is configured to process the data, output the processed result to the data output circuit, and finally let the processor core obtain the result.
  • Figure 1 is a schematic diagram of a multiply and accumulate operation
  • FIG. 2 is a schematic block diagram of an ARM SoC architecture of an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a structure of a calculation engine
  • FIG. 4 is a schematic structural diagram of a multiplier in the first embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a multiply and accumulate operation in the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a log domain representation format of an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a linear representation format of an embodiment of the present application.
  • FIG. 8 is a schematic diagram of shifting a shift sub-circuit provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a method for converting a data format according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of linear domain conversion performed by a linear conversion circuit according to an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a method for converting a data format according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of logarithmic domain conversion performed by a logarithmic conversion circuit in an embodiment of the present application
  • Figure 13 is a schematic diagram of a fractional calculation sub-circuit determining the value of a fractional portion
  • FIG. 14 is a schematic diagram of a segmentation fit of an embodiment of the present application.
  • 15 is a schematic diagram of another fractional calculation sub-circuit determining the value of a fractional part
  • FIG. 16 is a schematic diagram of a SoC structure of the present application.
  • 17 is a schematic structural view of an electronic device of the present application.
  • the abstraction of the matrix operation into a mathematical mode can be a multiply-accumulate operation, or a multiply-accumulate operation.
  • P and Q can represent a matrix or a vector.
  • P ⁇ Q can represent a matrix operation in a broad sense, including at least one of a matrix * vector, a matrix * matrix, and a convolution operation of a vector * vector.
  • the value of P ⁇ Q may be that the element p i in P is multiplied with the corresponding element q i in Q to obtain a product p i q i , and then these products are accumulated, and this process is a multiply-accumulate operation.
  • the multiply and accumulate operation yields an element of the result matrix.
  • logarithmic domain data data represented by a logarithmic domain
  • logarithmic data refers to data to be represented by a linear domain (hereinafter referred to as “linear domain data”, or “linear data”, or “linear”
  • the absolute value of the value ", or “linear field value”, etc.) is converted into a logarithmic value by logarithmic operation (usually a base-2 logarithm operation is performed for computer calculation, ie log 2 (
  • f and g are logarithmic domain data.
  • F and G are linear domain data.
  • the multiplication of the linear domain data F and G can be converted into the addition operation of the log domain data f and g, that is, f+g is the logarithmic domain representation format of the product F*G, based on the log domain result f
  • the value of F*G is obtained by converting +g into linear domain data (i.e., calculating 2 f+g by, for example, shifting, decoding circuitry, etc.).
  • the multiplication of data can be changed to take the base 2 value addition with the absolute value of the data, thereby avoiding the multiplication operation.
  • performing logarithmic operations requires a certain amount of overhead, in a matrix operation, one data may participate in multiple multiplications. Therefore, a logarithmic domain representation is calculated and then used multiple times. For the entire matrix operation, the computational overhead is also reduced.
  • IEEE Institute of Electrical and Electronics Engineers
  • Fig. 1 shows a schematic diagram of the multiplication and addition operation, the steps of which are as follows.
  • the multiplication is realized by means of an exponential operation sub-circuit, a shift sub-circuit, and a sign bit determining sub-circuit.
  • complicated exponential operation sub-circuits are not required, and the implementation of these sub-circuits is more than the multiplication circuit. It saves resources and occupies less logic resources, which can reduce the area occupied by the device and power consumption.
  • the result of a multiplication and addition is not the final result of the matrix operation.
  • the result of the multiply and accumulate operation may become a data of the next layer of calculation, which needs to be converted into a pair.
  • the representation format under the number field for example, can calculate log 2 (SUM) using a standard floating-point logarithm operation circuit, and save the result in a 16-bit floating point data representation format.
  • the standard floating-point exponent arithmetic unit and the standard floating-point logarithmic arithmetic unit consume large hardware resources.
  • the accumulator resource consumption for calculating the summation is also large.
  • the present application provides a method, a circuit, a calculation engine, and a convolution calculation chip for converting a data format, which can reduce the overhead of data conversion between a linear domain and a log domain, and improve the speed of convolution calculation. .
  • FIG. 2 is a schematic block diagram of a System-on-a-chip (SoC) architecture 200 of an Advanced RISC Machine (ARM) system in accordance with an embodiment of the present application.
  • SoC System-on-a-chip
  • ARM Advanced RISC Machine
  • the ARM SoC architecture 200 includes, for example, a central processing unit (CPU) 210, a double data rate (DDR) memory controller 220, and an advanced eXtensible interface (Advanced eXtensible Interface, AXI) bus 230 and hardware calculation module 240.
  • CPU central processing unit
  • DDR double data rate
  • AXI advanced eXtensible interface
  • the hardware calculation module is used to perform some dedicated data processing, that is, the hardware calculation module is used to perform some "dedicated” processing (such as neural network-based machine learning) for data such as images, audio, and the like.
  • the hardware computing module is based on various logic circuits (such as AND gates, OR gates, NOT gates, etc.), rather than having a certain instruction set like the CPU (such as The x86 instruction set and the ARM instruction set) perform data processing in the form of executing instructions.
  • a typical hardware computing module can be implemented based on an FPGA, ASIC, or the like.
  • the CPU usually has its own dedicated instruction set for performing other data processing than the dedicated data processing performed by the hardware calculation module by executing the instructions (of course, it is theoretically not limited to using the CPU to execute the hardware calculation module. Patented data processing, but limited by the CPU hardware architecture, the efficiency is relatively low).
  • DDR memory is also known as DDR SDRAM, which is DDR Synchronous Dynamic Random Access Memory (SDRAM).
  • convolutional computing chip 240 includes an input buffer 242, a calculation engine 244, and an output control module 246.
  • the CPU 210 controls the calculation start by the AXI bus 230, and the convolutional calculation chip 240 acquires the data to be processed from the DDR memory 220 through the AXI bus 230 (such as acquiring image data and training parameters for the image processor), and then sends the data to
  • the calculation engine 244 calculates the calculation based on the input data content and writes the calculation result back to the DDR memory 220, and notifies the CPU 210 that the calculation is complete.
  • FIG. 3 shows a schematic block diagram of the structure of a calculation engine.
  • the calculation engine 244 includes a direct memory access (DMA) control unit, a data cache, a parameter cache, a plurality of processing elements (PEs) (forming a PE array), an output buffer, and a pair. Number conversion circuit.
  • DMA direct memory access
  • PEs processing elements
  • the PE can be regarded as a circuit for realizing a specific function.
  • the PE may include a multiplication hardware circuit, and may perform various operations based on multiplication (such as multiplication or multiplication and addition); multiplication hardware circuit (PE) ) includes a linear conversion circuit (also called a linear conversion unit, which is represented by a linear conversion unit in Fig. 3).
  • the configuration of the PE array ie, the multiplication hardware circuit
  • the PE passes the data, and each PE can also be connected to the input buffer (parameter cache and data cache).
  • the DMA control unit (which can be regarded as a control circuit) reads the required image data and training parameters from the external DDR memory 220 to the data buffer and parameter cache; the image data and the training parameters pass through the PE array. Perform multiply-accumulate operation; the operation result is output to the output buffer, and then the operation result is converted from the linear domain to the logarithmic domain by a logarithmic conversion circuit; the final result can be returned to the data buffer for use as input data of the next round of multiplication and addition operations, Or directly output to the DDR memory 220 for storage.
  • FIG. 4 it is a schematic structural diagram of a hardware multiplier 40 (also referred to as “multiplication hardware circuit”, “multiplication circuit”, “multiplier” in this application), which can be used for two data A and B performs a multiplication operation, and the multiplier includes:
  • the adding sub-circuit 41 is configured to obtain log domain data a and log domain data b respectively corresponding to A and B, and perform addition operations on a and b to obtain c, c including an integer part and a fractional part;
  • the exponential operation sub-circuit 42 is configured to perform an exponential operation in which the base is 2 and the exponent is a fractional part of c, and an exponential operation result is obtained;
  • a shifting sub-circuit 43 for shifting the result of the index operation according to an integer part of c to obtain a shift result; wherein the shift result is used to indicate a product of A and B;
  • the output sub-circuit 44 is configured to output the product of A and B in accordance with the symbols of a and b in combination with the shift result.
  • the multiplying circuit 40 may further comprise an accumulator 45 for accumulating the product of the data A and B with another data from the same multiplying circuit; or for multiplying the product of the data A and B by another multiplication The product of the circuit is accumulated.
  • Each of the above sub-circuits can be implemented based on an ASIC or an FPGA.
  • the above hardware multiplier is implemented based on an ASIC, and at the same time, it can be packaged in one chip with other hardware such as a CPU, a GPU, and the like to form an SoC (system on a chip).
  • SoC system on a chip
  • the implementation of each sub-circuit can adopt a very simple circuit and occupy a small amount of resources.
  • the implementation of the entire multiplier is also very simple, occupying less resources, so that one chip can be in the same resource (such as area, Under power consumption, more multipliers can be integrated to improve the computing power of the chip.
  • addition sub-circuit 41 is specifically described in this embodiment.
  • the addition sub-circuit 41 is for adding the absolute value of the log domain data a and the absolute value of the log domain data b to obtain c.
  • a and b are logarithmic domain data, which are obtained by logarithmic domain conversion of linear domain data A and B respectively.
  • the logarithmic field includes 1+m+n binary bits (which can also be expressed as 1.mn), and m and n are positive integers, wherein the first bit is a sign bit for indicating positive and negative signs (hereinafter Also referred to as "symbol", this application does not distinguish), m bits are integer parts (or m bits are used to indicate the value of the integer part), n bits are fractional parts (or n bits are used to indicate the fractional part) Value).
  • the value of m and n can be determined according to the accuracy required by the system, the maximum number of bits, the greater the precision, but also the corresponding hardware resources. Those skilled in the art can take the appropriate values of m and n in combination with the requirements of the system for accuracy and hardware resources.
  • the addition is to add the fractional part of the absolute value of a to the fractional part of the absolute value of b (there may be carry) to obtain the fractional part of c, which means the integer part of the absolute value of a and the absolute value of b.
  • the integer parts are added (plus the carry can be added) to get the integer part of c, and the integer part and the fractional part of c are also represented based on m + n binary bits.
  • the purpose of log-domain conversion is to convert a data into a logarithmic format and then perform operations based on the data in the logarithmic format.
  • the log domain data is relative to the linear domain data, if one data (such as A or B mentioned above) is converted by log domain to obtain another data (such as a or b mentioned above),
  • the data before the log domain conversion (A or B) is linear domain data, and the converted data (a or b) is called log domain data.
  • the specific logarithmic domain conversion can include multiple implementation modes.
  • the following two methods are specifically introduced:
  • the log-domain conversion may refer to a base 2 logarithm of the absolute value of a linear domain data, and is represented by a sign bit. It can be understood that, in the specific implementation of "combining the positive and negative sign bits of the data", the simplest is that the sign bit of the data can be directly used as the sign bit of the log-domain converted data. Of course, the opposite symbol can also be used as the sign bit of the data after the logarithmic domain conversion, and the mutual conversion between the log domain data and the linear domain data can be realized only by remembering this transformation law.
  • the logarithmic domain data is represented as binary data of 1+m+n bits, which is also referred to as 1.m.n hereinafter.
  • m and n are both positive integers
  • 1 bit is the first sign bit S, used to indicate the positive and negative values of the data
  • m bits are integer bits, which are used to indicate that the absolute value of the data takes the base 2 logarithmic value.
  • the value of the integer part M, n bits are decimal places, used to indicate that the absolute value of the data takes the value N of the fractional part of the logarithm of the base 2.
  • ⁇ M means shifting M bits to the left. Specifically, shifting M bits to the left means that when M is greater than 0, it means shifting M bits to the left; when M is less than 0, it means shifting the absolute value of M to the right.
  • the sign bit S represents the sign of F (positive or negative), which does not participate in the operation of the data in the log domain representation format. In the case where F is a negative number, since the direct calculation of log 2 (negative number) in the real number field does not exist, the 1.mn format (log field representation format) in the embodiment of the present application expresses -log 2 (
  • the logarithm of taking the absolute value of -8 as the base 2 is log 2
  • 3, and then combining the sign of -8 (-), thus obtaining -3
  • -8 be linear domain data
  • -3 be log domain data
  • the data can also be converted in a linear domain (ie, taking the absolute value of the data of 2 and combining it with the sign bit of the data).
  • power of 2 is equal to 8, and then -8 is obtained by combining the sign bit (-).
  • one of the m bits used to represent the integer part of the log domain data may be used to represent the symbol, which is called the logarithmic domain integer part sign bit, and the remaining m
  • the value of -1 bit is equal to the absolute value of the data taking the absolute value of the integer part of the base 2 logarithm.
  • the highest bit of the 3-bit integer bit in the log field representation format 1.3.2 is used as the log-part integer part sign bit, so the data 0.25 in decimal is represented as 0 110 00 in the log-field representation format 1.3.2.
  • the highest bit in the middle ie, at the far left is the sign bit of the integer part of the logarithmic field, which is negative with 1 (if it is 0, it means positive).
  • the integer part of the logarithmic domain data has positive and negative values, which truly reflects the base 2 logarithm of the absolute value of a linear domain data, but because there is a symbol Bits cause a little waste (bit width becomes larger) for the bits of the integer part of the m number field. At the same time, the sign bit needs to be considered in the calculation, and a little overhead is caused.
  • the log-domain conversion may refer to a base 2 logarithm of the absolute value of a linear domain data, and convert the logarithmic value to a number greater than or equal to 0 based on a reference value, and Combined with the sign bit to represent.
  • the value of the first integer bit may optionally be a non-negative number, the value of the first integer bit being equal to the absolute value of the data (eg, F or G) taking the value M of the integer part of the logarithm of the base 2 and the reference value ( The difference between BASE).
  • the absolute value of the data eg, F or G
  • the value M of the integer part of the logarithmic value of the absolute value of the linear domain data may be a positive number or a negative number.
  • the BASE is subtracted on the basis of M, so that the pair of logarithmic domain representation formats
  • M' of the integer bit of the numeric field data is always kept non-negative.
  • the value of BASE may be different for different data (for example, different types of data, or data of different time periods).
  • the reference value of the reference value BASE may be such that all data (e.g., data of a certain batch) to which it is applied corresponds to a non-negative number.
  • a number such as -8 may be set (the range of the corresponding M' is 1 to 8) as long as M' is not allowed to be negative.
  • the value of M' may be subjected to range limitation processing, which is limited to a minimum value and a maximum value, a number smaller than the minimum value is taken as a minimum value, and a number greater than the maximum value is taken as a maximum value.
  • range limitation processing is limited to a minimum value and a maximum value, a number smaller than the minimum value is taken as a minimum value, and a number greater than the maximum value is taken as a maximum value.
  • the selection of the BASE is configurable, and the BASE can be configured by the external component of the linear conversion circuit, that is, the BASE is transmitted to the linear conversion circuit through the external component, and is determined in the software compilation process of the linear conversion circuit.
  • the BASE value is available.
  • the data can be prevented from having a negative value in the integer position in the expression of the logarithmic domain representation format, so that the integer bits do not need to be individually set with one sign bit, which can make the data expression more concise and save.
  • the computational overhead (the sign bit does not need to be considered in the operation).
  • the conversion of the above two modes is directed to non-zero.
  • the log domain representation format and the linear representation format of the data of the embodiments of the present application are given below.
  • the absolute value of the data takes the value of the integer part of the base 2 logarithm M is non-negative
  • the BASE value can take zero.
  • the BASE can also take other values of -1 or less than -1, which is not limited by the embodiment of the present application.
  • the binary representation format of the data in the logarithmic domain is 0 010 10, where 1 bit (0) is used to represent the sign bit, where 0 represents a positive number and 1 represents a negative number. 3 bits (010) represent an integer part, and 2 bits (10) represent a fractional part.
  • the sign bit is represented by 1 bit (0)
  • the integer part is represented by 3 bits (101)
  • the fractional part is represented by 5 bits (10101).
  • the logarithmic domain representation format of the data of the embodiment of the present application can represent a large numerical range with a small number of bits.
  • the data to be expressed may take a range from a negative minimum to a positive maximum.
  • the value of m in the logarithmic domain representation format 1.m.n and the value of n are determined. For example, the larger the absolute value of the image data or the training parameter is 2, the larger the value of m should be. The higher the accuracy requirement, the larger the value of n should be.
  • the process of determining the value of m and the value of n may be implemented by software (for example, by a general-purpose processor such as a CPU), which is not limited by the embodiment of the present application.
  • the index operation sub-circuit 42 is specifically described in this embodiment.
  • the exponential operation sub-circuit 42 performs 2 ⁇ ( fractional part of c) operation, and the "decimal part of c" means a decimal number greater than or equal to 0 and less than 1, for example, if the fractional part of c is 0.32, the actual operation is 2 ⁇ 0.32 (or 2 0.32 ) exponential operation.
  • the exponential operation sub-circuit is a decoding sub-circuit for decoding the fractional operation result according to the fractional part of c; or the exponential operation sub-circuit may also be a look-up sub-circuit for obtaining the table according to the fractional part of c The result of the index operation.
  • the general idea is to design a mapping relationship between "the fractional part of c" and the "exponential operation result” based on a certain precision value in advance, and subsequently decode or look up the table. The way to get the "index operation result" according to the "decimal part of c".
  • the exponent operation sub-circuit 820 may be a device (eg, a decoder) that inputs the eight-bit significant bit output as shown in FIG. 10, and can complete the "decimal portion of c". Conversion between "index operation results". It should be understood that the eight-bit effective bit output is the case in this example, and the bit can be increased or decreased according to actual needs.
  • the first decimal place has 4 sets of different cases (00, 01, 10, 11), and the result is calculated by calculating 2 N in advance (N can take 0, 0.25, 0.5, 0.75), and the result is obtained. Rounding to 8-bit records is saved as a table as shown in Figure 10 and below:
  • 01--10011000 (represents the binary number 1.0011000, corresponding to the decimal number 1.1875)
  • 10--10110101 (represents the binary number 1.0110101, corresponding to the decimal number 1.4140625)
  • 11--11010111 (represents the binary number 1.1010111, corresponding to the decimal number 1.6796875)
  • the shift sub-circuit 43 is specifically described in this embodiment.
  • the shift sub-circuit is essentially an integer part for shifting the result of the exponential operation to the left, that is, performing an exponential operation of 2 ⁇ (the integer part of c) to obtain a shift result, which is equal to the absolute product of the product of A and B.
  • the value (of course, due to the relationship of the digital circuit, there will be a certain error), and then the positive and negative sign bits are determined for the shift result, and the final product of A and B can be obtained.
  • the manner of shifting is the manner in which the data needs to be fetched (that is, which digits are taken as the integer part and which digits are the fractional part), and the final result is obtained.
  • the shift result is not a final result if it is not taken by the way of fetching data that matches the shift mode.
  • shifting 3 bits to the left of binary 1 is equivalent to multiplying 2 ⁇ 3 in decimal, and the result is equal to 1000 (equal to the decimal number 8).
  • the final result of the shift subcircuit is 1000.
  • the first two digits of 1000 are taken as the integer part of 1000, and the last two digits are used as the fractional part, then the result of 10.00 (binary number) will be obtained, and obviously an error occurs.
  • substantially the integer part for shifting the result of the index operation to the left by c means a shift mode in which the integer part of c is shifted to the left to obtain 2 ⁇ (the integer part of c). Need to match the corresponding method of fetching data), but in practice, the actual number of bits shifted in the actual can also not be the exact part of c, as long as there is a corresponding way to fetch data, so that the final result is 2 ⁇ (the integer part of c).
  • the "integral portion of the index operation result shifted to the left” is considered to be “substantially shifting the result of the exponential operation to the left of the integer portion of c to perform 2 ⁇ (integer portion of c)".
  • the data of the linear domain is represented by a format of 1+j+k (or as 1.jk), that is, 1 bit is used to represent symbols, and j bits are used to represent integer parts. And k bits are used to represent the fractional part.
  • the above shift result is equal to the product of A and B, which is a linear domain data.
  • the following specific shift and corresponding data fetching method can be used. achieve.
  • the exponential operation result is equal to 2 ⁇ (the fractional part of c), and the number is greater than or equal to 1, and less than 2 by the exponential operation law.
  • 1+w binary bits are used to represent the result of the exponential operation, wherein the first bit is an integer part (equal to 1), the w bits are a fractional part, and w is a positive integer greater than or equal to 1;
  • the 1.0101101 is shifted left by 2 bits, it becomes 101.01101; if it is shifted left (-3) bits (ie, shifted 3 bits to the right), it becomes 0.0010101101.
  • the shift sub-circuit is used to shift the result of the exponential operation according to the integer part of c, and is specifically used to first place the result of the exponential operation in a memory of j+k bits.
  • the highest j bits of the shifted result are taken as the integer part of the final result, and the remaining k bits are taken as the fractional part of the final result; wherein, when the left shift number is less than At 0 o'clock, the left shift X bit is equal to the absolute shift bit of the right shift X (for example, the left shift (-3) bit corresponds to the right shift of 3 bits).
  • the value of the sign bit of the linear domain data is not confirmed by the shift sub-circuit, but is determined by the output sub-circuit of the next stage according to the sign of the log domain data a, b.
  • the final result is the result 101.01101 obtained by shifting the integer part bit of the left shift c of 1.0101101 (ie, the left shift by 2 bits) (completely represented by 8+8 bits is 0000010101101000).
  • the integer part and/or the fractional part can be obtained based on the unified acquisition method of “the former high j bit is the value of the integer part and the low k bit is the value of the decimal part”. Value.
  • bit loss there may be a case of bit loss. For example, as shown in FIG. 3, if one bit is shifted to the right, the last data 1 is lost, but in this case, the follow-up is still followed.
  • the above principle obtains the last value, that is, the high j bit as the value of the integer part and the low k bit as the value of the fractional part.
  • the function implemented by the shift sub-circuit is only a shift, it is also simple to implement and takes up less resources.
  • the output sub-circuit 44 is specifically described in this embodiment.
  • the output sub-circuit is used to output the product of A and B in accordance with the symbols of a and b in combination with the shift result. It can be understood that if the symbols of a and b correspond to the symbols of A and B, respectively, the sign of the product of A and B can be finally determined based on the symbols of a and b. For example, when the a symbol is the same as the A symbol and the b symbol is the same as the B symbol, the symbols of A and B can be simply based on the multiplication principle (positive positive, positive and negative negative, negative negative positive) and a , b symbol to determine. For example, when a, b have a sign that is positive and one symbol is negative, the sign of the product of the last A and B is negative.
  • the output sub-circuit is used for symbol operation, and the implementation is also simple and takes up less resources.
  • Fig. 5 is a schematic flow chart of the multiply-and-accumulate operation. As shown in FIG. 5, taking F1*G1+F2*G2+F3*G3... as an example, the multiplication and addition operation includes the following steps.
  • the data f1 and g1 in the input logarithmic domain representation format may be through software (ie, by a general purpose processor such as a CPU) or hardware (for example, based on a Field Programmable Gate Array (FPGA). Or a hardware device such as an application specific integrated circuit (ASIC) to complete the preprocessing.
  • a general purpose processor such as a CPU
  • hardware for example, based on a Field Programmable Gate Array (FPGA).
  • FPGA Field Programmable Gate Array
  • ASIC application specific integrated circuit
  • the linear domain conversion is performed by a linear conversion circuit inside each PE.
  • the linear conversion circuit can be based on a conventional floating point index operation unit, that is, the calculation 2 c1 is performed based on the floating point index operation unit. Since the floating point data representation format V is based on the meaning described above, the floating point index operation unit calculates the amount of calculation of 2 c1 to be large, and the operation speed is slow.
  • the linear domain switching circuit may include the exponential operation sub-circuit, the shift sub-circuit, the output sub-circuit, and the like in FIG. 4 described above.
  • the linear representation format C1 obtained in S420 is added to the existing accumulated result.
  • the cumulative result SUM C1+C2+C3+....
  • the right side of S430 points to its own loop, which means that the Ci obtained this time is accumulated with the previous accumulated result (of course, it can also be accumulated with the data in another multiplier).
  • next step is to continue the multiplication operation as SUM, then S440 is performed; if the next step is no longer necessary for the SUM operation, the SUM can be directly output to the DDR memory 220 through the output buffer. Store.
  • SUM log-domain representation format
  • C1 is represented as c1 in the logarithmic domain representation format 1.4.4
  • FIG. 9 is a schematic flowchart of a method 700 for converting a data format according to an embodiment of the present application.
  • Method 700 can be performed by linear conversion circuit 800.
  • FIG. 10 is a schematic diagram of linear domain conversion performed by the linear conversion circuit 800 of the embodiment of the present application.
  • the linear conversion circuit 800 may include an acquisition sub-circuit 810, a decoding sub-circuit 820, a shift sub-circuit 830, and an output sub-circuit 840.
  • Each sub-circuit can be implemented based on an FPGA or ASIC.
  • the sub-circuits 810-840 are respectively used to execute S710-S740 of the method 700.
  • the method 700 for converting a data format in this embodiment includes:
  • the acquisition sub-circuit 810 acquires data of a log domain representation format 1.m.n. Wherein, 1 bit is the first sign bit, m bit is the first integer bit, and n bit is the first decimal place.
  • the data herein may include image data and/or training parameters.
  • the linear conversion circuit in the PE may acquire image data in a log domain representation format from the data cache; and/or obtain training parameters in a log domain representation format from the parameter cache.
  • the acquisition sub-circuit 810 obtains the data 1 010 10 of the log field representation format 1.3.2, where the first sign bit is 1, ie the data symbol is negative, the first integer bit is 010, and the first decimal place is 01.
  • the decoding sub-circuit 820 looks up the table to obtain a linear representation format corresponding to the first decimal place. Specifically, the decoding sub-circuit 820 obtains n bits of the first decimal place from the acquisition sub-circuit 810 to perform a decoding operation. Among them, the decoding operation directly obtains the result of 2 N through the hardware combination logic, that is, directly obtains the result of 2 N by looking up the table. It should be understood that since the value of the n-bit first decimal place is a finite number, the linear representation format corresponding to the first decimal place (with certain accuracy requirements) can be enumerated.
  • the decoding sub-circuit 820 looks up the table to obtain a linear representation format 1.0110101 corresponding to the first decimal place 10.
  • the decoding sub-circuit 820 can be a device with two significant digits inputting eight significant digits as shown in FIG. 10, wherein the correspondence between the two significant digit inputs and the eight significant digit outputs is stored. table. It should be understood that the eight-bit effective bit output is the case in this example, and the bit can be increased or decreased according to actual needs.
  • the shift sub-circuit 830 shifts the linear representation format corresponding to the first decimal place according to the value M of the integer part of the logarithm, and obtains the value of the absolute value of the data in the linear representation format. Specifically, the shift sub-circuit 830 obtains the m-bit first integer bit from the acquisition sub-circuit 810, obtains the result of 2 N from the decoding sub-circuit 820, and performs a shift operation on the decoded result of the decoding sub-circuit 820.
  • M is a positive number indicating a left shift M bit
  • M being a negative number means a right shift M absolute value bit. That is, S730 shifts the linear representation format corresponding to the first decimal place according to the value M of the integer part of the logarithmic value, and obtains the value of the absolute value of the data in the linear representation format, which may include: when M is greater than 0, The linear representation of the first decimal place is shifted to the left by M bits, and the absolute value of the data is obtained in the linear representation format; when M is less than 0, the absolute representation of the linear representation of the first decimal place is shifted to the right of M. , get the value of the absolute value of the data in the linear representation format.
  • the M' operation may be used instead of M, and if M' is a non-negative value, only the left shift may be performed.
  • the shift sub-circuit 830 shifts the linear representation format 1.0110101 corresponding to the first decimal place 10 by two bits according to the value 010 indicated by the first integer bit to obtain 101.10101.
  • output sub-circuit 840 represents the data as 1 + j + k bits of binary data in a linear representation format. Specifically, the output sub-circuit 840 derives a 1-bit first sign bit from the acquisition sub-circuit 810 and a value of the first sign bit to the second sign bit. In other words, output sub-circuit 840 sets the second sign bit based on the first sign bit. For example, if the data is a positive number, the second sign bit is set to 0; if the data is a negative number, the second sign bit is set to 1, but this embodiment of the present application does not limit this.
  • the output sub-circuit 840 obtains the shifted result from the shift sub-circuit 830, complements the shifted result or deletes the invalid bit to conform to the format of 1.j.k.
  • the output sub-circuit 840 can convert the result of the obtained 1.j.k format into a complement representation to obtain a final result, which is not limited by the embodiment of the present application.
  • the output sub-circuit 840 performs zero-padding and determines the second sign bit according to the first sign bit, and represents the data as 1+j+k-bit binary data in a linear representation format, for example, expressed as a linear representation in format 1.7.8. 1 000010110101000.
  • the method for converting a data format in the embodiment of the present application obtains data in a logarithmic domain representation format, and performs simple table lookup shifting on data in a logarithmic domain representation format of the embodiment of the present application to obtain a representation in a linear representation format. There is no need to perform complex power operations, which can reduce the overhead of data conversion between the log domain and the linear domain, and improve the speed of convolution calculation.
  • the acquisition sub-circuit 810 obtains the data 101010 of the log field representation format 1.3.2, wherein the first sign bit is 1 and the data symbol is negative, the first integer bit is 010, and the first decimal place is 01.
  • the decoding sub-circuit 820 looks up the table to obtain a linear representation format 1.0110101 corresponding to the first decimal place 10.
  • the decoding sub-circuit 820 can be a device with two significant digits inputting eight significant digits as shown in FIG. 10, wherein the correspondence between the two significant digit inputs and the eight significant digit outputs is stored. table. It should be understood that the eight-bit effective bit output is the case in this example, and the bit can be increased or decreased according to actual needs.
  • the shift sub-circuit 830 shifts the linear representation format 1.0110101 corresponding to the first decimal place 10 by two bits according to the value 010 indicated by the first integer bit to obtain 101.10101.
  • the output sub-circuit 840 performs zero-padding and determines the second sign bit according to the first sign bit, and represents the data as 1+j+k-bit binary data in a linear representation format, for example, expressed as a linear representation in format 1.7.8. 1000010110101000.
  • FIG. 11 is a schematic flowchart of a method 900 for converting a data format according to an embodiment of the present application.
  • Method 900 can be performed by logarithmic conversion circuit 100.
  • FIG. 12 is a schematic diagram of logarithmic domain conversion performed by the logarithmic conversion circuit 1000 of the embodiment of the present application.
  • the logarithmic conversion circuit 1000 may include an acquisition sub-circuit 1010, an integer calculation sub-circuit 1020, a fractional calculation sub-circuit 1030, and an output sub-circuit 1040.
  • Each sub-circuit can be implemented based on an FPGA or ASIC.
  • the sub-circuits 1010-1040 are respectively used to execute S910-S940 of the method 900.
  • the present embodiment specifically describes the log domain conversion.
  • the logarithmic domain transformation mainly takes the absolute value of the number of a linear domain as a base 2 logarithm (or can also be combined with BASE), and combines the matching number.
  • the conversion method can be implemented by software (that is, by letting the CPU calculate the software program to output the converted value), in order to speed up the processing, it can also be implemented based on a specific hardware circuit (such as ASIC, FPGA), this embodiment will be The implementation based on the hardware circuit is specifically introduced.
  • the logarithmic domain conversion of the embodiment is performed by a logarithmic domain conversion circuit, and the logarithmic domain conversion circuit includes: an acquisition subcircuit, an integer calculation subcircuit, a fractional calculation subcircuit, and an output subcircuit; wherein, see FIG. 11 and FIG.
  • the process of each sub-circuit processing is as follows:
  • the acquisition sub-circuit 1010 acquires data of the linear representation format 1.j.k. Wherein 1 bit is used to represent the second sign bit, j bits are used to represent the second integer bit, and k bits are used to represent the second decimal place.
  • the data here may be the accumulated data outputted from the output buffer, or may be data obtained by other methods, which is not limited in this embodiment of the present application.
  • the acquisition sub-circuit 1010 obtains data 0 001100111000000 (decimal representation 25.75) of the linear representation format 1.7.8, where the data symbols are positive, the second integer bits are 0011001, and the second decimal place is 11000000.
  • the integer calculation sub-circuit 1020 determines that the position of the non-zero highest bit of the binary data of the j+k bit is the hth bit (the number of bits of the lowest bit of the binary data of j+k bits is recorded as the 0th bit), and the data is determined.
  • the absolute value of the data in the log domain representation format takes the value M of the integer part of the base 2 logarithm, where M is equal to the difference between h and k.
  • the data 0 0011001 11000000 of the format 1.7.8 is still linearly represented.
  • the fractional calculation sub-circuit 1030 intercepts the s-bit of the non-zero highest bit from the upper bit to the lower bit, and obtains the absolute value of the data in the log domain representation format corresponding to the s bit, and takes the absolute value as the base 2 logarithmic value. , the value N of the fractional part is obtained.
  • the data 0 0011001 11000000 of the format 1.7.8 is still linearly represented as an example, and the s-bit (for example, 8 bits) after the decimal calculation sub-circuit 1030 intercepts the non-zero highest bit is 10011100, and the log 2 (1.10011100) is calculated to be corresponding.
  • the absolute value of the data in the logarithmic domain representation format takes the value N of the fractional part of the logarithm of the base 2 as 0.11, and the value N of the fractional part of the logarithm of the absolute value of the data is represented by the binary value. Is 0.1011.
  • the logarithmic domain representation format corresponding to s bits is required under certain accuracy requirements. It can be exhaustive.
  • the s-bit after the fractional calculation sub-circuit 1030 intercepts the non-zero highest bit is the value y of the s-bit after the non-zero highest bit of the data of the linear representation format 1.jk is obtained according to the acquisition sub-circuit 1010, and then the log 2 (1.y) is calculated. ), the absolute value of the obtained data is taken as the value N of the fractional part of the logarithm of the base 2 .
  • N There are many ways to obtain N, which will be exemplified below.
  • the size of s can be determined according to the accuracy requirements. Alternatively, s is greater than n.
  • the fractional calculation sub-circuit 1030 may be a device with an eight-bit effective bit input and a five-bit effective bit output as shown in FIG. 12. It should be understood that the five-bit effective bit output is the case in this example. Increase or decrease the bits according to actual needs.
  • the output sub-circuit 1040 represents the data as 1+m+n bits of binary data in a logarithmic domain representation format.
  • the output sub-circuit 1040 obtains a 1-bit second sign bit from the acquisition sub-circuit 1010, and assigns the value of the second sign bit to the first sign bit, that is, sets the first symbol in the log-domain representation format according to the second sign bit. Bit. For example, if the data is a positive number, the first sign bit is set to 0; if the data is a negative number, the first sign bit is set to 1, but this embodiment of the present application does not limit this.
  • the output sub-circuit 1040 takes the absolute value of the data obtained by the integer calculation sub-circuit 1020 and the absolute value of the data obtained by the base 2 and the fractional calculation sub-circuit 1030 to take the base 2 logarithmic value.
  • the output sub-circuit 1040 can convert the result of the obtained 1.m.n format into a complement representation, which is not limited by the embodiment of the present application.
  • the output sub-circuit 1040 adds the value M of the integer part of the logarithmic value in which the absolute value of the data is 2, and the value N of the fractional part of the logarithm of the base 2 to the absolute value of the data, and adds the result of the addition. Perform zero padding or delete invalid bits. Determining the first sign bit according to the second sign bit, and representing the data as 1+m+n bits of binary data in a log domain representation format, for example, represented as a log field representation of 010010 in format 1.3.4 (first The sign bit is 0, the first integer bit takes a valid 100 three bits, and the first decimal place takes a valid 1011 four bits).
  • the method for converting a data format in the embodiment of the present application obtains data in a linear representation format, and obtains the representation in a logarithmic domain representation format by simple interception and comparison, without complicated logarithm operation, and can be reduced.
  • the overhead of data conversion between linear and logarithmic domains increases the speed of convolution calculations.
  • the value of the fractional value of the data in the logarithmic domain representation format corresponding to the obtained s bit in the step S930 is a value N of the fractional part of the logarithm of the base 2, which may include:
  • the absolute value of the data in the log-domain representation format corresponding to the s-bit is taken as the value N of the fractional part of the logarithm of the base 2, wherein the table stores N corresponding to all possible values of the s-bit.
  • the method in which the absolute value of the determined data takes the value N of the fractional part of the logarithm of the base 2 is called a look-up table method.
  • the description is continued with the example of FIG. 12, and the 8 bits after the fractional calculation sub-circuit 1030 intercepts the non-zero highest bit are 10011100, and the result of the log 2 (1.10011100) is obtained by looking up the table.
  • the decimal calculation sub-circuit 1030 stores a correspondence table of eight significant digit inputs and five significant digit outputs.
  • the result of log 2 (1.y) is recorded in the table, and the result is 4 bits after the decimal place.
  • the integer bit 1 in 1.y is fixed, the integer bit 0 of the output result is also fixed, so it can be saved as an 8-bit input, 4-bit output table.
  • the value N of the fractional part of the logarithm of the absolute value of the data is expressed as a binary representation of 0.1011.
  • the value of the fractional value of the data in the logarithmic domain representation format corresponding to the obtained s bits in the step S930 is a value N of the fractional part of the logarithm of the base 2, which may include: The value corresponding to the s bit is compared with a preset 2 n comparison value, wherein the ith comparison value is smaller than the i+1th comparison value, and the ith comparison value corresponds to a value N i ; when the s bit corresponds to a value greater than or When it is equal to the Tth comparison value and smaller than the T+1 comparison value, it is determined that the absolute value of the data in the log domain representation format corresponding to the s bit takes the value N of the fractional part of the logarithm of the base 2 as N T .
  • the method in which the absolute value of the determined data takes the value N of the fractional part of the logarithm of the base 2 is called a stepwise comparison method.
  • FIG. 12 is still used.
  • the decimal number calculation sub-circuit 1030 intercepts the non-zero highest bit and the 8 bits are 10011100.
  • the absolute value of the data obtained by the comparator group comparison takes the base 2 logarithm of the decimal value. Part of the value N.
  • Figure 13 is a diagram showing the fractional calculation sub-circuit 1030 determining the value N of the fractional portion.
  • a comparison value is preset in each comparator.
  • the preset comparison values are arranged from small to large, that is, the comparison value 0 ⁇ comparator 1 ⁇ ... ⁇ comparator 15.
  • the setting of the comparison value is also set according to the table below.
  • the look-up table value of the item whose output has 4 bits has a jump is set as the comparison value.
  • the comparison value 0 is set to 1.00000000.
  • the comparison value 0 is directly set to 00000000.
  • the output corresponding to the value between any two adjacent comparison values is consistent. Therefore, the result of the step-by-step comparison can directly derive the value N of the fractional part of the logarithm of the base 2 by the absolute value of the data.
  • the output of the selector is 0.1011.
  • the comparison value 0 greater than or equal to comparator 0 is less than the comparison value 1 of comparator 1, and the selector result is 0.0000.
  • the comparison value 1 greater than or equal to the comparator 1 is smaller than the comparison value 2 of the comparator 2, and the selector result is 0.0001
  • the comparison value 14 greater than or equal to the comparator 14 is smaller than the comparison value 15 of the comparator 15, and the selector result is 0.1111.
  • the selector result is 1.0000.
  • the value of the fractional value of the data in the log field corresponding to the s-bit corresponding to the s-bit in the step S930 is a value N of the fractional part of the logarithm of the base 2, which may include: The value corresponding to the high x bit of the s bit is compared with a preset 2 n comparison value, wherein x is greater than 0 and less than s, the ith comparison value is smaller than the i+1 comparison value, and the ith comparison value corresponds to a pair. Values A and B; the result of x*A+B is calculated, and the absolute value of the data obtained from the result of x*A+B is taken as the value N of the fractional part of the logarithm of the base 2.
  • the result obtained by multiplying the high x bit by A can be shifted to the right by k bits, and the result of shifting the right k by the right is added to B, and the added result is left again.
  • the kn bit is shifted, and the obtained high n bits are the value N of the fractional part of the logarithmic value of the absolute value of the data in the log domain representation format corresponding to the s bit.
  • the decimal calculation sub-circuit 1030 intercepts the non-zero highest bit and the 8 bits are 10011100, and the value of the value is S. First look up the table, and then calculate the piecewise fitting result to obtain the absolute value of the data, taking the value N of the fractional part of the logarithm of the base 2.
  • FIG. 14 is a schematic diagram of a segmentation fit of an embodiment of the present application
  • FIG. 15 is a schematic diagram of the fractional calculation sub-circuit 1030 of the embodiment of the present application determining the value N of the fractional portion.
  • the present embodiment discloses a System on Chip (SoC) 14, which is composed of a processor core 141 (ie, a CPU core) and one or more Processing Element (PE) 1421.
  • SoC System on Chip
  • the PE array 142, each PE may include the multiplication hardware circuit 1421 described in the above embodiments;
  • the SoC further includes an input buffer 143, an output buffer 144, a logarithmic domain conversion circuit 145, and a control circuit 147, wherein the CPU core
  • the other components can be collectively referred to as the computational engine, or the data acceleration engine. The main function of these components is to handle some specific calculations for the CPU core. The following describes each component on the SoC.
  • the CPU core is mainly used to execute some general-purpose software programs, for example, running operating systems by operating instructions, various operating system-based applications, etc., when the CPU core needs to process some specific data processing (such as processing of a large amount of image data). If the calculation engine is more suitable for processing the data, the CPU core can send the processing of this part of the data to the calculation engine for processing.
  • the input buffer 143 is used to store input data, which may come from the CPU core 141 or may also come from the log domain conversion circuit 145.
  • the type of the input data is not limited, and the type of the data may be determined based on various applications. For example, when the neural network system is used, data that needs to be calculated and parameters may be included, and specifically, data and parameters may be separately stored through a plurality of memories.
  • the output buffer 144 is used to store the results of the PE array output. If it needs to be used again, it can be converted by the logarithmic domain conversion circuit 145 and output to the input buffer 143 for use in the next round of calculations.
  • the input buffer and the output buffer may be implemented based on a storage medium such as SRAM or eDRAM.
  • the control circuit 147 is connected to the processor core 141 (ie, the CPU core), the input buffer 143, and the output buffer 144; and the control device interacts with the processor core (eg, based on DMA protocol interaction, or custom protocol, message interaction, etc.) Let the processor core get the data in the output buffer.
  • the SoC of the present application may also include other IP cores 146, such as a graphics image processor (GPU), a digital signal processor (DPS), etc., which is not limited in this application.
  • IP cores 146 such as a graphics image processor (GPU), a digital signal processor (DPS), etc., which is not limited in this application.
  • GPU graphics image processor
  • DPS digital signal processor
  • the embodiment provides an electronic device 15 as a schematic structural diagram.
  • the electronic device including the SoC 151, may have multiple IP cores in the SoC, for example, a CPU core, an IP core composed of a PE array, or a PE array, an input buffer, an output buffer, and a logarithmic domain.
  • An IP core consisting of a conversion circuit can also include other IP cores 146.
  • the SoC is usually enclosed in a separate chip, for example, Huawei HiSilicon series chips (such as Kirin 950, 960), Qualcomm Xiaolong series SoC chips (such as Xiaolong 650, 660).
  • each IP core may also be separately packaged into one chip, or several IP cores may be packaged together into one chip.
  • the electronic device 15 may also include other components, for example, may include a memory 152 (such as a memory, a flash memory, etc.), an input and output device 153 (such as a display screen, a touch screen, a speaker, a mouse, a keyboard, etc.) and various communication modules 154 (such as WiFi). , USB, Bluetooth, 4G, 5G and other communication modules).
  • a memory 152 such as a memory, a flash memory, etc.
  • an input and output device 153 such as a display screen, a touch screen, a speaker, a mouse, a keyboard, etc.
  • various communication modules 154 such as WiFi.
  • USB, Bluetooth, 4G, 5G and other communication modules such as USB, Bluetooth, 4G, 5G and other communication modules.
  • circuits or sub-circuits of embodiments of the present application can be implemented based on an ASIC, FPGA, or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like. It should be noted that an ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware component may be a stand-alone device or may be integrated with a memory (storage module).
  • memories described herein are intended to comprise, without being limited to, these and any other suitable types of memory.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a high-density digital video disc (DVD)), or a semiconductor medium (for example, a solid state hard disk (Solid State Disk, SSD)) and so on.
  • a magnetic medium for example, a floppy disk, a hard disk, a magnetic tape
  • an optical medium for example, a high-density digital video disc (DVD)
  • DVD high-density digital video disc
  • semiconductor medium for example, a solid state hard disk (Solid State Disk, SSD)
  • the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Nonlinear Science (AREA)
  • Complex Calculations (AREA)
  • Error Detection And Correction (AREA)

Abstract

本申请提供了一种乘法电路、片上系统和电子设备,该乘法电路用于执行两个数据A和B的乘法运算,包括:加法子电路,用于获取与所述A和所述B分别对应的对数域数据a和对数域数据b,对所述a和所述b执行加法运算得到c,所述c包括整数部分和小数部分;指数运算子电路,用于执行底数是2,指数是所述c的小数部分的指数运算,得到指数运算结果;移位子电路,用于根据所述c的整数部分对所述指数运算结果进行移位,得到移位结果;输出子电路,用于根据所述a和所述b的符号,结合所述移位结果,输出所述A和所述B的乘积。本申请的乘法电路能够减小数据在线性域和对数域间转换时的开销,提高乘累加计算的速度。

Description

乘法电路、片上系统及电子设备
本申请要求于2017年09月19日提交中国专利局、申请号为201710852544.4、申请名称为“乘法硬件电路、片上系统及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理领域,并且更具体地,涉及乘法电路、片上系统及电子设备。
背景技术
在当今的信息时代,互联网和物联网应用每天都在产生大量的数据,对这些数据进行挖掘和处理往往能够获得有价值的信息。随着无人驾驶汽车、无人机、智能终端的普及,人工智能得到了广泛重视,它采用神经网络技术,对各种传感器输入的数据进行实时处理,实现对外界环境的感知。在这些数据处理的算法中,矩阵运算是一种核心计算模式,例如矩阵乘法,而矩阵乘法的基础是乘累加运算。利用一个典型的VGG16神经网络处理一张规模为224*224图片的计算量就是290亿次乘累加运算,这对当前的计算硬件与计算模式带来了严峻的考验。在乘累加运算中,乘法的代价最大,制约着矩阵运算的速度,也会影响设备的功耗。
发明内容
本申请提供一种乘法电路、片上系统及电子设备,能够减小数据在线性域和对数域之间转换时的开销,可以提高基于乘法的各种运算的速度。
第一方面,本申请提供了一种乘法电路,用于执行两个数据A和B的乘法运算,包括:加法子电路,用于获取与所述A和所述B分别对应的对数域数据a和对数域数据b,对所述a和所述b执行加法运算得到c,所述c包括整数部分和小数部分;指数运算子电路,用于执行底数是2,指数是所述c的小数部分的指数运算,得到指数运算结果;移位子电路,用于根据所述c的整数部分对所述指数运算结果进行移位,得到移位结果;输出子电路,用于根据所述a和所述b的符号,结合所述移位结果,输出所述A和所述B的乘积。
在第一方面的一种可能的实现方式中,所述对数域数据a和所述对数域数据b是通过分别对所述A和所述B的绝对值取以2为底的对数值并结合它们的符号位得到的,包括1+m+n个二进制比特位,m、n都是正整数,其中第1个比特是符号位,m个比特是整数部分,n个比特是小数部分。
在第一方面的一种可能的实现方式中,所述c的整数部分为所述a的整数部分与所述b的整数部分的和;所述c的小数部分为所述a的小数部分与所述b的小数部分的和。
在第一方面的一种可能的实现方式中,数值0对应的对数域数据定义为:符号位取值 为1,整数和小数部分都是0。
在第一方面的一种可能的实现方式中,所述A和所述B都包括1+j+k个二进制比特位,j、k都是正整数,其中第1个比特是符号位,j个比特是整数部分,k个比特是小数部分。
在第一方面的一种可能的实现方式中,所述指数运算结果为大于等于1、小于2的一个数;所述运算结果包括1+w个二进制比特位,其中,第1个比特为整数部分,w个比特为小数部分,w为大于等于1的正整数;所述移位子电路用于根据所述c的整数部分对所述指数运算结果进行移位时,具体用于对所述指数运算结果执行左移X位,所述X=c的整数部分-(w-k),所述移位结果是所述A和所述B的乘积的绝对值,所述乘积的绝对值的整数部分包括j个二进制比特位,小数部分包括k个二进制比特位;其中,当所述左移位数为小于0时,左移所述x位等于右移所述x的绝对值位。
在第一方面的一种可能的实现方式中,所述指数运算子电路为译码电路,所述译码电路用于根据所述c的小数部分译码得到所述指数运算的结果;或者所述指数运算子电路为查表电路,所述查表电路用于根据所述c的小数部分查表得到所述指数运算的结果。
在第一方面的一种可能的实现方式中,所述乘法电路还包括累加器,用于将所述数据A和B的乘积与来自所述乘法电路的另一个数据进行累加运算;或者所述累加器用于将所述数据A和B的乘积与来自另一个乘法电路的所述乘积进行累加运算。
第一方面的乘法电路通过加法子电路、指数运算子电路、移位子电路以及输出子电路的方式来实现相乘,不需要复杂的指数运算电路,这些子电路的实现比乘法电路更节省资源,占用逻辑资源少,从而可以减少器件占用的面积以及功耗。
第二方面,提供了一种片上系统,包括:处理器核,由一个或多个如第一方面或任一可能的实现方式所述的乘法硬件电路构成的乘法硬件电路阵列、数据输入缓存、数据输出缓存以及控制电路;所述控制电路与所述处理器核、所述数据输入电路以及所述数据输出电路连接;所述数据输入电路用于通过所述控制电路获取来自所述处理器核的数据;所述乘法硬件电路阵列用于获取所述数据输入缓存中的数据进行处理,得到处理后的结果,并输出给所述数据输出缓存;所述控制电路还用于与所述处理器核进行交互,使得所述处理器核获取所述数据输出缓存中的数据。
在第二方面的一种可能的实现方式中,还包括对数转换电路,用于对所述乘法硬件电路阵列的输出进行所述对数域转换,并将转换后的结果输入给所述数据输入缓存。
在第二方面的一种可能的实现方式中,所述对数域转换电路包括:整数计算子电路、小数计算子电路和第二符号位确定子电路,其中,所述线性域阵列输出数据为由1+j+k比特构成的二进制数,其中,j和k均为正整数,1比特为第二符号位,用于指示正负符号S,j比特用于指示所述线性域数据的绝对值的整数部分的值J,k比特用于指示所述线性域数据的绝对值的小数部分的值K;所述整数计算子电路,用于根据所述线性域阵列输出数据的j+k比特的二进制数的非零最高位所在的位数的值h1,计算h1与k的差值,所述差值用于表示所述A1的绝对值取以2为底的对数值的整数部分的值,其中,线性域阵列输出数据A1的j+k比特的二进制数的最低位记为第0位;所述小数计算子电路,用于根据所述线性域阵列输出数据从高位到低位的非零最高位后的预定个数s比特,得到所述线性域阵列输出数据的绝对值取以2为底的对数值的小数部分的值;第二符号位确定子电路,用 于分别根据所述线性域阵列输出数据的符号确定所述对数域阵列输出数据的符号,从而得到所述对数域阵列输出数据。
在第二方面的一种可能的实现方式中,所述小数计算子电路具体用于:通过查表或者译码得到所述A1从高位到低位的非零最高位后的s比特对应的值N1,通过查表或者译码得到所述A2从高位到低位的非零最高位后的s比特对应的值N2,其中,所述表中存储了所述s比特的所有可能值对应的值N。
在第二方面的一种可能的实现方式中,所述小数计算子电路具体用于:将所述A1从高位到低位的非零最高位后的s比特对应的值与预设的2 n个比较值比较,其中,第i比较值比第i+1比较值小,且所述第i比较值对应一个值N i;当所述A1从高位到低位的非零最高位后的s比特对应的值大于或等于第T1比较值,且小于第T1+1比较值时,确定所述N1为N T1;将所述A2从高位到低位的非零最高位后的s比特对应的值与预设的2 n个比较值比较,其中,第i比较值比第i+1比较值小,且所述第i比较值对应一个值N i;当所述A2从高位到低位的非零最高位后的s比特对应的值大于或等于第T2比较值,且小于第T2+1比较值时,确定所述N2为N T2
在第二方面的一种可能的实现方式中,所述小数计算子电路具体用于:将所述A1从高位到低位的非零最高位后的s比特的高x比特对应的值与预设的2 n个区间比较,其中,第i区间对应一对值αi和βi,x大于0并且小于s,当所述A1从高位到低位的非零最高位后的s比特的高x比特对应的值落入第一区间时,找到所述第一区间对应的一对值α1和β1,计算x*α1+β1的结果,根据所述x*α1+β1的结果得到所述N1;将所述A2从高位到低位的非零最高位后的s比特的高x比特对应的值与预设的2 n个区间比较,其中,当所述A2从高位到低位的非零最高位后的s比特的高x比特对应的值落入第二区间时,找到所述第二区间对应的一对值α2和β2,计算x*α2+β2的结果,根据所述x*α2+β2的结果得到所述N2。
第三方面,本申请提供了一种用于乘法运算的乘法硬件电路,具体的,这里的硬件电路是指基于ASIC、FPGA等实现的电路,而不是基于通用处理器(如基于x86、ARM架构的,需要读取指令来执行特定操作的处理器)来实现。当然,理论上也可以使用通用处理器来实现,但效率并不高,因此,为了更高效地处理数据,需要使用基于ASIC、FPGA实现的硬件电路。本实施例乘法硬件电路是指能够实现乘法的硬件电路,也不限定在乘法的基础上再进行一些其他的运算,例如,累加运算等。
本申请中的乘法硬件电路包括:对数域加法器和线性域转换电路,线性域转换电路包括指数运算子电路、移位子电路和符号位确定子电路;
其中,对数域加法器用于对第一对数域数据a1和第二对数域数据a2进行加法运算,得到对数域数据c1,其中,对数域数据是指对将线性域数据的绝对值取以2为底的对数值并结合线性域数据正负符号位后得到的数据;a1和a2通过分别对两个用于乘法运算的第一数据A1和第二数据A2进行对数域变换得到,a1、a2和c1都为由1+m+n比特构成的二进制数,m和n均为正整数,1比特为第一符号位,用于指示正负符号(下文也简称“符号”,本申请并不进行区分),m比特用于指示整数部分的值,n比特用于指示小数部分的值;
本申请中,对数域数据与线性域是相对的关系,如果一个数据进行了对数域转换(即 这个数据的绝对值取以2为底的对数值并结合该数据正负符号位后来进行表示)得到另一个数据,则称对数域转换前的数据为线性域数据,称转换后的数据为对数域数据。例如,对于一个数据-8,将-8的绝对值取以2为底的对数值为log 2|-8|=3,再结合-8的正负符号(-),从而得到-3,则称-8为线性域数据,-3为对数域数据。相对地,数据除了可以进行对数域转换外,也可以反过来进行线性域转化(即取2的该数据绝对值次方,并结合该数据的符号位来表示)。例如,对于-3,先取2的|-3|次方等于8,然后结合符号位(-)得到-8。
本申请中,前缀“第一”、“第二”等仅用于对修饰的同一类名词的不同个体进行区分,并不代表其他特殊的含义。例如,第一对数域数据以及第二对数域数据都是对修饰的名词“对数域数据”的不同个体进行区分,而不表示“第一对数域数据”是一种特定格式的对数域数据而“第二对数域数域”是另一种格式的对数域数据。本申请在提到“对数域数据”时,对数域数据具有的属性同时适用于“第一对数域数据”以及“第二对数域数据”。
乘法硬件电路还包括:指数运算子电路,用于根据c1的小数部分的值N1得到N',其中,N'的值是2的N1次方;即指数运算子电路用于通过计算2 N1来得到N',需要说明的是,本申请中的计算过程很多时候都是具有误差的,这是因为在基于ASIC、FPGA的数字电路实现时,很多数(例如,有很多位的小数,甚至无理数)的表示会受到硬件的限制(不同位宽能表示的数据范围不同)。因此,本申请中的“N'的值是2的N1次方”需要理解成并非在所有情况下都是完全等于,而是基于特定硬件限制(如特定的位宽)下的等于,也即最后的结果可能是刚好等于,但也有可能是约等于。本申请下文中的其他各种运算也基于同样的原则,后续不再赘述。
乘法硬件电路还包括:移位子电路,用于根据c1的整数部的值M1对指数运算子电路得到的N'进行移位,得到C1的绝对值,其中C1是A1乘以B1的乘积;需要说明的是,本申请中,左移一个负数位(如左移负3位),则表示右移这个负数的绝对值位(如右移|-3|,即右移3位);反之,右移一个负数位,则表示左多这个负数的绝对值位。
乘法硬件电路还包括:符号位确定子电路,用于根据a1的符号位的值以及a2的符号位的值确定C1的正负符号,并根据移位子电路得到的C1的绝对值以及C1的正负符号得到C1。C1的符号本质上是由A1以及B1的符号决定,而A1以及B1的符号又分别决定了a1以及a2的符号,因此,C1的符号可由a1以及a2这两个对数域加法器获取的数据的符号决定,确定的原则是本领域技术人员公知的技术,即如果一个是正,一个是负,则相乘后的结果是负;如果两个都为正或者都为负,则相乘结果为正。
在现有技术中,如果有两个数V=(-1) S×Fraction×2 Exp相乘,方法是将两个Fraction(分数)部分相乘,同时,将两个Exp(指数)相加,然后对Fraction乘法结果移位指数相加后得到的和。两个Fraction可以认为在线性域相乘,如果两个Fraction都用11比特来表示,则需要有一个11比特*11比特的乘法器,实现这个乘法器的代价比较大,例如,占用逻辑资源多,面积及功耗都会增大。
而第三方面中,是通过指数运算子电路、移位子电路以及符号位确定子电路的方式来实现相乘,此时,并不需要复杂的指数运算子电路,这些子电路的实现比乘法电路更节省资源,占用逻辑资源少,从而可以减少器件占用的面积以及功耗。
针对加法:FP16(浮点16)位加法代价很大,例如有两个数V1=(-1) S×Fraction1×2 Exp1V2=(-1) S×Fraction2×2 Exp2相加,则:
由于exp不一样,不能直接加
1)进行小数点对齐,exp1和exp2大的值,然后把小的值变成大的值;
例如exp1=5,exp2=3;
则将V2变成V2=(-1) S×(Fraction2×2 Exp2-Exp1)×2 Exp1
2)相加后得到(-1) S(Fraction1+(Fraction2×2 Exp2-Exp1))×2 Exp1
3)对(Fraction1+(Fraction2×2 Exp2-Exp1))进行归一化处理,1.xxx,会产生移位的偏移量x,加到2 Exp1+x
4)表示成1.5.10标准格式,5比特 Exp1+x;10比特xxx,1比特符号位。
结合第三方面,在第三方面第一种可能的实现方式中,所述N’为一个由1+w比特构成二进制数,用于表示一个大于等于1小于2的小数,其中,1比特表示的数为小数的整数部分的值,w比特表示数为小数的小数部分的值;
所述移位电路具体用于对所述N’左移M1-(w-k)位,得到最终移位结果,所述最终移位结果的最右k比特用于指示所述所述C1的绝对值的小数部分的值,所述最终位移结果的最右k比特左边的j比特用于指示所述C1的绝对值的整数部的值。
结合第三方面,在第三方面第二种可能的实现方式中,所述N’为一个由1+w比特构成二进制数,用于表示一个大于等于1小于2的小数,其中,1比特表示的数为小数的整数部分的值,w比特表示数为小数的小数部分的值,
所述移位子电路包括第一移位子电路以及第二移位子电路;
所述第一移位子电路用于对所述N’左移M1位;注:左移负数位其实就是右移正数位。
所述第二移位子电路用于对所述第一移位子电路移位后的结果再左移-(w-k)位,得到最终移位结果,所述最终移位结果的最右k比特用于指示所述所述C1的绝对值的小数部分的值,所述最终位移结果的最右k比特左边的j比特用于指示所述C1的绝对值的整数部的值。
结合第三方面及第三方面各种可能的实现方式,在第三种可能的实现方式中,所述指数运算子电路为译码子电路,所述译码子电路用于根据所述c1的所述N1译码得到所述N’;或者
所述指数运算子电路为查表子电路,所述查表子电路用于根据所述c1的所述N1查表得到所述N’。
结合第三方面及第三方面各种可能的实现方式,在第四种可能的实现方式中,
所述乘法硬件电路还包括累加器,用于将所述C1与来自所述乘法硬件电路的另一个线性域数据C2进行累加运算;或者
所述累加器用于将所述C1与来自另一个乘法硬件电路的线性域数据C3进行累加运算。
第四方面,本申请公开了一种片上系统SoC,包括处理器核,由一个或多个如第一方面及第一方面各种实现方式中的乘法硬件电路构成的乘法硬件电路阵列、数据输入缓存、数据输出缓存以及控制电路;
所述控制电路与所述处理器核、所述乘法硬件电路阵列、所述数据输入电路以及所述数据输出电路连接;
所述数据输入电路用于通过所述控制电路获取来自所述处理器核的数据;
所述乘法硬件电路阵列用于获取所述数据输入缓存中的数据进行处理,得到处理后的结果,并通过所述控制电路将所述结果输出给所述数据输出缓存。
其中,乘法硬件电路阵列的构成(即选用多少个乘法硬件电路、以什么样的方式构成等)是现有技术,本申请并不重点关注,本申请重点在于构成这个阵列的乘法硬件电路的具体实现。本申请中,输入缓存、输出缓存可以是由SRAM,eDRAM等存储介质实现。
结合第四方面,在第四方面第一种可能的实现方式中,SoC还包括对数转换电路;
对数转换电路,用于对所述乘法硬件电路阵列的输出进行所述对数域转换,并将转换后的结果输入给所述数据输入缓存。具体的,从输出缓存中获取数据,然后进行转化后输出给输入缓存,这样,后续乘法硬件电路阵列就可以从输入缓存中获取数据进行运算。
结合第四方面第一种可能的实现方式中,在第四方面第二种可能的实现方式中,所述对数域转换电路包括:整数计算子电路、小数计算子电路和第二符号位确定子电路,其中,所述线性域阵列输出数据为由1+j+k比特构成的二进制数,其中,j和k均为正整数,1比特为第二符号位,用于指示正负符号S,j比特用于指示所述线性域数据的绝对值的整数部分的值J,k比特用于指示所述线性域数据的绝对值的小数部分的值K;
所述整数计算子电路,用于根据所述线性域阵列输出数据的j+k比特的二进制数的非零最高位所在的位数的值h1,计算h1与k的差值,所述差值用于表示所述线性域阵列输出数据的绝对值取以2为底的对数值的整数部分的值,其中,线性域阵列输出数据的j+k比特的二进制数的最低位记为第0位;
所述小数计算子电路,用于根据所述线性域阵列输出数据从高位到低位的非零最高位后的预定个数s(大于等于k,如果不够s,补0)比特,得到所述线性域阵列输出数据的绝对值取以2为底的对数值的小数部分的值;具体的,可以通过查表或者译码的方法得到;
第二符号位确定子电路,用于分别根据所述线性域阵列输出数据的符号确定所述对数域阵列输出数据的符号,从而得到所述对数域阵列输出数据。
结合第四方面第二种可能的实现方式,在第四方面第三种可能的实现方式中,假设上面各个方面提到的A1,A2为所述线性域阵列输出数据,所述小数计算子电路具体用于:
将所述A1从高位到低位的非零最高位后的s比特对应的值与预设的2 n个比较值比较,其中,第i比较值比第i+1比较值小,且所述第i比较值对应一个值N i;当所述A1从高位到低位的非零最高位后的s比特对应的值大于或等于第T1比较值,且小于第T1+1比较值时,确定所述N1为N T1
将所述A2从高位到低位的非零最高位后的s比特对应的值与预设的2 n个比较值比较,其中,第i比较值比第i+1比较值小,且所述第i比较值对应一个值N i;当所述A2从高位到低位的非零最高位后的s比特对应的值大于或等于第T2比较值,且小于第T2+1比较值时,确定所述N2为N T2
结合第四方面第二种可能的实现方式,在第四方面第四种可能的实现方式中,所述小数计算子电路还可以有另一种实现方式,即小数计算子电路具体用于:
将所述A1从高位到低位的非零最高位后的s比特的高x比特对应的值与预设的2 n个区间比较,其中,第i区间对应一对值αi和βi,x大于0并且小于s,当所述A1从高位到低位的非零最高位后的s比特的高x比特对应的值落入第一区间时,找到所述第一区间对应的一对值α1和β1,计算x*α1+β1的结果,根据所述x*α1+β1的结果得到所述N1;
将所述A2从高位到低位的非零最高位后的s比特的高x比特对应的值与预设的2 n个区间比较,其中,当所述A2从高位到低位的非零最高位后的s比特的高x比特对应的值落入第二区间时,找到所述第二区间对应的一对值α2和β2,计算x*α2+β2的结果,根据所述x*α2+β2的结果得到所述N2。
第五方面,本申请公开了一种电子设备(可以是手机、平板、智能手表、智能电视等各种电子设备),其特征在于,包括第二方面及第二方面各种实现方式(或第四方面及第四方面各种实现方式)所述的片上系统(SoC),存储器;
所述存储器用于存储程序运行所需的指令;
所述SoC中的所述处理器核用于执行所述指令运行程序,并将需要处理的数据发送给所述乘法硬件电路阵列;
所述乘法硬件电路用于对所述数据进行处理后,将处理后得到的结果输出所述数据输出电路,并最终让所述处理器核获取所述结果。
附图说明
图1是乘累加运算的示意图;
图2是本申请实施例的ARM SoC架构的示意性框图;
图3是计算引擎的结构的示意性框图;
图4是本申请实施例一中的乘法器结构示意图;
图5是本申请实施例的乘累加运算的示意性流程图;
图6是本申请实施例的对数域表示格式的示意图;
图7是本申请实施例的线性表示格式的示意图;
图8是本申请实施例提供的移位子电路进行移位的示意图;
图9是本申请实施例的转换数据格式的方法的示意性流程图;
图10是本申请实施例的线性转换电路进行线性域转换的示意图;
图11是本申请实施例的转换数据格式的方法的示意性流程图;
图12是本申请实施例的对数转换电路进行对数域转换的示意图;
图13是一种小数计算子电路确定小数部分的值的示意图;
图14是本申请实施例的分段拟合的示意图;
图15是另一种小数计算子电路确定小数部分的值的示意图;
图16是本申请一种SoC结构示意图;
图17是本申请一种电子设备结构示意图。
具体实施方式
基于发明内容中的各方面及各方面相关的实现方式,下面将结合附图,对本申请中的技术方案进行具体描述。
矩阵运算抽象成数学模式可以是乘累加运算,或者称为乘加运算。其中,P和Q均可以表示矩阵或向量。P·Q可以表示广义上的矩阵运算,包括矩阵*向量、矩阵*矩阵和向量*向量的卷积运算中的至少一种。P·Q的值可以是P中的元素p i与Q中相应的元素q i进行乘法运算得到乘积p iq i,再将这些乘积进行累加运算,这个过程即为乘累加运算。乘累加 运算得到的是结果矩阵的一个元素。
由于乘累加运算中乘法的运算量大,因而现有的一种方案是采用对数运算系统,即将数据从线性域(linear field)转化到对数域(log field)下进行表示并运算。其中,本申请中,对数域以及线性域是指两种不同的数据表示格式,对数域是相对线性域来说的,即用对数域表示的数据(以下简称“对数域数据”、或者“对数数据”、或者“对数值”、或者“对数域值”等)是指将用线性域表示的数据(以下简称“线性域数据”、或者“线性数据”、或者“线性值”、或者“线性域值”等)的绝对值通过对数运算后转化成一个对数值(通常为了计算机计算方便会进行以2为底的对数运算,即进行log 2(|x|)运算,其中,x是指线性域数据)并结合符号位来表示的数据。
例如,要计算数据F和数据G的乘积F*G(以F和G均为正数为例),可以先计算数据F和数据G的对数域表示格式,log 2(F)=f,log 2(G)=g则F*G=2 f×2 g=2 f+g,此时,f、g即为对数域数据,相对的,F、G即为线性域数据。通过上述转化,可以将线性域数据F和G的乘法转换为对数域数据f和g的加法的运算,即f+g是乘积F*G的对数域表示格式,基于对数域结果f+g再转化成线性域数据(即通过例如移位、译码电路等方式计算2 f+g)即可得到F*G的值。通过上述方法,可以将数据的乘法变为以数据的绝对值取以2为底的对数值的加法,从而避免乘法运算。虽然进行对数运算需要一定的开销,但是在矩阵运算中一个数据可能会参与多次乘法,因此计算一次对数域表示,之后多次使用,对于整个矩阵运算而言,计算开销也是降低的。
现有的一种矩阵运算流程中,采用电气和电子工程师协会(Institute of Electrical and Electronics Engineers,IEEE)IEEE-754标准浮点数据表示格式,例如采用半精度的16比特(bit)浮点数据表示格式(或者采用单精度的32比特浮点数据表示格式)。
图1示出了乘加运算的示意图,其步骤如下。
S110,输入两个对数域表示格式的数据,例如,线性域数据F和G的转成对数域数据后分别为f和g,f和g可以是16比特浮点数据表示格式下的数据,则线性域乘法运算F*G转换为对数域加法运算f+g。
S120,根据对数域下的加法结果(即f+g),采用标准的浮点指数运算电路计算2 f+g,计算结果仍采用16比特浮点数据表示格式。
S130,将采用16比特浮点数据表示格式的S120中的数据与同为16比特浮点数据表示格式的其他数据做加法运算,得到累加和(SUM),仍采用16比特浮点数据表示格式。
在现有技术中,如果有两个数V=(-1) S×Fraction×2 Exp相乘,方法是将两个Fraction(分数)部分相乘,同时,将两个Exp(指数)相加,然后对Fraction乘法结果移位指数相加后得到的和。两个Fraction可以认为在线性域相乘,如果两个Fraction都用11比特来表示,则需要有一个11比特*11比特的乘法器,实现这个乘法器的代价比较大,例如,占用逻辑资源多,面积及功耗都会增大。
而申请中,是通过指数运算子电路、移位子电路以及符号位确定子电路的方式来实现相乘,此时,并不需要复杂的指数运算子电路,这些子电路的实现比乘法电路更节省资源,占用逻辑资源少,从而可以减少器件占用的面积以及功耗。
针对加法:FP16(浮点16)位加法代价很大,例如有两个数V1=(-1) S×Fraction1×2 Exp1V2=(-1) S×Fraction2×2 Exp2相加,则:
由于exp不一样,不能直接加,需要进行如下运算:
1)进行小数点对齐,确定exp1和exp2中的大的值,然后把小的值变成大的值;
例如exp1=5,exp2=3;
则将V2变成V2=(-1) S×(Fraction2×2 Exp2-Exp1)×2 Exp1
2)相加后得到(-1) S(Fraction1+(Fraction2×2 Exp2-Exp1))×2 Exp1
3)对(Fraction1+(Fraction2×2 Exp2-Exp1))进行归一化处理,变成1.xxx,在这过程中,可能会产生移位的偏移量x(可以正负),此时,需要加到2 Exp1使其变成2 Exp1+x
4)表示成1.5.10标准格式,5比特 Exp1+x;10比特xxx,1比特符号位。
在很多场景中,一次乘加结果并不是矩阵运算的最终结果,例如在神经网络运算中,乘累加运算的结果(SUM)可能将会成为下一层计算的一个数据,需要将其转换为对数域下的表示格式,例如,可以采用标准的浮点对数运算电路计算log 2(SUM),将结果保存成16比特浮点数据表示格式。
在现有的这种矩阵运算流程中,采用标准的浮点数据表示格式,数据表达位宽较大,例如为16比特或者为32比特,浮点数据表示格式V通常包括符号位S,指数位Exp和有效数位Fraction。其中,V=(-1) S×Fraction×2 Exp,这种浮点数据表示格式较为复杂,不利于数据在线性域和对数域之间的快速转换。
此外,标准的浮点指数运算单元和标准的浮点对数运算单元硬件资源消耗大,在浮点数据表达下,计算累加和的累加器(Accumulator)资源消耗也很大。
基于以上问题,本申请提供了一种转换数据格式的方法、电路、计算引擎和卷积计算芯片,能够减小数据在线性域和对数域之间转换时的开销,提高卷积计算的速度。
下面先介绍应用本申请实施例的硬件架构。本申请实施例以卷积神经网络(Convolutional Neural Network,CNN)计算用于移动手机芯片的场景为例进行介绍。图2是本申请实施例的进阶精简指令集机器(Advanced RISC Machine,ARM)系统单芯片(System-on-a-chip,SoC)架构200的示意性框图。
如图2所示,ARM SoC架构200例如包括主控中央处理器(Central Processing Unit,CPU)210、双倍速率(Double Data Rate,DDR)内存控制器220、先进可扩展接口(Advanced eXtensible Interface,AXI)总线230和硬件计算模块240。
其中,硬件计算模块用于进行一些专用数据处理,即硬件计算模块用于进行例如针对图像、音频等数据的一些“专用的”的处理(如基于神经网络的机器学习)。硬件计算模块与通用的处理器(如CPU)相比,最大特点是基于各种逻辑电路(如与门、或门、非门等等)实现,而不像CPU那样具有一定的指令集(如x86指令集、ARM指令集)以执行指令的方式来完成数据处理。典型的硬件计算模块可以基于FPGA、ASIC等实现。
CPU通常有自己专用的指令集,用于通过执行指令的方式来执行除硬件计算模块执行的专用数据处理外的其他数据处理(当然,理论上也不限定使用CPU来执行硬件计算模块所执行的专利数据处理,但受限于CPU硬件架构,效率比较低)。
DDR内存又称为DDR SDRAM,即DDR同步动态随机存储器(Synchronous Dynamic Random Access Memory,SDRAM)。如图2所示,卷积计算芯片240包括输入缓存242、计算引擎244和输出控制模块246。CPU 210通过AXI总线230控制计算启动,卷积计算芯片240通过AXI总线230从DDR内存220获取需要处理的数据(如针对图像处理器时, 获取图像数据以及和训练参数),然后将数据送至计算引擎244,计算引擎244根据输入的数据内容计算并将计算结果写回DDR内存220,并通知CPU 210计算完成。
本申请实施例的改进在于计算引擎244(可以以IP core的形式存在,另外,IP core也可以包括更多的电路,例如,包括输入缓存242等)。图3示出了计算引擎的结构的示意性框图。如图3所示,计算引擎244包括直接内存存取(Direct Memory Access,DMA)控制单元、数据缓存、参数缓存、多个处理元件(Processing Element,PE)(形成PE阵列)、输出缓存和对数转换电路。数据缓存和参数缓存可以认为是输入缓存,用于缓存图像数据和训练参数,图像数据和训练参数可以认为是数据。
其中,PE可视为用于实现特定功能的电路,例如,本申请中,PE可以包括乘法硬件电路,可进行基于乘法的各种运算(比如乘法、或者乘加运算);乘法硬件电路(PE)中包括线性转换电路(也叫做线性转换单元,图3中用线性转换单元表示)。PE阵列(即乘法硬件电路)的构成是现有技术,例如,可以像图3所示的方式进行构成,此时,阵列中第一级的PE(第一列以及第一行)为后续的PE传递数据,也可以让每个PE都连接输入缓存(参数缓存及数据缓存)。
计算引擎244的计算过程中,DMA控制单元(可视为一个控制电路)将需要的图像数据和训练参数从外部的DDR内存220读取到数据缓存和参数缓存;图像数据和训练参数通过PE阵列进行乘加运算;运算结果输出至输出缓存,然后通过对数转换电路将运算结果从线性域转换为对数域;最终结果可以输回到数据缓存作为下一轮乘加运算的输入数据使用,或者直接输出至DDR内存220进行存储。
实施例一
参见图4,为本实施例硬件乘法器40(本申请中也称“乘法硬件电路”、“乘法电路”、“乘法器”)的结构示意图,该乘法器可以用于对两个数据A和B进行乘法运算,该乘法器包括:
加法子电路41,用于获取与A和B分别对应的对数域数据a和对数域数据b,对a和b执行加法运算得到c,c包括整数部分和小数部分;
指数运算子电路42,用于执行底数是2,指数是c的小数部分的指数运算,得到指数运算结果;
移位子电路43,用于根据c的整数部分对指数运算结果进行移位,得到移位结果;其中,移位结果用于指示A与B的乘积;
输出子电路44,用于根据a和b的符号,结合移位结果,输出A和B的乘积。
此外,乘法电路40还可以包括累加器45,用于将数据A和B的乘积与来自同一个乘法电路的另一个数据进行累加运算;或者用于将数据A和B的乘积与来自另一个乘法电路的乘积进行累加运算。
上述各个子电路可以基于ASIC或者FPGA来实现。在一个典型的示例中,上述硬件乘法器基于ASIC来实现,同时,可以与CPU、GPU等其他硬件封装在一个芯片中,构成一个SoC(片上系统)。本实施例中,各个子电路的实现都可以采用很简单的电路,占用资源很少,这样,整个乘法器的实现也非常简单,占用资源少,使得一个芯片可以在同样的资源(如面积、功耗)下可以集成更多的乘法器,提升芯片的运算能力。
以下将通过各实施例来对各子电路进行介绍。
实施例二
基于上述各实施例,本实施例对加法子电路41进行具体介绍。
加法子电路41用于对对数域数据a的绝对值和对数域数据b的绝对值进行相加,得到c。其中,a、b都是对数域数据,通过分别对线性域数据A、B进行对数域转换得到。
本申请中,对数域包括1+m+n个二进制比特位(也可表示为1.m.n),m、n都是正整数,其中第1个比特是符号位用于指示正负符号(下文也简称“符号”,本申请并不进行区分),m个比特是整数部分(或者说m比特用于指示整数部分的值),n个比特是小数部分(或者说n比特用于指示小数部分的值)。其中,m、n的取值可以根据系统需要的精度来确定,位数最多,精度也越大,但也会相应地增加一些硬件资源。本领域技术人员可以结合系统对精度以及硬件资源的要求来取合适的m、n的值。
可以理解,相加是将a的绝对值的小数部分与b的绝对值的小数部分相加(可以有进位)得到c的小数部分,指将a的绝对值的整数部分与b的绝对值的整数部分相加(可以再加上进位)得到c的整数部分,c的整数部分与小数部分也基于m+n个二进制比特位来表示。
本申请中,对数域转换的目的在于将一个数据变成对数格式,然后基于该对数格式的数据进行运算。其中,对数域数据与线性域数据是相对的,如果一个数据(如上文提到的A或B)进行了对数域转换得到另一个数据(如上文提到的a或者b),则称对数域转换前的数据(A或B)为线性域数据,称转换后的数据(a或b)为对数域数据。
具体的对数域转换可以包括多种实现方式,下面对其中两种方式进行具体介绍:
方式一
在一种实现方式中,对数域转换可以是指对一个线性域数据的绝对值取以2为底的对数值,并结合符号位来表示。可以理解的是,“结合该数据正负符号位后来进行表示”具体实现时,最简单的是可以直接使用该数据的符号位作为对数域转换后的数据的符号位。当然,也可以使用相反的符号作为对数域转换后的数据的符号位,后续只要记住这种变换规律,也能够实现对数域数据与线性域数据之间的相互转换。
本申请中,如图6所示,对数域数据被表示为1+m+n比特(bit)的二进制数据,以下也记为1.m.n。其中,m和n均为正整数,1比特为第一符号位S,用于指示数据的值的正负,m比特为整数位,用于指示数据的绝对值取以2为底的对数值的整数部分的值M,n比特为小数位,用于指示数据的绝对值取以2为底的对数值的小数部分的值N。
线性域数据F与对数域数据(基于1.m.n格式表示)之间转换的基本关系如下述公式所示:
F=(-1) S2 M+N=(-1) S2 N<<M
其中,“<<”为左移符号。相应的,<<M表示向左移M位,具体的,左移M位是指:当M大于0时,表示向左移M位;当M小于0时,表示向右移M的绝对值位。符号位S表示F的符号(正或者负),其不参与数据在对数域表示格式下的运算。对F是负数的情况,由于在实数域下直接计算log 2(负数)是不存在的,因此本申请实施例中1.m.n格式(对数域表示格式)表达的是-log 2(|F|)。
例如,对于一个数据-8,将-8的绝对值取以2为底的对数值为log 2|-8|=3,再结合-8的正负符号(-),从而得到-3,则称-8为线性域数据,-3为对数域数据。相对地,数据 除了可以进行对数域转换外,也可以反过来进行线性域转化(即取2的该数据绝对值次方,并结合该数据的符号位来表示)。例如,对于-3,先取2的|-3|次方等于8,然后结合符号位(-)得到-8。
对于通过方式一进行转换时,在对一个线性域数据的绝对值取以2为底的对数值时,该结果可以是正数,也可以是负数(如计算log 2(0.25)=-2),此时,为了表示一个数据是正负还是负数,可以使用用于表示对数域数据整数部分的m比特中的一个位来表示符号,称该位为对数域整数部分符号位,剩余的m-1比特的值等于数据的绝对值取以2为底的对数值的整数部分的绝对值。
例如,十进制下的数据0.25取以2为底的对数值的结果log 2(0.25)=-2,即M=-2。将对数域表示格式1.3.2中的3比特整数位的最高位作为对数域整数部分符号位,因此十进制下的数据0.25在对数域表示格式1.3.2下表示为0 110 00。110中的最高位(即位于最左边)的1为对数域整数部分符号位,用1表示是负(如果是0则表示是正)。
在方式一中,经过对数域变换后,对数域数据整数部分值有正有负,真实地反映了对一个线性域数据的绝对值取以2为底的对数值,但由于存在一个符号位,会对m个数域整数部分的比特位造成一点点浪费(位宽会变大),同时,计算时还需要考虑到符号位,也会多造成一点开销。
方式二
在另一种实现方式中,对数域转换可以是指对一个线性域数据的绝对值取以2为底的对数值,并基于一个基准值将该对数值转成大于等于0的数,并结合符号位来表示。
第一整数位的值可选地可以为非负数,第一整数位的值等于数据(例如,F或G)的绝对值取以2为底的对数值的整数部分的值M与基准值(BASE)的差。
具体地,可以理解,线性域数据的绝对值取以2为底的对数值的整数部分的值M可以是正数也可以是负数。为了数据表达上的简便,对数域数据整数位可以设置为没有符号位,而是用对数域数据整数位的值M’=M-BASE表示真实的整数部分的原始数据M,值M’可以认为是指示值,即该值并不真实等于“线性域数据的绝对值取以2为底的对数值的整数部分的值M”,但与M存在一定的对应关系(M’=M-BASE),如果要得到数据的绝对值取以2为底的对数值的整数部分的值M,可以通过M=M’+BASE来计算。换句话说,由于对于不同的数据而言,其对应的M有可能是负的,为了防止在数据表达上出现该负值,在M的基础上减去BASE,使得对数域表示格式的对数域数据整数位的值M’始终保持为非负数。
例如,在一个简单的示例中,十进制下的数据0.25取以2为底的对数的结果log2(0.25)=-2,即M=-2。如果取BASE的值为-2,则M’=M-(BASE)=-2-(-2)=0。因此十进制下的数据0.25在对数域表示格式1.3.2下表示为0 000 00。
本申请中,针对不同的数据(例如,不同类型的数据,或者不同时间段的数据),BASE的取值也可以不同。该基准值BASE的取值原则可以是使得其适用的所有的数据(例如某一批次的数据)对应的M’均为非负数。例如,对于一批特定属性的数据的真实整数部分的原始数据M的范围为-7~0时,m为3bit,则BASE可以设置为一个让M’范围为从0开始的一个非负范围,也即可以将BASE设置为-7,由此就可以保证对数域数据整数位的值(M’=M-(-7))的范围是0~7。当然,在另一实施例中,也可以设置为-8之类的数(对应 的M’的范围是1~8),只要能够最终让M’不为负数即可。
此外,在一些应用场景中,还可以对M’的值进行范围限定处理,将其限定在一个最小值与最大值之间,小于最小值的数取最小值,大于最大值的数取最大值。例如,仍以原始数据M的范围为-7~0时,m为3bit为例,假设此时取BASE值为-5时,则M’的值是[-2,5],此时,可以通过一个电路将其限定在[0,3]的范围,如果小于0(比如-1、-2),则取0,如果大于5(如6、7),则取5。范围限定处理可以是对M’进行处理,也可以对M进行处理,使得最后的M’能够在预定的范围。
在本申请实施例中,BASE的选择是可配置的,可以通过线性转换电路的外部元件配置BASE,即通过外部元件将BASE传入线性转换电路,并在线性转换电路中软件的编译过程中确定该BASE值可用。
可以理解,由于基于BASE进行了对数域转换,因此,在后续的处理过程(需要准确的结果之前)中也需要对基于该变换所得到的结果再使用BASE进行适应性调整(由于BASE是对数域的,因此,调整时基于2^BASE来进行调整),来得到正确的结果。
通过方式二基于BASE的对数域转换方式,可以防止数据在对数域表示格式的表达上在整数位出现负值,这样整数位不用单独设置一位符号位,可以使数据表达更简洁,节省了运算开销(运算时不需要考虑符号位)。
可以理解,上述两种方式的转换都针对非0的情况下,当数据F=0时,由于log 2(F)是负无穷,采用1.m.n格式不便于表示和后续计算,所以可以采用特殊形式表示。例如,F=0的对数域表示格式可以为100000…,即第一符号位S是1,第一整数位的值和第一小数位的值全部为0。
下面举几个具体的例子来说明本申请实施例的数据的对数域表示格式和线性表示格式。为了简单起见,假设数据的绝对值取以2为底的对数值的整数部分的值M为非负的,则BASE值可以取0。当然,BASE也可以取-1或小于-1的其他值,本申请实施例对此不作限定。
以对数域表示格式为1.3.2为例,数据在对数域的二进制表示格式为0 010 10,其中,用1比特(0)表示符号位,其中0表示正数,1表示负数,用3比特(010)表示整数部分,用2比特(10)表示小数部分。该数据在二进制下的值为(+)2^(0.10)<<(2-0)=(+)101.10101,在线性表示格式1.3.5下为0 101 10101,其中,在线性域表示的格式中,用1比特(0)表示符号位,用3比特(101)表示整数部分以及用5比特(10101)表示小数部分。
数据在对数域表示格式1.3.2下为1 011 10,该数据在二进制下的值为(-)2^(0.10)<<(3-0)=(-)1011.0101,在线性表示格式1.4.4下为1 1011 0101。
对数域表示格式1.3.2下正数的最大值为011111,该数据在二进制下的值为(+)2^(0.11)<<(7-0)=(+)11010111,在线性表示格式1.8.0下为0 11010111。
对数域表示格式1.3.2下正数的最小值为000000,该数据在二进制下的值为(+)2^(0.00)<<(0-0)=(+)1.0000000,在线性表示格式1.1.7下为0 1 0000000。
对数域表示格式1.3.2下负数的最大值为100001,该数据在二进制下的值为(-)2^(0.01)<<(0-0)=(-)1.0011000,在线性表示格式1.1.7下为0 1 0000000。
对数域表示格式1.3.2下负数的最小值为111111,该数据在二进制下的值为(-)2^(0.11) <<(7-0)=(-)11010111,在线性表示格式1.8.0下为1 11010111。
因此,本申请实施例的数据的对数域表示格式可以用较小的位数表示很大的数值范围。在实际应用中,根据图像数据的特性、训练参数的特性和中间结果所需的精度,估计所需表达的数据可能取的负数最小值~正数最大值的范围。从而确定对数域表示格式1.m.n中m的取值和n的取值。例如,图像数据或训练参数的绝对值取以2为底的对数值越大,m的取值应越大;精度要求越高,n的取值应越大。应理解,该m的取值和n的取值的确定过程可以由软件(例如,可以通过CPU等通用处理器完成)来实现,本申请实施例对此不作限定。
实施例三
基于上述各实施例,本实施例对指数运算子电路42进行具体介绍。
指数运算子电路42进行2^(c的小数部分)运算,“c的小数部分”是指一个大于等于0,小于1的一个小数,例如,如果c的小数部分是0.32,则实际进行的是2^0.32(或者2 0.32)的指数运算。
需要说明的是,本申请中的计算过程很多时候都是具有误差的,这是因为在基于ASIC、FPGA的数字电路实现时,很多数(例如,有很多位的小数,甚至无理数)的表示会受到硬件的限制而不能精确表达(例如,数的表达会受到位宽的限制)。因此,本申请中的进行的“2^(c的小数部分)”运算的实际结果需要理解成并非在所有情况下都是完全等于,而是基于特定硬件限制(如特定的位宽)下的等于,也即最后的结果可能是刚好等于,但也有可能是约等于。本申请下文中的其他各种运算也基于同样的原则,后续不再赘述。
具体的,指数运算子电路为译码子电路,用于根据c的小数部分译码得到指数运算结果;或者指数运算子电路也可以为查表子电路,用于根据c的小数部分查表得到指数运算结果。无论是译码子电路还是查表子电路,总的思路都是事先基于一定的精度值设计好“c的小数部分”与“指数运算结果”之间的映射关系,后续通过译码或者查表的方式来根据“c的小数部分”得到“指数运算结果”。
例如,在一个具体的例子中,指数运算子电路820可以是图10所示的两位有效位输入八位有效位输出的器件(例如,一个译码器),能够完成“c的小数部分”与“指数运算结果”之间的转换。应理解,八位有效位输出是本例中的情况,可根据实际需要增加或减少bit。
具体的,当n=2时,第一小数位共有4组不同情况(00,01,10,11),通过事先计算2 N的结果(N可以取0,0.25,0.5,0.75),将结果四舍五入至8比特记录保存成表如图10以及如下:
00--10000000(表示二进制数1.0000000,对应于十进制数1.00)
01--10011000(表示二进制数1.0011000,对应于十进制数1.1875)
10--10110101(表示二进制数1.0110101,对应于十进制数1.4140625)
11--11010111(表示二进制数1.1010111,,对应于十进制数1.6796875)
这两种电路的具体实现都是本领域技术公知的技术,本申请并不赘述。同时,这两种电路的实现也是非常简单,占用很少的硬件资源。
实施例四
基于上述各实施例,本实施例对移位子电路43进行具体介绍。
移位子电路实质上是用于将指数运算结果左移c的整数部分,即进行2^(c的整数部分)指数运算,得到移位结果,该移位结果等于A与B的乘积的绝对值(当然,由于数字电路的关系,会存在一定的误差),后续再为该移位结果确定正负符号位,就可得到最终的A与B的乘积。
可以理解,移位子电路43进行移位时,其移位的方式是需要跟取数据时的方式(即取哪几位作为整数部分,哪几位作为小数部分)相匹配,其最终得到的移位结果如果没有通过与移位方式相匹配的取数据的方式来取时,该移位结果也不是一个最终的结果。
例如,二进制1左移3位相当于在10进制下乘以2^3,其结果等于二进制数1000(相当于10进制数8),此时,移位子电路最终结果呈现的是1000,但如果后面的电路在获取这个结果时,将1000中前2位10作为整数部分,将后两位作为小数部分,那么将会得到10.00(二进制数)这个结果,显然就出现了错误。
因此,本申请中“实质上是用于将指数运算结果左移c的整数部分”是指在原理上符合左移了c的整数部分来得到2^(c的整数部分)的移位方式(需要配合相应的取数据的方式),但在实际中,也实际中移的位数也可以不是严格的c的整数部分,只要有相应的取数据的方式来配合,使得最终得到的结果为2^(c的整数部分)。
例如,如果“实质上”要对一个二进制数1左移3位,但在实际中也可以将其左移5位,得到二进制数100000,但取最终结果时,取前4位作为整数部分,后2位作为小数部分,这样,也能得到正确的结果。
本申请中,如无特殊说明,可认为“指数运算结果左移c的整数部分”即为“实质上将指数运算结果左移c的整数部分以进行2^(c的整数部分)运算”。
在一个实施例中,参见图7,线性域的数据基于1+j+k(或者记为1.j.k)的格式来表示,即1个比特用于表示符号,j个比特用于表示整数部分以及k个比特用于表示小数部分。上述移位结果等于A与B的乘积,为一个线性域数据,为了更好地输出符合1+j+k格式的线性域数据,可以采用下面具体的移位及相对应的取数据的方法来实现。
具体的,本申请中,指数运算结果等于2^(c的小数部分),由指数运算规律决定了该数为大于等于1、小于2的一个数。本实施例中使用1+w个二进制比特位来表示指数运算结果,其中,第1个比特为整数部分(等于1),w个比特为小数部分,w为大于等于1的正整数;
由于指数运算结果是由1+w比特构成的大于等于1、小于2一个小数,其可表示成例如1.01010(二进制,w=5)之类的小数,在存储时,虽然不需要对小数点进行存储,但在逻辑层面,可认为实际上存储的1+w比特的数据在第1比特后面有一个小数点。如果进行移位时,可认为小数点的位置被固定不动,被移位的数“经过”该小数点。
例如,如果对1.0101101进行左移2位,则变成101.01101;如果左移(-3)位(即右移3位),则变成0.0010101101。
在一个通过具体硬件电路实现的实施例中,移位子电路用于根据c的整数部分对指数运算结果进行移位时,具体用于先将指数运算结果放在j+k比特的存储器中,其中,指数运算结果的最低位与j+k比特的最低位对齐,j+k大于等于1+w;然后对指数运算结果执行左移X位,X=c的整数部分-(w-k)。相应地,要获取最终结果时,取移位的结果的最高的j个比特作为最终结果的整数部分,取剩下的k个比特作为最终结果的小数部分;其中, 当左移位数为小于0时,左移X位等于右移X的绝对值位(例如,左移(-3)位相当于右移3位)。
例如,假设w=7,指数运算结果为10101101(在逻辑层面为二进制数1.0101101),c的整数部分为2,j=8,k=8,则:X=2-(7-8)=3,参见图8,为具体的移位方法:
1)如图8中的图(a)所示,先将指数运算结果10101101放在j+k=8+8=16比特的存储器中;
2)左移后的结果如图8中的图(b)所示,高j(8)比特(表示最终结果的整数部分)的数据为(00000)101(括号中的数字表示对不足8比特的部分补0);低j(8)比特(表示最终结果的整数部分)的数据为01101(000)(括号中的数字表示对不足8比特的部分补0)。即最终结果为00000101.01101000。
需要说明的是,本实施例中,线性域数据的符号位的值并不通过移位子电路确认,而由下一级的输出子电路根据对数域数据a、b的符号来确定。
可以看到,通过上述移位方式,最后得到的结果就是对1.0101101左移c的整数部分位(即左移2位)后得到的结果101.01101(完整的通过8+8比特表示的为0000010101101000)。通过上述实现方式,后续需要获取移位结果时,都可以基于统一的“前高j比特为整数部分的值,低k比特为小数部分的值”的获取方式来获取整数部分和/或小数部分的值。
需要说明的是,有一些情况下,有可能会造成比特丢失的情况,例如,如图3所示,如果右移一位,会造成最末尾的数据1丢失,但这种情况下后续仍然按照上述原则获取最后的值,即高j比特作为整数部分的值,低k比特作为小数部分的值。
由于移位子电路实现的功能仅仅是移位,因此,实现起来也很简单,占用资源少。
实施例五
基于上述各实施例,本实施例对输出子电路44进行具体介绍。
输出子电路用于根据a和b的符号,结合移位结果,输出A和B的乘积。可以理解,如果a和b的符号分别与A和B的符号对应,则可以基于a和b的符号来最终确定A和B的乘积的符号。例如,当a符号和A符号相同,b符号跟B符号相同时,A和B的符号就可以很简单地根据乘法运算规律(正正得正,正负得负,负负得正)以及a、b符号来确定。例如,当a、b有一个符号是正,有一个符号是负时,最后A和B的乘积的符号是负。
本实施例中,输出子电路用于进行符号运算,实现也很简单,占用资源少。
实施例六
基于上述各实施例,本实施例对乘加运算进行具体介绍。图5是乘加运算的示意性流程图。如图5所示,以实现F1*G1+F2*G2+F3*G3……为例,乘加运算包括以下步骤。
S410,在对数域表示格式下进行加法运算。
数据F1和数据G1是线性表示格式的数据,在对数域表示格式下则分别为f1和g1,其中,f1=log 2(F1),g1=log 2(G1)。输入两个对数域表示格式下的数据f1和g1,二者相加得到c1,即c1=f1+g1,这样,将基于线性表示格式的乘法运算F1*G1转化成了基于对数域表示格式的加法运算f1+g1。应理解,所输入的对数域表示格式下的数据f1和g1可以是通过软件(即通过CPU等通用处理器来完成)或硬件(例如基于现场可编程门阵列(Field Programmable Gate Array,FPGA)或专用集成电路(Application Specific Integrated Circuit,ASIC)等硬件器件来完成)预处理好的。
S420,线性域转换。
将对数域表示格式的c1=f1+g1转换为线性表示格式C1,即计算2 c1,其中,C1=2 c1=F1*G1。应理解,线性域转换是在每一个PE内部由线性转换电路执行的。现有的浮点数据表示格式下,线性转换电路可以是基于传统的浮点指数运算单元,即计算2 c1是基于浮点指数运算单元进行的。由于浮点数据表示格式V是基于上文所描述的意义的,所以浮点指数运算单元计算2 c1的运算量巨大,运算速度慢。本申请中,线性域换换电路可以包括前述图4中的指数运算子电路、移位子电路、输出子电路等。
S430,在线性表示格式下进行加法运算。
将S420中得到的线性表示格式C1累加到已有的累加结果上。这样,反复执行S410和S420,得到Ci,其中,Ci=2 ci=Fi*Gi,ci=fi+gi,i取值1,2,3…,i的最大值由乘加运算的规模决定。累加结果SUM=C1+C2+C3+…。图5中S430右侧指向自己的回路,表示将本次得到的Ci与上一次的累加结果进行累加(当然,也可以与另一个乘法器中的数据进行累加)。在得到累加结果SUM之后,如果下一步需要将SUM作为一个乘数继续进行乘法运算,则执行S440;如果下一步不再需要针对SUM进行运算,则可以通过输出缓存将SUM直接输出至DDR内存220进行存储。
S440,对数域转换。
在得到累加结果SUM之后,如果下一步需要将SUM作为一个乘数继续进行乘法运算,则将SUM转换为对数域表示格式,即计算log 2(SUM),将其作为输入数据输回到数据缓存,以便于再次执行S410。应理解,对数域转换是在PE外部由对数转换电路执行的。
下面以一个具体的例子说明本申请实施例的乘加运算的过程。
输入两个数据F和G。
对数域表示格式1.3.2下,f=001010,表示二进制数据F:(+)2^(0.10)<<(2-0)=101.10101。
对数域表示格式1.3.2下,g=101110,表示二进制数据G:(-)2^(0.10)<<(3-0)=-1011.0101。
例如,一个乘加运算需要计算C1=F*G,C2=F*G,SUM=C1+C2,再将SUM表示成对数域表示格式1.4.4下的sum(即计算log 2(SUM)),以供后续乘法使用。其中,对数域表示格式1.4.4下C1表示为c1,对数域表示格式1.4.4下C2表示为c2,BASE=3。
(1)先计算f+g=-(01010+01110)=-(11000)=111000,即计算对数乘法。
(2)线性域转换:C1=C2=(-)2^(0.00)<<(6-3)=(-)1000.0000,表示为1.7.8格式为:C1=C2=1000 1000.0000 0000。
(3)计算SUM=C1+C2=1001 0000.0000 0000(十进制数为-16)。
(4)将SUM表示成对数域表示格式1.4.4下的sum:找到SUM的绝对值的最高位1的所在的位h=12,减去线性表示格式的小数部分的位数k=8,得到对数域表示格式的整数位所指示的值=12-8=4(0100),因为BASE=-3,M’=M-BASE=4-(-3)=7,所以表示成0111;取最高位1之后的连续s比特(例如8比特),记为S,查表计算log(1.S),查表log 2(1.00000000),得到对数域表示格式的小数位0000;将对数域表示格式的小数位和整数位拼接, 设置符号位与原始符号位一致,得到对数域表示格式1.4.4下的sum为101110000。
实施例七
基于上述各实施例,下面详细说明本申请实施例的转换数据格式的方法700,即图5中的S420所示的线性域转换的流程。图9是本申请实施例的转换数据格式的方法700的示意性流程图。方法700可以由线性转换电路800执行。图10是本申请实施例的线性转换电路800进行线性域转换的示意图。如图10所示,线性转换电路800可以包括获取子电路810、译码子电路820、移位子电路830和输出子电路840。各子电路可以基于FPGA或ASIC实现。其中,子电路810~840分别用于执行方法700的S710~S740。
本实施例转换数据格式的方法700包括:
S710,获取子电路810获取对数域表示格式1.m.n的数据。其中,1比特为第一符号位,m比特为第一整数位,n比特为第一小数位。这里的数据可以包括图像数据和/或训练参数。具体地,PE中的线性转换电路可以从数据缓存获取对数域表示格式的图像数据;和/或从参数缓存获取对数域表示格式的训练参数。
例如,获取子电路810获取对数域表示格式1.3.2的数据1 010 10,其中,第一符号位为1即数据符号为负,第一整数位为010,第一小数位为01。
S720,译码子电路820查表得到第一小数位对应的线性表示格式。具体地,译码子电路820从获取子电路810得到n比特第一小数位,进行译码操作。其中,译码操作通过硬件组合逻辑直接得到2 N的结果,即通过查表直接得到2 N的结果。应理解,由于n比特第一小数位的取值为有限个,第一小数位对应的线性表示格式(在一定的精度要求下)是可以列举的。仍以对数域表示格式1.3.2为例进行说明,2 0.00对应于二进制的1.0000000,2 0.01对应于二进制的1.0011000,2 0.10对应于二进制的1.0110101,2 0.11对应于二进制的1.1010111,这里N是用二进制来表示的,以上对应关系是预先计算好存储在表中的。
译码子电路820查表得到第一小数位10对应的线性表示格式1.0110101。在一个具体的例子中,译码子电路820可以是图10所示的,两位有效位输入八位有效位输出的器件,其中存储有两位有效位输入与八位有效位输出的对应关系表。应理解,八位有效位输出是本例中的情况,可根据实际需要增加或减少比特。
应理解,当n=2时,第一小数位共有4组不同情况(00,01,10,11),通过事先计算2 N的结果,将结果四舍五入至8比特记录保存成表如图10以及如下:
00--10000000(代表线性表示格式的数值1.0000000)
01--10011000(代表线性表示格式的数值1.0011000)
10--10110101(代表线性表示格式的数值1.0110101)
11--11010111(代表线性表示格式的数值1.1010111)
S730,移位子电路830根据对数值的整数部分的值M,对第一小数位对应的线性表示格式进行移位,得到数据的绝对值在线性表示格式下的值。具体地,移位子电路830从获取子电路810得到m比特第一整数位,从译码子电路820得到2 N的结果,对译码子电路820译码的结果进行移位操作。m比特第一整数位的值为M’,m比特第一整数位所指示数值M,M是数据的绝对值取以2为底的对数值的整数部分的值,M=M’+BASE。如果想得到真实的数据,则应该根据M对译码子电路820译码的结果进行移位。应理解,由于对数域表示格式1.m.n是应用于计算引擎中或卷积计算芯片中的,因此可以预先设置移位 子电路830从获取子电路810获取m比特第一整数位,或者设置用户接口以方便用户设置m和/或n的值。
如前文描述的,当采用M运算时,M是正数表示左移M位,M是负数则表示右移M的绝对值位。即,S730根据对数值的整数部分的值M,对第一小数位对应的线性表示格式进行移位,得到数据的绝对值在线性表示格式下的值,可以包括:当M大于0时,对第一小数位对应的线性表示格式左移M位,得到数据的绝对值在线性表示格式下的值;当M小于0时,对第一小数位对应的线性表示格式右移M的绝对值位,得到数据的绝对值在线性表示格式下的值。
当然,本申请实施例中,也可以不采用M而是采用M’运算,M’为非负值,则只能进行左移。
移位子电路830根据第一整数位所指示的值010,对第一小数位10对应的线性表示格式1.0110101左移两位,得到101.10101。
S740,输出子电路840将数据表示为线性表示格式下的1+j+k比特的二进制数据。具体地,输出子电路840从获取子电路810得到1比特第一符号位,将第一符号位的值赋予第二符号位。换句话说,输出子电路840根据第一符号位设置第二符号位。例如,如果数据是正数,第二符号位设置为0;如果数据是负数,第二符号位设置为1,但本申请实施例对此不作限定。输出子电路840从移位子电路830得到移位后的结果,对移位后的结果进行补零或者删除无效位,使之符合1.j.k的格式。可选地,输出子电路840可以将得到的1.j.k格式的结果转换为补码表示得到最终的结果,本申请实施例对此不作限定。
输出子电路840进行补零和根据第一符号位确定第二符号位,将数据表示为线性表示格式下的1+j+k比特的二进制数据,例如,表示为线性表示格式1.7.8下的1 000010110101000。
本申请实施例的转换数据格式的方法,获取对数域表示格式的数据,对本申请实施例的对数域表示格式的数据进行简单的查表移位的方式得到其在线性表示格式下的表示,无需进行复杂的幂运算,能够减小数据在对数域和线性域之间转换时的开销,提高卷积计算的速度。
在图10的例子中,获取子电路810获取对数域表示格式1.3.2的数据101010,其中,第一符号位为1即数据符号为负,第一整数位为010,第一小数位为01。
译码子电路820查表得到第一小数位10对应的线性表示格式1.0110101。在一个具体的例子中,译码子电路820可以是图10所示的,两位有效位输入八位有效位输出的器件,其中存储有两位有效位输入与八位有效位输出的对应关系表。应理解,八位有效位输出是本例中的情况,可根据实际需要增加或减少bit。
应理解,当n=2时,第一小数位共有4组不同情况(00,01,10,11),通过事先计算2 N的结果,将结果四舍五入至8比特记录保存成表如图10以及如下:
00--10000000(代表线性表示格式的数值1.0000000)
01--10011000(代表线性表示格式的数值1.0011000)
10--10110101(代表线性表示格式的数值1.0110101)
11--11010111(代表线性表示格式的数值1.1010111)
移位子电路830根据第一整数位所指示的值010,对第一小数位10对应的线性表示 格式1.0110101左移两位,得到101.10101。
输出子电路840进行补零和根据第一符号位确定第二符号位,将数据表示为线性表示格式下的1+j+k比特的二进制数据,例如,表示为线性表示格式1.7.8下的1000010110101000。
下面详细说明本申请实施例的转换数据格式的方法900,即对数域转换的流程。图11是本申请实施例的转换数据格式的方法900的示意性流程图。方法900可以由对数转换电路100执行。图12是本申请实施例的对数转换电路1000进行对数域转换的示意图。如图12所示,对数转换电路1000可以包括获取子电路1010、整数计算子电路1020、小数计算子电路1030和输出子电路1040。各子电路可以基于FPGA或ASIC实现。其中,子电路1010~1040分别用于执行方法900的S910~S940。
实施例八
基于上述各实施例,本实施例对对数域转换进行具体说明。在上述实施例(如实施例二)中介绍过,对数域转化主要是将一个线性域的数的绝对值取以2为底的对数值(或者还可以结合BASE),并结合符合号来表示。该转化方法可以通过软件来实现(即通过让CPU运算软件程序来输出转化后的值),为了加快处理速度,还可以基于特定的硬件电路(如ASIC、FPGA)来实现,本实施例将对基于硬件电路的实现进行具体介绍。
本实施例对数域转化由对数域转化电路来完成,该对数域转化电路包括:获取子电路、整数计算子电路、小数计算子电路以及输出子电路;其中,参见图11、图12,各个子电路处理的流程如下:
S910、获取子电路1010获取线性表示格式1.j.k的数据。其中,1个比特用于表示第二符号位,j个比特用于表示第二整数位,k个比特用于表示第二小数位。这里的数据可以是从输出缓存输出的经过累加运算后的数据,也可以是其他方式得到的数据,本申请实施例对此不作限定。
例如,如图12所示,获取子电路1010获取线性表示格式1.7.8的数据0 001100111000000(十进制表示为25.75),其中数据符号为正,第二整数位为0011001,第二小数位为11000000。
S920、整数计算子电路1020确定j+k比特的二进制数据的非零最高位所在的位置为第h位(j+k比特的二进制数据的最低位的位数记为第0位),确定数据在对数域表示格式下的数据的绝对值取以2为底的对数值的整数部分的值M,其中,M等于h与k的差值。
例如,仍以线性表示格式1.7.8的数据0 0011001 11000000为例,整数计算子电路1020对0 0011001 11000000首先找到最高位1的位置,从右往左从0开始编码,最高位1是在第h=12位,因此数据的绝对值取以2为底的对数值的整数部分的值M=h-k=12-8=4(二进制表示为0100)。
为避免1.m.n格式下m比特的值可能出现负数,可以设置BASE,由M’=M-BASE得到m比特第一整数位。
S930、小数计算子电路1030由高位向低位截取非零最高位后的s比特,得到s比特对应的对数域表示格式下的数据的绝对值,将该绝对值取以2为底的对数值,则得到小数部分的值N。
例如,仍以线性表示格式1.7.8的数据0 0011001 11000000为例,小数计算子电路1030截取非零最高位后的s比特(例如为8比特)为10011100,计算log 2(1.10011100)得到对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N为0.11,数据的绝对值取以2为底的对数值的小数部分的值N在二进制下表示为0.1011。
由于表示一个数的比特数(如8位或者16位等)是有限的,s比特的取值也只可能是有限个,所以,在一定的精度要求下,s比特对应的对数域表示格式是可以穷举的。小数计算子电路1030截取非零最高位后的s比特是根据获取子电路1010得到线性表示格式1.j.k的数据的非零最高位后的s比特的值y,然后计算log 2(1.y),得到数据的绝对值取以2为底的对数值的小数部分的值N。得到N的方式可以有多种,将在下文中举例说明。在具体地实现中,可以根据精度需要确定s的大小。可选地,s大于n。
在一个具体的例子中,小数计算子电路1030可以是图12所示的,八位有效位输入、五位有效位输出的器件,应理解,五位有效位输出是本例中的情况,可根据实际需要增加或减少比特。
S940、输出子电路1040将数据表示为对数域表示格式下的1+m+n比特的二进制数据。
具体地,输出子电路1040从获取子电路1010得到1比特第二符号位,将第二符号位的值赋予第一符号位,即根据第二符号位设置对数域表示格式下的第一符号位。例如,如果数据是正数,第一符号位设置为0;如果数据是负数,第一符号位设置为1,但本申请实施例对此不作限定。另外,输出子电路1040将整数计算子电路1020得到的数据的绝对值取以2为底的对数值的整数部分和小数计算子电路1030得到的数据的绝对值取以2为底的对数值的小数部分相加,分别得到第一整数位和第一小数位,并进行补零或者删除无效位,使之符合1.m.n格式。可选地,输出子电路1040可以将得到的1.m.n格式的结果转换为补码表示,本申请实施例对此不作限定。
输出子电路1040将数据的绝对值取以2为底的对数值的整数部分的值M和数据的绝对值取以2为底的对数值的小数部分的值N相加,对相加的结果进行补零或者删除无效位。根据第二符号位确定第一符号位,将数据表示为对数域表示格式下的1+m+n比特的二进制数据,例如,表示为对数域表示格式1.3.4下的010010(第一符号位为0,第一整数位取有效的100三位,第一小数位取有效的1011四位)。
本申请实施例的转换数据格式的方法,获取线性表示格式的数据,通过简单的截取并比对的方式得到其在对数域表示格式下的表示,无需进行复杂的对数运算,能够减小数据在线性域和对数域之间转换时的开销,提高卷积计算的速度。
实施例九
基于以上各实施例,本实施例对实施例八中的S930中的步骤进行详细说明,具体的,至少可以通过以下三种方法来实现。
1)方法一
可选地,作为一个实施例,上述步骤S930中的得到s比特对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N,可以包括:查表得到s比特对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N,其中,表中存储了s比特的所有可能值对应的N。该确定数据的绝对值取以2为底的对数值的小数部分的值N的方法称之为查表法。
具体地,仍以图12的例子来进行说明,小数计算子电路1030截取非零最高位后的8比特为10011100,查表得到log 2(1.10011100)的结果。其中,小数计算子电路1030中存储有八位有效位输入与五位有效位输出的对应关系表。表中记录了log 2(1.y)的结果,结果保留小数位后4bit。
log 2(1.00000000)=0.0000,对应的输出结果为0.0000
log 2(1.00000001)=0.0000,对应的输出结果为0.0000
log 2(1.11111111)=1.0000,对应的输出结果为1.0000(四舍五入后值为1.0000)
由于1.y中的整数位1是固定的,输出结果的整数位0也是固定的,因此可以保存成一个8比特输入,4比特输出的表,本例子中查表log 2(1.10011100)得到结果是数据的绝对值取以2为底的对数值的小数部分的值N用二进制表示为0.1011。
2)方法二
可选地,作为另一个实施例,上述步骤S930中的得到s比特对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N,可以包括:将s比特对应的值与预设的2 n个比较值比较,其中,第i比较值比第i+1比较值小,且第i比较值对应一个值N i;当s比特对应的值大于或等于第T比较值,且小于第T+1比较值时,确定s比特对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N为N T。该确定数据的绝对值取以2为底的对数值的小数部分的值N的方法称之为逐级比较法。
具体地,仍以图12的例子来进行说明,小数计算子电路1030截取非零最高位后的8比特为10011100,通过比较器组比较得到数据的绝对值取以2为底的对数值的小数部分的值N。图13是一种小数计算子电路1030确定小数部分的值N的示意图。其中,比较器组中可以包括2 n=16个比较器,例如,比较器0,比较器1,…,比较器15。每个比较器中预先设置有比较值。预先设置的比较值是从小到大排列的,即比较值0<比较器1<…<比较器15。比较值的设置也根据下表设置。
log 2(1.00000000)=0.0000,对应的输出结果为0.0000
log 2(1.00000001)=0.0000,对应的输出结果为0.0000
log 2(1.11111111)=1.0000,对应的输出结果为1.0000
将输出结果4比特有跳变的那一项的查表值,即log 2()中的真数设置为比较值。例如,比较值0设置为1.00000000。可选地,直接将比较值0设置为00000000。换而言之,任意两个相邻的比较值之间的值对应的输出结果是一致的。因此,逐级比较的结果通过一个选择器可以直接得出数据的绝对值取以2为底的对数值的小数部分的值N。对于输入的8比特(10011100),选择器的输出结果是0.1011。
大于或等于比较器0的比较值0,小于比较器1的比较值1,选择器结果是0.0000
大于或等于比较器1的比较值1,小于比较器2的比较值2,选择器结果是0.0001
大于或等于比较器14的比较值14,小于比较器15的比较值15,选择器结果是0.1111
大于或等于比较器15的比较值15,选择器结果是1.0000。
3)方法三
可选地,作为又一个实施例,上述步骤S930中的得到s比特对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N,可以包括:将s比特的高x比特对应的值与预设的2 n个比较值比较,其中,x大于0并且小于s,第i比较值比第i+1比较值小,且第i比较值对应一对值A和B;计算x*A+B的结果,根据x*A+B的结果得到数据的绝对值取以2为底的对数值的小数部分的值N。
具体地,可以计算x*A+B时,可以将高x比特与A相乘得到的结果右移k位,将右移k位后的结果与B相加,将相加后的结果再左移k-n位,得到的高n位即为s比特对应的对数域表示格式下的数据的绝对值取以2为底的对数值的小数部分的值N。
仍以图12的例子来进行说明,小数计算子电路1030截取非零最高位后的8比特为10011100,其值的大小为S。先查表,再计算分段拟合结果得到数据的绝对值取以2为底的对数值的小数部分的值N。图14是本申请实施例的分段拟合的示意图;图15是本申请实施例的小数计算子电路1030确定小数部分的值N的示意图。如图14所示,因为8比特所代表的第一小数位的值的范围是[0,1);提前把[0,1)范围分成2 n=16段,在log(1.x)的曲线上则分别对应16个线段,每一线段都可以计算出一个y=A ix+B i的直线表达式,记录到如图15所示的表格,即记录下A i和B i,i=0,1,…,15,由此可以得到16组A i和B i,记录成表。最高位后的8比特为10011100,去其中的高x=4比特,查表得到A和B。计算A*S,得到的结果右移k=8(>>8)位;移位后的结果加B,再右移k-n=4(>>4)位,得到剩余4比特即为第一小数位1011,即数据的绝对值取以2为底的对数值的小数部分的值N为0.1011。
实施例十
基于上述各实施例,本实施例公开了一种片上系统SoC(System on Chip)14,参见图16,包括处理器核141(即CPU核)、一个或多个Processing Element(PE)1421组成的PE阵列142,每个PE可以包括上述各个实施例所描述的乘法硬件电路1421;该SoC还包括输入缓存143、输出缓存144、对数域转换电路145以及控制电路147,其中,除CPU核之外的其他部件可统称为计算引擎,或者数据加速引擎,这些部件主要的功能的是替CPU核处理一些特定的计算。下面分别对SoC上的各个部件进行介绍。
CPU核主要用于执行一些通用的软件程序,例如,通过读取指令运行操作系统、基于操作系统的各种应用程序等,当CPU核需要处理一些特定的数据处理(如大量的图像数据的处理),如果计算引擎更适合处理这些数据,则CPU核可以将这部分数据的处理发送给计算引擎来进行处理。
输入缓存143用于存储输入数据,输入数据可以来自CPU core 141,或者也可以来自对数域转换电路145。输入数据的类型并不限定,可以基于各种应用来确定数据的类型,例如,针对神经网络系统时,可以包括需要计算的数据以及参数,具体可以通过多个存储器分别存储数据以及参数。
输出缓存144用于存储PE阵列输出的结果,该结果如果需要被再次使用,可以通过对数域转换电路145转换后输出给输入缓存143供下一轮计算使用。
其中,输入缓存、输出缓存可以基于SRAM,eDRAM等存储介质实现。
控制电路147处理器核141(即CPU核)、输入缓存143以及输出缓存144连接;控制电器通过与处理器核进行交互(如基于DMA协议的交互,或者自定义协议、消息的交 互等)后,让处理器核获取输出缓存中的数据。
此外,本申请SoC还可以包括其他的IP core 146,例如,图形图像处理器(GPU)、数字信号处理器(DPS)等,本申请并不限定。
基于以上各实施例,参见图17,本实施例提供了一种电子设备15,为结构示意图。本电子设备中,包括SoC 151,SoC中可以有多个IP core(知识产权核),例如,CPU core,由PE阵列组成的IP core,或者由PE阵列、输入缓存、输出缓存以及对数域转换电路组成的一个IP core,此外还可以包括其他的IP core 146。SoC通常封闭成一块独立的芯片,例如,华为海思麒麟系列的芯片(如麒麟950,960),高通骁龙系列的SoC芯片(如骁龙650,660)。在其他实现方式中,各个IP core也可以单独封装成一个芯片,或者几个IP core一起封装成一个芯片。
电子设备15还可以包括其他部件,例如,可以包括存储器152(如内存、闪存等),输入输出设备153(如显示屏、触摸屏、喇叭、鼠标、键盘等)以及各种通信模块154(如WiFi,USB、蓝牙、4G、5G等通信模块)。这些部件的实现为本领域技术人员所公知的技术,本申请并不赘述。
应理解,本申请实施例的各电路或者子电路可以基于ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等来实现。需要说明的是,ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件可以为独立的器件,或者可以与存储器(存储模块)集成在一起。
应注意,本文描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(Digital Video Disc,DVD))、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。
应理解,本文中涉及的第一、第二以及各种数字编号仅为描述方便进行的区分,并不用来限制本申请的范围。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程 构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。

Claims (15)

  1. 一种乘法电路,其特征在于,用于执行两个数据A和B的乘法运算,包括:
    加法子电路,用于获取与所述A和所述B分别对应的对数域数据a和对数域数据b,对所述a和所述b执行加法运算得到c,所述c包括整数部分和小数部分;
    指数运算子电路,用于执行底数是2,指数是所述c的小数部分的指数运算,得到指数运算结果;
    移位子电路,用于根据所述c的整数部分对所述指数运算结果进行移位,得到移位结果;
    输出子电路,用于根据所述a和所述b的符号,结合所述移位结果,输出所述A和所述B的乘积。
  2. 如权利要求1所述的乘法电路,其特征在于,所述对数域数据a和所述对数域数据b是通过分别对所述A和所述B的绝对值取以2为底的对数值并结合它们的符号位得到的,包括1+m+n个二进制比特位,m、n都是正整数,其中第1个比特是符号位,m个比特是整数部分,n个比特是小数部分。
  3. 如权利要求2所述的乘法电路,其特征在于,所述c的整数部分为所述a的整数部分与所述b的整数部分的和;所述c的小数部分为所述a的小数部分与所述b的小数部分的和。
  4. 如权利要求2或者3所述的乘法电路,其特征在于,数值0对应的对数域数据定义为:符号位取值为1,整数和小数部分都是0。
  5. 如权利要求1-4任一所述的乘法电路,其特征在于,所述A和所述B都包括1+j+k个二进制比特位,j、k都是正整数,其中第1个比特是符号位,j个比特是整数部分,k个比特是小数部分。
  6. 如权利要求1-5任一所述的乘法电路,其特征在于,所述指数运算结果为大于等于1、小于2的一个数;所述运算结果包括1+w个二进制比特位,其中,第1个比特为整数部分,w个比特为小数部分,w为大于等于1的正整数;
    所述移位子电路用于根据所述c的整数部分对所述指数运算结果进行移位时,具体用于对所述指数运算结果执行左移X位,所述X=c的整数部分-(w-k),所述移位结果是所述A和所述B的乘积的绝对值,所述乘积的绝对值的整数部分包括j个二进制比特位,小数部分包括k个二进制比特位;其中,当所述左移位数为小于0时,左移所述x位等于右移所述x的绝对值位。
  7. 如权利要求1-6任一所述的乘法电路,其特征在于,
    所述指数运算子电路为译码电路,所述译码电路用于根据所述c的小数部分译码得到所述指数运算的结果;或者
    所述指数运算子电路为查表电路,所述查表电路用于根据所述c的小数部分查表得到所述指数运算的结果。
  8. 如权利要求1-7任一项所述的乘法电路,其特征在于,
    所述乘法电路还包括累加器,用于将所述数据A和B的乘积与来自所述乘法电路的 另一个数据进行累加运算;或者
    所述累加器用于将所述数据A和B的乘积与来自另一个乘法电路的所述乘积进行累加运算。
  9. 一种片上系统,其特征在于,包括:处理器核,由一个或多个如权利要求1-8任一所述的乘法硬件电路构成的乘法硬件电路阵列、数据输入缓存、数据输出缓存以及控制电路;
    所述控制电路与所述处理器核、所述数据输入电路以及所述数据输出电路连接;
    所述数据输入电路用于通过所述控制电路获取来自所述处理器核的数据;
    所述乘法硬件电路阵列用于获取所述数据输入缓存中的数据进行处理,得到处理后的结果,并输出给所述数据输出缓存;
    所述控制电路还用于与所述处理器核进行交互,使得所述处理器核获取所述数据输出缓存中的数据。
  10. 如权利要求9所述的片上系统,其特征在于,还包括对数转换电路,用于对所述乘法硬件电路阵列的输出进行所述对数域转换,并将转换后的结果输入给所述数据输入缓存。
  11. 如权利要求10所述的片上系统,其特征在于,所述对数域转换电路包括:整数计算子电路、小数计算子电路和第二符号位确定子电路,其中,所述线性域阵列输出数据为由1+j+k比特构成的二进制数,其中,j和k均为正整数,1比特为第二符号位,用于指示正负符号S,j比特用于指示所述线性域数据的绝对值的整数部分的值J,k比特用于指示所述线性域数据的绝对值的小数部分的值K;
    所述整数计算子电路,用于根据所述线性域阵列输出数据的j+k比特的二进制数的非零最高位所在的位数的值h1,计算h1与k的差值,所述差值用于表示所述A1的绝对值取以2为底的对数值的整数部分的值,其中,线性域阵列输出数据A1的j+k比特的二进制数的最低位记为第0位;
    所述小数计算子电路,用于根据所述线性域阵列输出数据从高位到低位的非零最高位后的预定个数s比特,得到所述线性域阵列输出数据的绝对值取以2为底的对数值的小数部分的值;
    第二符号位确定子电路,用于分别根据所述线性域阵列输出数据的符号确定所述对数域阵列输出数据的符号,从而得到所述对数域阵列输出数据。
  12. 如权利要求11所述的片上系统,其特征在于,所述小数计算子电路具体用于:
    通过查表或者译码得到所述A1从高位到低位的非零最高位后的s比特对应的值N1,通过查表或者译码得到所述A2从高位到低位的非零最高位后的s比特对应的值N2,其中,所述表中存储了所述s比特的所有可能值对应的值N。
  13. 如权利要求11所述的片上系统,其特征在于,所述小数计算子电路具体用于:
    将所述A1从高位到低位的非零最高位后的s比特对应的值与预设的2 n个比较值比较,其中,第i比较值比第i+1比较值小,且所述第i比较值对应一个值N i;当所述A1从高位到低位的非零最高位后的s比特对应的值大于或等于第T1比较值,且小于第T1+1比较值时,确定所述N1为N T1
    将所述A2从高位到低位的非零最高位后的s比特对应的值与预设的2 n个比较值比较, 其中,第i比较值比第i+1比较值小,且所述第i比较值对应一个值N i;当所述A2从高位到低位的非零最高位后的s比特对应的值大于或等于第T2比较值,且小于第T2+1比较值时,确定所述N2为N T2
  14. 如权利要求11所述的片上系统,其特征在于,所述小数计算子电路具体用于:
    将所述A1从高位到低位的非零最高位后的s比特的高x比特对应的值与预设的2 n个区间比较,其中,第i区间对应一对值αi和βi,x大于0并且小于s,当所述A1从高位到低位的非零最高位后的s比特的高x比特对应的值落入第一区间时,找到所述第一区间对应的一对值α1和β1,计算x*α1+β1的结果,根据所述x*α1+β1的结果得到所述N1;
    将所述A2从高位到低位的非零最高位后的s比特的高x比特对应的值与预设的2 n个区间比较,其中,当所述A2从高位到低位的非零最高位后的s比特的高x比特对应的值落入第二区间时,找到所述第二区间对应的一对值α2和β2,计算x*α2+β2的结果,根据所述x*α2+β2的结果得到所述N2。
  15. 一种电子设备,其特征在于,包括如权利要求9-14任一所述的片上系统,存储器;
    所述存储器用于存储程序运行所需的指令;
    所述片上系统中的所述处理器核用于执行所述指令运行程序,并将需要处理的数据发送给所述乘法硬件电路阵列;
    所述乘法硬件电路用于对所述数据进行处理后,将处理后得到的结果输出所述数据输出电路,并最终让所述处理器核获取所述结果。
PCT/CN2018/106559 2017-09-19 2018-09-19 乘法电路、片上系统及电子设备 WO2019057093A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18858914.7A EP3674883B1 (en) 2017-09-19 2018-09-19 Multiplication circuit, system on chip, and electronic device
US16/822,720 US11249721B2 (en) 2017-09-19 2020-03-18 Multiplication circuit, system on chip, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710852544.4A CN109521994B (zh) 2017-09-19 2017-09-19 乘法硬件电路、片上系统及电子设备
CN201710852544.4 2017-09-19

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/822,720 Continuation US11249721B2 (en) 2017-09-19 2020-03-18 Multiplication circuit, system on chip, and electronic device

Publications (1)

Publication Number Publication Date
WO2019057093A1 true WO2019057093A1 (zh) 2019-03-28

Family

ID=65768177

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/106559 WO2019057093A1 (zh) 2017-09-19 2018-09-19 乘法电路、片上系统及电子设备

Country Status (4)

Country Link
US (1) US11249721B2 (zh)
EP (1) EP3674883B1 (zh)
CN (1) CN109521994B (zh)
WO (1) WO2019057093A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114127680A (zh) * 2019-07-15 2022-03-01 脸谱科技有限责任公司 支持用于高效乘法的替代数字格式的系统和方法
WO2023124371A1 (zh) * 2021-12-31 2023-07-06 上海商汤智能科技有限公司 数据处理装置、方法、芯片、计算机设备及存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256644B (zh) * 2018-01-05 2021-06-22 上海兆芯集成电路有限公司 微处理器电路以及执行神经网络运算的方法
US11740932B2 (en) * 2018-05-04 2023-08-29 Apple Inc. Systems and methods for task switching in neural network processor
CN110309088B (zh) * 2019-06-19 2021-06-08 北京百度网讯科技有限公司 Zynq fpga芯片及其数据处理方法、存储介质
CN110633069B (zh) * 2019-09-06 2022-09-16 安徽大学 一种基于静态随机存储器的乘法电路结构
CN112199072B (zh) * 2020-11-06 2023-06-02 杭州海康威视数字技术股份有限公司 一种基于神经网络层的数据处理方法、装置及设备
US11748062B2 (en) 2021-01-28 2023-09-05 Macronix International Co., Ltd. Multiplication and addition operation device and control method for multiplication and addition operation thereof
EP4092578A1 (en) * 2021-05-18 2022-11-23 Aptiv Technologies Limited Computer-implemented method of executing softmax
CN115114662A (zh) * 2022-06-30 2022-09-27 蚂蚁区块链科技(上海)有限公司 隐私数据的安全处理方法和装置

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1178588A (zh) * 1995-03-10 1998-04-08 摩托罗拉公司 使用移位装置的求幂电路及其使用方法
CN103369326A (zh) * 2013-07-05 2013-10-23 西安电子科技大学 适于高性能视频编码标准hevc的变换编码器
CN104011706A (zh) * 2011-12-31 2014-08-27 英特尔公司 包括对数和反对数单元的图形照明引擎

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4626825A (en) * 1985-07-02 1986-12-02 Vlsi Technology, Inc. Logarithmic conversion apparatus
US6003058A (en) 1997-09-05 1999-12-14 Motorola, Inc. Apparatus and methods for performing arithimetic operations on vectors and/or matrices
US6466633B1 (en) * 2000-08-31 2002-10-15 Shiron Satellite Communications (1996) Ltd. Methods and apparatus for implementing a receiver filter with logarithmic coefficient multiplication
US20060101244A1 (en) * 2004-11-10 2006-05-11 Nvidia Corporation Multipurpose functional unit with combined integer and floating-point multiply-add pipeline
US20060106905A1 (en) * 2004-11-17 2006-05-18 Chren William A Jr Method for reducing memory size in logarithmic number system arithmetic units
JP4529098B2 (ja) * 2008-07-29 2010-08-25 ソニー株式会社 演算処理装置および方法、並びにプログラム
CN102436365B (zh) 2010-12-20 2014-04-09 中国电子科技集团公司第四十一研究所 一种频谱分析仪中线性频谱数据转换为对数数据的方法及装置
CN103455302A (zh) 2012-05-31 2013-12-18 上海华虹集成电路有限责任公司 用硬件实现对数运算的电路
US9317251B2 (en) * 2012-12-31 2016-04-19 Nvidia Corporation Efficient correction of normalizer shift amount errors in fused multiply add operations

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1178588A (zh) * 1995-03-10 1998-04-08 摩托罗拉公司 使用移位装置的求幂电路及其使用方法
CN104011706A (zh) * 2011-12-31 2014-08-27 英特尔公司 包括对数和反对数单元的图形照明引擎
CN103369326A (zh) * 2013-07-05 2013-10-23 西安电子科技大学 适于高性能视频编码标准hevc的变换编码器

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114127680A (zh) * 2019-07-15 2022-03-01 脸谱科技有限责任公司 支持用于高效乘法的替代数字格式的系统和方法
WO2023124371A1 (zh) * 2021-12-31 2023-07-06 上海商汤智能科技有限公司 数据处理装置、方法、芯片、计算机设备及存储介质

Also Published As

Publication number Publication date
CN109521994A (zh) 2019-03-26
US20200218509A1 (en) 2020-07-09
EP3674883B1 (en) 2022-01-26
EP3674883A1 (en) 2020-07-01
CN109521994B (zh) 2020-11-10
US11249721B2 (en) 2022-02-15
EP3674883A4 (en) 2020-08-26

Similar Documents

Publication Publication Date Title
WO2019057093A1 (zh) 乘法电路、片上系统及电子设备
Abed et al. VLSI implementation of a low-power antilogarithmic converter
TWI698759B (zh) 曲線函數裝置及其操作方法
US8874630B2 (en) Apparatus and method for converting data between a floating-point number and an integer
Pagliari et al. Serial T0: Approximate bus encoding for energy-efficient transmission of sensor signals
WO2018196750A1 (zh) 处理乘加运算的装置和处理乘加运算的方法
Subhasri et al. Hardware‐efficient approximate logarithmic division with improved accuracy
US10133552B2 (en) Data storage method, ternary inner product operation circuit, semiconductor device including the same, and ternary inner product arithmetic processing program
CN115827555B (zh) 数据处理方法、计算机设备、存储介质和乘法器结构
CN114201140B (zh) 指数函数处理单元、方法和神经网络芯片
CN115268832A (zh) 浮点数取整的方法、装置以及电子设备
CN111258542B (zh) 乘法器、数据处理方法、芯片及电子设备
US8933731B2 (en) Binary adder and multiplier circuit
US8868634B2 (en) Method and apparatus for performing multiplication in a processor
RP An efficient implementation of low‐power approximate compressor–based multiplier for cognitive communication systems
JP5366625B2 (ja) データ伝送装置およびデータ伝送方法
KR20200032005A (ko) 산술 논리 유닛, 데이터 처리 시스템, 방법 및 모듈
US20140253214A1 (en) Multiplier circuit
Pathak A review of approximate adders for energy-efficient digital signal processing
Muralidharan et al. An enhanced Carry elimination adder for low power VLSI applications
ERNA1a et al. FPGA Implementation of High-Performance Truncated Rounding based Approximate Multiplier with High-Level Synchronous XOR-MUX Full Adder
Hounsinou Leading Digit Computation Circuits
TW202036270A (zh) 電腦可讀儲存媒體、電腦實施的方法及計算邏輯區段
KR20230005643A (ko) 플로팅 포인트 연산 회로의 동작 방법 및 플로팅 포인트 연산 회로를 포함하는 집적 회로
Ngo et al. Partitioning and gating technique for low-power multiplication in video processing applications

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18858914

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018858914

Country of ref document: EP

Effective date: 20200325