WO2023178860A1 - 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片 - Google Patents

一种基于指数函数和softmax函数的优化方法、硬件系统及芯片 Download PDF

Info

Publication number
WO2023178860A1
WO2023178860A1 PCT/CN2022/100635 CN2022100635W WO2023178860A1 WO 2023178860 A1 WO2023178860 A1 WO 2023178860A1 CN 2022100635 W CN2022100635 W CN 2022100635W WO 2023178860 A1 WO2023178860 A1 WO 2023178860A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
point data
calculated
floating
floating point
Prior art date
Application number
PCT/CN2022/100635
Other languages
English (en)
French (fr)
Inventor
马成勇
李冰华
袁峰
Original Assignee
奥比中光科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 奥比中光科技集团股份有限公司 filed Critical 奥比中光科技集团股份有限公司
Publication of WO2023178860A1 publication Critical patent/WO2023178860A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/556Logarithmic or exponential functions

Definitions

  • the invention belongs to the field of neural network technology, and in particular relates to an optimization method, hardware system and chip based on an exponential function and a normalized exponential function (softmax function).
  • the softmax function can be seen almost everywhere.
  • the Softmax function is usually used as the activation function of the output layer in classification tasks. When implementing the softmax function, you need to
  • embodiments of the present invention provide an optimization method, a hardware system and a chip based on exponential functions and softmax functions, which can solve one or more technical problems in related technologies.
  • an embodiment of the present application provides an optimization method based on an exponential function, which includes: reading data to be calculated and dequantizing each data to be calculated to obtain floating point data corresponding to each data to be calculated; Perform variant splitting on each floating-point data according to the preset splitting formula to obtain floating-point data whose exponent bits are the integer part and decimal part respectively; perform preset function calculations on the floating-point data of the integer part and decimal part respectively.
  • the calculation results corresponding to the integer part and the decimal part are reorganized to obtain the exponential operation result of the floating point data; the exponential operation result corresponding to each floating point data is quantified to obtain each of the The result of the exponential function operation on the data to be calculated.
  • an embodiment of the present application provides an optimization method based on the softmax function, including: reading the data to be calculated and dequantizing each data to be calculated, and obtaining the floating point data corresponding to each data to be calculated; Perform variant splitting on each floating-point data according to the preset splitting formula to obtain floating-point data whose exponent bits are the integer part and decimal part respectively; perform preset function calculations on the floating-point data of the integer part and decimal part respectively.
  • the calculation results corresponding to the integer part and the decimal part are reorganized to obtain the exponential operation result of the floating point data; the exponential operation results corresponding to each floating point data are accumulated to obtain the cumulative operation result, And calculate the ratio of the exponential operation result and the accumulation operation result; quantify the ratio corresponding to the floating point data to obtain the softmax function operation result of the data to be calculated.
  • an embodiment of the present application provides an exponential function hardware system, including: a data reading module for reading data to be calculated; an exponential function calculation module for inverse quantification of each data to be calculated, Obtain the floating point data corresponding to each of the data to be calculated; and perform variant splitting on each floating point data according to a preset splitting formula to obtain floating point data whose exponent bits are respectively the integer part and the decimal part; respectively Perform preset function calculations on the floating point data of the partial and decimal parts to obtain the calculation results corresponding to the integer part and the decimal part, and reorganize the calculation results to obtain the exponential operation result of the floating point data; for each floating point data The corresponding exponential operation results are quantified to obtain the exponential function operation results of each data to be calculated.
  • an embodiment of the present application provides a hardware system for a softmax function, including: a data reading module for reading data to be calculated; a softmax function calculation module for inverse quantizing each data to be calculated to obtain Floating-point data corresponding to each of the data to be calculated; performing variant splitting on each floating-point data according to a preset splitting formula to obtain floating-point data whose exponent bits are respectively an integer part and a decimal part; and separately splitting the integer part Perform a preset function calculation with the floating point data of the decimal part to obtain the calculation result corresponding to the integer part and the decimal part, reorganize the calculation result to obtain the exponential operation result of the floating point data; The exponential operation results are accumulated to obtain the cumulative operation result, and the ratio of the exponential operation result to the cumulative operation result is calculated; the ratio corresponding to the floating point data is quantified to obtain the softmax function of the data to be calculated Operation result.
  • an embodiment of the present application provides a chip, including a hardware system of an exponential function as described in any embodiment of the third aspect, or a hardware system of a softmax function as described in any embodiment of the fourth aspect.
  • an embodiment of the present application provides a computer storage medium.
  • the computer storage medium stores a computer program.
  • the index-based method as described in any embodiment of the first aspect is implemented. function optimization method, or the optimization method based on the softmax function described in any embodiment of the second aspect.
  • an embodiment of the present application provides a computer program product.
  • the electronic device can implement the optimization method based on an exponential function as described in any embodiment of the first aspect. , or the optimization method based on the softmax function described in any embodiment of the second aspect.
  • the embodiment of the present application converts the data to be calculated into floating point data for calculation, without limiting the numerical range of the input data, and at the same time, the calculation accuracy and calculation speed can be greatly improved.
  • Figure 1 is a schematic flow chart of the implementation of an optimization method based on the softmax function provided by an embodiment of the present application
  • Figure 2 is a schematic process diagram of an optimization method based on the softmax function provided by an embodiment of the present application
  • Figure 3 is a schematic diagram of data to be calculated for each of the four channels provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of the layout format of 4 channels of data to be calculated in DDR provided by an embodiment of the present application;
  • Figure 5 is a schematic diagram of a process of reading corresponding to-be-calculated data of four channels from DDR according to an embodiment of the present application
  • Figure 6 is a schematic diagram of an fp32 type data format provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a single-precision floating-point number calculation module provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a search unit 73 provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of an adder module provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of a hardware system for a softmax function provided by an embodiment of the present application.
  • Figure 11 is a schematic structural diagram of a hardware system for a softmax function provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a hardware system for a softmax function provided by an embodiment of the present application.
  • Figure 13 is a schematic structural diagram of a softmax function calculation module 1020 provided by an embodiment of the present application.
  • connection should be understood in a broad sense.
  • it can be a fixed connection, a detachable connection, or an integral body; it can be a direct connection or an intermediate connection.
  • the medium is indirectly connected, which can be the internal connection between two components or the interaction between two components.
  • An embodiment of the present application provides an optimization method based on the softmax function, which can be deployed in any chip.
  • Figure 1 it is a schematic flow chart of an optimization method based on the softmax function provided by an embodiment of the present application.
  • FIG. 2 is a schematic process diagram of an optimization method based on the softmax function of the embodiment shown in FIG. 1 provided by an embodiment of the present application.
  • the optimization method based on the softmax function may include: step S110 to step S160.
  • the data to be calculated is the data to be calculated stored in the second memory.
  • the second memory may be a dynamic random access memory (Dynamic Random Access Memory, DRAM) such as Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM).
  • DRAM Dynamic Random Access Memory
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • data to be calculated is read from the second memory through data reading, and corresponding data to be calculated is sent to each channel.
  • the data to be calculated is read from the DDR memory through data reading, and the corresponding data to be calculated is sent to each of the four preset channels until the data of each channel is read. .
  • the data to be calculated before data is read, the data to be calculated needs to be arranged in a preset arrangement format to obtain the data to be calculated.
  • the preset arrangement format is the NC4HW format.
  • the data to be calculated is The data is usually int8 or int16 and other types of data, then the data to be calculated is written into the DDR according to the NC4HW format, and the data to be calculated arranged according to the NC4HW format is obtained.
  • the number of data to be calculated is C
  • the corresponding data can be divided into C/4 groups, and after grouping, the data to be calculated in NC4HW format is obtained by interleaving. It should be noted that if the number of data to be calculated is not divisible, the data will be padded by zero padding, and there is no limit here.
  • the data to be calculated is the data of channel 0, channel 1, channel 2 and channel 3 from top to bottom.
  • the data of the four channels are arranged according to the NC4HW format.
  • the arrangement format is shown in Figure 4.
  • the data at the same position in the four channels are arranged adjacently.
  • the data arranged in the NC4HW format is read from the DDR, 4 data are read at a time and sent to the corresponding channel for post-processing, thereby achieving softmax simultaneous solution of 4 sets of data, achieving 4 times speed up, thereby greatly reducing solution time.
  • some embodiments of the present application include multiple preset channels, such as 4 preset channels.
  • Each preset channel can implement softmax solution for a set of data.
  • the hardware implementation and hardware structure of each preset channel are same. Therefore, only the data of one of the preset channels will be described in the following embodiments.
  • the embodiments of this application greatly shorten the solution time by realizing simultaneous solution of multi-channel data, thereby providing an efficient and fast softmax hardware implementation method.
  • S120 Dequantize the data to be calculated through each preset channel to obtain floating point data corresponding to each preset channel.
  • the data to be calculated is dequantized according to the preset inverse quantization parameters in the preset channel and converted into floating point data; the preset inverse quantization parameters can be obtained through system configuration.
  • the int8 or int16 type data read from the DDR is converted into a single-precision floating point number (fp32) or a half-precision floating point number (fp16).
  • fp32 single-precision floating point numbers
  • Figure 6 The format of single-precision floating point numbers is shown in Figure 6.
  • the representation and calculation of decimals can be achieved through fp32.
  • fp32 is a data type that uses 4 bytes, that is, 32 bits (bit) for encoding and storage. The first bit is the sign bit (sig), then 8 bits represent the exponent bit (exp), and the last 23 bits are the mantissa bit ( fra).
  • sig sign bit
  • exp the exponent bit
  • fra mantissa bit
  • the quantization process mentioned later is the inverse process of inverse quantization, that is, input_0 is obtained from input_00.
  • this embodiment uses e According to the exponential operation of e x , the exponential operation result corresponding to each preset channel is obtained, that is, the input_00th power of e is solved.
  • the first memory is such as random access memory (RAM) or the like.
  • the hardware cannot directly implement the calculation of ex , respectively perform preset function calculations on the split floating-point data to obtain the corresponding calculation results, and reorganize the calculation results to obtain the exponential operation results and store them.
  • the variant splitting and reorganization can be performed through a single method as shown in Figure 7 Implementation of precision floating point calculation module.
  • (aa) 2 is expressed as the binary form of aa
  • aa is decimal by default
  • (aa) 10 is expressed as the decimal form of aa
  • (aa) fp32 is expressed as the single-precision floating point number representation of aa . Since (decim) 10 ⁇ [0,1), (2 decim ) 10 ⁇ [1,2), at this time, according to the expression method of fp32 in the embodiment of step S120, the result expressed by fp32 of ex can be obtained, that is :
  • the hexadecimal number of (e x ) fp32 is ⁇ 1′b0,8′h00,( ⁇ 1'b1,result_decim [22:0] ⁇ >>(-temp-126)) ⁇ , ⁇ 1'b1,result_decim[22:0] ⁇ >>(-temp-126) means ⁇ 1'b1,rdsult_decim[22:0] ⁇ The integer shifted to the right by (-temp-126) bits; when temp+127 ⁇ -23, that is, the current data to be calculated exceeds the range that fp32 can represent, then the default ( ex ) hexadecimal number of fp32 is 0.
  • the single-precision floating-point calculation module includes: a base-changing unit 71, a splitting unit 72, a search unit 73, an exponent bit solving unit 74, and a combination unit 75; among which, the base-changing unit 71 is used for the calculation to be performed
  • the floating-point data (input_00) of the data is changed to the base to obtain the data to be calculated after the base change, that is,
  • the splitting unit 72 is used to split the exponent of the data to be calculated after changing the base into an integer part temp and a decimal part decim, that is,
  • the search unit 73 is used to search for single-precision floating-point data whose exponent is the decimal part of the floating-point data, that is, to find the floating-point data of (2 decim ) fp32 ;
  • the exponent bit solving unit 74 is used to search for the integer part according to the exponent of the floating-point data.
  • the search unit 73 searches for single-precision floating point data whose exponent is the decimal part decim through a preset table lookup method.
  • the exponent of the floating-point data is split into the decimal part, and then the single-precision floating-point data of the split data is solved separately through a preset lookup table method. Specifically,
  • decim1 2 0.000000yyyyyy, indicating the second 6 digits of the binary number of decim;
  • decim2 0.000000000000zzzzzz, indicating the third 6 digits of the binary number of decim;
  • decim3 2 0.000000000000000000vvvvv, which represents the last 5 digits of the decim binary number.
  • fp32 (2 decim0+decim1+decim2+decim3 )
  • fp32 (2 decim0 ) fp32 ⁇ (2 decim1 ) fp32 ⁇ (2 decim2 ) fp32 ⁇ (2 decim3 ) fp32 .
  • the search unit 73 includes a segmentation sub-unit 731, four search sub-units 732 and a floating point multiplier sub-unit 733; wherein the four outputs of the segmentation sub-unit 731 are connected respectively The inputs of the four search subunits 732 and the outputs of the four search subunits 732 are respectively connected to the inputs of the floating point multiplier unit 733; the splitting subunit 731 is used to split the binary number of the 23-bit decimal part decim into 4 parts.
  • the four search sub-units 732 calculate (2 decim0 ) fp32 , (2 decim1 ) fp32 , (2 decim2 )
  • the result of fp32 sum that is, the first 6 digits of the binary number of the decimal part decim are calculated through the table lookup method, then 6 digits, then 6 digits, and the last 5 digits corresponding to the floating point number; the floating point multiplier subunit 733 is based on
  • the table lookup results of the four search subunits 732 are solved for (2 decim ) fp32 , that is, (2 decim0 ) fp32 , (2 decim1 ) fp32 , (2 decim2 ) fp32 and (2 decim3 ) fp32 are multiplied to obtain result_decim.
  • the floating point multiplier subunit 733 includes three multipliers, and the three multipliers are used to operate the table lookup results of the four lookup subunits 732 to obtain result_decim, that is, (2 decim ) fp32 .
  • result_decim that is, (2 decim ) fp32 .
  • the combination unit 75 includes a comparator subunit, wherein the comparator subunit is used to determine that the exponent of the floating point data obtained by the exponent bit solving unit 74 is the integer part temp, e x is in the fp32 representation Whether the exponent below satisfies the first preset numerical range or the fourth preset numerical range, then the exponent operation result corresponding to the output floating point data is the first constant value or the second constant value; if it is determined that the floating point value obtained by the exponent bit solving unit 74
  • the exponent of the point data is the integer part temp, whether the exponent of ex in the fp32 representation satisfies the second preset numerical range or the third preset numerical range, the exponent of the floating point data is obtained according to the corresponding value obtained by the search unit 73 Operation result.
  • the values in the first preset numerical range, the second preset numerical range, the third preset numerical range and the fourth preset numerical range are in order from large to small.
  • the comparator subunit may include one or more comparators, for example, may include multiple comparators in cascade, which is not limited by the present application.
  • the comparator subunit is used to output the exponent operation result of the floating point data as a first constant value if it is determined that the exponent obtained by the exponent bit solving unit 74 satisfies the first preset numerical range.
  • the hexadecimal number of the output ( ex ) fp32 is 32′h7f80_0000.
  • the comparator subunit is also used to obtain the exponent operation result of the floating point data according to the corresponding value obtained by the search unit 73 if it is determined that the exponent obtained by the exponent bit solving unit 74 satisfies the second preset numerical range. Specifically, based on the corresponding value result_decim obtained by the search unit 73 and the exponent obtained by the exponent bit solving unit 74 , the exponent operation result of the floating point data is obtained. As a non-limiting example, if it is determined that the index 0 ⁇ temp+127 ⁇ 255, the hexadecimal number of the output ( ex ) fp32 is 1′b0,temp+127,result_decim[22:0].
  • the comparator subunit is also used to obtain the exponent operation result of the floating point data according to the corresponding value obtained by the search unit 73 if it is determined that the exponent obtained by the exponent bit solving unit 74 satisfies the third preset numerical range. Specifically, based on the corresponding value result_decim obtained by the search unit 73 and the exponent obtained by the exponent bit solving unit 74 , the exponent operation result of the floating point data is obtained.
  • the hexadecimal number of the output ( ex ) fp32 is ⁇ 1′b0,8′h00,( ⁇ 1'b1,result_decim[ 22:0] ⁇ >>(-temp-126)) ⁇ .
  • the comparator subunit is also used to output the exponent operation result of the floating point data as a second fixed value if it is determined that the exponent obtained by the exponent bit solving unit 74 satisfies the fourth preset value range.
  • the hexadecimal number of the output ( ex ) fp32 is 0.
  • the above exponential operation results need to be quantified to obtain the final exponential function operation results of the data to be calculated; however, in practical applications, this application is based on The above steps can further accelerate the softmax function operation, which also includes the following steps:
  • S140 Read the exponential operation results of each floating point data and accumulate them to obtain the accumulation operation results.
  • corresponding floating-point data exponential operation results are sequentially read from the first memory and accumulated to obtain an accumulation operation result. That is to say, based on the corresponding number of floating-point data exponential operation results e x , e y , e z , ..., solve e x +e y +e z +....
  • the corresponding quantity refers to the number of input data to be calculated in any preset channel, that is, the dimension of the input array. In the following example for channel 0, the number of data to be calculated is 13 for exemplary description.
  • the accumulation operation result is obtained by reading the exponential operation results of a corresponding number of floating point data from the first memory ram0 and performing accumulation ⁇ .
  • the floating-point data exponential operation results can be accumulated through multiple adders to obtain the accumulation operation result.
  • the adder module structure composed of multiple adders is shown in Figure 9.
  • the adder module Including multiple adders, multiple adders form an addition tree.
  • the adder module includes 7 floating-point adder units 91 and one accumulator unit 92. Therefore, the adder module can solve the cumulative sum of 8 data at a time, that is, continuously read 8 data from the first memory ram.
  • the input data to be calculated includes 13 pieces, which are a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, then by finding e x
  • 13 exponential operation results corresponding to the first memory ram which are e a0 , e a1 , e a2 , e a3 , e a4 , e a5 , e a6 , e a7 , e a8 , e a9 , e a10 ,e a11 ,e a12 .
  • the process of finding the cumulative sum is as follows: first read 8 numbers from the first memory ram, namely e a0 , e a1 , e a2 , e a3 , e a4 , e a5 , e a6 , e a7 and accumulate them to obtain the partial cumulative sum.
  • p_sum0 and then read e a8 , e a9 , e a10 , e a11 , e a12 from the first memory ram.
  • the missing 3 are replaced with 0, that is, e a8 , e a9 ,e a10 ,e a11 ,e a12 ,0,0,0 are accumulated to obtain the partial accumulation sum p_sum1, and then add the last partial accumulation sum p_sum0 to obtain the accumulation operation result of all exponential operation results, that is, the accumulation sum p_sum.
  • the number of adders in this embodiment can be designed according to actual conditions and is not limited here.
  • the softmax function can be modified and solved through the exponential operation result and accumulation operation result obtained above.
  • the softmax function formula is Based on this formula, it is assumed that the result of the accumulation operation is The corresponding exponential operation result is read from the first memory according to the accumulation operation result, that is, Then use the exponential operation result and the accumulation operation result to calculate the ratio, and you can get the softmax result of the floating point data.
  • the exponential operation result of each floating point data is read from the first memory ram, and the exponential operation result and the accumulation operation result of each floating point data are divided to calculate div to obtain the corresponding ratio.
  • the division calculation div can be calculated through a divider, specifically implemented using a divider
  • the process is as follows: the ex results are sequentially read from the first memory as the dividend of the divider, and then the accumulated sum p_sum obtained in step S140 is used as the divisor of the divider.
  • the ratio needs to be quantified, and the floating-point data is converted into integer data through quantization, that is, the softmax function operation result of the data to be calculated is obtained.
  • quantization refers to converting floating-point data types into integer data types; the quantized results need to be written into the first memory for temporary storage until all floating-point data are quantized and output in parallel at the same time. It should be noted that the quantization process is the inverse process of inverse quantization, and you can refer to the aforementioned inverse quantization process.
  • the ratio of the floating-point data type is quantized to obtain the ratio of the in8 or int16 data type, and the ratio is written into the first memory ram for temporary storage, as shown in Figure 2.
  • the second memory may be DDR.
  • the softmax function operation results of each of the 4 channels are read from the first memory ram through data reading, and then according to The NC4HW format will be written into the second memory DDR. It should be understood that the quantized ratios of the four channels are arranged in the same format as the data to be calculated of the four channels, and will not be described again here.
  • sequence number of each step in the above embodiment does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
  • An embodiment of the present application also provides an exponential function hardware system.
  • an exponential function hardware system For details about the hardware system of the exponential function that are not described in detail, please refer to the relevant description of the aforementioned method and will not be repeated here.
  • Figure 10 is a schematic structural diagram of an exponential function hardware system provided by an embodiment of the present invention.
  • the system specifically includes a control module 1010, an exponential function calculation module 1110, a data reading module 1030, a data writing module 1040, a first memory 1050 and a second memory 1060; among which, the control module 1010 is connected to the exponential function calculation module 1110, the data reading module 1110 and the second memory 1060 respectively.
  • the fetching module 1030, the data writing module 1040, the first memory 1050 and the second memory 1060 are used to control the reading and writing time and quantity of data, as well as the working logic and process of each module.
  • control module 1010 is used to control the data reading module 1030 to read the data to be calculated, the amount of data, the time to read the data, etc. from the second memory 1060 such as DDR, and is also used to control the data writing module 1040 to write the data to be calculated.
  • the exponential function operation result of the data is written into the second memory 1060, the time of writing the data, etc.
  • the control module 1010 is also used to control the work flow of the exponential function calculation module 1110 and so on.
  • the data reading module 1030 is used to read the data to be calculated from the second memory 1060 under the control of the control module 1010, and send the data to be calculated to the exponential function calculation module 1110.
  • the exponential function calculation module 1110 is used to solve the exponential function operation result of the data to be calculated.
  • the exponential function calculation module 1110 includes one or more. When one is included, the data reading module 1030 is used to read the data to be calculated from the second memory 1060 under the control of the control module 1010 and send the data to be calculated to the index.
  • Function calculation module 1110 when multiple functions are included, the data reading module 1030 is used to read the data to be calculated from the second memory 1060 under the control of the control module 1010, and send the data to be calculated to the s exponential function of the corresponding channel Compute module 1110.
  • the exponential function calculation module 1110 includes: an inverse quantization sub-module 1021, an exponential calculation sub-module 1022 and a quantization sub-module 1025, wherein the inverse quantization sub-module 1021 is used to perform the calculation on the data to be calculated. Inverse quantization is performed to obtain the floating point data corresponding to the data to be calculated; the exponent calculation sub-module 1022 is used to perform deformation splitting on each floating point data according to the preset splitting formula to obtain floating point data in which the exponent bits are respectively the integer part and the decimal part.
  • the quantization sub-module 1025 is used for Quantify the exponential operation result to obtain the exponential function operation result of the data to be calculated. It should be noted that the specific content of each sub-module included in the exponential function calculation module 1110 can be found in steps S120 to S130, which will not be described again here.
  • the exponent calculation sub-module 1022 is the exponent calculation module as shown in Figure 7, including: a base changing unit 71, a splitting unit 72, a search unit 73, an exponent bit solving unit 74 and a combination unit 75.
  • a base changing unit 71 a base changing unit 71
  • a splitting unit 72 a split unit 73
  • an exponent bit solving unit 74 a combination unit 75.
  • An embodiment of the present application also provides a hardware system for softmax function.
  • a hardware system for softmax function For details about the hardware system of the softmax function that are not described in detail, please refer to the relevant description of the aforementioned method and will not be repeated here.
  • Figure 12 is a schematic structural diagram of a hardware system for a softmax function provided by an embodiment of the present invention. It should be noted that in the embodiment shown in Figure 10, four softmax function calculation modules 1020 are shown. In other embodiments, the softmax function hardware system may only include one softmax function calculation module 1020, or include other There are a number of softmax function calculation modules 1020. The number of calculation modules is designed according to the specific number of channels. This application does not specifically limit this.
  • the hardware system of the softmax function includes: a control module 1010, a softmax function calculation module 1020, a data reading module 1030, a data writing module 1040, a first memory 1050 and a second memory 1060.
  • the control module is respectively connected to the softmax function calculation module 1020, the data reading module 1030, the data writing module 1040, the first memory 1050 and the second memory 1060, and is used to control the reading and writing time and quantity of data, as well as the working logic and sum of each module. Process etc.
  • the control module 1010 is used to control the data reading module 1030 to read the data to be calculated, the amount of data, the time to read the data, etc.
  • the control module 1010 is also used to control the workflow of the softmax function calculation module 1020 and so on.
  • the data reading module 1030 is used to read the data to be calculated from the second memory 1060 under the control of the control module 1010, and send the data to be calculated to the softmax function calculation module 1020.
  • the softmax function calculation module 1020 is used to solve the softmax function calculation results of the data to be calculated.
  • the softmax function calculation module 1020 includes one or more. When one is included, the data reading module 1030 is used to read the data to be calculated from the second memory 1060 under the control of the control module 1010 and send the data to be calculated to softmax.
  • Function calculation module 1020 when multiple functions are included, the data reading module 1030 is used to read the data to be calculated from the second memory 1060 under the control of the control module 1010, and send the data to be calculated to the softmax function calculation of the corresponding channel. Module 1020.
  • the softmax function calculation module 1020 includes: an inverse quantization sub-module 1021, an exponential calculation sub-module 1022, an adder sub-module 1023, a divider sub-module 1024 and a quantization sub-module 1025, where,
  • the inverse quantization submodule 1021 is used to inversely quantize the data to be calculated to obtain the floating point data corresponding to the data to be calculated;
  • the index calculation submodule 1022 is used to perform variant splitting on each floating point data according to the preset splitting formula to obtain the exponent.
  • the exponent calculation sub-module 1022 is the exponent calculation module as shown in Figure 7, including: a base changing unit 71, a splitting unit 72, a search unit 73, an exponent bit solving unit 74 and a combination unit 75.
  • a base changing unit 71 a base changing unit 71
  • a splitting unit 72 a split unit 73
  • an exponent bit solving unit 74 a combination unit 75.
  • An embodiment of the present application also provides a chip, which includes the aforementioned exponential function hardware system and/or softmax function hardware system.
  • An embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium stores a computer program.
  • the computer program When executed by a processor, it can realize the hardware system of the aforementioned exponential function and/or the hardware system based on the softmax function. Optimization.
  • An embodiment of the present application provides a computer program product.
  • the terminal device can implement the steps in the embodiments of the aforementioned exponential function hardware system and/or the optimization method based on the softmax function.
  • the computer program includes computer program code
  • the computer program code can be in the form of source code, object code, executable file or some intermediate form, etc.
  • Computer-readable media may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, mobile hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), RAM, electronic Carrier signals, telecommunications signals, and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, the computer-readable medium does not include Electrical carrier signals and telecommunications signals.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

一种基于指数函数和softmax函数的优化方法、硬件系统及芯片,适用于人工神经网络技术领域,包括:读取待计算数据并对各待计算数据进行反量化,得到对应的浮点数据;对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;分别对整数部分和小数部分的浮点数据进行预设函数计算得到整数部分和小数部分对应的计算结果,将计算结果重组得到浮点数据的指数运算结果;对各浮点数据对应的指数运算结果进行量化得到各待计算数据的指数函数运算结果。该方法通过将待计算数据转化成浮点数据进行指数函数计算,对输入数据的数值范围不予限制,同时可以大大提升计算精度。

Description

一种基于指数函数和softmax函数的优化方法、硬件系统及芯片
本申请要求于2022年3月22日提交中国专利局,申请号为202210283260.9,发明名称为“一种基于指数函数和softmax函数的优化方法、硬件系统及芯片”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于神经网络技术领域,尤其涉及一种基于指数函数和归一化指数函数(softmax函数)的优化方法、硬件系统及芯片。
背景技术
在神经网络中,几乎无处不可见softmax函数的身影。Softmax函数通常在分类任务中作为输出层的激活函数。在实现softmax函数时,需要求
Figure PCTCN2022100635-appb-000001
Figure PCTCN2022100635-appb-000002
目前,在实现
Figure PCTCN2022100635-appb-000003
的时候通常通过分段线性拟合来实现。具体地,将输入数据x j设定在一个范围内,限制在一定范围内的数据实现softmax,在该设定的范围内,又分成若干个区间,每个区间内将
Figure PCTCN2022100635-appb-000004
用一元一次函数来近似表示,从而求解出
Figure PCTCN2022100635-appb-000005
然后累加求解
Figure PCTCN2022100635-appb-000006
最后利用除法器求解出
Figure PCTCN2022100635-appb-000007
由此可见,现有技术无法在任意数值范围实现softmax计算。
发明内容
有鉴于此,本发明实施例提供了一种基于指数函数和softmax函数的优化方法、硬件系统及芯片,能够解决相关技术中的一个或多个技术问题。
第一方面,本申请一实施例提供了一种基于指数函数的优化方法,包括: 读取待计算数据并对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;对所述各浮点数据对应的所述指数运算结果进行量化,得到所述各待计算数据的指数函数运算结果。
第二方面,本申请一实施例提供了一种基于softmax函数的优化方法,包括:读取待计算数据并对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;将所述各浮点数据对应的所述指数运算结果进行累加得到累加运算结果,并计算所述指数运算结果与所述累加运算结果的比值;量化所述浮点数据对应的所述比值,得到所述待计算数据的softmax函数运算结果。
第三方面,本申请一实施例提供了一种指数函数的硬件系统,包括:数据读取模块,用于读取待计算数据;指数函数计算模块,用于对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;以及根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;对所述各浮点数据对应的所述指数运算结果进行量化,得到所述各待计算数据的指数函数运算结果。
第四方面,本申请一实施例提供一种softmax函数的硬件系统,包括:数据读取模块,用于读取待计算数据;softmax函数计算模块,用于对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;根据预设拆分公式对 各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;以及分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;将所述各浮点数据对应的所述指数运算结果进行累加得到累加运算结果,并计算所述指数运算结果与所述累加运算结果的比值;量化所述浮点数据对应的所述比值,得到所述待计算数据的softmax函数运算结果。
第五方面,本申请一实施例提供了一种芯片,包括如第三方面任一实施例所述的指数函数的硬件系统,或第四方面任一实施例所述的softmax函数的硬件系统。
第六方面,本申请一实施例提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如第一方面任一实施例所述的基于指数函数的优化方法,或第二方面任一实施例所述的基于softmax函数的优化方法。
第七方面,本申请一实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得电子设备可实现如第一方面任一实施例所述的基于指数函数的优化方法,或第二方面任一实施例所述的基于softmax函数的优化方法。
本申请实施例通过将待计算数据转化成浮点数据进行计算,对输入数据的数值范围不予限制,同时可以大大提升计算精度及计算速率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例提供的一种基于softmax函数的优化方法的实现流 程示意图;
图2是本申请一实施例提供的一种基于softmax函数的优化方法的过程示意图;
图3是本申请一实施例提供的一种4个通道各自的待计算数据的示意图;
图4是本申请一实施例提供的一种4个通道的待计算数据在DDR中的排布格式的示意图;
图5是本申请一实施例提供的一种从DDR中读取4个通道相应的待计算数据的过程示意图;
图6是本申请一实施例提供的一种fp32类型的数据格式示意图;
图7是本申请一实施例提供的一种单精度浮点数计算模块的结构示意图;
图8是本申请一实施例提供的一种查找单元73的结构示意图;
图9是本申请一实施例提供的一种加法器模块的结构示意图;
图10是本申请一实施例提供的一种softmax函数的硬件系统的结构示意图;
图11是本申请一实施例提供的一种softmax函数的硬件系统的结构示意图;
图12是本申请一实施例提供的一种softmax函数的硬件系统的结构示意图;
图13是本申请一实施例提供的一种softmax函数计算模块1020的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本发明实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本发明。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节 妨碍本发明的描述。
在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
在本申请说明书中描述的“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
此外,在本申请的描述中,“多个”的含义是两个或两个以上。术语“第一”和“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
还应当理解,除非另有明确的规定或限定,术语“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或成一体;可以是直接相连,也可以是通过中间媒介间接相连,可以是两个元件内部的连通或两个元件的相互作用关系。对于本领域普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
为了说明本发明所述的技术方案,下面通过具体实施例来进行说明。
本申请一实施例提供一种基于softmax函数的优化方法,该方法可部署于任意一种芯片中。如图1所示,是本申请一实施例提供的一种基于softmax函数的优化方法的流程示意图。图2是本申请一实施例提供的图1所示实施例的一种基于softmax函数的优化方法的过程示意图。
在一个实施例中,如图1所示,基于softmax函数的优化方法可以包括:步骤S110至步骤S160。
S110,读取待计算数据,并发送相应的待计算数据至各预设通道。
其中,待计算数据,即存储在第二存储器中的待计算数据。第二存储器可 以为诸如双倍数据率同步动态随机存取存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM)等动态随机存取存储器(Dynamic Random Access Memory,DRAM)。
在一些实施例中,通过数据读取从第二存储器中读取待计算数据,并给每个通道发送相应的待计算数据。在图2所示实施例中,通过数据读取从DDR存中读取待计算数据,并给预设四通道中的每个通道发送相应的待计算数据,直至各通道的数据读取完毕为止。
在一个实施例中,在数据读取之前,需将需计算的数据按预设排布格式进行排布得到待计算数据,优选地,预设排布格式为NC4HW格式,更具体地,需计算的数据通常为int8或int16等类型的数据,则需计算的数据按照NC4HW格式写入DDR中,得到按照NC4HW格式排布的待计算数据。具体地,若需计算的数据的数量为C,则对应的可分为C/4组,分组后通过交织排布的方式得到NC4HW格式的待计算数据。需要说明的是,若需计算的数据的数量不能整除,则通过补零方式进行数据补齐,此处不作限制。
作为一非限制性示例,如图3所示,待计算数据为从上往下依次是通道0、通道1、通道2和通道3的数据。在DDR中按照NC4HW格式对4个通道的数据进行排布,排布格式如图4所示,4个通道中相同位置处的数据依次相邻排列。如图5所示,从DDR中将按照NC4HW格式排布的数据读取出来,一次读取4个数据并发送至相应的通道进行后处理,从而实现4组数据的softmax同时求解,实现4倍加速,从而大大缩短求解时间。
需要说明的是,本申请一些实施例包含多个预设通道,例如4个预设通道,每个预设通道可以实现一组数据的softmax求解,每个预设通道的硬件实现方式和硬件结构相同。因此,在下面实施例中只对其中一个预设通道的数据进行阐述。本申请实施例通过实现多通道数据的同时求解,大大缩短求解时间,从而提供了一种高效快速的softmax硬件实现方法。
还需要说明的是,本申请其他一些实施例中,可以仅包括一个通道,即对 一组数据求解softmax。
S120,通过各预设通道将待计算数据进行反量化,得到各预设通道对应的浮点数据。
其中,根据预设通道中的预设反量化参数将待计算数据进行反量化,转化成浮点数据;预设反量化参数可以通过系统配置得到。
作为一非限制性示例,根据反量化参数,将从DDR中读取的int8或int16类型数据转成单精度浮点数(fp32)或半精度浮点数(fp16)。
需要说明的是,在后续的实施例中以将待计算数据转化成单精度浮点数(fp32)为例进行说明,应理解,示例性说明不能解释为对本申请的限制。单精度浮点数的格式如图6所示,通过fp32可以实现小数的表示和计算。其中,fp32为采用4个字节,即32位(bit)进行编码存储的一种数据类型,第一bit为符号位(sig),接着8bit表示指数位(exp),最后23bit为尾数位(fra)。对于任何一个fp32,存在以下五种表示数据的方法:
(1)当exp==0,fra==0时,表示该数据为:0。
(2)当exp==0,fra!=0时,表示该数据为:(-1) sig×(0.尾数位)×2 (1-127)
(3)当exp==8’b1111_1111,fra==23’d0,表示该数据为:正无穷大或负无穷大。具体地,当sig==1时,为负无穷大;当sig==0时,为正无穷大。其中,b表示当前这个数采用二进制表示,b前面的数字8’表示当前这个数为8bit,b后面的数表示当前这个数的具体二进制数据;d表示当前这个数采用十进制表示,d前面的数字23’表示当前这个数为23bit,d后面的数表示当前这个数的具体十进制数据。
(4)当exp==8’b1111_1111,fra!=23’d0,表示该数据不是一个数,即为非数(Not a Number,NAN)。
(5)除上述之外的数据,其余数据均可表示为:(-1) sig×(1.尾数位)×2 (exp-127)
在一个实施例中,假设反量化参数是fp32_scale,那么输入的待计算数据input_0通过反量化,转化成单精度浮点数,即input_00=fp32_scale*input_0。需 要说明的是,后续提及的量化过程是反量化的逆过程,即由input_00得到input_0。
S130,对浮点数据进行变型拆分得到指数位为整数部分和小数部分的浮点数据,分别对拆分后的浮点数据进行预设函数计算得到对应的计算结果,重组计算结果得到指数运算结果并将其进行存储。
需要说明的是,本实施例以e x进行举例说明,如图2所示,指数计算为对浮点数据进行e x计算,更具体地,对各预设通道中经过反量化得到的浮点数据进行e x的指数运算得到各预设通道对应的指数运算结果,即求解e的input_00次方。
在一个实施例中,基于单精度浮点数形式对e x进行拟合求解,并将指数运算结果写入第一存储器ram;其中,e x中的x为需要进行指数函数计算的浮点数据,第一存储器诸如随机存取存储器(random access memory,RAM)等。
在另一个实施例中,由于硬件无法直接实现e x计算,因此,本申请在进行e x指数计算之前,需要对浮点数据进行变型拆分得到指数位为整数部分和小数部分的浮点数据,分别对拆分后的浮点数据进行预设函数计算得到对应的计算结果,重组计算结果得到指数运算结果并将其进行存储,具体地,变型拆分重组可通过如图7所示的单精度浮点数计算模块实现。
在一个实施例中,单精度浮点数计算模块设计的基本原理为:由于e是个常数,约等于2.718181828459,因此,可先对e x进行变型(即换底),换成2为底数的指数函数,即
Figure PCTCN2022100635-appb-000008
其中,log 2e x=x×log 2e,由于log 2e为常数,所以log 2e x可理解为x乘以一个常数。因此,基于上述理解对x×log 2e的结果进行拆分,任意一个浮点数均可以拆成整数和小数部分。更具体地,对x×log 2e拆分,假设temp是x×log 2e结果的整数部分,decim是x×log 2e结果的小数部分,将拆分后的数据进行组合可得到x×log 2e=temp+decim。所以,
Figure PCTCN2022100635-appb-000009
为了便于描述,本申请实施例中将(aa) 2表示为aa的二进制形式,aa默认是十进制,(aa) 10表示为aa的十进制,(aa) fp32表示为aa的单精度浮点数表示形 式。由于(decim) 10∈[0,1),所以(2 decim) 10∈[1,2),此时根据步骤S120一实施例中fp32的表达方法,可以得到e x的fp32表示的结果,即:
(e x) fp32=2 temp×2 decim=(-1) 0×1.xxxxxxxxxxxxxxxxxxx×2 temp
=(-1) 0×1.xxxxxxxxxxxxxxxxxxx×2 (temp+127)-127
基于上述公式,可知,当前浮点数据的指数部分为temp+127,假设2 decim的单精度浮点计算结果是result_decim=(2 decim) fp32,那么(e x) fp32为:
当temp+127>0时:
e x=temp+127>255?
32′h7f80_0000∶{1′b0,temp+127,result_decim[22:0]}
即,当temp+127>255,(e x) fp32的十六进制数是正无穷32′h7f80_0000;当temp+127≤255,(e x) fp32的十六进制数是{1′b0,temp+127,result_decim[22:0]},其中,h为十六进制的表示形式,result_decim[22:0]表示result_decim这个数的二进制数的低23位。
当temp+127≤0,即(e x) fp32的指数位为0,则根据步骤S120中的表示方法,当前待计算数据可通过(-1) sig×(0.尾数位)×2 (1-127)表示,即:
e x=2 temp×2 decim=2 1-127×2 decim/(2 -temp-126)
根据上式可得(0.尾数位)=2 decim/(2 -temp-126),又基于result_decim=(2 decim) fp32,得2 decim={1.result_decim[22:0]},即{1’b1,result_decim[22:0]},进一步根据fp32的表达方式,尾数位仅能为23位,即以23为界限确定e x的计算方式,具体地:
e x=temp+127>-23?
{1′b0,8′h00,({1′b1,result_decim[22:0]}>>(-temp-126))}:0
其中,当temp+127>-23时,即-temp-126不超过23,此时(e x) fp32的十六进制数是{1′b0,8′h00,({1’b1,result_decim[22:0]}>>(-temp-126))},{1’b1,result_decim[22:0]}>>(-temp-126)表示{1’b1,rdsult_decim[22:0]}向右移动(-temp-126)位后的整数;当temp+127≤-23时,即当前待计算 数据超出fp32所能表示的范围,则默认(e x) fp32的十六进制数是0。
可见,在进行指数函数计算时,通过对e x进行换底,将换底后的得到的
Figure PCTCN2022100635-appb-000010
中的x×log 2e拆分为整数部分temp和小数部分decim,再根据步骤S120一实施例中fp32的表达方法,即通过预设函数计算对浮点数据的指数部分进行判断,得到(e x) fp32的相关表示方式;基于上述表示方式,整数部分temp经过上述拆分后为已知数据,此时只需求result_decim[22:0]值便可得到(e x) fp32的表示结果,也即只需对(2 decim) fp32进行求解,再通过将上述整数部分为已知数据及(2 decim) fp32进行重组便可得到浮点数据在单精度浮点数形式下的指数运算结果。
进一步地,基于上述设计原理,单精度浮点数计算模块包括:换底单元71、拆分单元72、查找单元73、指数位求解单元74和组合单元75;其中,换底单元71用于对待计算数据的浮点数据(input_00)进行换底得到换底后的待计算数据,即
Figure PCTCN2022100635-appb-000011
拆分单元72用于将换底后的待计算数据的指数拆分为整数部分temp和小数部分decim,即
Figure PCTCN2022100635-appb-000012
Figure PCTCN2022100635-appb-000013
查找单元73用于查找浮点数据的指数为小数部分decim的单精度浮点数据,即查找(2 decim) fp32的浮点数据;指数位求解单元74用于根据浮点数据的指数为整数部分temp求解e x在fp32表示形式下的指数;组合单元75用于根据小数部分decim的单精度浮点数据和整数部分temp求解的e x在fp32表示形式下的指数进行重组,获取待计算数据对应的浮点数据的指数运算结果。通过上述设计的硬件系统,一方面对输入数据的范围没有限制,另一方面精度可大大提升,例如当采用单精度浮点数时,精度可提升至10 -4
在一个实施例中,查找单元73通过预设查表法查找的指数为小数部分decim的单精度浮点数据。
在一个实现方式中,通过对浮点数据的指数为小数部分进行拆分,后通过预设查表法分别求解拆分后数据的单精度浮点数据,具体地,
(decim) 2=0.xxxxxxyyyyyyzzzzzzvvvvv
=0.xxxxxx+0.000000yyyyyy+0.000000000000zzzzzz
+0.000000000000000000vvvvv
(decim0) 2=0.xxxxxx,表示decim的二进制数的第一个6位数;
(decim1) 2=0.000000yyyyyy,表示decim的二进制数的第二个6位数;
(decim2) 2=0.000000000000zzzzzz,表示decim的二进制数的第三个6位数;
(decim3) 2=0.000000000000000000vvvvv,表示decim的二进制数的末尾5位数。
因此,在本实现方式中,(2 decim) fp32求解方法如下:
(2 decim) fp32=(2 decim0+decim1+decim2+decim3) fp32=(2 decim0) fp32×(2 decim1) fp32×(2 decim2) fp32×(2 decim3) fp32
作为一非限制性实现方式,如图8所示,查找单元73包括分割子单元731、4个查找子单元732和浮点乘法器子单元733;其中,分割子单元731的4个输出分别连接4个查找子单元732的输入,4个查找子单元732的输出分别连接浮点乘法器单元733的输入;分割子单元731用于将23位的小数部分decim的二进制数拆分为4个部分,依次为(decim0) 2、(decim1) 2、(decim2) 2和(decim3) 2,4个查找子单元732分别通过查表法算出(2 decim0) fp32、(2 decim1) fp32、(2 decim2) fp32和的结果,即分别通过查表法算出小数部分decim的二进制数的前6位,接着6位,再接着6位,最后5位各自对应的浮点数;浮点乘法器子单元733根据4个查找子单元732的查表结果完成(2 decim) fp32求解,即将(2 decim0) fp32、(2 decim1) fp32、(2 decim2) fp32和(2 decim3) fp32相乘得到result_decim。在图8所示的示例中,浮点乘法器子单元733包括3个乘法器,利用3个乘法器对4个查找子单元732的查表结果进行运算得到result_decim,即(2 decim) fp32。本实施例通过将查找表分割,使查找表的数据量大大减少,从而大幅减少存储资源。
在一个实施例中,组合单元75包括比较器子单元,其中,比较器子单元, 用于若确定指数位求解单元74得到的浮点数据的指数为整数部分temp时,e x在fp32表示形式下的指数是否满足第一预设数值范围或第四预设数值范围,则对应输出浮点数据的指数运算结果为第一定值或第二定值;若确定指数位求解单元74得到的浮点数据的指数为整数部分temp时,e x在fp32表示形式下的指数是否满足第二预设数值范围或第三预设数值范围,则根据查找单元73得到的对应值获取浮点数据的指数运算结果。其中,第一预设数值范围、第二预设数值范围、第三预设数值范围和第四预设数值范围中的数值依次由大到小。
在本申请实施例中,比较器子单元可以包括一个或多个比较器,例如可以包括级联的多个比较器,本申请对此不予限制。
具体地,比较器子单元,用于若确定指数位求解单元74得到的指数满足第一预设数值范围,则输出浮点数据的指数运算结果为第一定值。作为一非限制性示例,若确定该指数temp+127>255,则输出(e x) fp32的十六进制数是32′h7f80_0000。
比较器子单元,还用于若确定指数位求解单元74得到的指数满足第二预设数值范围,则根据查找单元73得到的对应值获取浮点数据的指数运算结果。具体地,根据查找单元73得到的对应值result_decim和指数位求解单元74得到的指数,获取浮点数据的指数运算结果。作为一非限制性示例,若确定指数0<temp+127≤255,则输出(e x) fp32的十六进制数是1′b0,temp+127,result_decim[22:0]。
比较器子单元,还用于若确定指数位求解单元74得到的指数满足第三预设数值范围,则根据查找单元73得到的对应值获取浮点数据的指数运算结果。具体地,根据查找单元73得到的对应值result_decim和指数位求解单元74得到的指数,获取浮点数据的指数运算结果。作为一非限制性示例,若确定指数-23<temp+127≤0,则输出(e x) fp32的十六进制数是{1′b0,8′h00,({1’b1,result_decim[22:0]}>>(-temp-126))}。
比较器子单元,还用于若确定指数位求解单元74得到的指数满足第四预设 数值范围,则输出浮点数据的指数运算结果为第二定值。作为一非限制性示例,若确定指数temp+127≤-23,输出(e x) fp32的十六进制数是0。
在一个实施例中,若本申请仅用于进行指数函数计算,则需对上述的指数运算结果进行量化,即可得到待计算数据最终的指数函数运算结果;但在实际应用上,本申请基于上述步骤还可进一步对softmax函数运算进行加速,具体还包括以下步骤:
S140,读取各浮点数据的指数运算结果并进行累加,得到累加运算结果。
其中,从第一存储器中依次读取对应的浮点数据指数运算结果并进行累加,得到累加运算结果。也就是说,根据相应数量的浮点数据指数运算结果e x,e y,e z,…,求解e x+e y+e z+…。相应数量指的是任一预设通道中输入的待计算数据的数量,即输入数组的维度。在下面针对通道0的示例中,以待计算数据的数量为13进行示例性描述。
在一个实施例中,如图2所示,通过从第一存储器ram0中读取相应数量的浮点数据的指数运算结果进行累加Σ,得到累加运算结果。
作为一非限制性示例,可通过多个加法器对浮点数据指数运算结果进行累加得到累加运算结果,具体地,多个加法器的组成的加法器模块结构如图9所示,加法器模块包括多个加法器,多个加法器形成加法树。具体地,加法器模块包括7个浮点加法器单元91和一个累加器单元92,因此,加法器模块每次可求解8个数据的累加和,即从第一存储器ram中连续读取8个数据(即指数运算结果)做累加,求出部分累加和,记为
Figure PCTCN2022100635-appb-000014
然后依次求出下一8个数的部分累加和
Figure PCTCN2022100635-appb-000015
并和上一部分累加和相加,重复相同的步骤直至该通道的所有指数运算结果都进行一次累加为止,最后得到所有数据的累加运算结果。
需要说明的是,当数据的数量少于8时,可通过零进行补齐。比如,针对通道0而言,输入的待计算数据包括13个,依次为a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,那么通过求e x后,第一存储器ram中 对应存有13个指数运算结果,依次为e a0,e a1,e a2,e a3,e a4,e a5,e a6,e a7,e a8,e a9,e a10,e a11,e a12。在求累加和的过程如下,先从第一存储器ram读取8个数,即e a0,e a1,e a2,e a3,e a4,e a5,e a6,e a7做累加得到部分累加和p_sum0,然后再从第一存储器ram读取e a8,e a9,e a10,e a11,e a12,由于此时只有5个数据,那么不足的3个用0替代,即求e a8,e a9,e a10,e a11,e a12,0,0,0的累加,得到部分累加和p_sum1,然后加上上次的部分累加和p_sum0,得到所有指数运算结果的累加运算结果,即累加和p_sum。另外,本实施例中的加法器数量可根据实际情况设计,此处不作限制。
S150,计算各浮点数据的指数运算结果与累加运算结果的比值并将比值进行量化,得到softmax函数运算结果。
基于步骤S140,可通过上述获取的指数运算结果和累加运算结果对softmax函数进行变型求解。在一个实施例中,softmax函数公式为
Figure PCTCN2022100635-appb-000016
基于该公式,假设累加运算结果为
Figure PCTCN2022100635-appb-000017
根据累加运算结果从第一存储器中读取对应的指数运算结果,即
Figure PCTCN2022100635-appb-000018
然后利用指数运算结果与累加运算结果进行比值计算,便可以得到浮点数据的softmax结果。
在一个实施例中,如图2所示,从第一存储器ram中读取各浮点数据的指数运算结果,并对各浮点数据的指数运算结果与累加运算结果进行除法计算div,得到对应的比值。
作为一非限制性示例,除法计算div可通过除法器进行计算,具体利用除法器实现
Figure PCTCN2022100635-appb-000019
过程如下,从第一存储器中依次读出e x结果,作为除法器的被除数,然后将步骤S140中求得的累加和p_sum作为除法器的除数。比如,针对通道0而言,将步骤S140一示例中存在第一存储器ram中的e a0,e a1,e a2,e a3,e a4,e a5,e a6,e a7,e a8,e a9,e a10,e a11,e a12依次取出与p_sum相除,则除法器的输出便是
Figure PCTCN2022100635-appb-000020
进一步地,得到浮点数据的指数运算结果与累加运算结果的比值,需将比值进行量化,通过量化将浮点数据转换为整型数据,即得到待计算数据的 softmax函数运算结果。
其中,量化是指将浮点数据类型转化成整型数据类型;量化后的结果需写入第一存储器中进行暂存,以待所有浮点数据均完成量化后同时并行输出。需要说明的是,量化过程是反量化的逆过程,可以参考前述的反量化过程。
在一些实施例中,将浮点数据类型的比值进行量化,得到in8或int16数据类型的比值,并写入第一存储器ram中进行暂存,如图2所示。
S160,待所有待计算数据对应的softmax函数运算结果计算完毕,对softmax函数运算结果执行写入操作。
其中,在各通道的待计算数据对应的比值都经过量化得到softmax函数运算结果后,将各通道的softmax函数运算结果写入第二存储器,第二存储器可以为DDR。
在一个实施例中,如图2所示,当4个通道的待计算数据的softmax结果计算完毕,通过数据读取从第一存储器ram中读取4个通道各自的softmax函数运算结果,然后按照NC4HW格式将写入第二存储器DDR中。应理解,4个通道的经量化的比值与4个通道的待计算数据的排布格式相同,此处不再赘述。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本申请一实施例还提供一种指数函数的硬件系统。该指数函数的硬件系统中未详细描述之处请详见前述方法的相关描述,此处不再赘述。
图10是本发明实施例提供的一种指数函数的硬件系统的结构示意图。系统具体包括控制模块1010、指数函数计算模块1110、数据读取模块1030、数据写入模块1040、第一存储器1050和第二存储器1060;其中,控制模块1010分别连接指数函数计算模块1110、数据读取模块1030、数据写入模块1040、第一存储器1050及第二存储器1060,用于控制数据的读写时间和数量,以及各模块的工作逻辑和流程等。具体地,控制模块1010用于控制数据读取模块1030 从诸如DDR等第二存储器1060中读取待计算数据、数据量和读数据的时间等,还用于控制数据写入模块1040将待计算数据的指数函数运算结果写入第二存储器1060和写数据的时间等。控制模块1010还用于控制指数函数计算模块1110的工作流程等。
数据读取模块1030用于在控制模块1010的控制下从第二存储器1060中读取待计算数据,并将待计算数据发送给指数函数计算模块1110。指数函数计算模块1110用于求解待计算数据的指数函数运算结果。指数函数计算模块1110包括一个或多个,当包括一个时,数据读取模块1030用于在控制模块1010的控制下从第二存储器1060中读取待计算数据,并将待计算数据发送给指数函数计算模块1110;当包括多个时,数据读取模块1030用于在控制模块1010的控制下从第二存储器1060中读取待计算数据,并将待计算数据发送给相应通道的s指数函数计算模块1110。
在一些实施例中,如图11所示,指数函数计算模块1110包括:反量化子模块1021、指数计算子模块1022及量化子模块1025,其中,反量化子模块1021用于将待计算数据进行反量化,得到待计算数据对应的浮点数据;指数计算子模块1022用于根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;以及分别对整数部分和小数部分的浮点数据进行预设函数计算得到整数部分和小数部分对应的计算结果,将计算结果重组得到所述浮点数据的指数运算结果;量化子模块1025用于对指数运算结果进行量化得到待计算数据的指数函数运算结果。需要说明的是,指数函数计算模块1110中包括的各子模块具体内容参见步骤S120~S130,此处不再赘述。
在一些实施例中,指数计算子模块1022为如图7所示的指数计算模块,包括:换底单元71、拆分单元72、查找单元73、指数位求解单元74和组合单元75,具体参见前述内容,此处不再赘述。
本申请一实施例还提供一种softmax函数的硬件系统。该softmax函数的硬件系统中未详细描述之处请详见前述方法的相关描述,此处不再赘述。
图12是本发明实施例提供的一种softmax函数的硬件系统的结构示意图。需要说明的是,在图10所示的实施例中示出了4个softmax函数计算模块1020,在其他实施例中,softmax函数的硬件系统可以仅包括1个softmax函数计算模块1020,或者包括其他数量个softmax函数计算模块1020,计算模块的数量根据具体的通道数进行设计,本申请对此不予具体限制。
如图12所示,softmax函数的硬件系统包括:控制模块1010、softmax函数计算模块1020、数据读取模块1030、数据写入模块1040、第一存储器1050和第二存储器1060。控制模块分别连接softmax函数计算模块1020、数据读取模块1030、数据写入模块1040、第一存储器1050和第二存储器1060,用于控制数据的读写时间和数量,以及各模块的工作逻辑和流程等。具体地,控制模块1010用于控制数据读取模块1030从诸如DDR等第二存储器1060中读取待计算数据、数据量和读数据的时间等,还用于控制数据写入模块1040将待计算数据的softmax结果写入第二存储器1060和写数据的时间等。控制模块1010还用于控制softmax函数计算模块1020的工作流程等。
数据读取模块1030用于在控制模块1010的控制下从第二存储器1060中读取待计算数据,并将待计算数据发送给softmax函数计算模块1020。softmax函数计算模块1020用于求解待计算数据的softmax函数计算结果。softmax函数计算模块1020包括一个或多个,当包括一个时,数据读取模块1030用于在控制模块1010的控制下从第二存储器1060中读取待计算数据,并将待计算数据发送给softmax函数计算模块1020;当包括多个时,数据读取模块1030用于在控制模块1010的控制下从第二存储器1060中读取待计算数据,并将待计算数据发送给相应通道的softmax函数计算模块1020。
在一些实施例中,如图13所示,softmax函数计算模块1020包括:反量化子模块1021、指数计算子模块1022、加法器子模块1023、除法器子模块1024和量化子模块1025,其中,反量化子模块1021用于将待计算数据进行反量化,得到待计算数据对应的浮点数据;指数计算子模块1022用于根据预设拆分公式 对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;以及分别对整数部分和小数部分的浮点数据进行预设函数计算得到整数部分和小数部分对应的计算结果,将计算结果重组得到所述浮点数据的指数运算结果;加法器子模块1023用于各浮点数据对应的指数运算结结果进行累加运算,得到累加运算结果;除法器子模块1024用于根据存储的各浮点数据对应的指数运算结果和累加运算结果计算两者之间的比值;量化子模块1025用于对比值进行量化得到softmax函数运算结果。需要说明的是,在图10中4个softmax函数计算模块1020的结构相同;softmax函数计算模块1020中包括的各子模块具体内容参见步骤S120~S150,此处不再赘述。
在一些实施例中,指数计算子模块1022为如图7所示的指数计算模块,包括:换底单元71、拆分单元72、查找单元73、指数位求解单元74和组合单元75,具体参见前述内容,此处不再赘述。
本申请一实施例还提供了一种芯片,该芯片包括前述的指数函数的硬件系统和/或softmax函数的硬件系统。
本申请一实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时可实现前述指数函数的硬件系统和/或基于softmax函数的优化方法。
本申请一实施例提供了一种计算机程序产品,当计算机程序产品在电子设备上运行时,使得终端设备可实现前述指数函数的硬件系统和/或基于softmax函数的优化方法实施例中的步骤。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
其中,计算机程序包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(read-only memory,ROM)、RAM、 电载波信号、电信信号以及软件分发介质等。需要说明的是,计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围,均应包含在本发明的保护范围之内。

Claims (11)

  1. 一种基于指数函数的优化方法,其特征在于,包括:
    读取待计算数据并对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;
    根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;
    分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;
    对所述各浮点数据对应的所述指数运算结果进行量化,得到所述各待计算数据的指数函数运算结果。
  2. 一种基于softmax函数的优化方法,其特征在于,包括:
    读取待计算数据并对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;
    根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;
    分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;
    将所述各浮点数据对应的所述指数运算结果进行累加得到累加运算结果,并计算所述指数运算结果与所述累加运算结果的比值;
    量化所述浮点数据对应的所述比值,得到所述各待计算数据的softmax函数运算结果。
  3. 如权利要求1或2所述的优化方法,其特征在于,在所述读取待计算数据之前,还包括:
    将需计算的数据按预设排布格式进行排布,得到所述待计算数据。
  4. 如权利要求1或2所述的优化方法,其特征在于,所述读取待计算数据并对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据,包括:
    读取待计算数据,并发送相应的待计算数据至各预设通道;
    通过所述各预设通道将相应的所述待计算数据进行反量化,得到所述各预设通道对应的浮点数据。
  5. 如权利要求1或2所述的优化方法,其特征在于,所述根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据,包括:
    根据预设拆分公式对对所述各待计算数据的浮点数据进行变型,得到变型后的所述浮点数据;
    将变型后的所述浮点数据的指数位拆分为整数部分和小数部分。
  6. 如权利要求1或2所述的优化方法,其特征在于,所述分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果,包括:
    通过预设查表法查找所述浮点数据的指数位为所述小数部分的单精度浮点数据;
    根据所述浮点数据的指数位为所述整数部分求解指数函数在单精度浮点数据表示形式下的指数;
    根据所述小数部分的单精度浮点数据和所述整数部分求解的指数函数在单精度浮点数据表示形式下的指数进行重组,获取所述浮点数据的指数运算结果。
  7. 如权利要求6所述的优化方法,其特征在于,所述根据所述小数部分的单精度浮点数据和所述整数部分求解的指数函数在单精度浮点数据表示形式下的指数进行重组,获取所述浮点数据的指数运算结果,包括:
    若确定所述整数部分求解指数函数在单精度浮点数据表示形式下的指数满足第一预设数值范围或第四预设数值范围,则对应输出所述浮点数据的指数运 算结果对应为第一定值或第二定值;
    若确定所述整数部分求解指数函数在单精度浮点数据表示形式下的指数满足第二预设数值范围或第三预设数值范围,则根据所述小数部分的单精度浮点数据和所述整数部分进行重组,获取所述浮点数据的指数运算结果;所述第一预设数值范围、所述第二预设数值范围、所述第三预设数值范围和所述第四预设数值范围中的数值依次由大到小。
  8. 一种指数函数的硬件系统,其特征在于,包括:
    数据读取模块,用于读取待计算数据;
    指数函数计算模块,用于对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;以及
    分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;对所述各浮点数据对应的所述指数运算结果进行量化,得到所述各待计算数据的指数函数运算结果。
  9. 一种softmax函数的硬件系统,其特征在于,包括:
    数据读取模块,用于读取待计算数据;
    softmax函数计算模块,用于对各待计算数据进行反量化,得到所述各待计算数据对应的浮点数据;根据预设拆分公式对各浮点数据进行变型拆分,得到指数位分别为整数部分和小数部分的浮点数据;以及
    分别对所述整数部分和小数部分的浮点数据进行预设函数计算得到所述整数部分和小数部分对应的计算结果,将所述计算结果重组得到所述浮点数据的指数运算结果;将所述各浮点数据对应的所述指数运算结果进行累加得到累加运算结果,并计算所述指数运算结果与所述累加运算结果的比值;量化所述浮点数据对应的所述比值,得到所述待计算数据的softmax函数运算结果。
  10. 一种芯片,其特征在于,包括如权利要求8所述指数函数的硬件系统 和/或如权利要求9所述的softmax函数的硬件系统。
  11. 一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的优化方法。
PCT/CN2022/100635 2022-03-22 2022-06-23 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片 WO2023178860A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210283260.9A CN114610267A (zh) 2022-03-22 2022-03-22 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片
CN202210283260.9 2022-03-22

Publications (1)

Publication Number Publication Date
WO2023178860A1 true WO2023178860A1 (zh) 2023-09-28

Family

ID=81864197

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100635 WO2023178860A1 (zh) 2022-03-22 2022-06-23 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片

Country Status (2)

Country Link
CN (1) CN114610267A (zh)
WO (1) WO2023178860A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114610267A (zh) * 2022-03-22 2022-06-10 奥比中光科技集团股份有限公司 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片
CN114546330B (zh) * 2022-04-26 2022-07-12 成都登临科技有限公司 函数实现方法、逼近区间分段方法、芯片、设备及介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021537A (zh) * 2018-01-05 2018-05-11 南京大学 一种基于硬件平台的softmax实现方式
CN111240746A (zh) * 2020-01-12 2020-06-05 苏州浪潮智能科技有限公司 一种浮点数据反量化及量化的方法和设备
US20210019116A1 (en) * 2019-07-18 2021-01-21 International Business Machines Corporation Floating point unit for exponential function implementation
CN112685693A (zh) * 2020-12-31 2021-04-20 南方电网科学研究院有限责任公司 一种实现Softmax函数的设备
CN113721884A (zh) * 2021-09-01 2021-11-30 北京百度网讯科技有限公司 运算方法、装置、芯片、电子装置及存储介质
CN114201140A (zh) * 2021-12-16 2022-03-18 千芯半导体科技(北京)有限公司 指数函数处理单元、方法和神经网络芯片
CN114610267A (zh) * 2022-03-22 2022-06-10 奥比中光科技集团股份有限公司 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108021537A (zh) * 2018-01-05 2018-05-11 南京大学 一种基于硬件平台的softmax实现方式
US20210019116A1 (en) * 2019-07-18 2021-01-21 International Business Machines Corporation Floating point unit for exponential function implementation
CN111240746A (zh) * 2020-01-12 2020-06-05 苏州浪潮智能科技有限公司 一种浮点数据反量化及量化的方法和设备
CN112685693A (zh) * 2020-12-31 2021-04-20 南方电网科学研究院有限责任公司 一种实现Softmax函数的设备
CN113721884A (zh) * 2021-09-01 2021-11-30 北京百度网讯科技有限公司 运算方法、装置、芯片、电子装置及存储介质
CN114201140A (zh) * 2021-12-16 2022-03-18 千芯半导体科技(北京)有限公司 指数函数处理单元、方法和神经网络芯片
CN114610267A (zh) * 2022-03-22 2022-06-10 奥比中光科技集团股份有限公司 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片

Also Published As

Publication number Publication date
CN114610267A (zh) 2022-06-10

Similar Documents

Publication Publication Date Title
WO2023178860A1 (zh) 一种基于指数函数和softmax函数的优化方法、硬件系统及芯片
US20200218509A1 (en) Multiplication Circuit, System on Chip, and Electronic Device
US10491239B1 (en) Large-scale computations using an adaptive numerical format
TWI701612B (zh) 用於神經網路中激勵函數的電路系統及其處理方法
CN103838860A (zh) 一种基于动态副本策略的文件存储系统及其存储方法
CN111966649A (zh) 一种高效去重的轻量级在线文件存储方法及装置
CN109165006B (zh) Softmax函数的设计优化及硬件实现方法及系统
CN111240746B (zh) 一种浮点数据反量化及量化的方法和设备
CN114612996A (zh) 神经网络模型的运行方法、介质、程序产品以及电子设备
CN113741858A (zh) 存内乘加计算方法、装置、芯片和计算设备
US20210004679A1 (en) Asymmetric quantization for compression and for acceleration of inference for neural networks
Vaeztourshizi et al. An energy-efficient, yet highly-accurate, approximate non-iterative divider
CN110110852B (zh) 一种深度学习网络移植到fpag平台的方法
US20230342419A1 (en) Matrix calculation apparatus, method, system, circuit, and device, and chip
WO2023116400A1 (zh) 向量运算方法、向量运算器、电子设备和存储介质
CN111258633B (zh) 乘法器、数据处理方法、芯片及电子设备
CN107015783B (zh) 一种浮点角度压缩实现方法及装置
US20210357758A1 (en) Method and device for deep neural network compression
CN115827555A (zh) 数据处理方法、计算机设备、存储介质和乘法器结构
CN209895329U (zh) 乘法器
JP2023509121A (ja) 浮動小数点数の乗算計算方法及び機器、並びに算術論理演算装置
CN111384975A (zh) 多进制ldpc解码算法的优化方法、装置及解码器
CN114207609A (zh) 信息处理装置、信息处理系统和信息处理方法
US11907680B2 (en) Multiplication and accumulation (MAC) operator
WO2023124235A1 (zh) 多输入浮点数处理方法、装置、处理器及计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22932912

Country of ref document: EP

Kind code of ref document: A1