WO2021212285A1 - 浮点累加装置、方法和计算机存储介质 - Google Patents

浮点累加装置、方法和计算机存储介质 Download PDF

Info

Publication number
WO2021212285A1
WO2021212285A1 PCT/CN2020/085715 CN2020085715W WO2021212285A1 WO 2021212285 A1 WO2021212285 A1 WO 2021212285A1 CN 2020085715 W CN2020085715 W CN 2020085715W WO 2021212285 A1 WO2021212285 A1 WO 2021212285A1
Authority
WO
WIPO (PCT)
Prior art keywords
floating
point
calculation
stage
output
Prior art date
Application number
PCT/CN2020/085715
Other languages
English (en)
French (fr)
Inventor
刘子男
徐功林
韩彬
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to PCT/CN2020/085715 priority Critical patent/WO2021212285A1/zh
Priority to CN202080006248.2A priority patent/CN113168308A/zh
Publication of WO2021212285A1 publication Critical patent/WO2021212285A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting

Definitions

  • the present invention relates to the technical field of floating point calculations, in particular to a floating point accumulation device, method and computer storage medium.
  • DSP Digital Signal Processor
  • ALU integer arithmetic and logic operations
  • FPU floating-point calculations
  • Floating-point numbers are widely used due to the characteristics of high precision and high dynamic range.
  • DSP has a floating-point instruction set dedicated to floating-point calculations, including basic floating-point addition, subtraction, and multiplication operations, as well as more complex multiplication and addition, Operations such as multiplication and accumulation. Floating point addition is more complicated than integer addition. How to efficiently implement floating point accumulation is an important direction in the design of floating point unit (Float Point Unit, FPU).
  • floating-point addition Due to the complex structure of floating-point numbers, floating-point addition is usually divided into multiple steps. In hardware implementation, multi-stage pipelines are often used to increase the operating frequency. In the floating-point addition pipeline structure, there is a delay of several cycles between input and output. Therefore, each floating-point number input lags behind the previous floating-point number by several cycles, which not only accumulates very slowly, but also takes up IO (input/output ) The port time is also very long.
  • the first aspect of the embodiments of the present invention provides a floating-point accumulation device, the floating-point accumulation device includes:
  • a loading module for reading N original floating-point numbers, and sequentially outputting the N original floating-point numbers in N calculation cycles, where N is an integer greater than 3;
  • the control module is used to determine the calculation stage of the current calculation cycle, and according to the calculation stage, control at least one multiplexer to send the original floating-point number output by the loading module to the input of the floating-point addition module Terminal, or control the at least one multiplexer to send the intermediate result output from the output terminal of the floating-point addition module to the input terminal of the floating-point addition module;
  • the floating-point addition module is configured to obtain the two floating-point numbers sent by the multiplexer at the input terminal in each calculation cycle, and use an M-stage pipeline structure to accumulate the two floating-point numbers, And output the intermediate result or final result at the output terminal, where M is an integer greater than 1;
  • the calculation stage includes a calculation stage regarding the original floating-point number and a calculation stage regarding the intermediate result.
  • a second aspect of the embodiments of the present invention provides a floating-point accumulation method, and the floating-point accumulation method includes:
  • the calculation stage includes a calculation stage regarding the original floating-point number and a calculation stage regarding the intermediate result.
  • the third aspect of the embodiments of the present invention provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the floating point accumulation method of the embodiment of the present invention are implemented
  • the floating-point accumulation method, the floating-point accumulation device and the computer storage medium of the embodiments of the present invention can increase the speed of the floating-point accumulation.
  • Figure 1 shows a timing diagram of the existing floating point accumulation process
  • Fig. 2 shows a structural block diagram of a floating point accumulation device according to an embodiment of the present invention
  • Figure 3 shows a timing diagram of floating point accumulation implemented by a floating point accumulation device according to an embodiment of the present invention
  • Figure 4 shows a schematic diagram of a three-stage pipeline structure according to an embodiment of the present invention
  • Fig. 5 shows a flowchart of a floating point accumulation method according to an embodiment of the present invention.
  • the embodiment of the present invention relates to the accumulation of floating-point numbers.
  • Floating-point numbers generally follow the IEEE 7542008 standard.
  • the single-precision floating-point number specified by the standard is 32 bits, including 1 bit sign bit S, 8 bit exponent bit E, and 23 bit mantissa bit M.
  • the value of a floating point number can be expressed as (-1) ⁇ S ⁇ 2 ⁇ (E-BIAS) ⁇ (1+M), BIAS is the exponent offset, which is 127 in the single-precision floating-point number format.
  • floating-point addition Due to the complex structure of floating-point numbers, floating-point addition usually includes multiple steps.
  • a multi-stage pipeline structure is often used to increase the operating frequency, and each stage of the pipeline occupies one cycle. For example, referring to Figure 1, when a three-stage floating-point addition pipeline structure is adopted, the delay from input to output is 3 cycles (ie, clock cycles). If two floating-point numbers fp1 and fp2 are input at time T0, then at time T3 In order to get the calculation result of fp1+fp2.
  • the embodiments of the present invention provide an improved floating point accumulation device, method, and computer storage medium.
  • the floating point accumulation device, method, and computer storage medium of the embodiments of the present invention will be described in detail below with reference to the accompanying drawings. In the case of no conflict, the following embodiments and features in the implementation can be combined with each other.
  • Fig. 2 shows a structural block diagram of a floating point accumulation device 200 according to an embodiment of the present invention.
  • the floating point accumulation device 200 includes a loading module 210, a control module 220, and a floating point addition module 230, where:
  • the loading module 210 is configured to read N original floating-point numbers, and successively output the N original floating-point numbers in N calculation cycles respectively, where N is an integer greater than 3;
  • the control module 220 is used to determine the calculation stage of the current calculation cycle, and according to the calculation stage, control at least one multiplexer (for example, the multiplexer MUX1 and the multiplexer MUX2) to load the module
  • the output of the original floating-point number is sent to the input of the floating-point addition module, or at least one multiplexer is controlled to send the intermediate result output from the output of the floating-point addition module to the input of the floating-point addition module Terminal, wherein the calculation stage includes a calculation stage about the original floating point number and a calculation stage about the intermediate result;
  • the floating-point addition module 230 is configured to obtain two floating-point numbers sent by the multiplexer at the input end in each calculation cycle, and accumulate the two floating-point numbers using an M-stage pipeline structure, And output the intermediate result or final result at the output terminal, where M is an integer greater than 1.
  • the calculation period is the time required to perform one floating-point operation.
  • the calculation period is related to the clock period.
  • the calculation cycle is one clock cycle.
  • the present invention is not limited to this.
  • the calculation cycle is multiple clock cycles.
  • the floating-point accumulation device 200 is configured to implement floating-point accumulation in stages, wherein the control module 220 chooses to send the original floating-point number or the intermediate result to the input terminal of the floating-point addition module 230 according to the current calculation stage.
  • Some of the calculation phases are the calculation phase for the original floating-point number.
  • the control module 220 controls the multiplexer to send the original floating-point number to the input of the floating-point addition module 230 to obtain the original floating-point number.
  • the intermediate result obtained by the accumulation of floating-point numbers, and the rest of the calculation stages are the calculation stages about the intermediate result.
  • the control module 220 controls the multiplexer to send the intermediate result to the input terminal of the floating-point addition module 230.
  • the results are accumulated to get the final result.
  • the first calculation stage is the calculation stage related to the original floating-point number
  • the second to Mth calculation stages are calculation stages related to the intermediate result.
  • the control module 220 instructs to send the original floating-point numbers to the input of the floating-point addition module in turn, so as to accumulate each group of the original floating-point numbers.
  • M first intermediate results are obtained in M groups respectively, wherein, in the first calculation stage, the M first intermediate results are respectively used as the intermediate results at the output end of the floating-point addition module Output.
  • the control module 220 respectively instructs to send the intermediate result output by the previous calculation stage and one of the first intermediate results of the M first intermediate results to the float.
  • the input terminal of the point module is accumulated to obtain a plurality of second intermediate results respectively, wherein in the second to the Mth calculation stages, the plurality of second intermediate results are respectively used as the intermediate results or the The final result is output at the output terminal of the floating-point addition module.
  • the floating point accumulation device 200 of the embodiment of the present invention makes full use of the delay characteristic of floating point accumulation, which greatly reduces the calculation time and the IO occupation time.
  • the floating-point accumulating device 200 of the embodiment of the present invention has a small hardware resource overhead, which saves chip area and reduces chip cost, and can be easily expanded into a vector structure, which meets the arithmetic characteristics of DSP.
  • the floating-point accumulation device 200 is configured to perform floating-point accumulation operations in calculation stages. 3, the functions of the loading module 210, the control module 220, and the floating-point addition module 230 will be described in detail in accordance with the sequence of calculation stages.
  • the control module 220 controls at least one multiplexer to send N original floating-point numbers to the floating-point addition module 230 cycle by cycle.
  • the original floating-point numbers are the floating-point numbers to be accumulated.
  • the floating-point number can be a 32-bit single-precision floating-point number specified by the IEEE 754 2008 standard.
  • the loading module 210 is responsible for reading the N original floating-point numbers.
  • the loading module 210 reads an original floating-point number in each cycle of the first calculation stage and sends it to the floating-point addition module 230 in turn.
  • the floating-point addition module 230 Each cycle reads an original floating-point number. After processing the two original floating-point numbers and sending them to the next-level pipeline structure, each stage of the multi-stage pipeline structure immediately receives the two original floating-point numbers input at the next moment.
  • Points instead of the previous floating-point accumulation schemes, the two floating-point numbers are processed in multiple stages and the accumulation result is output, and then a new set of floating-point numbers is received, which will cause some problems between the reading of two adjacent sets of floating-point numbers. The delay of the cycle.
  • the loading module 210 sends the floating-point number fp1 to the floating-point addition module 230 at T0, and sends the floating-point number fp2 to the floating-point addition module 230 at T1, ... until the floating-point number fpN is sent at TN-1 , That is, it takes N cycles in total for the loading module 210 to send all N original floating-point numbers to be accumulated.
  • the original floating-point numbers include five floating-point numbers fp1, fp2, fp3, fp4, and fp5.
  • the load module 210 sequentially sends the original floating-point numbers fp1 to fp5 to the floating point numbers. Click the addition module 230.
  • the floating-point accumulation device 200 may include two multiplexers, which are illustrated in FIG. 2 as the first multiplexer MUX1 and the first multiplexer. Two multiplexer MUX2. At each moment, the first multiplexer MUX1 and the second multiplexer MUX2 respectively send a floating-point number to the input of the floating-point addition module.
  • control module 220 uses the hierarchical structure of the floating-point addition module 230 to control the order of the floating-point numbers input to the floating-point addition module 230 to convert the N original floating-point numbers Divide into multiple groups, realize the separate accumulation of each group of original floating-point numbers, so as to obtain multiple first intermediate results as the grouped intermediate results of each group.
  • the grouping and accumulation of the original floating-point numbers can be controlled by the control module 220 to send different floating-point numbers to the floating-point addition module 230 during different calculation cycles of the first multiplexer MUX1 and the second multiplexer MUX2. to fulfill.
  • control module 220 when the control module 220 determines that it is currently in the first calculation stage, it can control the first multiplexer MUX1 to send the N original floating-point numbers sent by the loading module 210 to the floating-point addition module 230; when the control module 220 determines When the current calculation period is the first calculation period of the first calculation stage, the second multiplexer MUX2 is controlled to send 0 to the input terminal; when the control module 220 determines that the current calculation period is the second calculation period of the first calculation stage In part of the calculation cycle, the second multiplexer MUX2 is controlled to send the intermediate result output by the output terminal in the current calculation cycle to the input terminal.
  • the first part of the calculation period of the first calculation stage is the first M calculation periods.
  • the calculation period is one clock period.
  • the present invention is not limited to this. In other embodiments, the calculation period may also be multiple clock periods.
  • the number of groups of the original floating-point number is consistent with the number of stages of the pipeline structure of the floating-point addition module 230.
  • the floating-point addition module 230 adopts an M-stage pipeline, and the original floating-point number is also divided into M groups, thereby making full use of floating-point addition.
  • the delay characteristics are consistent with the number of stages of the pipeline structure of the floating-point addition module 230.
  • the floating-point addition module 230 adopts a three-stage pipeline
  • the first multiplexer MUX1 under the control of the control module 220, sequentially sends the original floating-point numbers fp1 to fpM sent by the loading module 210 to the floating-point addition module 230.
  • the second multiplexer sends 0 to the input terminal of the floating-point addition module 230. Since the delay of the floating-point addition module 230 is M cycles, at time TM, the output terminal of the floating-point addition module 230 outputs the calculation result of fp1+0, that is, fp1.
  • the first multiplexer MUX1 sets fpM+ 1 is sent to the input terminal of the floating-point addition module 230, and the second multiplexer MUX2 sends fp1 to the input terminal of the floating-point addition module 230 to accumulate fp1 and fpM+1; at the moment of TM+1, float
  • the output terminal of the point addition module 230 outputs the calculation result of fp2+0, that is, fp2.
  • the first multiplexer MUX1 sends fpM+2 to the input terminal of the floating-point addition module 230, and the second multiplexer MUX2 will fp2 is sent to the input terminal of the floating-point addition module 230, and fp2 and fpM+2 are accumulated, and so on, thereby realizing the grouping and accumulation of the original floating-point number.
  • the number of original floating-point numbers in each group needs to be the same, that is, the total number of original floating-point numbers should be an integer multiple of M.
  • the loading module 210 sends at least one additional floating-point number as the original floating-point number to the floating-point addition module 230 to fill the original floating-point number to an integer multiple of M.
  • the floating-point number sent to the floating-point calculation module 230 is Number, that is, the number of the at least one additional floating point number is indivual.
  • the value of the at least one additional floating-point number is 0, so as not to affect the floating-point accumulation result while realizing the occupying function.
  • at least one additional floating point number used to occupy a position may be sent out in at least one calculation cycle at the end of the first calculation stage.
  • at least one additional floating point number used to occupy a bit can be sent to the floating point addition module 230 by the second multiplexer MUX2.
  • the floating-point addition module 230 uses a three-stage pipeline structure to add two floating-point numbers. Specifically, referring to FIG. 4, the first-stage pipeline structure first detects the type of the input floating-point number, and then exchanges two floating-point numbers, so that the floating-point number with a larger absolute value is always first. After that, the order operation is performed to make the exponent of the floating-point number with a small absolute value the same as the exponent of the floating-point number with a large absolute value, and the new operator and the sign bit of the result are calculated at the same time.
  • the second-stage pipeline structure performs mantissa addition (or subtraction), and leading zero anticipation (LZA) is also required when mantissa subtracts to predict the number of leading zeros in the result.
  • LZA leading zero anticipation
  • the third-stage pipeline structure is first reduced. When the new operator is subtraction, it needs to be shifted to the left, and when the new operator is addition, it needs to be shifted to the right. After that, anomaly detection and rounding are performed, and finally the correct result is obtained and the calculated result is output. Among them, anomaly detection is used to detect whether the calculation result is abnormal. For example, check whether the calculation result exceeds the normal value range.
  • the first multiplexer MUX1 sequentially sends the floating-point numbers fp1 to fp5 and the floating-point number 0 for placeholders to the input terminal of the floating-point addition module 230.
  • the first The second multiplexer MUX2 sends 0 to the input terminal of the floating-point addition module 230; at time T3, the output terminal of the floating-point addition module 230 outputs the calculation result of fp0+0 (regout), that is, the floating-point number fp1, and is
  • the fp1 and 0 input at time T0 are processed by the first-stage pipeline structure in the three-stage pipeline structure at the time T0-T1, and processed by the second-stage pipeline structure at the time T1-T2.
  • T3 is processed by the third-stage pipeline structure and output at T3;
  • the fp2 and 0 input at T1 are processed by the first-stage pipeline structure in the three-stage pipeline structure at T1-T2, and at T2-T3
  • the second-stage pipeline structure performs processing
  • the third-stage pipeline structure performs processing at time T3-T4, and outputs at time T4, and so on.
  • the first-stage pipeline structure in the three-stage pipeline structure processes fp1 and 0 at time T0-T1, processes fp2 and 0 at time T1-T2, and processes fp3 and 0 at time T2-T3.
  • the second-stage pipeline structure in the three-stage pipeline structure processes fp1 and 0 at time T1-T2, processes fp2 and 0 at time T2-T3, and processes fp3 and 0 at time T3-T4, thereby Analogy; that is, each stage of the three-stage pipeline structure processes different floating-point numbers at each moment, so that the computing resources are fully utilized and the computing time is shortened.
  • the floating-point addition module 230 groups and accumulates a plurality of original floating-point numbers to obtain a plurality of first intermediate results as the grouped intermediate results. Due to the delay characteristics of floating-point addition, some intermediate results of the grouping will be output at a certain time during the second to Mth calculation stages.
  • the control module 220 instructs the first multiplexer MUX1 and the second multiplexer MUX2 to separate the intermediate results output from the previous calculation stage and the M output from the first calculation stage.
  • a first intermediate result of the first intermediate results is sent to the input terminal of the floating-point module 230 for accumulation, so as to obtain a plurality of second intermediate results respectively.
  • the multiple The second intermediate results are respectively output as the intermediate result or the final result at the output terminal of the floating-point addition module 230.
  • the control module 220 controls the first multiplexer MUX1 and the second multiplexer MUX2 to separate the grouping intermediate results of the first group of the M groups and The grouping intermediate results of the second group of the M groups are sent to the input terminal of the floating-point addition module 230 for accumulation, where the first grouping is different from the second grouping.
  • the first group and the second group may be two adjacent groups, that is, the intermediate results of the first group and the second group are output in sequence. It should be noted that the grouping intermediate results of the first grouping and the second grouping may not be output at adjacent moments. There may be a delay of several calculation cycles between the two. Therefore, the grouping intermediate results of the previous grouping can be temporarily output afterwards.
  • the control module 220 can control the first multiplexer MUX1 to extract the grouped intermediate result of the previous group from the intermediate register module 240 and send it to the floating point addition module when the intermediate result of the latter group is output.
  • the input terminal of 230 controls the second multiplexer MUX2 to send the intermediate result of the latter group to the input terminal of the floating-point addition module 230 to accumulate the intermediate results of the two groups.
  • the intermediate register module 240 includes a plurality of intermediate registers, and each intermediate register is used to register an intermediate result or a grouped intermediate result.
  • the intermediate register module 240 may include two 32-bit registers for registering mid_a and mid_c, respectively.
  • the control module 220 controls the first multiplexer MUX1 to send mid_a registered in the intermediate register module 240 to the floating-point addition module 230.
  • the input terminal and the second multiplexer MUX2 are controlled to send mid_b output from the output terminal of the floating-point addition module 230 to the input terminal again to calculate mid_a+mid_b, that is, fp1+fp4+fp2+fp5.
  • mid_a since mid_a has a calculation cycle delay between output and input, it needs to be registered in the intermediate register module 240 for one cycle, and there is no delay between the output and input of mid_b, so its registers can be hidden in floating In the output register of the point addition module 230, it is not necessary to set a register for registering mid_b in the intermediate register module 240.
  • the grouping intermediate result is accumulated gradually, that is, a new grouping intermediate result is accumulated on the basis of the intermediate result output by the previous calculation stage.
  • the control module 220 determines that the current calculation period is in a certain calculation stage after the second calculation stage, it controls the first multiplexer MUX1 and the second multiplexer MUX2 to output the intermediate results of the previous calculation stage.
  • the grouped intermediate results registered in the intermediate register module 240 and the intermediate register module 240 are input to the input terminal of the floating-point addition module 230 for accumulation until all the grouped intermediate results are accumulated, and the final result is output.
  • the calculation result of mid_a+mid_b has not been output yet, so mid_c will be registered in the intermediate register module 240 in.
  • the calculation result of the second calculation stage is output at time T10, so mid_c is registered in the intermediate register module 240 for two cycles.
  • control module 220 controls the first multiplexer MUX1 and the second multiplexer MUX2 to send mid_c and mid_a+mid_b to the output terminal of the floating-point addition module 230, respectively, until the final calculation result is output at T13 (mid_a+mid_b) +mid_c, that is, (fp1+fp4)+(fp2+fp5)+fp3.
  • the floating-point accumulation device 200 adopts a grouping accumulation calculation method.
  • the first calculation stage two floating-point numbers are sent to the input terminal of the floating-point addition module 230 at each moment for accumulation.
  • Each stage of the pipeline structure of the point addition module processes different floating-point numbers at different times, thereby making full use of computing resources; in the subsequent calculation phases, only a small number of grouped intermediate results need to be accumulated, which greatly reduces calculating time.
  • the floating-point accumulation device 200 of the embodiment of the present invention only adds an intermediate register module 240 for storing intermediate results, and there is basically no other hardware resource overhead, which saves chip area and reduces chip cost.
  • the floating-point accumulation device of the embodiment of the present invention can shorten the calculation time and IO time of the accumulation of N floating-point numbers from 3 (N-1) cycles to Cycles.
  • N is large, It can be roughly regarded as N, which basically achieves the fastest calculation speed, especially the IO time is reduced, and the IO requirements of other components are not affected.
  • FIG. 5 shows a flowchart of a floating point accumulation method 500 according to an embodiment of the present invention.
  • the floating-point calculation method 500 can be implemented by the floating-point calculation device 200 described above. The following only describes the main steps of the floating-point calculation method 500, and further details can be referred to above.
  • the floating point accumulation method 500 includes the following steps:
  • Step S510 reading N original floating-point numbers, and successively outputting the N original floating-point numbers in N calculation cycles respectively, where N is an integer greater than 3;
  • Step S520 Determine the calculation stage of the current calculation cycle, and select the original floating-point number as the input of the floating-point addition according to the calculation stage, or send the intermediate result of the floating-point addition output as the floating-point addition.
  • the input of point addition wherein, the calculation stage includes a calculation stage on the original floating point number and a calculation stage on the intermediate result;
  • Step S530 Obtain two floating-point numbers as input for floating-point addition in each calculation cycle, accumulate the two floating-point numbers with an M-stage pipeline structure, and output the intermediate result or final result, where M Is an integer greater than 1.
  • the floating point accumulation method 500 of the embodiment of the present invention divides the calculation process into several calculation stages, and step S510 is executed in the first calculation stage, that is, an original floating point number is output in each calculation cycle of the first calculation stage.
  • the number of calculation stages is equal to the number of pipeline structures, that is, the calculation process is divided into M calculation stages.
  • step S520 and step S520 in different calculation stages, different floating-point numbers are respectively accumulated.
  • the first calculation stage is the calculation stage related to the original floating-point number
  • the second to Mth calculation stages are calculation stages related to the intermediate result.
  • it is instructed to send the original floating-point number to the input of the floating-point addition module in order to accumulate each group of the original floating-point number, so as M first intermediate results are respectively obtained in, wherein, in the first calculation stage, the M first intermediate results are respectively output as the intermediate results at the output end of the floating-point addition module.
  • each calculation stage of the second to Mth calculation stages respectively instruct the intermediate result output by the previous calculation stage and one of the first intermediate results of the M first intermediate results to be sent to the floating-point module
  • the input terminal is accumulated to obtain a plurality of second intermediate results respectively, wherein in the second to the Mth calculation stages, the plurality of second intermediate results are respectively used as the intermediate results or the final results Output at the output terminal of the floating-point addition module.
  • an original floating-point number is read and accumulated using a multi-stage pipeline structure.
  • Each stage of the multi-stage pipeline structure is processed and combined with two original floating-point numbers. After being sent to the next stage of the pipeline structure, it immediately receives the two original floating-point numbers input at the next moment, instead of multi-stage processing of the two floating-point numbers and the output of the accumulation result before receiving the new one after the previous floating-point accumulation scheme.
  • a set of floating-point numbers which causes a delay of several cycles between the reading of two adjacent sets of floating-point numbers.
  • floating-point addition needs to read two inputs at each moment, in each calculation cycle of the first calculation phase, in addition to sequentially using an original floating-point number as the input of floating-point addition, additional calculation cycles are required.
  • a floating point number is used together as the input of floating point addition.
  • the N original floating-point numbers are divided into multiple groups, and each group of original floating-point numbers is accumulated separately, so that M first intermediate results are obtained as M groups The intermediate result of the grouping.
  • the number of groups of the original floating-point number is consistent with the number of stages of the pipeline structure.
  • the original floating-point number is also divided into M groups, so as to make full use of the delay characteristics of floating-point addition.
  • the original floating-point numbers are divided into three groups, and each group of original floating-point numbers is separately accumulated to obtain the grouping intermediate result.
  • the number of original floating point numbers in each group needs to be the same, that is, the total number of original floating point numbers should be an integer multiple of M.
  • N is not an integral multiple of M
  • at least one additional floating-point number is additionally sent as the original floating-point number to use the original floating-point number to be an integral multiple of M.
  • the value of the at least one additional floating-point number is 0, so as not to affect the floating-point accumulation result while realizing the occupying function.
  • at least one additional floating point number used to occupy a position may be sent out in at least one calculation cycle at the end of the first calculation stage.
  • a three-stage pipeline structure is used to add two floating-point numbers.
  • the type of the input floating-point number is first detected, and then two floating-point numbers are exchanged, so that the floating-point number with a larger absolute value is always first.
  • the order operation is performed to make the exponent of the floating-point number with a small absolute value the same as the exponent of the floating-point number with a large absolute value, and the new operator and the sign bit of the result are calculated at the same time.
  • mantissa addition or subtraction
  • leading zero prediction LZA
  • the N original floating-point numbers are divided into 3 groups.
  • the first-stage pipeline structure in the three-stage pipeline structure processes the first set of floating-point numbers in the first calculation cycle, and processes the second set of floating-point numbers in the second calculation cycle.
  • the third set of floating-point numbers are processed, and so on; the second-stage pipeline structure in the three-stage pipeline structure processes the first set of floating-point numbers in the second calculation cycle, and in the third calculation cycle
  • the second group of floating-point numbers are processed in the cycle
  • the third group of floating-point numbers is processed in the fourth calculation cycle, and so on; that is, each stage of the three-stage pipeline structure has a different float at each moment.
  • the points are processed, so as to make full use of computing resources and shorten the calculation time.
  • the first calculation stage multiple original floating-point numbers are grouped and accumulated to obtain multiple grouped intermediate results. Due to the delay characteristics of floating-point addition, some intermediate results of the grouping will be output at a certain time during the second to Mth calculation stages.
  • the second to Mth calculation stages respectively instructed to send the intermediate result output by the previous calculation stage and a first intermediate result of the M first intermediate results to the input terminal of the floating-point module for Accumulation is performed to obtain a plurality of second intermediate results respectively, wherein in the second to the Mth calculation stages, the plurality of second intermediate results are respectively used as the intermediate results or the final results in the The output terminal of the floating-point addition module outputs.
  • the grouping intermediate result of the first group of the M groups and the grouping intermediate result of the second grouping of the M groups are respectively used as the input of the floating point addition Perform accumulation, where the first grouping is different from the second grouping.
  • first group and the second group may be two adjacent groups, that is, the intermediate results of the first group and the second group are output in sequence. It should be noted that the grouping intermediate results of the first grouping and the second grouping may not be output at adjacent moments. There may be a delay of several calculation cycles between the two. Therefore, the grouping intermediate results of the previous grouping can be temporarily output afterwards. Stored in the register, and when the next grouping intermediate result is output, the grouping intermediate result of the previous grouping can be extracted from the register and used as the input of floating point addition, and the grouping intermediate result of the next grouping can be sent as floating point addition at the same time To accumulate the intermediate results of the two groups.
  • the grouping intermediate result is accumulated gradually, that is, a new grouping intermediate result is accumulated on the basis of the intermediate result output by the previous calculation stage. Specifically, if it is determined that the current calculation cycle is in a certain calculation stage after the second calculation stage, the intermediate result output from the previous calculation stage and the group intermediate result registered in the register are respectively used as the input of floating point addition for accumulation until All the intermediate results of the grouping have been accumulated.
  • the number of calculation stages is three.
  • the third calculation stage groups the intermediate results output by the second calculation stage with the third group.
  • the intermediate results are accumulated to get the final result.
  • the floating point accumulation method 500 adopts a grouping accumulation calculation method, which greatly reduces the calculation time.
  • the floating-point accumulation method 500 of the embodiment of the present invention can shorten the calculation time and IO time of the accumulation of N floating-point numbers from 3 (N-1) cycles to Cycles.
  • N is large, It can be roughly regarded as N, which basically achieves the fastest calculation speed, especially the IO time is reduced, and the IO requirements of other components are not affected.
  • the embodiment of the present invention also provides a computer storage medium on which a computer program is stored.
  • the computer storage medium is a computer-readable storage medium.
  • the computer storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory ( CD-ROM), USB memory, or any combination of the above storage media.
  • the computer-readable storage medium may be any combination of one or more computer-readable storage media.
  • the computer program instructions stored on the computer storage medium cause the computer or the processor to perform the following steps when being executed by the computer or the processor:
  • the calculation stage includes a calculation stage regarding the original floating-point number and a calculation stage regarding the intermediate result.
  • an embodiment of the present invention also provides a computer program product, which contains instructions, which when executed by a computer, cause the computer to execute the steps of the floating point accumulation method 500 shown in FIG. 5.
  • the floating-point accumulation method, the floating-point accumulation device, and the computer storage medium of the embodiments of the present invention can increase the speed of floating-point accumulation, shorten the IO occupation time, and have less hardware resource overhead, which saves chip area and reduces chip size. cost.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc. .
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present invention essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another device, or some features can be ignored or not implemented.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some modules according to the embodiments of the present invention.
  • DSP digital signal processor
  • the present invention can also be implemented as a device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein.
  • Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals.
  • Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Nonlinear Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Advance Control (AREA)

Abstract

一种浮点累加装置、方法和计算机存储介质,所述浮点累加装置包括:载入模块,用于读取N个原始浮点数,并分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数(S510);控制模块,用于判断当前计算周期所处的计算阶段,并根据所述计算阶段,控制至少一个多路选择器将载入模块输出的原始浮点数或将所述浮点加法模块的输出端输出的中间结果送入浮点加法模块的输入端(S520);浮点加法模块,用于在每个所述计算周期中在输入端获取两个浮点数,采用M级流水线结构对所述两个浮点数进行累加,并在输出端输出所述中间结果或最终结果,其中M为大于1的整数(S530)。该装置能够提高浮点累加的速度。

Description

浮点累加装置、方法和计算机存储介质 技术领域
本发明涉及浮点计算技术领域,具体而言涉及一种浮点累加装置、方法和计算机存储介质。
背景技术
数字信号处理器(Digital Signal Processor,DSP)是一种专门用于数据密集型计算的芯片,其中含有不同的电路结构完成不同的功能,包括完成整数算术逻辑运算的ALU、完成浮点数运算的FPU等。浮点数由于高精度、高动态范围的特性使用非常广泛,DSP有专门用于浮点计算的浮点指令集,其中包括基本的浮点加、减、乘等操作,以及较复杂的乘加、加乘、累加等操作。浮点加法比整数加法更复杂,如何高效的实现浮点累加是浮点运算单元(Float Point Unit,FPU)设计中的一个重要方向。
由于浮点数结构复杂,浮点加法通常分为多个步骤,硬件实现时往往采用多级流水线实现以提高工作频率。在浮点加法流水线结构中,输入到输出之间有几个周期的延迟,因此每个浮点数输入都要比前一个浮点数滞后若干周期,不但累加速度很慢,并且占用IO(输入/输出)端口的时间也很长。
发明内容
在发明内容部分中引入了一系列简化形式的概念,这将在具体实施方式部分中进一步详细说明。本发明的发明内容部分并不意味着要试图限定出所要求保护的技术方案的关键特征和必要技术特征,更不意味着试图确定所要求保护的技术方案的保护范围。
针对现有技术的不足,本发明实施例第一方面提供了一种浮点累加装置,所述浮点累加装置包括:
载入模块,用于读取N个原始浮点数,并分别在N个计算周期 连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
控制模块,用于判断当前计算周期所处的计算阶段,并根据所述计算阶段,控制至少一个多路选择器将所述载入模块输出的所述原始浮点数送入浮点加法模块的输入端,或控制所述至少一个多路选择器将所述浮点加法模块的输出端输出的中间结果送入到所述浮点加法模块的输入端;
浮点加法模块,用于在每个所述计算周期中在所述输入端获取所述多路选择器送入的两个浮点数,采用M级流水线结构对所述两个浮点数进行累加,并在输出端输出所述中间结果或最终结果,其中M为大于1的整数;
其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段。
本发明实施例第二方面提供了一种浮点累加方法,所述浮点累加方法包括:
读取N个原始浮点数,并分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
判断当前计算周期所处的计算阶段,并根据所述计算阶段选择将所述原始浮点数作为浮点加法的输入,或将所述浮点加法输出的中间结果送入为所述浮点加法的输入;
在每个所述计算周期中获取两个浮点数作为浮点加法的输入,采用M级流水线结构对两个所述浮点数进行累加,并输出所述中间结果或最终结果,其中M为大于1的整数;
其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段。
本发明实施例第三方面提供了一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现本发明实施例的浮点累加方法的步骤
本发明实施例的浮点累加方法、浮点累加装置和计算机存储介质能够提高浮点累加的速度。
附图说明
本发明的下列附图在此作为本发明的一部分用于理解本发明。附图中示出了本发明的实施例及其描述,用来解释本发明的原理。
附图中:
图1示出了现有的浮点累加过程的时序图;
图2示出了根据本发明一实施例的浮点累加装置的结构框图;
图3示出了根据本发明一实施例的浮点累加装置实现的浮点累加的时序图;
图4示出了根据本发明一实施例的三级流水线结构的示意图;
图5示出了根据本发明一实施例的浮点累加方法的流程图。
具体实施方式
为了使得本发明的目的、技术方案和优点更为明显,下面将参照附图详细描述根据本发明的示例实施例。显然,所描述的实施例仅仅是本发明的一部分实施例,而不是本发明的全部实施例,应理解,本发明不受这里描述的示例实施例的限制。基于本发明中描述的本发明实施例,本领域技术人员在没有付出创造性劳动的情况下所得到的所有其它实施例都应落入本发明的保护范围之内。
在下文的描述中,给出了大量具体的细节以便提供对本发明更为彻底的理解。然而,对于本领域技术人员而言显而易见的是,本发明可以无需一个或多个这些细节而得以实施。在其他的例子中,为了避免与本发明发生混淆,对于本领域公知的一些技术特征未进行描述。
应当理解的是,本发明能够以不同形式实施,而不应当解释为局限于这里提出的实施例。相反地,提供这些实施例将使公开彻底和完全,并且将本发明的范围完全地传递给本领域技术人员。
在此使用的术语的目的仅在于描述具体实施例并且不作为本发明的限制。在此使用时,单数形式的“一”、“一个”和“所述/该”也意图包括复数形式,除非上下文清楚指出另外的方式。还应明白术语“组成”和/或“包括”,当在该说明书中使用时,确定所述特征、整数、步骤、操作、元件和/或部件的存在,但不排除一个或更多其它的特征、整数、步骤、操作、元件、部件和/或组的存在或添加。在此使用时,术语“和/或”包括相关所列项目的任何及所有组合。
为了彻底理解本发明,将在下列的描述中提出详细的步骤以及详细的结构,以便阐释本发明提出的技术方案。本发明的较佳实施例详细描述如下,然而除了这些详细描述外,本发明还可以具有其他实施方式。
本发明实施例涉及浮点数的累加。浮点数一般遵循IEEE 7542008标准,该标准规定的单精度浮点数为32bit,包括1bit符号位S、8bit指数位E和23bit尾数位M,浮点数的值可以表示为(-1)^S×2^(E-BIAS)×(1+M),BIAS为指数偏移量,在单精度浮点数格式下为127。
由于浮点数结构复杂,浮点加法通常包括多个步骤,硬件实现时往往采用多级流水线结构实现以提高工作频率,每一级流水线占用一个周期。例如,参照图1,当采用三级浮点加法流水线结构时,输入到输出的延迟为3个周期(即,时钟周期),如果在T0时刻输入两个浮点数fp1、fp2,则在T3时刻才能得到fp1+fp2的计算结果。
因此,当计算N个浮点数的累加fp1+fp2+……+fpN时,由于浮点加法的延迟为3,每个浮点数的输入都要比前一个浮点数滞后3个周期,即在T0时刻读入fp1和fp2,在T3时刻得到fp1+fp2的结果并读入fp3,……,直到在T3(N-2)时刻才能读入fpN,在T3(N-1)时刻才能输出fp1+fp2+……+fpN的累加结果,运算和IO(输入输出)分别需要3(N-1)个周期,长时间的运算不但降低浮点累加本身的速度,而且占用IO的时间也很长,而长时间的IO占用会影响DSP内其他组件的工作,降低DSP的整体性能。
针对以上问题,本发明实施例提出了一种改进的浮点累加装置、方法及计算机存储介质。下面结合附图,对本发明实施例的浮点累加装置、方法及计算机存储介质进行详细说明。在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。
图2示出了根据本发明的一个实施例的浮点累加装置200的结构框图。如图2所示,浮点累加装置200包括载入模块210、控制模块220和浮点加法模块230,其中:
载入模块210用于读取N个原始浮点数,并分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
控制模块220用于判断当前计算周期所处的计算阶段,并根据所 述计算阶段,控制至少一个多路选择器(例如,多路选择器MUX1和多路选择器MUX2)将所述载入模块输出的所述原始浮点数送入浮点加法模块的输入端,或控制至少一个多路选择器将所述浮点加法模块的输出端输出的中间结果送入到所述浮点加法模块的输入端,其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段;
浮点加法模块230用于在每个所述计算周期中在所述输入端获取所述多路选择器送入的两个浮点数,采用M级流水线结构对所述两个浮点数进行累加,并在输出端输出所述中间结果或最终结果,其中M为大于1的整数。在一个实施方式中,计算周期为进行一次浮点操作所需要的时间。计算周期与时钟周期相关。例如,在一示例中,计算周期为一个时钟周期。然而,本发明并非限于此。在另一示例中,计算周期为多个时钟周期。
根据本发明实施例的浮点累加装置200配置为分阶段实现浮点累加,其中控制模块220根据当前计算阶段选择将原始浮点数或者将中间结果送入浮点加法模块230的输入端。其中一些计算阶段为关于原始浮点数的计算阶段,在关于原始浮点数的计算阶段中,控制模块220控制多路选择器将原始浮点数送入浮点加法模块230的输入端,以得到由原始浮点数累加所得的中间结果,其余一些计算阶段为关于中间结果的计算阶段,在这些计算阶段中,控制模块220控制多路选择器将中间结果送入浮点加法模块230的输入端,对中间结果进行累加以得到最终结果。
进一步地,在一些实施例中,第一个计算阶段为所述关于原始浮点数的计算阶段,第二至第M个计算阶段为关于中间结果的计算阶段。在第一个计算阶段的每个计算周期中,控制模块220依次指示将所述原始浮点数送入所述浮点加法模块的输入端,以对每组所述原始浮点数进行累加,从而在M个分组中分别获得M个第一中间结果,其中,在所述第一个计算阶段,所述M个第一中间结果分别作为所述中间结果在所述浮点加法模块的所述输出端输出。
在第二至第M个计算阶段的每个计算阶段中,控制模块220分别指示将上一个计算阶段输出的中间结果和所述M个第一中间结果 的一个第一中间结果送入所述浮点模块的输入端以进行累加,从而分别获得多个第二中间结果,其中在所述第二至所述第M个计算阶段,所述多个第二中间结果分别作为所述中间结果或所述最终结果在所述浮点加法模块的所述输出端输出。
本发明实施例的浮点累加装置200充分利用了浮点累加的延迟特性,极大地缩短了计算时间和IO占用时间。同时,本发明实施例的浮点累加装置200硬件资源开销较小,节省了芯片面积,降低了芯片成本,并且可以方便的扩展为矢量结构,切合DSP的运算特性。
如上所述,根据本发明实施例的浮点累加装置200配置为分计算阶段进行浮点累加运算。下面参照图3,按照计算阶段顺序对载入模块210、控制模块220和浮点加法模块230的功能进行详细描述。
首先,在第一个计算阶段中,控制模块220控制至少一个多路选择器将N个原始浮点数逐周期地送入浮点加法模块230,所述原始浮点数即待累加的浮点数,所述浮点数可以是IEEE 754 2008标准规定的32bit单精度浮点数。
其中,载入模块210负责读取所述N个原始浮点数。载入模块210在第一个计算阶段的每个周期读取一个原始浮点数并依次送入浮点加法模块230,换句话说,在第一个计算阶段中,浮点加法模块230在每个周期均读入一个原始浮点数,多级流水线结构中的每一级流水线结构在对两个原始浮点数处理完毕并送入下一级流水线结构以后,立刻接收下一时刻输入的两个原始浮点数,而非如以往的浮点累加方案一样,对两个浮点数进行多级处理并输出累加结果以后再接收新一组浮点数,从而在相邻两组浮点数的读入之间造成若干周期的延迟。
具体地,载入模块210在T0时将浮点数fp1送往浮点加法模块230,在T1时将浮点数fp2送往浮点加法模块230,……,直到在TN-1时送出浮点数fpN,即载入模块210送出全部N个待累加的原始浮点数共需要N个周期。在图3所示的例子中,原始浮点数包括fp1、fp2、fp3、fp4、fp5共5个浮点数,则在T0至T4时刻,载入模块210依次将原始浮点数fp1至fp5送入浮点加法模块230。
由于浮点加法模块230的输入端在每一时刻需要读取两个输入,因此浮点累加装置200可以包括两个多路选择器,在图2中例示为第 一多路选择器MUX1和第二多路选择器MUX2。在每一时刻,第一多路选择器MUX1和第二多路选择器MUX2分别将一个浮点数送入浮点加法模块的输入端。
进一步地,在本发明实施例的浮点累加装置200中,控制模块220利用浮点加法模块230的分级结构,通过控制输入到浮点加法模块230的浮点数的顺序,将N个原始浮点数划分为多个组,实现对每组原始浮点数的分别累加,从而求得多个第一中间结果以作为每个分组的分组中间结果。
对原始浮点数的分组累加可以通过由控制模块220控制第一多路选择器MUX1和第二多路选择器MUX2在第一计算阶段的不同计算周期向浮点加法模块230送入不同的浮点数来实现。具体地,当控制模块220判断当前处于第一计算阶段时,可以控制第一多路选择器MUX1将载入模块210送出的N个原始浮点数送入浮点加法模块230;当控制模块220判断当前计算周期为第一计算阶段的第一部分计算周期时,控制所述第二多路选择器MUX2将0送入所述输入端;当控制模块220判断当前计算周期为第一计算阶段的第二部分计算周期时,控制第二多路选择器MUX2将当前计算周期中输出端所输出的中间结果送入到输入端。在一个实施方式中,若采用M级流水线,则第一计算阶段的第一部分计算周期为前M个计算周期。在一个实施方式中,计算周期为一个时钟周期。然而本发明并非限于此。在其他实施方式中,计算周期也可以为多个时钟周期。
其中,原始浮点数的分组个数与浮点加法模块230的流水线结构的级数一致,浮点加法模块230采用M级流水线,则原始浮点数也分为M组,从而充分利用浮点加法的延迟特性。例如,当浮点加法模块230采用三级流水线时,原始浮点数fp1、fp2、fp3、……、fpN被划分为3组,对每组原始浮点数分别累加以求得分组中间结果mid_a、mid_b和mid_c,其中:mid_a=fp1+fp4+fp7+……,mid_b=fp2+fp5+fp8+……,mid_c=fp3+fp6+fp9+……。
具体地,在第一计算阶段的前M个周期,第一多路选择器MUX1在控制模块220的控制下依次将载入模块210送出的原始浮点数fp1至fpM送入浮点加法模块230的输入端,第二多路选择器则将0送 入浮点加法模块230的输入端。由于浮点加法模块230的延迟为M个周期,因此在TM时刻,浮点加法模块230的输出端输出fp1+0的计算结果,即fp1,此时,第一多路选择器MUX1将fpM+1送入浮点加法模块230的输入端,第二多路选择器MUX2将fp1送入至浮点加法模块230的输入端,以进行fp1与fpM+1的累加;在TM+1时刻,浮点加法模块230的输出端输出fp2+0的计算结果,即fp2,此时第一多路选择器MUX1将fpM+2送入浮点加法模块230的输入端,第二多路选择器MUX2将fp2送入至浮点加法模块230的输入端,进行fp2与fpM+2的累加,以此类推,由此实现了对原始浮点数的分组累加。
较佳地,为了硬件实现的方便,每个分组中原始浮点数的个数需要相同,即总的原始浮点数个数应为M的整数倍。当N不为M的整数倍时,则载入模块210送出至少一个额外的浮点数作为原始浮点数送入浮点加法模块230,以将原始浮点数的个数补齐为M的整数倍。补齐以后送入浮点计算模块230的浮点数共为
Figure PCTCN2020085715-appb-000001
个,即所述至少一个额外的浮点数的个数为
Figure PCTCN2020085715-appb-000002
个。
较佳地,所述至少一个额外的浮点数的数值为0,以在实现占位作用的同时不影响浮点累加结果。为了便于实现,用于占位的至少一个额外的浮点数可以在第一计算阶段的最后至少一个计算周期送出。并且,用于占位的至少一个额外的浮点数可以均由第二多路选择器MUX2送入浮点加法模块230。
在一个实施例中,浮点加法模块230采用三级流水线结构对两个浮点数进行相加。具体地,参照图4,第一级流水线结构首先检测输入的浮点数的类型,然后交换两个浮点数,使得绝对值大浮点数的总是在前。之后进行对阶操作,使绝对值小的浮点数的指数和绝对值大的浮点数的指数相同,同时计算新的运算符和结果的符号位。第二级流水线结构进行尾数相加(或相减),尾数相减时还需同时进行前导零预测(Leading Zero Anticipation,LZA),预测结果中的前导零数目。第三级流水线结构首先进行规约化,当新运算符为减时,需要进行左移,当新运算符为加时,需要进行右移。之后进行异常检测和舍入,最后得到正确的结果并输出计算结果。其中,异常检测用于检测计算 结果是否异常。例如,检测计算结果是否超出了正常的取值范围。
图3示出了采用三级流水线结构的浮点累加的时序图。由于M的值为3,因此5个原始浮点数共分为3组,为了使每组原始浮点数的个数相同,载入模块210额外送出一个浮点数0,使送入浮点加法模块230的原始浮点数的个数变为6个,其中浮点数fp1和fp4为一组,浮点数fp2和fp5为一组,浮点数fp3和0为一组,分别计算每一组的分组中间结果mid_a、mid_b和mid_c,其中mid_a=fp1+fp4,mid_b=fp2+fp5,mid_c=fp3+0。
具体地,在T0至T5时刻,第一多路选择器MUX1依次将浮点数fp1至fp5和用于占位的浮点数0送入浮点加法模块230的输入端,在T0至T2时刻,第二多路选择器MUX2将0送入浮点加法模块230的输入端;在T3时刻,浮点加法模块230的输出端输出fp0+0的计算结果(regout),即浮点数fp1,并被第二多路选择器MUX2送入至输入端,从而与浮点数fp4进行累加,以计算mid_a=fp1+fp4;在T4时刻,浮点加法模块230的输出端fp2+0=fp2,并被第二多路选择器MUX2送入至输入端,从而与浮点数fp5进行累加,以计算mid_b=fp2+fp5;在T5时刻,浮点加法模块230的输出端fp3+0=fp3,并被第二多路选择器MUX2送入至输入端,从而与用于占位的0进行累加,以计算mid_c=fp3+0。
在上述计算过程中,T0时刻输入的fp1和0在T0-T1时刻由三级流水线结构中的第一级流水线结构进行处理,在T1-T2时刻由第二级流水线结构进行处理,在T2-T3时刻由第三级流水线结构进行处理,并在T3时刻输出;T1时刻输入的fp2和0在T1-T2时刻由三级流水线结构中的第一级流水线结构进行处理,在T2-T3时刻由第二级流水线结构进行处理,在T3-T4时刻由第三级流水线结构进行处理,并在T4时刻输出,以此类推。
换句话说,三级流水线结构中的第一级流水线结构在T0-T1时刻对fp1和0进行处理,在T1-T2时刻对fp2和0进行处理,在T2-T3时刻对fp3和0进行处理;三级流水线结构中的第二级流水线结构在T1-T2时刻对fp1和0进行处理,在T2-T3时刻对fp2和0进行处理,在T3-T4时刻对fp3和0进行处理,以此类推;即三级流水线结构中 的每一级流水线结构在每一时刻均对不同的浮点数进行处理,从而充分利用了计算资源,缩短了计算时间。
基于以上描述,在第一个计算阶段中,浮点加法模块230对多个原始浮点数进行分组累加,以获得多个第一中间结果以作为分组中间结果。由于浮点加法的延迟特性,部分分组中间结果将在之后的第二至第M个计算阶段的某一时刻输出。之后,在第二至第M个计算阶段中,控制模块220分别指示第一多路选择器MUX1和第二多路选择器MUX2将上一个计算阶段输出的中间结果和第一计算阶段输出的M个第一中间结果的一个第一中间结果送入所述浮点模块230的输入端以进行累加,从而分别获得多个第二中间结果,其中在第二至第M个计算阶段,所述多个第二中间结果分别作为所述中间结果或所述最终结果在所述浮点加法模块230的所述输出端输出。
具体地,当判断当前计算周期为第二个计算阶段时,控制模块220控制第一多路选择器MUX1和第二多路选择器MUX2分别将M个分组中的第一分组的分组中间结果和M个分组中的第二分组的分组中间结果送入到浮点加法模块230的输入端以进行累加,其中,第一分组和第二分组不同。
进一步地,第一个分组和第二个分组可以是相邻的两个分组,即第一个分组和第二个分组的分组中间结果是依次输出的。需要注意的是,第一个分组和第二个分组的分组中间结果未必在相邻时刻输出,二者之间可能有若干个计算周期的延迟,因此前一个分组的分组中间结果输出以后可以暂时存放在中间寄存器模块240中,控制模块220可以在后一个分组中间结果输出时,控制第一多路选择器MUX1从中间寄存器模块240中提取前一个分组的分组中间结果并送入浮点加法模块230的输入端,同时控制第二多路选择器MUX2将后一个分组的分组中间结果送入浮点加法模块230的输入端,以对两个分组中间结果进行累加。
继续参照图3,第一个分组的分组中间结果mid_a=fp4+fp1在T6时刻输出,由于此时第二个分组的分组中间结果mid_b=fp5+fp2尚未输出,因此将mid_a暂时寄存在中间寄存器模块240中。示例性地,中间寄存器模块240包括多个中间寄存器,每个中间寄存器用于寄存 一个中间结果或一个分组中间结果。当M=3时,中间寄存器模块240可以包括两个32bit的寄存器,分别用于寄存mid_a和mid_c。
在T7时刻,浮点加法模块230的输出端输出mid_b=fp5+fp2,此时控制模块220控制第一多路选择器MUX1将中间寄存器模块240中寄存的mid_a送入至浮点加法模块230的输入端,并控制第二多路选择器MUX2将浮点加法模块230的输出端输出的mid_b再次送入至输入端,从而计算mid_a+mid_b,即fp1+fp4+fp2+fp5。
可以理解的是,由于mid_a在输出和输入之间具有一个计算周期的延迟,因此需要寄存在中间寄存器模块240中一个周期,而mid_b的输出与输入之间没有延迟,因此其寄存器可以隐藏在浮点加法模块230自身的输出寄存器中,而不需要在中间寄存器模块240中设置用于寄存mid_b的寄存器。
在第二个计算阶段之后的每个计算阶段中,逐步进行分组中间结果的累加,即在上一计算阶段输出的中间结果的基础上累加一个新的分组中间结果。具体地,若控制模块220判断当前计算周期处于第二计算阶段之后的某一计算阶段,则控制第一多路选择器MUX1和第二多路选择器MUX2分别将上一计算阶段输出的中间结果和中间寄存器模块240中寄存的分组中间结果输入到浮点加法模块230的输入端以进行累加,直到全部的分组中间结果累加完毕,并输出最终结果。
在图3所示的例子中,计算阶段的数目共有三个,在第二计算阶段计算mid_a+mid_b以后,第三计算阶段将mid_c与mid_a+mid_b进行累加,即可得到最终结果(mid_a+mid_b)+mid_c。
具体地,在第一计算阶段的T5时刻输出fp3和0以后,其计算结果mid_c=fp3+0在T8时刻输出,此时mid_a+mid_b的计算结果尚未输出,因此mid_c将被寄存在中间寄存器模块240中。第二计算阶段的计算结果在T10时刻输出,因此mid_c在中间寄存器模块240中寄存两个周期。
在T10时刻,控制模块220控制第一多路选择器MUX1和第二多路选择器MUX2分别将mid_c和mid_a+mid_b送入浮点加法模块230的输出端,直到在T13时刻输出最终的计算结果(mid_a+mid_b) +mid_c,即(fp1+fp4)+(fp2+fp5)+fp3。
根据本发明实施例的浮点累加装置200采用分组累加的计算方式,在第一个计算阶段内,每一时刻均向浮点加法模块230的输入端送入两个浮点数以进行累加,浮点加法模块的每一级流水线结构在不同时刻对不同的浮点数进行处理,从而充分利用了计算资源;在之后的计算阶段中,则仅需要对少量的分组中间结果进行累加,极大地缩短了计算时间。同时,本发明实施例的浮点累加装置200仅增加了中间寄存器模块240用于存储中间结果,此外基本没有其他硬件资源开销,节省了芯片面积,降低了芯片成本。
当采用三级流水线结构进行浮点加法计算时,本发明实施例的浮点累加装置可以将N个浮点数累加的运算时间和IO时间由3(N-1)个周期缩短至
Figure PCTCN2020085715-appb-000003
个周期。在N较大时,
Figure PCTCN2020085715-appb-000004
可以近似看做N,基本达到了最快的计算速度,特别是降低了IO时间,不影响其他组件的IO需求。
图5示出了根据本发明的一个实施例的浮点累加方法500的流程图。浮点计算方法500可以由上述的浮点计算装置200实现。以下仅对浮点计算方法500的主要步骤进行描述,进一步的细节可以参照上文。
如图5所示,浮点累加方法500包括如下步骤:
步骤S510,读取N个原始浮点数,并分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
步骤S520,判断当前计算周期所处的计算阶段,并根据所述计算阶段选择将所述原始浮点数作为浮点加法的输入,或将所述浮点加法输出的中间结果送入为所述浮点加法的输入;其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段;
步骤S530,在每个所述计算周期中获取两个浮点数作为浮点加法的输入,采用M级流水线结构对两个所述浮点数进行累加,并输出所述中间结果或最终结果,其中M为大于1的整数。
本发明实施例的浮点累加方法500将计算过程分为若干个计算阶段,步骤S510在第一个计算阶段执行,即在第一个计算阶段的每个计算周期中分别输出一个原始浮点数。在一个实施例中,计算阶段的 数目与流水线结构的数目相等,即计算过程共分为M个计算阶段。
在步骤S520和步骤S520中,在不同的计算阶段,分别对不同的浮点数进行累加。具体地,在一些实施例中,第一个计算阶段为所述关于原始浮点数的计算阶段,第二至第M个计算阶段为关于中间结果的计算阶段。在第一个计算阶段的每个计算周期中,依次指示将所述原始浮点数送入所述浮点加法模块的输入端,以对每组所述原始浮点数进行累加,从而在M个分组中分别获得M个第一中间结果,其中,在所述第一个计算阶段,所述M个第一中间结果分别作为所述中间结果在所述浮点加法模块的所述输出端输出。
在第二至第M个计算阶段的每个计算阶段中,分别指示将上一个计算阶段输出的中间结果和所述M个第一中间结果的一个第一中间结果送入所述浮点模块的输入端以进行累加,从而分别获得多个第二中间结果,其中在所述第二至所述第M个计算阶段,所述多个第二中间结果分别作为所述中间结果或所述最终结果在所述浮点加法模块的所述输出端输出。
其中,在第一个计算阶段的每个周期中均读取一个原始浮点数并采用多级流水线结构进行累加,多级流水线结构中的每一级流水线结构在对两个原始浮点数处理完毕并送入下一级流水线结构以后,立刻接收下一时刻输入的两个原始浮点数,而非如以往的浮点累加方案一样,对两个浮点数进行多级处理并输出累加结果以后再接收新一组浮点数,从而在相邻两组浮点数的读入之间造成若干周期的延迟。
由于浮点加法在每一时刻需要读取两个输入,因此在第一计算阶段的每个计算周期,除了依次将一个原始浮点数作为浮点加法的输入以外,还根据不同的计算周期将另外一个浮点数一同作为浮点加法的输入。通过控制输入到浮点加法的浮点数的顺序,将N个原始浮点数划分为多个组,实现对每组原始浮点数的分别累加,从而求得M个第一中间结果以作为M个分组的分组中间结果。
具体地,当判断当前计算周期为第一计算阶段的第一部分计算周期时,将0送入输入端;当判断当前计算周期为第一计算阶段的第二部分计算周期时,将当前计算周期中输出端所输出的中间结果送入到输入端。
其中,原始浮点数的分组个数与流水线结构的级数一致,当采用M级流水线时,原始浮点数也分为M组,从而充分利用浮点加法的延迟特性。例如,当采用三级流水线时,原始浮点数被划分为3组,对每组原始浮点数分别累加以求得分组中间结果。
较佳地,为了硬件实现的方便,每个分组中原始浮点数的个数需要相同,即总的原始浮点数个数应为M的整数倍。当N不为M的整数倍时,则额外送出至少一个额外的浮点数作为原始浮点数使用,以将原始浮点数的个数补齐为M的整数倍。
较佳地,所述至少一个额外的浮点数的数值为0,以在实现占位作用的同时不影响浮点累加结果。为了便于实现,用于占位的至少一个额外的浮点数可以在第一计算阶段的最后至少一个计算周期送出。
在一个实施例中,采用三级流水线结构对两个浮点数进行相加。
具体地,在第一级流水线结构中,首先检测输入的浮点数的类型,然后交换两个浮点数,使得绝对值大浮点数的总是在前。之后进行对阶操作,使绝对值小的浮点数的指数和绝对值大的浮点数的指数相同,同时计算新的运算符和结果的符号位。
在第二级流水线结构中,进行尾数相加(或相减),尾数相减时还需同时进行前导零预测(Leading Zero Anticipation,LZA),预测结果中的前导零数目。
在第三级流水线结构中,首先进行规约化,当新运算符为减时,需要进行左移,当新运算符为加时,需要进行右移。之后进行异常检测和舍入,最后得到正确的结果并输出计算结果。
当采用三级流水线结构时,由于M的值为3,因此N个原始浮点数共分为3组。在浮点累加的计算过程中,三级流水线结构中的第一级流水线结构在第一个计算周期对第一组浮点数进行处理,在第二个计算周期对第二组浮点数进行处理,在第三个计算周期对第三组浮点数进行处理,以此类推;三级流水线结构中的第二级流水线结构在第二个计算周期对第一组浮点数进行处理,在第三个计算周期对第二组浮点数进行处理,在第四个计算周期对第三组浮点数进行处理,以此类推;即三级流水线结构中的每一级流水线结构在每一时刻均对不同的浮点数进行处理,从而充分利用了计算资源,缩短了计算时间。
基于以上描述,在第一个计算阶段中,对多个原始浮点数进行分组累加,以获得多个分组中间结果。由于浮点加法的延迟特性,部分分组中间结果将在之后的第二至第M个计算阶段的某一时刻输出。之后,在第二至第M个计算阶段中,分别指示将上一个计算阶段输出的中间结果和所述M个第一中间结果的一个第一中间结果送入所述浮点模块的输入端以进行累加,从而分别获得多个第二中间结果,其中在所述第二至所述第M个计算阶段,所述多个第二中间结果分别作为所述中间结果或所述最终结果在所述浮点加法模块的所述输出端输出。
具体地,当判断当前计算周期为第二个计算阶段时,分别将M个分组中的第一分组的分组中间结果和M个分组中的第二分组的分组中间结果作为浮点加法的输入以进行累加,其中,第一分组和第二分组不同。
进一步地,第一个分组和第二个分组可以是相邻的两个分组,即第一个分组和第二个分组的分组中间结果是依次输出的。需要注意的是,第一个分组和第二个分组的分组中间结果未必在相邻时刻输出,二者之间可能有若干个计算周期的延迟,因此前一个分组的分组中间结果输出以后可以暂时存放在寄存器中,并且可以在后一个分组中间结果输出时,从寄存器中提取前一个分组的分组中间结果并作为浮点加法的输入,同时将后一个分组的分组中间结果送入为浮点加法的输入,以对两个分组中间结果进行累加。
在第二个计算阶段之后的每个计算阶段中,逐步进行分组中间结果的累加,即在上一计算阶段输出的中间结果的基础上累加一个新的分组中间结果。具体地,若判断当前计算周期处于第二计算阶段之后的某一计算阶段,则分别将上一计算阶段输出的中间结果和寄存器中寄存的分组中间结果作为浮点加法的输入以进行累加,直到全部的分组中间结果累加完毕。
当M=3时,计算阶段的数目共有三个,在第二计算阶段计算两个分组中间结果的累加以后,第三计算阶段将第二个计算阶段输出的中间结果与第三个分组的分组中间结果进行累加,即可得到最终结果。
基于上面的描述,根据本发明实施例的浮点累加方法500采用分 组累加的计算方式,极大地缩短了计算时间。当采用三级流水线结构进行浮点加法计算时,本发明实施例的浮点累加方法500可以将N个浮点数累加的运算时间和IO时间由3(N-1)个周期缩短至
Figure PCTCN2020085715-appb-000005
个周期。在N较大时,
Figure PCTCN2020085715-appb-000006
可以近似看做N,基本达到了最快的计算速度,特别是降低了IO时间,不影响其他组件的IO需求。
另外,本发明实施例还提供了一种计算机存储介质,其上存储有计算机程序。当所述计算机程序由处理器执行时,可以实现前述图5所示的浮点累加方法500的步骤。例如,该计算机存储介质为计算机可读存储介质。计算机存储介质例如可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、或者上述存储介质的任意组合。计算机可读存储介质可以是一个或多个计算机可读存储介质的任意组合。
在一个实施例中,计算机存储介质上存储的计算机程序指令在被计算机或处理器运行时使计算机或处理器执行以下步骤:
读取N个原始浮点数,并分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
判断当前计算周期所处的计算阶段,并根据所述计算阶段选择将所述原始浮点数作为浮点加法的输入,或将所述浮点加法输出的中间结果送入为所述浮点加法的输入;
在每个所述计算周期中获取两个浮点数作为浮点加法的输入,采用M级流水线结构对两个所述浮点数进行累加,并输出所述中间结果或最终结果,其中M为大于1的整数;
其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段。
另外,本发明实施例还提供了一种计算机程序产品,其包含指令,当该指令被计算机所执行时,使得计算机执行上述图5所示的浮点累加方法500的方法的步骤。
综上所述,本发明实施例的浮点累加方法、浮点累加装置和计算机存储介质能够提高浮点累加的速度,缩短IO占用时间,并且硬件资源开销少,节省了芯片面积,降低了芯片成本。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其他任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以 是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。
尽管这里已经参考附图描述了示例实施例,应理解上述示例实施例仅仅是示例性的,并且不意图将本发明的范围限制于此。本领域普通技术人员可以在其中进行各种改变和修改,而不偏离本发明的范围和精神。所有这些改变和修改意在被包括在所附权利要求所要求的本发明的范围之内。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行, 取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该本发明的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如相应的权利要求书所反映的那样,其发明点在于可以用少于某个公开的单个实施例的所有特征的特征来解决相应的技术问题。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域的技术人员可以理解,除了特征之间相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以 以任意的组合方式来使用。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
以上所述,仅为本发明的具体实施方式或对具体实施方式的说明,本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。本发明的保护范围应以权利要求的保护范围为准。

Claims (22)

  1. 一种浮点累加装置,其特征在于,所述浮点累加装置包括:
    载入模块,用于读取N个原始浮点数,并分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
    控制模块,用于判断当前计算周期所处的计算阶段,并根据所述计算阶段,控制至少一个多路选择器将所述载入模块输出的所述原始浮点数送入浮点加法模块的输入端,或控制至少一个多路选择器将所述浮点加法模块的输出端输出的中间结果送入到所述浮点加法模块的输入端;
    浮点加法模块,用于在每个所述计算周期中在所述输入端获取所述多路选择器送入的两个浮点数,采用M级流水线结构对所述两个浮点数进行累加,并在输出端输出所述中间结果或最终结果,其中M为大于1的整数;
    其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段。
  2. 根据权利要求1所述的浮点累加装置,其特征在于,所述N个原始浮点数共分为M组,所述计算阶段的数目为M个。
  3. 根据权利要求2所述的浮点累加装置,其特征在于,所述根据所述计算阶段控制至少一个多路选择器将所述载入模块输出的所述原始浮点数送入浮点加法模块的输入端,或控制所述至少一个多路选择器将所述浮点加法模块的输出端输出的中间结果送入到所述浮点加法模块的输入端,包括:
    在第一个计算阶段的每个计算周期中,依次指示将所述原始浮点数送入所述浮点加法模块的输入端,以对每组所述原始浮点数进行累加,从而获得M个分组中间结果;
    在第二至第M个计算阶段的每个计算阶段中,分别指示将上一个计算阶段输出的所述中间结果和M个分组中间结果中的一个所述分组中间结果送入到所述浮点模块的输入端以进行累加;以及
    其中,第一个计算阶段为所述关于原始浮点数的计算阶段,以及第二至第M个计算阶段为所述关于中间结果的计算阶段。
  4. 根据权利要求3所述的浮点累加装置,其特征在于,所述多路 选择器包括第一多路选择器和第二多路选择器,所述第一多路选择器和所述第二多路选择器分别用于向所述浮点加法模块输入一个浮点数。
  5. 根据权利要求4所述的浮点累加装置,其特征在于,所述在第一个计算阶段中,分别将每组所述原始浮点数送入所述浮点加法模块的输入端进行累加,包括:
    在所述第一计算阶段的每个计算周期中,控制所述第一多路选择器依次将所述载入模块输出的所述原始浮点数送入所述输入端;
    在所述第一计算阶段的第一部分计算周期中,控制所述第二多路选择器将0送入所述输入端;
    在所述第一计算阶段的第二部分计算周期中,控制所述第二多路选择器将当前计算周期中所述输出端所输出的所述中间结果送入所述输入端。
  6. 根据权利要求3-5之一所述的浮点累加装置,其特征在于,所述载入模块还用于:当N不是M的整数倍时,送出至少一个额外的浮点数;其中,所述至少一个额外的浮点数的数值为0。
  7. 根据权利要求6所述的浮点累加装置,其特征在于,所述至少一个额外的浮点数在所述第一计算阶段的最后至少一个计算周期送出。
  8. 根据权利要求3-7之一所述的浮点累加装置,其特征在于,还包括中间寄存器模块,配置为:若当前计算周期不使用所述输出端输出的分组中间结果进行累加时,寄存所述分组中间结果。
  9. 根据权利要求8所述的浮点累加装置,其特征在于,所述中间寄存器模块包括多个寄存器,每个所述寄存器用于寄存一个所述分组中间结果。
  10. 根据权利要求1-9之一所述的浮点累加装置,其特征在于,所述M级流水线结构为三级流水线结构,所述输入端和所述输出端之间间隔三个计算周期。
  11. 根据权利要求10所述的浮点累加装置,其特征在于,在所述三级流水线中:
    第一级流水线结构用于对接收到的浮点数进行类型检测、交换、 对阶,以及计算新运算符和结果的符号位;
    第二级流水线结构用于进行尾数相加或相减以及前导零预测;
    第三级流水线结构用于进行规约化、异常检测和舍入运算,并输出累加结果。
  12. 一种浮点累加方法,其特征在于,所述浮点累加方法包括:
    读取N个原始浮点数,分别在N个计算周期连续地依次输出所述N个原始浮点数,其中N为大于3的整数;
    判断当前计算周期所处的计算阶段,并根据所述计算阶段选择将所述原始浮点数作为浮点加法的输入,或将所述浮点加法输出的中间结果送入为所述浮点加法的输入;
    在每个所述计算周期中获取两个浮点数作为浮点加法的输入,采用M级流水线结构对两个所述浮点数进行累加,并输出所述中间结果或最终结果,其中M为大于1的整数;
    其中,所述计算阶段包括关于原始浮点数的计算阶段,以及关于中间结果的计算阶段。
  13. 根据权利要求12所述的浮点累加方法,其特征在于,所述N个原始浮点数共分为M组,所述计算阶段的数目为M个。
  14. 根据权利要求13所述的浮点累加方法,其特征在于,所述根据所述计算阶段选择将所述原始浮点数作为浮点加法的输入,或将所述浮点加法输出的中间结果送入为所述浮点加法的输入,包括:
    在第一个计算阶段中,分别指示将每组所述原始浮点数进行累加,以获得M个分组中间结果;
    在第二至第M个计算阶段的每个计算阶段中,分别指示将上一个计算阶段输出的所述中间结果和M个分组中间结果中的一个所述分组中间结果送入所述浮点模块的输入端以进行累加;以及
    其中,第一个计算阶段为所述关于原始浮点数的计算阶段,以及第二至第M个计算阶段为所述关于中间结果的计算阶段。
  15. 根据权利要求14所述的浮点累加方法,其特征在于,所述在第一个计算阶段中,分别将每组所述原始浮点数进行累加,包括:
    在所述第一计算阶段的每个计算周期中,依次将每个所述原始浮点数作为输入所述浮点加法的其中一个浮点数;
    在所述第一计算阶段的第一部分计算周期中,将0作为输入所述浮点加法的另外一个浮点数;
    在所述第一计算阶段的第二部分计算周期中,将当前计算周期输出的所述中间结果送入为输入所述浮点加法的另外一个浮点数。
  16. 根据权利要求13-15之一所述的浮点累加方法,其特征在于,还包括:当N不是M的整数倍时,送出至少一个额外的浮点数;其中,所述至少一个额外的浮点数的数值为0。
  17. 根据权利要求16所述的浮点累加方法,其特征在于,所述至少一个额外的浮点数在所述第一计算阶段的最后至少一个计算周期送出。
  18. 根据权利要求13-17之一所述的浮点累加方法,其特征在于,还包括:若当前计算周期不使用输出的分组中间结果进行累加,则将所述分组中间结果寄存在寄存器中。
  19. 根据权利要求18所述的浮点累加方法,其特征在于,每个所述寄存器寄存一个所述分组中间结果。
  20. 根据权利要求12-19之一所述的浮点累加方法,其特征在于,所述M级流水线结构为三级流水线结构,所述浮点加法的输入和输出之间间隔三个计算周期。
  21. 根据权利要求20所述的浮点累加方法,其特征在于,在所述三级流水线结构中:
    第一级流水线结构用于对接收到的浮点数进行类型检测、交换、对阶,以及计算新运算符和结果的符号位;
    第二级流水线结构用于进行尾数相加或相减以及前导零预测;
    第三级流水线结构用于进行规约化、异常检测和舍入运算,并输出累加结果。
  22. 一种计算机存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求12至21中任一项所述方法的步骤。
PCT/CN2020/085715 2020-04-20 2020-04-20 浮点累加装置、方法和计算机存储介质 WO2021212285A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/085715 WO2021212285A1 (zh) 2020-04-20 2020-04-20 浮点累加装置、方法和计算机存储介质
CN202080006248.2A CN113168308A (zh) 2020-04-20 2020-04-20 浮点累加装置、方法和计算机存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/085715 WO2021212285A1 (zh) 2020-04-20 2020-04-20 浮点累加装置、方法和计算机存储介质

Publications (1)

Publication Number Publication Date
WO2021212285A1 true WO2021212285A1 (zh) 2021-10-28

Family

ID=76879257

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/085715 WO2021212285A1 (zh) 2020-04-20 2020-04-20 浮点累加装置、方法和计算机存储介质

Country Status (2)

Country Link
CN (1) CN113168308A (zh)
WO (1) WO2021212285A1 (zh)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115328436A (zh) * 2022-10-11 2022-11-11 深圳鲲云信息科技有限公司 一种多个累加器的计算方法、装置、电子设备和存储介质
CN115309363A (zh) * 2022-10-11 2022-11-08 深圳鲲云信息科技有限公司 一种累加器计算方法、装置、电子设备和存储介质
CN115328437A (zh) * 2022-10-11 2022-11-11 深圳鲲云信息科技有限公司 一种累加器计算方法、装置、电子设备和存储介质
CN117170622B (zh) * 2023-11-03 2024-03-01 深圳鲲云信息科技有限公司 累加器及用于累加器的方法和芯片电路及计算设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859241A (zh) * 2010-05-22 2010-10-13 中国人民解放军国防科学技术大学 基于全展开的全流水128位精度浮点累加器
CN102033732A (zh) * 2010-12-17 2011-04-27 浙江大学 基于fpga的高速低延迟浮点累加器及其实现方法
CN102629189A (zh) * 2012-03-15 2012-08-08 湖南大学 基于fpga的流水浮点乘累加方法
CN103176767A (zh) * 2013-03-01 2013-06-26 浙江大学 一种低功耗高吞吐的浮点数乘累加单元的实现方法
US20140281419A1 (en) * 2013-03-15 2014-09-18 Intel Corporation Combined floating point multiplier adder with intermediate rounding logic

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221490B (zh) * 2007-12-20 2010-11-10 清华大学 一种具有数据前送结构的浮点乘加单元
CN100570552C (zh) * 2007-12-20 2009-12-16 清华大学 一种并行浮点乘加单元
US11061672B2 (en) * 2015-10-02 2021-07-13 Via Alliance Semiconductor Co., Ltd. Chained split execution of fused compound arithmetic operations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101859241A (zh) * 2010-05-22 2010-10-13 中国人民解放军国防科学技术大学 基于全展开的全流水128位精度浮点累加器
CN102033732A (zh) * 2010-12-17 2011-04-27 浙江大学 基于fpga的高速低延迟浮点累加器及其实现方法
CN102629189A (zh) * 2012-03-15 2012-08-08 湖南大学 基于fpga的流水浮点乘累加方法
CN103176767A (zh) * 2013-03-01 2013-06-26 浙江大学 一种低功耗高吞吐的浮点数乘累加单元的实现方法
US20140281419A1 (en) * 2013-03-15 2014-09-18 Intel Corporation Combined floating point multiplier adder with intermediate rounding logic

Also Published As

Publication number Publication date
CN113168308A (zh) 2021-07-23

Similar Documents

Publication Publication Date Title
WO2021212285A1 (zh) 浮点累加装置、方法和计算机存储介质
US6675235B1 (en) Method for an execution unit interface protocol and apparatus therefor
CN110689125A (zh) 计算装置
US6601077B1 (en) DSP unit for multi-level global accumulation
US9274802B2 (en) Data compression and decompression using SIMD instructions
KR101085810B1 (ko) 멀티스테이지 부동 소수점 누산기
TWI493453B (zh) 提高精確度積和演算之微處理器及其視頻解碼裝置、其方法及其電腦程式產品
US20190095175A1 (en) Arithmetic processing device and arithmetic processing method
TW201344565A (zh) 用以對緊縮資料執行差異解碼之系統,設備,及方法
US11880682B2 (en) Systolic array with efficient input reduction and extended array performance
CN112711738A (zh) 用于向量内积的计算装置、方法和集成电路芯片
US20170017467A1 (en) Integer/floating point divider and square root logic unit and associates methods
US20230004523A1 (en) Systolic array with input reduction to multiple reduced inputs
CN112463113B (zh) 浮点加法单元
WO2021078210A1 (zh) 用于神经网络运算的计算装置、方法、集成电路和设备
WO2021120851A1 (zh) 一种浮点处理装置和数据处理方法
WO2021232422A1 (zh) 神经网络的运算装置及其控制方法
WO2019205064A1 (zh) 神经网络加速装置与方法
US11221826B2 (en) Parallel rounding for conversion from binary floating point to binary coded decimal
CN209895329U (zh) 乘法器
US7047271B2 (en) DSP execution unit for efficient alternate modes for processing multiple data sizes
US6981012B2 (en) Method and circuit for normalization of floating point significants in a SIMD array MPP
KR100732426B1 (ko) 고속 컨텍스트 전환을 갖는 컴퓨터
CN111124358A (zh) 一种序列累加器的运算方法和设备
WO2022141321A1 (zh) Dsp处理器及其并行计算方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20931810

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20931810

Country of ref document: EP

Kind code of ref document: A1