CN114296682A - Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip - Google Patents

Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip Download PDF

Info

Publication number
CN114296682A
CN114296682A CN202111667694.0A CN202111667694A CN114296682A CN 114296682 A CN114296682 A CN 114296682A CN 202111667694 A CN202111667694 A CN 202111667694A CN 114296682 A CN114296682 A CN 114296682A
Authority
CN
China
Prior art keywords
floating point
point number
target
processed
exponent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111667694.0A
Other languages
Chinese (zh)
Inventor
霍冠廷
王文强
徐宁仪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Power Tensors Intelligent Technology Co Ltd
Original Assignee
Shanghai Power Tensors Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Power Tensors Intelligent Technology Co Ltd filed Critical Shanghai Power Tensors Intelligent Technology Co Ltd
Priority to CN202111667694.0A priority Critical patent/CN114296682A/en
Publication of CN114296682A publication Critical patent/CN114296682A/en
Priority to PCT/CN2022/124517 priority patent/WO2023124372A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/06Arrangements for sorting, selecting, merging, or comparing data on individual record carriers
    • G06F7/10Selecting, i.e. obtaining data of one kind from those record carriers which are identifiable by data of a second kind from a mass of ordered or randomly- distributed record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/483Computations with numbers represented by a non-linear combination of denominational numbers, e.g. rational numbers, logarithmic number system or floating-point numbers
    • G06F7/485Adding; Subtracting

Abstract

The present disclosure provides a floating-point number processing apparatus, a floating-point number processing method, an electronic device, a storage medium, and a chip, where the floating-point number processing method includes: acquiring a plurality of floating point numbers to be processed which are operated in a target chip; expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed; accumulating the plurality of expanded floating point numbers in the target chip to obtain a target floating point number; and normalizing the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.

Description

Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip
Technical Field
The present disclosure relates to the field of integrated circuit technologies, and in particular, to a floating-point number processing apparatus, a floating-point number processing method, an electronic device, a storage medium, and a chip.
Background
With the continuous development of semiconductor technology, computer architecture and processor design architecture, the functions of processors are more and more powerful, and the structures are more and more complex. The floating-point number operation is a type of operation with more operation steps, larger delay and larger power consumption in the processor, and has larger influence on the performance index of the processor.
Therefore, it is important to provide a method for optimizing the floating-point number processing.
Disclosure of Invention
In view of the above, the present disclosure provides at least a floating-point number processing apparatus, a floating-point number processing method, an electronic device, a storage medium, and a chip.
In a first aspect, the present disclosure provides a floating point number processing apparatus, the apparatus comprising: a selector and an adder; wherein the adder is connected with the selector;
the selector is used for expanding the sign bit of each floating point number to be processed in the acquired floating point numbers to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
and the adder is used for accumulating the plurality of extended floating point numbers to obtain a target floating point number.
In order to ensure the precision of the floating point number accumulation process and meet the bit width requirement of the intermediate floating point number obtained after the addition and the operation, the sign bit of each floating point number to be processed can be expanded to the target bit width by using the selector to obtain a plurality of expanded floating point numbers, and the expansion of the bit width of the floating point number to be processed is realized. And furthermore, in the process of accumulating the plurality of extended floating point numbers, the intermediate floating point numbers obtained in the process of accumulating are not required to be normalized, the times of normalization processing are reduced, and the delay and the power consumption of the chip in the process of accumulating the plurality of extended floating point numbers are reduced. Meanwhile, after the number of normalization processing is reduced, the problem of data loss caused by normalization processing can be relieved, and the precision of the target floating point number is improved.
In a possible embodiment, the apparatus further comprises: a comparator; the comparator is connected with the adder;
the adder is further used for inputting the target floating point number to the comparator;
and the comparator is used for normalizing the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.
During implementation, normalization processing is not needed to be carried out on intermediate floating point numbers obtained in the accumulation processing process of the plurality of expanded floating point numbers; after the target floating point number is obtained through accumulation, the target floating point number is normalized once to obtain a target processing result, so that the normalization processing times are reduced, and the delay and the power consumption of the accumulation process of processing a plurality of floating point numbers by the target chip are reduced; meanwhile, the problem of data loss caused by normalization processing can be relieved, and the precision of a target processing result is improved.
In a possible embodiment, the apparatus further comprises: the order matching arithmetic unit is respectively connected with the adder and the selector; the order operator comprises a subtracter and a shifter;
the subtracter is used for determining a target difference value between the initial exponent of each expanded floating point number and the target exponent; wherein the target exponent is determined based on an initial exponent of a plurality of extended floating point numbers;
the shifter is used for aligning the initial exponent of each expanded floating point number to the target exponent, and right-shifting the mantissa of each expanded floating point number based on the target difference corresponding to the expanded floating point number to obtain a processed floating point number; inputting the processed floating point number to the adder;
and the adder is used for accumulating the mantissas of the processed floating point numbers to obtain a target floating point number.
In a possible implementation manner, when the selector is configured to expand a sign bit of each of the obtained multiple floating point numbers to be processed to a target bit width to obtain multiple expanded floating point numbers, the selector is configured to: selecting two floating point numbers to be processed from a plurality of floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers;
the subtractor is used for taking an initial exponent of a first floating point number of the two expanded floating point numbers as a target exponent; determining a first difference between the target exponent and an initial exponent of a second floating point number of the two extended floating point numbers; wherein the initial exponent of the first floating point number is greater than the initial exponent of the second floating point number;
the shifter to align an initial exponent of the second floating point number to the target exponent; based on the first difference, right shifting the mantissa of the second floating point number to obtain a processed second floating point number; inputting the first floating-point number and the processed second floating-point number to the adder;
and the adder is used for adding the first floating point number and the processed second floating point number to obtain an intermediate floating point number.
In a possible embodiment, the apparatus further comprises: a first register; the first register is respectively connected with the adder and the subtracter of the order-matching arithmetic unit;
the adder is used for inputting the obtained intermediate floating point number to the first register;
and the first register is used for storing the intermediate floating point number and the target floating point number sent by the adder and sending the intermediate floating point number to the subtracter.
In a possible implementation, the apparatus further includes a second register; the second register is connected with the subtracter;
the second register is used for storing a target index; and sending the target index to the subtractor of the order matching module.
In a possible implementation, the selector is further configured to: selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
in response to that the initial exponent of the extended floating point number corresponding to the selected floating point number to be processed is larger than the target exponent stored in the second register, taking the initial exponent of the extended floating point number corresponding to the selected floating point number to be processed as a new target exponent, and updating the second register by using the new target exponent; the second register is also used for sending the new target index to the subtracter;
the subtractor is used for determining a second difference value between the exponent of the intermediate floating point number and the target exponent based on the received new target exponent;
the shifter is used for aligning the exponent of the middle floating point number to the target exponent and right shifting the mantissa of the middle floating point number based on the second difference value to obtain a processed middle floating point number; sending the processed intermediate floating point number and the selected expanded floating point number corresponding to the floating point number to be processed to the adder;
the adder is used for adding the mantissa of the intermediate floating point number and the mantissa of the expanded floating point number corresponding to the selected floating point number to be processed to obtain a new intermediate floating point number; and sending the new intermediate floating point number to a first register; and taking the intermediate floating point number obtained after the last addition as a target floating point number, and sending the target floating point number to a first register.
In one possible implementation, when normalizing the target floating-point number to obtain a target processing result, the comparator is configured to:
and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
In a possible implementation, the target bit width is determined based on an initial bit width of an initial mantissa included in the plurality of floating point numbers to be processed and a total number of the plurality of floating point numbers to be processed.
In a second aspect, the present disclosure provides a floating point number processing method, including:
acquiring a plurality of floating point numbers to be processed which are operated in a target chip;
expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
and accumulating the plurality of expanded floating point numbers in the target chip to obtain a target floating point number.
In the method, in order to guarantee the precision of the floating point number accumulation process and meet the bit width requirement of the intermediate floating point number obtained after the addition operation, the sign bit of each floating point number to be processed can be expanded to the target bit width to obtain a plurality of expanded floating point numbers, and the expansion of the bit width of the floating point number to be processed is realized. And furthermore, in the process of accumulating the plurality of extended floating point numbers in the target chip, the intermediate floating point numbers obtained in the process of accumulating are not required to be normalized, the times of normalization processing are reduced, and the delay and the power consumption of the accumulation process of the target chip for processing the plurality of extended floating point numbers are reduced. Meanwhile, after the number of normalization processing is reduced, the problem of data loss caused by normalization processing can be relieved, and the precision of the target floating point number is improved.
In one possible implementation, after the obtaining the target floating point number, the method further includes:
and normalizing the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.
During implementation, normalization processing is not needed to be carried out on intermediate floating point numbers obtained in the accumulation processing process of the plurality of expanded floating point numbers; after the target floating point number is obtained through accumulation, the target floating point number is normalized once to obtain a target processing result, so that the normalization processing times are reduced, and the delay and the power consumption of the accumulation process of processing a plurality of floating point numbers by the target chip are reduced; meanwhile, the problem of data loss caused by normalization processing can be relieved, and the precision of a target processing result is improved.
In one possible embodiment, the target bit width is determined according to the following steps:
determining an adjustment bit width corresponding to the plurality of floating point numbers to be processed based on an initial bit width of initial mantissas included in the plurality of floating point numbers to be processed;
and determining the target bit width based on the adjusting bit width and the total number of the floating point numbers to be processed.
In the above embodiment, the target bit width is determined more accurately according to the adjustment bit width and the total number of the floating point numbers to be processed, the sign bit of the floating point number to be processed is subsequently extended to the target bit width, and after a plurality of extended floating point numbers are obtained, the bit width of the extended floating point number can meet the bit width requirement of accumulation processing, and further, normalization processing on the intermediate floating point number obtained in the accumulation processing process is not required.
In one possible implementation manner, the expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers includes:
selecting two floating point numbers to be processed from a plurality of floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers;
accumulating the plurality of extended floating point numbers in the target chip to obtain a target floating point number, including:
taking the two expanded floating point numbers as two current floating point numbers, and adding the two current floating point numbers in the target chip to obtain an intermediate floating point number;
selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
taking the extended floating point number and the intermediate floating point number corresponding to the selected floating point number to be processed as two updated current floating point numbers, returning to the step of adding the two current floating point numbers in the target chip to obtain the intermediate floating point number, and adding the two current floating point numbers until each floating point number to be processed in the floating point numbers to be processed is added;
and taking the intermediate floating point number obtained after the last addition processing as the target floating point number.
The target floating point number can be obtained by accumulating a plurality of floating point numbers to be processed for a plurality of times, and the generated intermediate floating point number does not need to be normalized in the accumulation process, so that the problem of data loss caused by normalization of the intermediate floating point number is solved, the precision of the intermediate floating point number is ensured, and the precision of the obtained target floating point number is higher.
In one possible implementation, the adding the two current floating point numbers in the target chip to obtain an intermediate floating point number includes:
aligning, in the target chip, an initial exponent of a first of the two current floating point numbers to an initial exponent of a second floating point number; right shifting the mantissa of the first floating point number by a target digit to obtain a first floating point number after order matching; wherein the initial exponent of the first floating point number is less than the initial exponent of the second floating point number of the two floating point numbers; the target number of bits is a difference between an initial exponent of the second floating point number and an initial exponent of the first floating point number;
and adding the first floating point number and the second floating point number subjected to the order matching processing in the target chip to obtain an intermediate floating point number.
In a possible implementation manner, the normalizing the target floating point number to obtain a target processing result includes:
and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
Here, the target floating point number may be flexibly normalized by using one or more of a shift manner, a truncate manner, and a rounding manner, so as to obtain a target processing result.
The following descriptions of the effects of the apparatus, the electronic device, and the like refer to the description of the above method, and are not repeated here.
In a third aspect, the present disclosure provides a chip comprising: the chip comprises the floating point number processing device of the first aspect or any embodiment and a memory;
the memory is used for storing a plurality of floating point numbers to be processed to be operated;
and the floating point number processing device is used for processing the floating point numbers to be processed to obtain target floating point numbers.
In a fourth aspect, the present disclosure provides a floating-point number processing apparatus, including:
the acquisition module is used for acquiring a plurality of floating point numbers to be processed which are operated in a target chip;
the expansion module is used for expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
and the first processing module is used for accumulating the plurality of extended floating point numbers in the target chip to obtain a target floating point number.
In one possible implementation, after the obtaining the target floating point number, the apparatus further includes:
and the second processing module is used for normalizing the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.
In a possible implementation, the extension module is configured to determine the target bit width according to the following steps:
determining an adjustment bit width corresponding to the plurality of floating point numbers to be processed based on an initial bit width of initial mantissas included in the plurality of floating point numbers to be processed;
and determining the target bit width based on the adjusting bit width and the total number of the floating point numbers to be processed.
In a possible implementation manner, when the sign bit of each floating point number to be processed is extended to the target bit width to obtain a plurality of extended floating point numbers, the extension module is configured to:
selecting two floating point numbers to be processed from a plurality of floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers;
the first processing module, when accumulating the plurality of extended floating point numbers in the target chip to obtain a target floating point number, is configured to:
taking the two expanded floating point numbers as two current floating point numbers, and adding the two current floating point numbers in the target chip to obtain an intermediate floating point number;
selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
taking the extended floating point number and the intermediate floating point number corresponding to the selected floating point number to be processed as two updated current floating point numbers, returning to the step of adding the two current floating point numbers in the target chip to obtain the intermediate floating point number, and adding the two current floating point numbers until each floating point number to be processed in the floating point numbers to be processed is added;
and taking the intermediate floating point number obtained after the last addition processing as the target floating point number.
In a possible implementation manner, the first processing module, when summing the two current floating point numbers in the target chip to obtain an intermediate floating point number, is configured to:
aligning, in the target chip, an initial exponent of a first of the two current floating point numbers to an initial exponent of a second floating point number; right shifting the mantissa of the first floating point number by a target digit to obtain a first floating point number after order matching; wherein the initial exponent of the first floating point number is less than the initial exponent of the second floating point number of the two floating point numbers; the target number of bits is a difference between an initial exponent of the second floating point number and an initial exponent of the first floating point number;
and adding the first floating point number and the second floating point number subjected to the order matching processing in the target chip to obtain an intermediate floating point number.
In a possible implementation manner, when normalizing the target floating-point number to obtain a target processing result, the second processing module is configured to:
and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
In a fifth aspect, the present disclosure provides an electronic device comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the floating point number processing method as set forth in the second aspect or any one of the embodiments; alternatively, a chip as described in the third aspect.
In a sixth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the floating point number processing method according to the second aspect or any one of the embodiments.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
FIG. 1 is a flow chart illustrating a floating point number processing method according to an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating an architecture of a floating-point processing apparatus according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating an architecture of another floating-point processing apparatus provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram illustrating an architecture of a chip provided by an embodiment of the disclosure;
FIG. 5 is a block diagram illustrating an architecture of a floating-point processing apparatus according to an embodiment of the present disclosure;
fig. 6 shows a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
With the continuous development of semiconductor technology, computer architecture and processor design architecture, the functions of processors are more and more powerful, and the structures are more and more complex. The floating-point number operation is a type of operation with more operation steps, larger delay and larger power consumption in the processor, and has larger influence on the performance index of the processor.
Accumulation is a common floating-point operation. Generally, in the accumulation operation of a plurality of floating-point numbers, two floating-point numbers may be accumulated to obtain an intermediate result, and the intermediate result may be normalized. And accumulating the intermediate result after the normalization processing and the next floating point number until the accumulation operation is finished to obtain a target result: however, the intermediate result is normalized after each accumulation, which results in a large power consumption and a large delay of the processor performing the floating-point accumulation operation. Meanwhile, since normalization processing causes data loss, normalization processing of the intermediate result each time causes reduction in the accuracy of the normalized intermediate result, and thus when a target result is obtained by multiple accumulation operations, the accuracy of the target result is low.
In order to alleviate the above problem, an embodiment of the present disclosure provides a floating point number processing method.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
For the convenience of understanding the embodiments of the present disclosure, a floating-point number processing method disclosed in the embodiments of the present disclosure will be described in detail first. The execution main body of the floating point number Processing method provided by the embodiment of the present disclosure is generally a chip with certain computing capability, and the chip may be, for example, a processor, and the processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Tensor Processing Unit (TPU), an embedded Neural-Network Processing Unit (NPU), and the like. In some possible implementations, the floating point number processing method may be implemented by a processor calling computer readable instructions stored in a memory.
Referring to fig. 1, a schematic flow diagram of a floating point number processing method provided in the embodiment of the present disclosure is shown, where the method includes S101 to S103, where:
s101, acquiring a plurality of floating point numbers to be processed which are operated in a target chip;
s102, expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
s103, accumulating the plurality of expanded floating point numbers in the target chip to obtain a target floating point number.
In the method, in order to guarantee the precision of the floating point number accumulation process and meet the bit width requirement of the intermediate floating point number obtained after the addition operation, the sign bit of each floating point number to be processed can be expanded to the target bit width to obtain a plurality of expanded floating point numbers, and the expansion of the bit width of the floating point number to be processed is realized. And furthermore, in the process of accumulating the plurality of extended floating point numbers in the target chip, the intermediate floating point numbers obtained in the process of accumulating are not required to be normalized, the times of normalization processing are reduced, and the delay and the power consumption of the accumulation process of the target chip for processing the plurality of extended floating point numbers are reduced. Meanwhile, after the number of normalization processing is reduced, the problem of data loss caused by normalization processing can be relieved, and the precision of the target floating point number is improved.
S101 to S103 will be specifically described below.
In S101, the target chip may be any chip that needs to perform floating-point operations, for example, the target chip may be a CPU, a GPU, an NPU, or the like. A floating point number generally includes a sign bit, a mantissa, and an exponent. For example, floating point number (0.3)10=(0011 1110 1001 1001 1001 1001 1001 1010)2Wherein, sign bit: sa is 0, index: ea 01111101, mantissa: and Ma 1.00110011001100110011010.
In S102, it is considered that, when accumulating a plurality of floating point numbers, the accumulated floating point numbers are subjected to an order matching process, and mantissas including sign bits of the floating point numbers after the order matching process are added. When the mantissas of a plurality of floating point numbers containing sign bits are added, the bit width of the sum obtained after the addition is greater than or equal to the bit width of the mantissa of each floating point number before the addition, wherein the mantissa of each floating point number contains sign bits. For example, if the bit width of the mantissa containing the sign bit of floating-point number 1 is 12 and the bit width of the mantissa containing the sign bit of floating-point number 2 is 13, the bit width after summing floating-point number 1 and floating-point number 2 may be 14.
In order to guarantee the precision of the floating point number obtained after the summation, and meet the bit width requirement of the sum value obtained after the summation operation, the sign bit of each floating point number to be processed can be expanded. Wherein, the expanded bit width may be a target bit width. The target bit width may be a bit width value set by a user, and the target bit width is larger than an initial bit width corresponding to a mantissa of any floating point number to be processed. Or, the target bit width may also be determined according to the initial bit width corresponding to the mantissa of each floating point number to be processed and the total number of the floating point numbers to be processed.
During implementation, the target bit width may be determined, and the sign bit of each floating point number to be processed is extended to the target bit width to obtain an extended floating point number. For example, in the above case, the sign bit: the Sa is 0, and the sign bit after spreading may be: sa is 0000.
In an alternative embodiment, the target bit width is determined according to the following steps: determining an adjustment bit width corresponding to the plurality of floating point numbers to be processed based on an initial bit width of initial mantissas included in the plurality of floating point numbers to be processed; and determining the target bit width based on the adjusting bit width and the total number of the floating point numbers to be processed.
During implementation, the initial bit width of the initial mantissa of each floating point number to be processed may be determined, and the adjustment bit width may be determined based on the initial bit widths of the plurality of initial mantissas. The adjustment bit width may be the maximum value of the plurality of initial bit widths, or may be any integer larger than the maximum value. The following description will be given taking an example of adjusting the bit width to the maximum bit width among a plurality of initial bit widths.
For example, if the number of floating point numbers to be processed is 5, the maximum bit width in the initial bit widths of the 5 initial mantissas is determined, and the maximum bit width is determined as the adjustment bit width. Determining a target bit width according to the adjusting bit width and the total number of floating point numbers to be processed; that is, on the basis of the maximum bit width, according to the total number of floating point numbers to be processed and the carry system of the floating point numbers to be processedAnd determining the overflow digit of the sum value, and determining the target bit width according to the overflow digit and the maximum bit width. For example, the target bit width may be: n + logaAnd m, where n is the adjustment bit width, m is the total number of the floating point numbers to be processed, and a is the number system of the floating point numbers to be processed, for example, when a is 2, it indicates that the floating point numbers to be processed are binary numbers.
Wherein, at n + logaAnd when the value of m is a decimal, rounding up can be performed to obtain the target bit width. E.g. at n + logam=12+log2When 5 ═ 14.32, the target bit width may be determined to be 15. If the floating point number to be processed is: a ═ 0.3)1000111110110011001100110011001101, where 0 is a sign bit, 01111101 is an initial exponent, and 10011001100110011001101 is an initial mantissa, the sign bit of the floating point number to be processed is extended to the target bit width, and the floating point number to be processed may be: 00000111110110011001100110011001101.
in the above embodiment, the target bit width is determined more accurately according to the adjustment bit width and the total number of the floating point numbers to be processed, the sign bit of the floating point number to be processed is subsequently extended to the target bit width, and after a plurality of extended floating point numbers are obtained, the bit width of the extended floating point number can meet the bit width requirement of accumulation processing, and further, normalization processing on the intermediate floating point number obtained in the accumulation processing process is not required.
In S103, the multiple extended floating point numbers may be subjected to order matching processing, so that the initial exponents of the multiple extended floating point numbers are consistent, and the initial mantissas of the multiple extended floating point numbers subjected to order matching processing are accumulated to obtain the target floating point number. In practice, the log rank processing may include: determining target indexes corresponding to the plurality of expanded floating point numbers, for example, the target index may be the maximum index among initial indexes corresponding to the plurality of expanded floating point numbers, aligning the initial index of each expanded floating point number to the target index, and right-shifting the mantissa of the expanded floating point number by x bits to obtain the expanded floating point number after the order matching. Wherein x is a positive integer, and the value of x is the difference between the target exponent and the initial exponent of the extended floating-point number.
During implementation, two extended floating point numbers in the plurality of extended floating point numbers may be accumulated, and the obtained intermediate floating point number and the next extended floating point number may be accumulated until each extended floating point number is accumulated to obtain the target floating point number.
In an optional implementation manner, in S102, the expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers may include: selecting two floating point numbers to be processed from the floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers.
During implementation, two floating point numbers to be processed can be randomly selected from the floating point numbers to be processed, or the processing sequence of the floating point numbers to be processed can be set, and the two floating point numbers to be processed are selected according to the processing sequence; and then expanding the sign bits of the two selected floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers.
After obtaining the two extended floating-point numbers, S103 may include:
s1031, taking the two expanded floating point numbers as two current floating point numbers, and adding the two current floating point numbers in the target chip to obtain an intermediate floating point number;
s1032, selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
s1033, taking the extended floating point number and the intermediate floating point number corresponding to the selected floating point number to be processed as two updated current floating point numbers, and returning to the step of adding and processing the two current floating point numbers in the target chip to obtain the intermediate floating point number until each floating point number to be processed in the floating point numbers to be processed is subjected to adding and processing;
and S1034, taking the intermediate floating point number obtained after the last summation processing as the target floating point number.
In S1031, the two expanded floating point numbers may be used as two current floating point numbers, and the two current floating point numbers are summed in the target chip to obtain an intermediate floating point number.
In an alternative embodiment, the summing the two current floating-point numbers in the target chip to obtain an intermediate floating-point number may include:
step A1, aligning an initial exponent of a first floating point number of the two current floating point numbers to an initial exponent of a second floating point number in the target chip; right shifting the mantissa of the first floating point number by a target digit to obtain a first floating point number after order matching; wherein the initial exponent of the first floating point number is less than the initial exponent of the second floating point number of the two floating point numbers; the target number of bits is a difference between an initial exponent of the second floating point number and an initial exponent of the first floating point number;
step a2, adding the first floating point number and the second floating point number after the order matching processing in the target chip to obtain an intermediate floating point number.
In implementation, the first floating point number may be a floating point number with a smaller initial exponent of the two current floating point numbers; the second floating point number may be the floating point number of the two current floating point numbers whose initial exponent is larger. Before summing the two current floating point numbers, the first floating point number may be subjected to order matching processing to obtain the first floating point number subjected to order matching processing, so that the exponent of the first floating point number subjected to order matching processing is consistent with the initial exponent of the second floating point number.
The process of the second order processing of the first floating point number may include: aligning the initial exponent of the first floating point number to the initial exponent of the second floating point number in the target chip, and right-shifting the initial mantissa of the first floating point number by a target number of bits, the target number of bits being a difference between the initial exponent of the second floating point number and the initial exponent of the first floating point number. For example, if the initial exponent of the first floating point number is 10 and the initial exponent of the second floating point number is 15, the initial exponent of the first floating point number is aligned to 15, and the initial mantissa of the first floating point number is right-shifted by 5 bits, so as to obtain the first floating point number after the order matching, that is, the exponent of the first floating point number after the order matching is 15.
And adding the mantissa of the first floating point number subjected to the order processing and the initial mantissa of the second floating point number to obtain an intermediate floating point number.
In S1032, after the intermediate floating point number is obtained, one to-be-processed floating point number may be selected from the non-selected to-be-processed floating point numbers, and the selected to-be-processed floating point number may be a randomly selected to-be-processed floating point number or a to-be-processed floating point number selected according to a processing order. And then expanding the sign bit of the selected floating point number to be processed to the target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed.
In S1033, the expanded floating point number and the intermediate floating point number corresponding to the selected floating point number to be processed may be used as the two updated current floating point numbers, and the step of S1031 is returned until each floating point number to be processed in the plurality of floating point numbers to be processed is subjected to addition and processing.
In S1034, the intermediate floating point number obtained after the last addition processing may be used as the target floating point number.
The target floating point number is obtained by accumulating a plurality of floating point numbers to be processed for a plurality of times, and the sign bit of each floating point number to be processed is expanded to the target bit width before the floating point numbers to be processed are accumulated, so that the mantissa bit width of the intermediate floating point number obtained in the accumulation process does not exceed the target bit width, the generated intermediate floating point number does not need to be normalized in the accumulation process, the problem of data loss caused by normalization of the intermediate floating point number is solved, the precision of the intermediate floating point number is guaranteed, and the precision of the obtained target floating point number is higher.
In one possible implementation, after the obtaining the target floating point number, the method further includes: and normalizing the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.
During implementation, normalization processing is not needed to be carried out on intermediate floating point numbers obtained in the accumulation processing process of the plurality of expanded floating point numbers; after the target floating point number is obtained through accumulation, the target floating point number is normalized once to obtain a target processing result, so that the normalization processing times are reduced, and the delay and the power consumption of the accumulation process of processing a plurality of floating point numbers by the target chip are reduced; meanwhile, the problem of data loss caused by normalization processing can be relieved, and the precision of a target processing result is improved.
Since floating point numbers are stored in the target chip in a normalized form in order to ensure uniqueness of the floating point number representation in the target chip. Therefore, after the plurality of floating point numbers to be processed are accumulated to obtain the target floating point number, the obtained target floating point number is normalized to obtain a target processing result, and the format of the target processing result is matched with the preset format. Wherein the preset format may be determined based on a set floating point number standard.
In an optional implementation, the normalizing the target floating-point number to obtain a target processing result may include: and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
In implementation, the target floating point number may be normalized by using one or more of a shift manner, a truncate manner, and a rounding manner, so as to obtain a target processing result. The shifting mode can comprise left shifting, right shifting and the like; the truncation may be deleting data exceeding a specific number of bits, or the like. The target processing result may be a normalized floating point number.
Here, the target floating point number may be flexibly normalized by using one or more of a shift manner, a truncate manner, and a rounding manner, so as to obtain a target processing result.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same concept, an embodiment of the present disclosure further provides a floating point processing apparatus, which is shown in fig. 2 and is an architecture schematic diagram of the floating point processing apparatus provided in the embodiment of the present disclosure, and the floating point processing apparatus includes: a selector 201 and an adder 202; wherein, the adder 202 is connected with the selector 201;
the selector 201 is configured to expand a sign bit of each of the obtained multiple floating point numbers to be processed to a target bit width to obtain multiple expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
the adder 202 is configured to perform accumulation processing on the plurality of extended floating point numbers to obtain a target floating point number.
In order to ensure the precision of the floating point number accumulation process and meet the bit width requirement of the intermediate floating point number obtained after the addition and the operation, the sign bit of each floating point number to be processed can be expanded to the target bit width by using the selector to obtain a plurality of expanded floating point numbers, and the expansion of the bit width of the floating point number to be processed is realized. And furthermore, in the process of accumulating the plurality of extended floating point numbers, the intermediate floating point numbers obtained in the process of accumulating are not required to be normalized, the times of normalization processing are reduced, and the delay and the power consumption of the chip in the process of accumulating the plurality of extended floating point numbers are reduced. Meanwhile, after the number of normalization processing is reduced, the problem of data loss caused by normalization processing can be relieved, and the precision of the target floating point number is improved.
In a possible embodiment, as shown in fig. 3, the device further comprises: a comparator 203; the comparator 203 is connected with the adder 202;
the adder 202 is further configured to input the target floating point number to the comparator;
the comparator 203 is configured to normalize the target floating point number to obtain a target processing result, where a format of the target processing result is matched with a preset format.
During implementation, normalization processing is not needed to be carried out on intermediate floating point numbers obtained in the accumulation processing process of the plurality of expanded floating point numbers; after the target floating point number is obtained through accumulation, the target floating point number is normalized once to obtain a target processing result, so that the normalization processing times are reduced, and the delay and the power consumption of the accumulation process of processing a plurality of floating point numbers by the target chip are reduced; meanwhile, the problem of data loss caused by normalization processing can be relieved, and the precision of a target processing result is improved.
In a possible embodiment, the apparatus further comprises: an order-matching operator 204, wherein the order-matching operator 204 is respectively connected with the adder 202 and the selector 201; the logarithmic operator includes a subtractor 241 and a shifter 242;
the subtractor 241 is configured to determine a target difference between the initial exponent of each extended floating point number and the target exponent; wherein the target exponent is determined based on an initial exponent of a plurality of extended floating point numbers;
the shifter 242 is configured to align the initial exponent of each expanded floating point number to the target exponent, and right shift the mantissa of each expanded floating point number based on a target difference corresponding to the expanded floating point number to obtain a processed floating point number; inputting the processed floating point number to the adder;
the adder 202 is configured to add the mantissas of the processed floating point numbers to obtain a target floating point number.
In a possible implementation manner, when the sign bit of each of the obtained multiple floating point numbers to be processed is extended to a target bit width to obtain multiple extended floating point numbers, the selector 201 is configured to: selecting two floating point numbers to be processed from a plurality of floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers;
the subtractor 241 is configured to take an initial exponent of a first floating point number of the two extended floating point numbers as a target exponent; determining a first difference between the target exponent and an initial exponent of a second floating point number of the two extended floating point numbers; wherein the initial exponent of the first floating point number is greater than the initial exponent of the second floating point number;
the shifter 242, configured to align the initial exponent of the second floating point number to the target exponent; based on the first difference, right shifting the mantissa of the second floating point number to obtain a processed second floating point number; inputting the first floating-point number and the processed second floating-point number to the adder;
the adder 202 is configured to add the first floating point number and the processed second floating point number to obtain an intermediate floating point number.
In a possible embodiment, the apparatus further comprises: a first register 205; the first register 205 is connected to the adder 202 and the subtractor 241 of the logarithmic arithmetic unit, respectively;
the adder 202, configured to input the obtained intermediate floating-point number to the first register 205;
the first register 205 is configured to store the intermediate floating point number and the target floating point number sent by the adder, and send the intermediate floating point number to the subtractor 241.
In a possible embodiment, the apparatus further comprises a second register 206; the second register 206 is connected with the subtractor 241;
the second register 206 is used for storing a target index; and sends the target index to the subtractor 241 of the dyadic module.
In a possible implementation, the selector 201 is further configured to: selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
in response to that the initial exponent of the extended floating point number corresponding to the selected floating point number to be processed is larger than the target exponent stored in the second register, taking the initial exponent of the extended floating point number corresponding to the selected floating point number to be processed as a new target exponent, and updating the second register by using the new target exponent; the second register 206 is further configured to send the new target index to the subtractor;
the subtractor 241 is configured to determine a second difference between the exponent of the intermediate floating point number and the target exponent based on the received new target exponent;
the shifter 242 is configured to align the exponent of the middle floating point number to the target exponent, and right shift the mantissa of the middle floating point number based on the second difference to obtain a processed middle floating point number; sending the processed intermediate floating point number and the extended floating point number corresponding to the selected floating point number to be processed to the adder 202;
the adder 202 is configured to add the mantissa of the intermediate floating point number and the mantissa of the extended floating point number corresponding to the selected floating point number to be processed, so as to obtain a new intermediate floating point number; and sending the new intermediate floating point number to a first register; and taking the intermediate floating point number obtained after the last summation processing as a target floating point number, and sending the target floating point number to the first register 205.
In one possible implementation, when normalizing the target floating-point number to obtain a target processing result, the comparator 203 is configured to:
and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
In a possible implementation, the target bit width is determined based on an initial bit width of an initial mantissa included in the plurality of floating point numbers to be processed and a total number of the plurality of floating point numbers to be processed.
The bit width of the adder and the first register is at least the target bit width. The bit width of the second register may be a maximum value among exponent bit widths of initial exponents included in the plurality of floating point numbers to be processed; that is, the exponent bit width of the initial exponent of each floating point number to be processed may be determined, and the maximum value of the exponent bit widths may be determined as the bit width of the second register.
The floating-point number processing method proposed in the above embodiment is exemplarily described with reference to fig. 3. Assuming that the plurality of floating point numbers to be processed include a1, a2, a3, a4 and a5, the floating point number processing method may include:
step one, inputting two floating point numbers to be processed (a1 and a2) which are subjected to accumulation operation into a selector, and expanding sign bits of the two floating point numbers to be processed to a target bit width by the selector to obtain two expanded floating point numbers. And inputs the two expanded floating-point numbers to the log operator.
Step two, the order arithmetic unit carries out order processing on the two expanded floating point numbers to obtain two expanded floating point numbers with aligned indexes; and the exponent aligned two extended floating point numbers are input to the adder. And taking the maximum exponent number of the initial exponent numbers included in the two expanded floating point numbers as a target exponent number, and storing the target exponent number into a second register.
In implementation, a subtracter in the order arithmetic unit determines an initial exponent of the first floating point number and a target difference value between the initial exponent and a target exponent; the shifter may align an initial exponent of the first floating point number to a target exponent, and right shift a mantissa of the first floating point number by a target number of bits to obtain the first floating point number after the order matching. The exponent of the first floating point number and the exponent of the second floating point number after the pair of steps are consistent. The initial exponent of the first floating point number is smaller than the initial exponent of the second floating point number in the two expanded floating point numbers; the target number of bits is a target difference between an initial exponent of the second floating point number (target exponent) and an initial exponent of the first floating point number. Here, the initial exponent of the second floating point number may be stored as the target exponent in the second register.
Step three, the adder adds the mantissas of the two extended floating point numbers with aligned exponents to obtain a middle floating point number; and registers the intermediate floating-point number in the first register.
And step four, inputting the floating point number a3 to be processed into a selector, and expanding the sign bit of the floating point number a3 to be processed to the target bit width by the selector to obtain the expanded floating point number. And inputs the extended floating-point number to the logarithm operator. And the first register inputs the intermediate floating-point number to the log operator.
And step five, performing order matching on the expanded floating point number and the middle floating point number corresponding to the floating point number a3 to be processed by the order matching arithmetic unit to obtain two expanded floating point numbers with aligned exponents. And the exponent aligned two extended floating point numbers are input to the adder.
During implementation, the size between the initial exponent of the expanded floating point number corresponding to the floating point number a3 to be processed and the exponent of the intermediate floating point number (i.e., the target exponent stored in the second register) is determined, and if the initial exponent of the expanded floating point number is smaller than the exponent of the intermediate floating point number, the initial exponent of the expanded floating point number is aligned to the exponent of the intermediate floating point number, and the mantissa of the expanded floating point number is shifted to the right by the target digit (i.e., the difference between the exponent of the intermediate floating point number and the exponent of the expanded floating point number), so as to obtain the expanded floating point number after the order matching processing. The exponent of the extended floating point number and the exponent of the intermediate floating point number after the pair-order processing are consistent.
If the initial exponent of the expanded floating point number is larger than the exponent of the intermediate floating point number, the initial exponent of the intermediate floating point number is subjected to level matching to the exponent of the expanded floating point number, and the mantissa of the intermediate floating point number is shifted to the right by a target digit (namely the difference between the exponent of the expanded floating point number and the exponent of the intermediate floating point number), so that the intermediate floating point number subjected to level matching is obtained. The intermediate floating point number and the extended floating point number after the double-order processing are two extended floating point numbers aligned in exponent.
When the initial exponent of the expanded floating point number is the same as the exponent of the intermediate floating point number, the adder directly adds the mantissa of the expanded floating point number and the mantissa of the intermediate floating point number to obtain the intermediate floating point number without performing order matching.
And step six, when the initial exponent of the expanded floating point number corresponding to the floating point number a3 to be processed is greater than the exponent of the intermediate floating point number, taking the initial exponent of the expanded floating point number as an updated target exponent, and updating the updated target exponent to the second register.
Step seven, the adder adds the mantissas of the two extended floating point numbers with aligned exponents to obtain a middle floating point number; and registers the intermediate floating-point number in the first register.
Floating point numbers a4 and a5 to be processed are respectively obtained, the steps from four to seven are respectively repeated according to the obtained floating point numbers a4 and a5 to obtain target floating point numbers after a1, a2, a3, a4 and a5 are accumulated.
Step eight, after the target floating point number is obtained, the first register may input the target floating point number to the comparator. And the comparator normalizes the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.
Based on the same concept, the embodiment of the present disclosure provides a chip, as shown in fig. 4, including: a floating point number processing device 401 and a memory 402 according to any of the above embodiments;
the memory 402 is used for storing a plurality of floating point numbers to be processed;
the floating point number processing device 401 is configured to process the multiple floating point numbers to be processed to obtain a target floating point number.
Based on the same concept, an embodiment of the present disclosure further provides a floating point processing apparatus, as shown in fig. 5, which is an architecture schematic diagram of the floating point processing apparatus provided in the embodiment of the present disclosure, and includes an obtaining module 501, an expanding module 502, and a first processing module 503, specifically:
an obtaining module 501, configured to obtain multiple floating point numbers to be processed, where the floating point numbers are operated in a target chip;
an extension module 502, configured to extend the sign bit of each to-be-processed floating point number to a target bit width to obtain a plurality of extended floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
the first processing module 503 is configured to perform accumulation processing on the multiple extended floating point numbers in the target chip to obtain a target floating point number.
In one possible implementation, after the obtaining the target floating point number, the apparatus further includes:
the second processing module 504 is configured to normalize the target floating point number to obtain a target processing result, where a format of the target processing result is matched with a preset format.
In a possible implementation, the extension module 502 is configured to determine the target bit width according to the following steps:
determining an adjustment bit width corresponding to the plurality of floating point numbers to be processed based on an initial bit width of initial mantissas included in the plurality of floating point numbers to be processed;
and determining the target bit width based on the adjusting bit width and the total number of the floating point numbers to be processed.
In a possible implementation manner, when the sign bit of each floating point number to be processed is extended to the target bit width to obtain a plurality of extended floating point numbers, the extension module 502 is configured to:
selecting two floating point numbers to be processed from a plurality of floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers;
the first processing module 503, when performing accumulation processing on the plurality of extended floating point numbers in the target chip to obtain a target floating point number, is configured to:
taking the two expanded floating point numbers as two current floating point numbers, and adding the two current floating point numbers in the target chip to obtain an intermediate floating point number;
selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
taking the extended floating point number and the intermediate floating point number corresponding to the selected floating point number to be processed as two updated current floating point numbers, returning to the step of adding the two current floating point numbers in the target chip to obtain the intermediate floating point number, and adding the two current floating point numbers until each floating point number to be processed in the floating point numbers to be processed is added;
and taking the intermediate floating point number obtained after the last addition processing as the target floating point number.
In a possible implementation manner, the first processing module 503, when the two current floating point numbers are summed in the target chip to obtain an intermediate floating point number, is configured to:
aligning, in the target chip, an initial exponent of a first of the two current floating point numbers to an initial exponent of a second floating point number; right shifting the mantissa of the first floating point number by a target digit to obtain a first floating point number after order matching; wherein the initial exponent of the first floating point number is less than the initial exponent of the second floating point number of the two floating point numbers; the target number of bits is a difference between an initial exponent of the second floating point number and an initial exponent of the first floating point number;
and adding the first floating point number and the second floating point number subjected to the order matching processing in the target chip to obtain an intermediate floating point number.
In a possible implementation manner, when normalizing the target floating-point number to obtain a target processing result, the second processing module 504 is configured to:
and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
In some embodiments, the functions of the apparatus provided in the embodiments of the present disclosure or the included templates may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, no further description is provided here.
Based on the same technical concept, the embodiment of the disclosure also provides an electronic device. Referring to fig. 6, a schematic structural diagram of an electronic device provided in the embodiment of the present disclosure includes a processor 601, a memory 602, and a bus 603. The memory 602 is used for storing execution instructions and includes a memory 6021 and an external memory 6022; the memory 6021 is also referred to as an internal memory, and is configured to temporarily store the operation data in the processor 601 and the data exchanged with the external memory 6022 such as a hard disk, the processor 601 exchanges data with the external memory 6022 through the memory 6021, and when the electronic device 600 operates, the processor 601 communicates with the memory 602 through the bus 603, so that the processor 601 executes the following instructions:
acquiring a plurality of floating point numbers to be processed which are operated in a target chip;
expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
and accumulating the plurality of expanded floating point numbers in the target chip to obtain a target floating point number.
The specific processing flow of the processor 601 may refer to the description of the above method embodiment, and is not described herein again.
Furthermore, the embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the floating point number processing method in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the floating point number processing method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the above-described apparatus and the specific working process of the apparatus may refer to the corresponding process in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another apparatus, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above are only specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (13)

1. A floating point number processing apparatus, the apparatus comprising: a selector and an adder; wherein the adder is connected with the selector;
the selector is used for expanding the sign bit of each floating point number to be processed in the acquired floating point numbers to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
and the adder is used for accumulating the plurality of extended floating point numbers to obtain a target floating point number.
2. The apparatus of claim 1, further comprising: a comparator; the comparator is connected with the adder;
the adder is further used for inputting the target floating point number to the comparator;
and the comparator is used for normalizing the target floating point number to obtain a target processing result, wherein the format of the target processing result is matched with a preset format.
3. The apparatus of claim 1 or 2, further comprising: the order matching arithmetic unit is respectively connected with the adder and the selector; the order operator comprises a subtracter and a shifter;
the subtracter is used for determining a target difference value between the initial exponent of each expanded floating point number and the target exponent; wherein the target exponent is determined based on an initial exponent of a plurality of extended floating point numbers;
the shifter is used for aligning the initial exponent of each expanded floating point number to the target exponent, and right-shifting the mantissa of each expanded floating point number based on the target difference corresponding to the expanded floating point number to obtain a processed floating point number; inputting the processed floating point number to the adder;
and the adder is used for accumulating the mantissas of the processed floating point numbers to obtain a target floating point number.
4. The apparatus according to claim 3, wherein the selector, when expanding a sign bit of each of the obtained plurality of floating point numbers to be processed to a target bit width to obtain a plurality of expanded floating point numbers, is configured to: selecting two floating point numbers to be processed from a plurality of floating point numbers to be processed, and expanding sign bits of the two floating point numbers to be processed to a target bit width to obtain two expanded floating point numbers;
the subtractor is used for taking an initial exponent of a first floating point number of the two expanded floating point numbers as a target exponent; determining a first difference between the target exponent and an initial exponent of a second floating point number of the two extended floating point numbers; wherein the initial exponent of the first floating point number is greater than the initial exponent of the second floating point number;
the shifter to align an initial exponent of the second floating point number to the target exponent; based on the first difference, right shifting the mantissa of the second floating point number to obtain a processed second floating point number; inputting the first floating-point number and the processed second floating-point number to the adder;
and the adder is used for adding the first floating point number and the processed second floating point number to obtain an intermediate floating point number.
5. The apparatus of claim 4, further comprising: a first register; the first register is respectively connected with the adder and the subtracter of the order-matching arithmetic unit;
the adder is used for inputting the obtained intermediate floating point number to the first register;
and the first register is used for storing the intermediate floating point number and the target floating point number sent by the adder and sending the intermediate floating point number to the subtracter.
6. The apparatus of claim 5, further comprising a second register; the second register is connected with the subtracter;
the second register is used for storing a target index; and sending the target index to the subtractor of the order matching module.
7. The apparatus of claim 6, wherein the selector is further configured to: selecting a floating point number to be processed from the floating point numbers to be processed which are not selected; expanding the sign bit of the selected floating point number to be processed to a target bit width to obtain an expanded floating point number corresponding to the selected floating point number to be processed;
in response to that the initial exponent of the extended floating point number corresponding to the selected floating point number to be processed is larger than the target exponent stored in the second register, taking the initial exponent of the extended floating point number corresponding to the selected floating point number to be processed as a new target exponent, and updating the second register by using the new target exponent; the second register is also used for sending the new target index to the subtracter;
the subtractor is used for determining a second difference value between the exponent of the intermediate floating point number and the target exponent based on the received new target exponent;
the shifter is used for aligning the exponent of the middle floating point number to the target exponent and right shifting the mantissa of the middle floating point number based on the second difference value to obtain a processed middle floating point number; sending the processed intermediate floating point number and the selected expanded floating point number corresponding to the floating point number to be processed to the adder;
the adder is used for adding the mantissa of the intermediate floating point number and the mantissa of the expanded floating point number corresponding to the selected floating point number to be processed to obtain a new intermediate floating point number; and sending the new intermediate floating point number to a first register; and taking the intermediate floating point number obtained after the last addition as a target floating point number, and sending the target floating point number to a first register.
8. The apparatus of claim 2, wherein the comparator, when normalizing the target floating point number to obtain a target processing result, is configured to:
and normalizing the target floating point number by using one or more modes of a shifting mode, a truncation mode and a rounding mode to obtain a target processing result.
9. The apparatus of any of claims 1 to 8, wherein the target bit width is determined based on an initial bit width of an initial mantissa included in the plurality of floating point numbers to be processed and a total number of the plurality of floating point numbers to be processed.
10. A floating point number processing method, comprising:
acquiring a plurality of floating point numbers to be processed which are operated in a target chip;
expanding the sign bit of each floating point number to be processed to a target bit width to obtain a plurality of expanded floating point numbers; wherein the target bit width matches the total number of floating point numbers to be processed;
and accumulating the plurality of expanded floating point numbers in the target chip to obtain a target floating point number.
11. A chip, comprising: the chip comprising the floating point number processing device of any of claims 1 to 9, and a memory;
the memory is used for storing a plurality of floating point numbers to be processed to be operated;
and the floating point number processing device is used for processing the floating point numbers to be processed to obtain target floating point numbers.
12. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the floating point number processing method of any one of claims 10 to 15; alternatively, a chip as claimed in claim 11.
13. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the floating-point number processing method as claimed in claim 10.
CN202111667694.0A 2021-12-31 2021-12-31 Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip Pending CN114296682A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111667694.0A CN114296682A (en) 2021-12-31 2021-12-31 Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip
PCT/CN2022/124517 WO2023124372A1 (en) 2021-12-31 2022-10-11 Floating-point number processing apparatus and method, electronic device, storage medium, and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111667694.0A CN114296682A (en) 2021-12-31 2021-12-31 Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip

Publications (1)

Publication Number Publication Date
CN114296682A true CN114296682A (en) 2022-04-08

Family

ID=80974349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111667694.0A Pending CN114296682A (en) 2021-12-31 2021-12-31 Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip

Country Status (2)

Country Link
CN (1) CN114296682A (en)
WO (1) WO2023124372A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114968170A (en) * 2022-06-24 2022-08-30 北京百度网讯科技有限公司 Method for generating fixed sum of floating point number, related device and computer program product
WO2023124372A1 (en) * 2021-12-31 2023-07-06 上海商汤智能科技有限公司 Floating-point number processing apparatus and method, electronic device, storage medium, and chip

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10061561B2 (en) * 2016-09-07 2018-08-28 Arm Limited Floating point addition with early shifting
CN108255777B (en) * 2018-01-19 2021-08-06 中国科学院电子学研究所 Embedded floating point type DSP hard core structure for FPGA
CN114296682A (en) * 2021-12-31 2022-04-08 上海阵量智能科技有限公司 Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023124372A1 (en) * 2021-12-31 2023-07-06 上海商汤智能科技有限公司 Floating-point number processing apparatus and method, electronic device, storage medium, and chip
CN114968170A (en) * 2022-06-24 2022-08-30 北京百度网讯科技有限公司 Method for generating fixed sum of floating point number, related device and computer program product

Also Published As

Publication number Publication date
WO2023124372A1 (en) 2023-07-06

Similar Documents

Publication Publication Date Title
Demmel et al. Parallel reproducible summation
KR100955557B1 (en) Floating-point processor with selectable subprecision
CN114296682A (en) Floating point number processing device, floating point number processing method, electronic equipment, storage medium and chip
CN106951211B (en) A kind of restructural fixed and floating general purpose multipliers
JPH0727456B2 (en) Floating point arithmetic unit
KR102581403B1 (en) Shared hardware logic unit and method for reducing die area
CN103914276A (en) Fixed point division circuit utilizing floating point architecture
CN108733347B (en) Data processing method and device
JP2019121398A (en) Accelerated computing method and system using lookup table
CN113721884A (en) Operation method, operation device, chip, electronic device and storage medium
CN113703840A (en) Data processing device, method, chip, computer equipment and storage medium
CN106997284B (en) Method and device for realizing floating point operation
WO2023124362A1 (en) Floating point number processing method and apparatus, electronic device and storage medium
CN108334304A (en) digital recursive division
JP6919539B2 (en) Arithmetic processing unit and control method of arithmetic processing unit
CN115357216A (en) Data processing method, medium, electronic device, and program product
CN106528050B (en) Trailing or leading digit predictor
CN114139693A (en) Data processing method, medium, and electronic device for neural network model
CN113591031A (en) Low-power-consumption matrix operation method and device
CN113254072A (en) Data processor, data processing method, chip, computer device, and medium
CN111313906A (en) Conversion circuit of floating point number
Trivedi et al. A Review on Single Precision Floating Point Arithmetic Unit of 32 bit Number
CN111124361A (en) Arithmetic processing apparatus and control method thereof
US8095767B2 (en) Arbitrary precision floating number processing
CN115269003A (en) Data processing method and device, processor, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40062799

Country of ref document: HK