CN112667197B - Parameterized addition and subtraction operation circuit based on POSIT floating point number format - Google Patents

Parameterized addition and subtraction operation circuit based on POSIT floating point number format Download PDF

Info

Publication number
CN112667197B
CN112667197B CN202011601975.1A CN202011601975A CN112667197B CN 112667197 B CN112667197 B CN 112667197B CN 202011601975 A CN202011601975 A CN 202011601975A CN 112667197 B CN112667197 B CN 112667197B
Authority
CN
China
Prior art keywords
floating point
point number
operation result
mantissa
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011601975.1A
Other languages
Chinese (zh)
Other versions
CN112667197A (en
Inventor
廖琳
谭洪舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202011601975.1A priority Critical patent/CN112667197B/en
Publication of CN112667197A publication Critical patent/CN112667197A/en
Application granted granted Critical
Publication of CN112667197B publication Critical patent/CN112667197B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a parameterized addition and subtraction operation circuit based on a POSIT floating point number format, which relates to the technical field of computers, and is constructed by a data input unit, a decoding unit, a scale determining unit, a mantissa processing unit, an MSB unit, a result encoding unit and a result selecting output unit.

Description

Parameterized addition and subtraction operation circuit based on POSIT floating point number format
Technical Field
The invention relates to the technical field of computers, in particular to a parameterized addition and subtraction operation circuit based on a POSIT floating point number format.
Background
Floating point hardware arithmetic units are a part of many specific CPUs or computing platforms that must exist. The current common floating point number format is IEEE 754, and four different precision formats are included in IEEE 754 format to cope with different data precision requirements. The floating-point arithmetic unit consumes a lot of hardware resources, taking the lowest 32-bit single-precision floating-point processing unit as an example, it consumes about 50% of the chip area and energy of the chip in a simple sequential scalar processing element, whereas double-precision floating-point numbers can reach 60% to 70% according to observation. In addition, the limitations in precision of the IEEE 754 format and its high complexity are significant problems in floating point computing.
Compared with IEEE 754 format, POSIT format floating point number has flexible and simple encoding mode, and may be selected in required precision and has higher precision than IEEE 754 mode. If the POSIT data format is used for hardware calculation, occupied hardware resources can be reduced, and higher data precision can be obtained. The chinese patent application CN111538472a discloses a arithmetic processor and an arithmetic processing system for Posit floating point number on day 8 and 14 in 2020, which discloses an arithmetic result represented by the intermediate data in the form of complement directly obtained from a decoding circuit by performing addition, subtraction and multiplication operations to obtain the arithmetic result represented by the intermediate data in the form of complement, and directly inputs the arithmetic result represented by the intermediate data in the form of complement into an encoding circuit, so that the encoding circuit directly converts the intermediate data in the form of complement into Posit floating point number, but it cannot satisfy parameterized addition and subtraction operations of Posit formats with different bit widths.
Disclosure of Invention
The invention provides a parameterized addition and subtraction operation circuit based on a POSIT floating point number format, which aims to overcome the technical defect that the parameterized addition and subtraction operation of the POSIT floating point number format cannot be satisfied in the existing POSIT floating point addition and subtraction operation.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a parameterized addition and subtraction operation circuit based on POSIT floating point number format comprises a data input unit, a decoding unit, a scale determining unit, a mantissa processing unit, an MSB unit, a result encoding unit and a result selecting output unit; wherein:
the data input unit is used for inputting data of a first floating point number in a posit format and a second floating point number in a posit format, the total bit width of the first floating point number and the second floating point number is N, and the maximum allowed width of the index part is ES;
the decoding unit acquires different fields in the POSIT for the first floating point number and the second floating point number, and determines the values of a sign segment, an exponent segment and a mantissa segment of the first floating point number and the values of the sign segment, the exponent segment and the mantissa segment of the second floating point number;
the scale determining unit is used for receiving specific fields of the first floating point number and the second floating point number, obtaining the sizes of scales of floating point values of the first floating point number and the second floating point number according to the specific fields, and determining larger scales according to the calculated sizes of scales of the first floating point number and the second floating point number; sending the difference diff between the two scales to the mantissa processing unit; meanwhile, the scale determining unit receives the shift size of the mantissa part and the carry flag to determine whether to adjust the larger scale, thereby determining a first operation result;
the mantissa processing unit is used for receiving mantissa fields of the first floating point number and the second floating point number, correspondingly adjusting the mantissa field of the smaller floating point number according to the size relation of the two floating point number scales of the scale determining unit and the difference value between the two floating point number scales, and obtaining a second operation result according to the adjusted mantissa fields of the first floating point number and the second floating point number; meanwhile, normalizing the second operation result according to the first operation result to obtain a second operation result of the mantissa part meeting the POSIT format standard as a third operation result;
the MSB unit is used for determining the most significant bit of the second operation result and outputting the most significant bit of the second operation result to the scale determination unit;
the result encoding unit is used for receiving the second operation result, the third operation result and the symbol field in the decoding unit, encoding the second operation result, the third operation result and the symbol field in the decoding unit to form a target operation result, and outputting the target operation result by the result selection output unit.
In one possible implementation, the decoding units include a first decoding unit and a second decoding unit; wherein:
the first decoding unit is used for acquiring different fields in the POSIT of the first floating point number and determining values of a symbol section, an exponent section and a mantissa section in the first floating point number;
the second decoding unit is used for acquiring different fields in the POSIT for the second floating point number and determining values of a symbol section, an exponent section and a mantissa section in the second floating point number.
In one possible implementation, the scale determination unit includes a splicer, a comparator, and an adder; the first operation result determining process is as follows:
determining the scale value of a single floating point number, wherein the specific calculation formula is as follows:
scale={regime,exp},
wherein scale represents the corresponding scaling scale in the floating point number represented in the POSIT format code, region represents the specific value of a field in POSIT, exp is the specific value of a step code field, and { represents a splicing device, i.e. the region and exp are spliced by the splicing device; in a mathematical sense, this stitching action represents the meaning of the following formula:
scale=2 ES ×regime+exp
here, ES represents the width of the step code exp portion;
after the sizes of the scale of the first floating point number and the second floating point number are obtained, the sizes of the scale parts of the first floating point number and the second floating point number are obtained through a comparator, then the larger scale part is sent into an adder to serve as one operand, the adder finally performs the following operation formula, and the scale of the final result is determined in this way:
scale=scalemax+Flag-Lshift,
wherein, scalemax represents the larger one of the scale values of the first floating point number and the second floating point number, flag represents the highest bit of the first operation result output in the MSB unit, lshift represents the parameter from the mantissa normalizing unit, and the number represents the bit size of the mantissa shift;
the scale herein is the first operation result, and represents the scale value of the target operation result.
In one possible implementation manner, the second operation result determining process specifically includes:
assuming that the scale value of the first floating point number is larger according to the result of scale value size comparison of the scale determination unit, ma represents the mantissa portion of the first floating point number from the decoding unit, mb represents the mantissa portion of the second floating point number; the difference diff of scale in the two floating point numbers is also from the scale determination unit; the mantissa portion Mb of the second floating point number for the smaller scale is first processed according to the following formula:
M b =M b >>diff,
the Ma and Mb in the first floating point number and the second floating point number after processing are added or subtracted to obtain X, Y and Z parts in the following formula:
M a ±M b =XY.Z,
wherein 1 is a hidden bit of the mantissa part, ma represents the mantissa part of the first floating point number, mb represents the mantissa part of the second floating point number, X represents the highest bit of the second operation result, Y represents the next highest bit of the second operation result, and Z represents the remaining mantissa part of the second operation result;
if X is 1, the value obtained by adding the mantissa parts of the first floating point number and the second floating point number overflows, and the second operation result is obtained according to the following formula:
add_M=YZ,
if X is 0, the subtraction of the mantissa portions of the first floating point number and the second floating point number is described, and no overflow condition occurs, and at this time, the second operation result is obtained according to the following formula:
add_M=XY.Z<<1,
here, add_m is the second operation result, representing the mantissa portion of the result.
In one possible implementation, the mantissa processing unit includes a mantissa normalization unit; the mantissa normalization unit normalizes the second operation result according to the first operation result to obtain the second operation result of the mantissa part meeting the POSIT format standard as a third operation result.
And in the mantissa normalization unit, normalizing the second operation result to obtain a second operation result of the mantissa part meeting the POSIT format standard, sending an intermediate parameter Lshift used in normalization to the scale determination unit to be added into the adjustment process of the first operation result, and sending the normalized result to the result encoding unit as a third operation result.
In one possible implementation manner, the third operation result is determined by the mantissa normalization unit, and is specifically expressed as:
add_ML=add_M<<LOD(add_M),
the add_m represents a second operation result, the LOD represents a leading zero detection circuit in the mantissa normalizing unit, the first position of the add_m, which is 1 bit, in the code is detected, and then a shift device in mantissa normalization is utilized to shift the second operation result left by the result value of the LOD;
here, frac is the third operation result, and represents the mantissa partial value of the target operation result.
In a possible implementation, the circuit further comprises a special result processing unit; the special result processing unit determines whether the first floating point number and the second floating point number are 0 or infinity according to the decoding results of the first decoding unit and the second decoding unit so as to determine the special result of operation, and the special result directly enters the result encoding unit.
In the result selection output unit, if a special result is received, outputting the special result; otherwise, outputting the result of the result encoding unit.
In the special result processing unit, if the first floating point number and the second floating point number are both 0, the result of addition and subtraction operation is 0; if one of the first floating point number and the second floating point number is 0, the operation result of addition and subtraction is the other floating point number; if any one of the first floating point number and the second floating point number is infinite, the operation result of addition and subtraction is infinite.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a parameterized addition and subtraction operation circuit based on a POSIT floating point number format, which is characterized in that the original IEEE 754 floating point calculation format is abandoned by floating point calculation, a new POSIT floating point data format is adopted for addition and subtraction operation, the expression method of the floating point number by the format is more flexible and simpler, the hardware resource during hardware platform calculation is better saved by the addition and subtraction operation based on the POSIT format, and the operation precision can be improved under the condition of the same bit width.
Drawings
FIG. 1 is a schematic diagram of the POSIT floating point data format of the present application;
FIG. 2 is a schematic diagram of a parameterized addition and subtraction circuit based on Posit data format in the example of the present application;
FIG. 3 is a schematic diagram showing the steps of a sub-module decoding unit of a parameterized addition and subtraction circuit based on Posit data format in the example of the present application;
FIG. 4 is a schematic diagram of another example of a parameterized addition and subtraction circuit based on Posit data format;
wherein: 1. a data input unit; 2. a decoding unit; 21. a first decoding unit; 22. a second decoding unit; 3. a scale determination unit; 4. a mantissa processing unit; 41. a mantissa normalization unit; 5. MSB units; 6. a result encoding unit; 7. a result selection output unit; 8. a special result processing unit.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;
for the purpose of better illustrating the embodiments, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the actual product dimensions;
it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical scheme of the invention is further described below with reference to the accompanying drawings and examples.
Unums (universal numbers) is an arithmetic and binary representation format in real numbers similar to that in floating point numbers, proposed by John l.gustatson, to replace IEEE 754, which is now commonly used. POSIT belongs to the third edition in Unum, namely Type III un. The post represents a real number by scientific counting, as in IEEE 754, and if the number cannot be represented accurately, rounding to the nearest value is adopted.
The format design of Posit is hardware friendly and easy to implement in hardware. As shown in FIG. 1, the position is divided into sign bit, region bit, exact bit and fraction bit. The Posit number will give the total bit width N and the maximum width that the exponent (exponents) can occupy, the width of the remaining fields varying mainly according to the width occupied by the region field. Specific analysis of the individual fields is set forth below:
symbol section: the positive and negative of the floating point number representing the posit code representation is located at the top of the code field.
The region field: immediately following the symbol field, a series of identical 1 sequences and the least significant inversion bit 0 is made up of or consists of a series of 0 sequences and the least significant inversion bit 1. Yet another case is that a series of 1 sequences is terminated by a bit width constraint, rather than by a 0 of the flip bit. The region field with a sequence of consecutive 1 belongs to the positive region, and its value is the number of consecutive 1's minus 1; the region field with consecutive 0 sequences belongs to a negative region, whose value is the inverse of the number of 0's in the sequence. A value of-4 as represented for region field 00001; the region field is 111110, which represents a value of 4. If the position code with the total bit width n=8 is 01111111, the region field in the code is 1111111, which represents a value of 6, and its termination bit is not a flip bit, but is limited by the bit width.
An exposure field: this field agrees with a given maximum width ES at a given Posit number. If the region field is terminated with a flip bit and the total bit width has a free position, the region field is immediately followed by an exposure field, and the width of the exposure field cannot exceed the ES bit agreed in advance. This field belongs to an unsigned integer and also to the part of Posit's exponent for the floating point number represented.
The fraction field: the mantissa portion is immediately after the exponent field, and is immediately after the mantissa field if there is a total bit width remaining after the exponent encoded field of the ES bit. That is, the width of the mantissa field is entirely dependent on the width of the total bit width remaining after encoding of the other fields is completed. Notably, the mantissa portion has an integer hidden bit of 1, which is not actually represented in Posit encoding, but which is actually present, so the mantissa portion represents a real number of 1.0 or more and less than 2. For example, when the mantissa field is 0010, the floating point value represented is:
1+0*2 -1 +0*2 -2 +1*2 -3 +0*2 -4 =1.125;
for a region value k, the exponent portion has a value e, the mantissa portion has a value f (plus the value of hidden bit 1) of a posit number, which represents a floating point number whose value can be expressed by the following formula:
x=(-1) S ×useed k ×2 e ×f;
s represents the highest sign bit of the Posit code, used=2 2es ES denotes the width of the digits. Used and e are two different sizes worth scaling for floating point numbers, which can be jointly translated into a scale based on 2, where the size of the scale based on 2 can be expressed by the following formula:
scale=2 ES ×k+e
example 1
As shown in fig. 2, a parameterized addition and subtraction circuit based on the post floating point number format in this embodiment is provided. The so-called parameterization, i.e. the bit width N and the exponent width ES of the input post number, can be chosen arbitrarily. The circuit comprises a data input unit 1, a decoding unit 2, a scale determining unit 3, a mantissa processing unit 4, an MSB unit 5, a result encoding unit 6 and a result selecting output unit 7; wherein:
two operands to be added and subtracted, namely a first floating point number and a second floating point number in a Posit data format, are input through the data input unit 1 and fed to the decoding unit 2. Here, it is assumed that the first floating point number and the second floating point number are both configured as < N, ES >, i.e., posit numbers with a total bit width of N and an exponent maximum width of ES. Assuming the first floating point number, IN1, the decoding unit 2 includes a concatenation device, an exclusive OR circuit, an LZD circuit, a shift device, an adder, a multiplier, and the like. After the unit receives the first floating point number, the unit starts decoding operation. The step structure of the decoding operation is shown in fig. 3. The decoding unit includes first fetching the most significant bit of the first floating point number code as a sign bit.
If the sign bit extracted in the previous step is 1, performing a complement operation on the first floating point number; otherwise, no operation is performed, that is, the code XIN1 of the absolute value of the Posit number is determined, with a width N. The complement is to invert each bit of IN1 and then add 1.
After obtaining the first floating point number absolute value code XIN1, the next-higher order of N-1 XINs 1 are connected by a splicing device to be used as one of the input numbers of the exclusive or circuit in the decoding unit, because the next-higher order of XIN1 can know the positive and negative of the position number region field. The other input number is from XIN1[ N-2:0], i.e. the next higher order up to the lowest order, and the two numbers are exclusive-ored to obtain a new N-bit number RIN1. The purpose of the exclusive or operation is to be able to make it possible to obtain its value with the LZD circuit (leading zero detection circuit) in the decoding unit, whether it is a positive or a negative region field. Assuming that the next highest order of XIN1 is 1, namely a positive region field, after exclusive OR circuit, the operation result of the region sequence of 1 of the next highest order and 1 in XIN1 is 0; assuming that the next highest order of XIN1 is 0, that is, the negative region sequence, the result of the operation of the next highest order 0 and the region sequence of XIN1 that is 0 after exclusive-or. From this, it can be seen that for a positive region sequence, the original 11 … 0 is converted into 00 …, and the negative region sequence is unchanged, so that the operation can ensure that only LZD units are needed, and no LOD units (leading 1 detection circuit) are needed, so that the circuit area is saved.
After RIN1 is obtained from the exclusive OR circuit output, it can detect the position of the first 1 bit in RIN1, that is, the number r of the previous 0 s in RIN1, by LZD circuits in the decoding unit.
Whether the positive or negative field, r represents the number of consecutive 1 or 0 sequences, so r+2 represents the length of the sign bit plus the field, and XIN1[ N-4:0] is shifted left by r-1 length by using the shift device to obtain a temporary number temp, which is the exact field and fraction field in the floating point number, since the shortest length of the sign bit plus the field is 3.
Then according to the limit of the maximum width ES of the index part in Posit, the temp [ N-4:N-3-ES ] part is an index exposure field, and the temp [ N-4-ES:0] part is a fraction field. However, for subsequent calculation, the fraction field is to splice the hidden bit 1 with the fraction field in the code and grs=3' b0, where the hidden bit is the most significant and grs is the least significant 3 bits by using a splicing device. The Grs bit is 3 bits added next to the last bit of the mantissa field, and is used for final rounding after the mantissa field operation is completed, the hidden bit, the fraction field and the Grs bit are spliced to form the extended mantissa field Frac.
After obtaining each field of the first floating point number, calculating to obtain an absolute value X1 of the floating point number, and when a region field in XIN1 is positive, obtaining a formula of the floating point number as follows:
X1=22ES*(r-1)*2exponent*(1+fraction)
when the region field is negative, the derived floating point value is formulated as follows:
X1=2 2ES *(-r)*2 exponent *(1+fraction)
the operation of the second floating point number in the decoding unit 2 is the same as the first floating point number decoding operation, and will not be described again.
In the implementation process, the decoding unit 2 includes a first decoding unit 21 and a second decoding unit 22; wherein:
the first decoding unit 21 is configured to acquire different fields in the post for the first floating point number, and determine values of a symbol segment, an exponent segment, and a mantissa segment in the first floating point number;
the second decoding unit 22 is configured to obtain each different field in the post for the second floating point number, and determine values of a sign segment, an exponent segment, and a mantissa segment in the second floating point number.
More specifically, the scale determination unit 3 includes a splicer, a comparator, and an adder. Firstly, receiving a value exp of a region field and a value exp of an exponent part from a first floating point number and a second floating point number in two decoding units, and then splicing the two values respectively by using a splicing device to obtain values of the first floating point number and the second floating point number scale. And then comparing the first floating point number and the second floating point number by using a comparator, subtracting the first floating point number and the second floating point number to obtain a difference diff of the first floating point number and the second floating point number, and sending the results to a mantissa processing unit. The larger scale value is not the scale value of the final target operation result, but also accepts the highest-order flag from the mantissa processing unit 4 and Lshift for scale max And adjusting to obtain a first operation result, namely a final scale value. The formula for adjustment is already mentioned above and will not be described here again.
In a specific implementation, the mantissa processing unit 4 includes a mantissa normalization unit 41; the mantissa normalization unit 41 normalizes the second operation result according to the first operation result, and obtains the second operation result of the mantissa portion that meets the post format standard as the third operation result. In the mantissa normalizing unit 41, the second operation result is normalized to obtain a second operation result of the mantissa portion meeting the post format standard, the intermediate parameter Lshift used in normalization is sent to the scale determining unit 3 to be added into the adjustment process of the first operation result, and the normalized result is sent to the result encoding unit 6 as a third operation result.
In a specific implementation process, the mantissa normalizing unit 41 includes an LZD circuit and a shift device, receives the processed second operation result add_m from the mantissa processing unit 4, detects the leading zero number Lshift by using the LZD circuit, and left shifts the add_m by the corresponding Lshift bit to obtain normalized add_ml, where the bit add_ml [ MSB-1:0] is the fraction field of the target operation result, that is, the third operation result, and the highest 1 is the hidden bit of the target operation result.
More specifically, the mantissa processing unit 4 includes an adder, a shifting device, and a numerical comparator, receives Frac of the first floating point number and the second floating point number as two operands of the adder, and adds or subtracts Frac fields of the first floating point number and the second floating point number according to whether the operation requirement is addition or subtraction, to obtain a second operation result. If the operation is addition, the addition is directly carried out; if the operation is subtraction, the magnitude relation of the two mantissa fields Frac is compared by a numerical comparator, and the smaller Frac field value is subtracted from the larger Frac field value when the subtraction is performed. Note that the values represented by the two Frac fields are all numbers greater than or equal to 1.0 and less than 2, i.e., they are all unsigned numbers. The added value is a number more than or equal to 4, so the bit width of the second operation result is more than 1 bit than the width of the Frac, otherwise, the width of the existing Frac can not accommodate the width of the result and overflow phenomenon can occur; their subtracted values must be numbers greater than or equal to 0 and less than 1. When the highest bit of the second operation result is 1, the weighted digit 2≡1 represented by the highest bit, add_m represents the second operation result, the unchanged fraction of add_m as the display containing the hidden bit is transmitted into the mantissa normalization unit, MSB represents the highest bit of add_m, such an operation changes the weighting of the number originally located at the left side of the binary decimal point in add_m and having the weighting of 2^1 into 2≡0, which is equivalent to shifting add_m by one bit to the right, so the operation will result in the adjustment of the target operation result scale; when the highest order of the second operation result is 0, only the subtraction operation will occur, and at this time, the shift device is used to shift it one bit to the left and then send it to the mantissa normalization unit, the operation will not change the value of the whole scale, and the left shift will only increase the number of bits of the mantissa part since the numbers on the left of the binary decimal point become zero after the subtraction operation.
More specifically, the MSB unit is configured to determine the most significant bit of the second operation result, and output the most significant bit of the second operation result to the scale determining unit 3;
more specifically, the result encoding unit 6 includes a splicing device, a shifting device, a numerical comparator, an and device, or a device, and receives the first operation result, the second operation result, and the third operation result obtained by the operations of the different units, and the absolute values and sign fields of the first floating point number and the second floating point number obtained by the decoding unit. And comparing the absolute value X1 of the first floating point number and the absolute value X2 of the second floating point number by using a numerical comparator, and taking the sign field of the floating point number with the large absolute value as the sign bit of the target operation result.
Then, according to the scale value of the target operation result obtained from the first operation result, it is noted that, if the scale is a negative value, the negative portion of the scale must be derived from the region value, and the value exp of the index portion is unsigned. Therefore, the lowest ES bit scale [ ES-1:0] of the scale is directly taken as the exponent part, and if the scale is positive, the rest of the scale is the value r of the region except the exponent part of the lowest ES bit. If scale is negative, the absolute value r of the region value is the value of scale [ ES+RS-1:RS ] plus 1, RS represents the coding width of the region value, generally log2 (N) +1, and can be deduced according to the fact that the maximum value of the region does not exceed N-2 and the sign bit of the added value. According to the calculation mode of scale in the position number, it can be known that as long as the value of the region is negative, the final scale must be negative, so the scale is positive, the value of the region must be positive, that is, the positive and negative of the scale determine the positive and negative of the region.
All fields are connected by splicing devices, and the absolute value of the final target operation result can be obtained by translating according to scale values and positive values according to the following steps:
Shift_temp={!scale[MSB],scale[MSB],exponent,add_ML[MSB-1:0],{N-1{0}}}
if the scale value is positive, i.e. the region value is positive, the shift device is used for shifting the upper part by r lengths; if the scale value is negative, shifting the above formula by r-1 length;
after shift_temp is obtained by shifting the corresponding length according to different conditions, shift_temp [ MSB: MSB- (N-1) ] is the result before rounding. While there is only one way to round the Posit number, namely to the nearest value. The LSB, G, R, S bits are acquired from the shifted shift_temp, the LSB represents the lowest bit of the mantissa part in the target operation result, and the GRS is 3 bits immediately after the LSB. Shift_temp [ MSB- (N-1): MSB- (N+2) ] is LSB, G, R, S bit from high to low, respectively. If the GRS bit is greater than or equal to 100 and the LSB bit is 1, the target result is rounded, otherwise, the result is truncated. The absolute value of the final target operation result after adding the sign bit can be expressed as follows:
result={0,shift_temp[MSB:MSB-(N-1)]+G&&(L||R||S)}
the sign bit of the target operation result is obtained according to the absolute value compared by the numerical comparator, if the sign bit is 0, the result is the final result; if the result is 1, the result is subjected to the work of negating and adding 1 to the complement code, and the final target operation result is obtained.
More specifically, the result selection output unit 7 outputs, receives the target operation result, and outputs the result.
In a specific implementation process, the invention provides a parameterized addition and subtraction operation circuit based on a POSIT floating point number format, floating point calculation discards the original IEEE 754 floating point calculation format, and addition and subtraction operation is carried out by adopting a new POSIT floating point data format.
Example 2
As shown in fig. 4, on the basis of the previous embodiment, the special result processing unit 9 is added in this embodiment, so that the target operation result about the special result can be obtained more quickly without going through other units, considering the special floating point number occurring during the calculation of the Posit data format.
The special result processing unit 9 receives the input encoded values of the first floating point number and the second floating point number from the data input unit. Only two special data formats are available in the Posit data format, namely 0 with all 0 coding fields and NaR with the rest of 0 fields except for 1 sign bit. The special result processing unit sets a valid signal, and when any one of the first floating point number or the second floating point number is detected to have 0 or NaR, the signal is set high. The signal is sent to the result selection output unit 8, so that the result selection output unit outputs the result sent from the current special result processing unit. The special results are divided into the following cases: when each field of the first floating point number and the second floating point number is 0, namely the floating point values of the first floating point number and the second floating point number are 0, directly sending the result 0 into a result selection output unit; when any one of the first floating point number and the second floating point number is NaR or both are NaR, namely, the highest bit of the coding field of the floating point number is 1, and all other fields are 0, directly sending the NaR into a result output unit; when only one of the first floating point number or the second floating point number is 0 and the other is the code of the common condition, the floating point number of the code which is not 0 is directly sent to the result selection output unit.
The result selection output unit 8 receives the valid signal of the special result processing unit 9, and then directly outputs the result transmitted by the valid signal as a target operation result.
The present invention may be used in a variety of processing platforms including floating point number computing units, such as personal computers, multiprocessor systems, and various small or large computers.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

1. The parameterized addition and subtraction operation circuit based on the POSIT floating point number format is characterized by comprising a data input unit (1), a decoding unit (2), a scale determining unit (3), a mantissa processing unit (4), an MSB unit (5), a result encoding unit (6) and a result selecting output unit (7); wherein:
the data input unit (1) is used for inputting data of a first floating point number in a POSIT format and a second floating point number in the POSIT format;
the decoding unit (2) acquires different fields in the POSIT for the first floating point number and the second floating point number, and determines the values of a sign segment, an exponent segment and a mantissa segment of the first floating point number and the values of the sign segment, the exponent segment and the mantissa segment of the second floating point number;
the scale determining unit (3) is configured to receive preset fields of the first floating point number and the second floating point number, obtain scales of floating point values of the first floating point number and the second floating point number according to the preset fields, and then determine a larger scale between the scales of the first floating point number and the second floating point number according to the calculated scales of the first floating point number and the second floating point number; and send the difference diff between two scales to the mantissa processing unit (4); meanwhile, the scale determining unit (3) receives the shift size of the mantissa part and the carry flag to determine whether to adjust a larger scale between scales of the first floating point number and the second floating point number, thereby determining a first operation result;
the mantissa processing unit (4) is used for receiving mantissa fields of the first floating point number and the second floating point number, correspondingly adjusting the mantissa field of the smaller floating point number between the scale of the first floating point number and the scale of the second floating point number according to the size relation of the two floating point number scales of the scale determining unit (3) and the difference value between the two floating point number scales, and obtaining a second operation result according to the adjusted mantissa fields of the first floating point number and the second floating point number; meanwhile, normalizing the second operation result according to the first operation result to obtain a second operation result of the mantissa part meeting the POSIT format standard as a third operation result;
the MSB unit (5) is used for determining the most significant bit of the second operation result and outputting the most significant bit of the second operation result to the scale determining unit (3);
the result encoding unit (6) is used for receiving the second operation result, the third operation result and the symbol field in the decoding unit, encoding the second operation result, the third operation result and the symbol field in the decoding unit to form a target operation result, and the result selecting and outputting unit (7) is used for outputting the target operation result.
2. A parameterized addition and subtraction circuit based on the post floating point number format according to claim 1, characterized in that the decoding unit (2) comprises a first decoding unit (21) and a second decoding unit (22); wherein:
the first decoding unit (21) is used for acquiring different fields in the POSIT for the first floating point number and determining values of a symbol section, an exponent section and a mantissa section in the first floating point number;
the second decoding unit (2) is configured to acquire different fields in the post for the second floating point number, and determine values of a sign segment, an exponent segment and a mantissa segment in the second floating point number.
3. The parameterized addition and subtraction circuit based on the post floating point number format according to claim 1, wherein the scale determination unit (3) comprises a splicer, a comparator and an adder; the first operation result determining process is as follows:
determining the scale value of a single floating point number, wherein the specific calculation formula is as follows:
scale={regime,exp},
wherein scale represents the corresponding scaling scale in the floating point number represented in the POSIT format code, region represents the specific value of a field in POSIT, exp is the specific value of a step code field, and { represents a splicing device, i.e. the region and exp are spliced by the splicing device;
after the sizes of the scales of the first floating point number and the second floating point number are obtained, the sizes of the scales of the first floating point number and the second floating point number are obtained through a comparator, then the larger scale between the scales of the first floating point number and the scales of the second floating point number is sent into an adder to serve as one operand, the adder finally performs the following operation formula, and the scale of the final result is determined in this way:
scale=scalemax+Flag-Lshift,
wherein scale represents a larger scale of scale values of the first floating point number and the second floating point number, flag represents a highest order of a first operation result output from the MSB unit (5), lshift represents a parameter from a mantissa normalization unit, and the parameter represents a bit size by which mantissas are shifted;
the scale herein is the first operation result, and represents the scale value of the target operation result.
4. A parameterized addition and subtraction circuit based on the post floating point number format according to claim 3, wherein the second operation result determining process specifically comprises:
assuming that the first floating point scale value is larger, M, according to the result of scale value size comparison by the scale determination unit (3) a Mantissa portion representing a first floating point number from a decode unit, M b A mantissa portion representing a second floating point number; the difference diff of scale in the two floating-point numbers is also from the scale determination unit (3); then the mantissa portion M of the second floating point corresponding to the smaller scale between the scales of the first floating point and the second floating point is first b Processing according to the following formula:
M b =M b >>diff,
then M in the processed first floating point number and second floating point number a And M b The addition or subtraction results in the X, Y and Z portions of the following formulas:
M a ±M b =XY.Z,
wherein 1 is the hidden bit of the mantissa part, M a Mantissa portion representing a first floating point number, M b Representing the mantissa part of the second floating point number, X representing the highest order of the second operation result, Y representing the next highest order of the second operation result, Z representing the remaining mantissa part of the second operation result;
if X is 1, the value obtained by adding the mantissa parts of the first floating point number and the second floating point number overflows, and the second operation result is obtained according to the following formula:
add_M=YZ,
if X is 0, the subtraction of the mantissa portions of the first floating point number and the second floating point number is described, and no overflow condition occurs, and at this time, the second operation result is obtained according to the following formula:
add_M=XY.Z<<1,
here, add_m is the second operation result, representing the mantissa portion of the result.
5. A parameterized addition and subtraction circuit based on the post floating point number format according to claim 4, characterized in that the mantissa processing unit (4) comprises a mantissa normalizing unit (41); the mantissa normalization unit (41) normalizes the second operation result according to the first operation result to obtain the second operation result of the mantissa part meeting the POSIT format standard as a third operation result.
6. The parameterized addition and subtraction circuit based on the post floating point number format according to claim 5, wherein in the mantissa normalizing unit (41), the second operation result is normalized to obtain a second operation result of the mantissa part conforming to the post format standard, an intermediate parameter Lshift used in normalization is sent to the scale determining unit (3) to add to the adjustment process of the first operation result, and the normalized result is sent to the result encoding unit (6) as a third operation result.
7. The parameterized addition and subtraction circuit based on the post floating point number format according to claim 6, wherein the third operation result is determined by the mantissa normalizing unit (41), specifically expressed by the following formula:
add_ML=add_M<<LOD(add_M),
wherein add_m represents a second operation result, LOD represents a leading zero detection circuit in the mantissa normalizing unit (41), detects that the first one of add_m is a 1-bit position in the code, and then shifts the second operation result left by a shift device in mantissa normalization by a result value of LOD;
here, frac is the third operation result, and represents the mantissa partial value of the target operation result.
8. Parameterized addition and subtraction circuit based on the post floating point number format according to any of claims 1-7, further comprising a special result processing unit (8); the special result processing unit (8) determines whether the first floating point number and the second floating point number are 0 or infinity according to the decoding results of the first decoding unit and the second decoding unit so as to determine the special result of operation, and the special result directly enters the result encoding unit (6).
9. The parameterized addition and subtraction circuit based on the post floating point number format according to claim 8, wherein in the result selection output unit (7), if a special result is received, the special result is output; otherwise, outputting the result of the result encoding unit (6).
10. The parameterized addition and subtraction circuit based on the post floating point format according to claim 9, wherein in the special result processing unit (8), if the first floating point number and the second floating point number are both 0, the result of the addition and subtraction operation is 0; if one of the first floating point number and the second floating point number is 0, the operation result of addition and subtraction is the other floating point number; if any one of the first floating point number and the second floating point number is infinite, the operation result of addition and subtraction is infinite.
CN202011601975.1A 2020-12-29 2020-12-29 Parameterized addition and subtraction operation circuit based on POSIT floating point number format Active CN112667197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011601975.1A CN112667197B (en) 2020-12-29 2020-12-29 Parameterized addition and subtraction operation circuit based on POSIT floating point number format

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011601975.1A CN112667197B (en) 2020-12-29 2020-12-29 Parameterized addition and subtraction operation circuit based on POSIT floating point number format

Publications (2)

Publication Number Publication Date
CN112667197A CN112667197A (en) 2021-04-16
CN112667197B true CN112667197B (en) 2023-07-14

Family

ID=75410611

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011601975.1A Active CN112667197B (en) 2020-12-29 2020-12-29 Parameterized addition and subtraction operation circuit based on POSIT floating point number format

Country Status (1)

Country Link
CN (1) CN112667197B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117075842B (en) * 2023-08-25 2024-02-06 上海合芯数字科技有限公司 Decimal adder and decimal operation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111290732A (en) * 2020-03-03 2020-06-16 南京大学 Floating-point number multiplication circuit based on posit data format
CN111538472A (en) * 2020-04-27 2020-08-14 西安交通大学 Positt floating point number operation processor and operation processing system
CN111538473A (en) * 2020-04-27 2020-08-14 西安交通大学 Posit floating point number processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111290732A (en) * 2020-03-03 2020-06-16 南京大学 Floating-point number multiplication circuit based on posit data format
CN111538472A (en) * 2020-04-27 2020-08-14 西安交通大学 Positt floating point number operation processor and operation processing system
CN111538473A (en) * 2020-04-27 2020-08-14 西安交通大学 Posit floating point number processor

Also Published As

Publication number Publication date
CN112667197A (en) 2021-04-16

Similar Documents

Publication Publication Date Title
US4758972A (en) Precision rounding in a floating point arithmetic unit
CN101263467B (en) Floating point normalization and denormalization
US9608662B2 (en) Apparatus and method for converting floating-point operand into a value having a different format
JP3178746B2 (en) Format converter for floating point numbers
US6175851B1 (en) Fast adder/subtractor for signed floating point numbers
CN104899004A (en) Data processing apparatus and method for multiplying floating point operands
CN112463113B (en) Floating point addition unit
US10585972B2 (en) Apparatus for performing modal interval calculations based on decoration configuration
US9059726B2 (en) Apparatus and method for performing a convert-to-integer operation
CN111538473A (en) Posit floating point number processor
US5260889A (en) Computation of sticky-bit in parallel with partial products in a floating point multiplier unit
CN112667197B (en) Parameterized addition and subtraction operation circuit based on POSIT floating point number format
JPH05224883A (en) System for converting binary number of magnitude having floating-point n-bit code into binary number indicated by two&#39;s complement of fixed-point m-bit
KR970073162A (en) Calculating the absolute difference of two integer numbers in a single instruction cycle
CN113625989A (en) Data operation device, method, electronic device, and storage medium
CN116643718B (en) Floating point fusion multiply-add device and method of pipeline structure and processor
JP3753275B2 (en) Most significant bit position prediction method
CN1619484A (en) Floating point unit and index calculating method
CN108153513B (en) Leading zero prediction
JP2558669B2 (en) Floating point arithmetic unit
KR920003493B1 (en) Operation circuit based on floating-point representation
KR101922462B1 (en) A data processing apparatus and method for performing a shift function on a binary number
US5710730A (en) Divide to integer
CN112671411B (en) Bidirectional conversion circuit of floating point data format based on IEEE754 and POSIT
EP1282034A2 (en) Elimination of rounding step in the short path of a floating point adder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant