CN116700665A

CN116700665A - Method and device for determining floating point number square root reciprocal

Info

Publication number: CN116700665A
Application number: CN202210174708.3A
Authority: CN
Inventors: 唐志敏; 王海洋; 姜莹
Original assignee: Xiangdixian Computing Technology Chongqing Co ltd
Current assignee: Xiangdixian Computing Technology Chongqing Co ltd
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2023-09-05
Anticipated expiration: 2042-02-24
Also published as: CN116700665B

Abstract

The present disclosure provides a method of determining the inverse square root of a floating point number, comprising: the central processing unit identifies the first precision floating point number, generates a processing instruction based on an identification result, and sends the processing instruction to the hardware accelerator; the hardware accelerator performs, based on the received processing instructions: processing the first precision floating point number to obtain a second precision floating point number; wherein the precision of the second precision floating point number is less than the precision of the first precision floating point number; calculating the second precision floating point number by adopting a square root reciprocal arithmetic logic unit ALU corresponding to the second precision floating point number to obtain the square root reciprocal of the second precision floating point number; and determining a Newton iteration initial value according to the reciprocal square root of the second precision floating point number, and calling an integer arithmetic logic unit ALU to simulate Newton iteration method to determine the reciprocal square root of the first precision floating point number.

Description

Method and device for determining floating point number square root reciprocal

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a method and apparatus for determining the reciprocal square root of floating point numbers.

Background

Currently, in order to meet various application requirements, such as pursuing more extreme rendering of pictures, a hardware accelerator is required to have the capability of calculating high-precision floating point numbers.

However, the hardware accelerator currently on the market generally has only an ALU unit for calculating a low-precision floating point number or an ALU unit for calculating an integer number, and if it is desired to implement the calculation of the inverse square root of a high-precision floating point number, it is necessary to develop hardware, i.e., design and develop an ALU for calculating the inverse square root of the high-precision floating point number, which increases the design complexity of the hardware and also requires a long development period.

Disclosure of Invention

Aiming at the technical problems, the disclosure provides a method and a device for determining the reciprocal square root of a floating point number, and the technical scheme is as follows.

According to a first aspect of the present disclosure, there is provided a method of determining the inverse square root of a floating point number, comprising:

the central processing unit identifies the first precision floating point number, generates a processing instruction based on an identification result, and sends the processing instruction to the hardware accelerator;

the hardware accelerator performs, based on the received processing instructions:

processing the first precision floating point number to obtain a second precision floating point number; wherein the precision of the second precision floating point number is less than the precision of the first precision floating point number;

Calculating the second precision floating point number by adopting a square root reciprocal arithmetic logic unit ALU corresponding to the second precision floating point number to obtain the square root reciprocal of the second precision floating point number;

and determining a Newton iteration initial value according to the reciprocal square root of the second precision floating point number, and calling an integer arithmetic logic unit ALU to simulate Newton iteration method to determine the reciprocal square root of the first precision floating point number.

In one embodiment, the processing the first precision floating point number to obtain the second precision floating point number includes:

intercepting the mantissa of the first precision floating point number to obtain a first mantissa meeting the requirements of the mantissa bit width of the second precision floating point number, wherein the first mantissa is used as the mantissa of the second precision floating point number;

splitting the index of the first precision floating point number to obtain a first index meeting the requirement of the index range of the second precision floating point number, wherein the first index is used as the index of the second precision floating point number.

In one embodiment, the splitting the exponent of the first precision floating point number includes:

splitting the index of the first precision floating point number into a first index and a second index to be added; the first index meets the index range requirement of the second precision floating point number, and the second index is an even number.

In one embodiment, the determining the initial value of newton iteration according to the inverse square root of the second precision floating point number, and calling the integer arithmetic logic unit ALU to simulate newton iteration method to determine the inverse square root of the first precision floating point number, comprises:

taking the mantissa of the square root reciprocal of the second precision floating point number as a Newton iteration initial value, taking the mantissa of the first precision floating point number as a target value, and calling an integer arithmetic logic unit ALU to simulate Newton iteration method for iteration to obtain an output value;

taking the product of the output value, the first numerical value and the second numerical value as the reciprocal square root of the first precision floating point number; wherein the first value is a power result based on 2 and an exponent of the inverse square root of the second precision floating point number is an exponent, and the second value is a power result based on 2 and an inverse one half of the second exponent is an exponent.

taking the product of the integer value corresponding to the reciprocal square root of the second precision floating point number and the second value as a Newton iteration initial value, taking the integer value corresponding to the first precision floating point number as a target value, and carrying out iteration by using a Newton iteration method for calculating the reciprocal square root to obtain an output value; wherein the second value is a power result based on 2 and based on the inverse of one half of the second index; the integer value corresponding to the first precision floating point number is the product of the mantissa of the first precision floating point number and a third value, wherein the third value is a product of taking 2 as a base and taking the exponent of the first precision floating point number as an exponent; the integer value corresponding to the second precision floating point number is the product of the mantissa of the second precision floating point number and a fourth value, wherein the fourth value is a product of taking 2 as a base and taking the exponent of the second precision floating point number as an exponent;

The output value is taken as the inverse square root of the first precision floating point number.

According to a second aspect of the present disclosure, there is provided an apparatus for determining the reciprocal square root of a floating point number, comprising: a central processing unit and a hardware accelerator;

the central processing unit is used for identifying the first precision floating point number, generating a processing instruction based on the identification result and sending the processing instruction to the hardware accelerator;

the hardware accelerator is configured to execute, based on the received processing instruction:

In one embodiment, the hardware accelerator is specifically configured to intercept the mantissa of the first precision floating point number to obtain a first mantissa that meets a second precision floating point number mantissa bit width requirement, and is used as the mantissa of the second precision floating point number;

In one embodiment, the hardware accelerator is specifically configured to split the exponent of the first precision floating point number into a first exponent and a second exponent to be added; the first index meets the index range requirement of the second precision floating point number, and the second index is an even number.

In one embodiment, the hardware accelerator is specifically configured to call an integer arithmetic logic unit ALU to simulate a newton iteration method to iterate with a mantissa of a square root reciprocal of the second precision floating point number as a newton iteration initial value and a mantissa of the first precision floating point number as a target value to obtain an output value;

In one embodiment, the hardware accelerator is specifically configured to iterate by using a product of an integer value corresponding to the reciprocal square root of the second precision floating point number and the second value as a newton iteration initial value, using an integer value corresponding to the first precision floating point number as a target value, and using a newton iteration method for calculating the reciprocal square root to obtain an output value; wherein the second value is a power result based on 2 and based on the inverse of one half of the second index; the integer value corresponding to the first precision floating point number is the product of the mantissa of the first precision floating point number and a third value, wherein the third value is a product of taking 2 as a base and taking the exponent of the first precision floating point number as an exponent; the integer value corresponding to the second precision floating point number is the product of the mantissa of the second precision floating point number and a fourth value, wherein the fourth value is a product of taking 2 as a base and taking the exponent of the second precision floating point number as an exponent;

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device including the apparatus for determining the reciprocal square root of a floating point number described above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a hardware accelerator comprising:

the memory is used for storing processing instructions sent by the central processing unit;

a controller for reading the processing instructions in the memory to perform:

According to a fifth aspect of embodiments of the present disclosure, there is provided a central processing unit including:

a memory for storing a processing program;

A controller for reading the processing program to execute: identifying the first precision floating point number, generating a processing instruction based on the identification result, and sending the processing instruction to a hardware accelerator so that the hardware accelerator processes the first precision floating point number to obtain a second precision floating point number; wherein the precision of the second precision floating point number is less than the precision of the first precision floating point number; calculating the second precision floating point number by adopting a square root reciprocal arithmetic logic unit ALU corresponding to the second precision floating point number to obtain the square root reciprocal of the second precision floating point number; and determining a Newton iteration initial value according to the reciprocal square root of the second precision floating point number, and calling an integer arithmetic logic unit ALU to simulate Newton iteration method to determine the reciprocal square root of the first precision floating point number.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings, which are intended to be used in the description of the embodiments or the prior art, are briefly described below, the drawings being illustrated herein to provide a further understanding of the present disclosure, the exemplary embodiments of the present disclosure and the description thereof being intended to explain the present disclosure and not to constitute undue limitations of the present disclosure, and other drawings may be obtained from these drawings by those of ordinary skill in the art.

FIG. 1 is a schematic diagram of a single precision floating point number composition in accordance with an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a double-precision floating point number composition in accordance with an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an apparatus for determining the reciprocal square root of a floating point number according to an embodiment of the disclosure;

FIG. 4 is a flow chart of a method of determining the reciprocal square root of a floating point number according to an embodiment of the present disclosure;

FIG. 5 is a logic diagram of a method of determining the inverse square root of a floating point number according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a hardware accelerator according to an embodiment of the disclosure.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions and advantages of the embodiments of the present disclosure will be further described in detail below with reference to the accompanying drawings, and it should be apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments, and it should be noted that the embodiments of the present disclosure and features of the embodiments may be combined with each other without conflict, and all other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present disclosure should fall within the scope of protection.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the scope of the application. In this disclosure, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein includes any and all possible combinations of the listed plurality of associated items.

It should be understood that although the terms "first," "second," and the like may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if, for example," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context. It will be further understood that the terms "comprises" and/or "comprising," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may in fact be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

At present, technologies such as graphic processing and machine learning are continuously updated in an iterative manner, and accordingly, the computing requirements on computer equipment are higher and higher, and the computing pressure of a Central Processing Unit (CPU) in the computer equipment is also higher and higher, so that a hardware accelerator is adopted in the industry to share the computing pressure of the CPU, and the hardware accelerator can be understood as a hardware product specially used for computing, and receives an instruction sent by the CPU, performs corresponding computing according to the instruction, and returns a computing result to the CPU, and common hardware accelerators include a GPU (Graphics Processing Unit, a graphic processor), a TBU (Tensor Processing Unit, a tensor processor) and the like, which are not limited in the disclosure.

In a hardware accelerator, a plurality of arithmetic logic units (arithmetic and logic unit), i.e., combinational logic circuits for implementing a plurality of groups of arithmetic operations and logic operations, abbreviated as ALUs, are usually configured in hardware in advance for performing calculations on different data. One type of ALU is dedicated to only processing certain data, for example, an ALU for calculating a single-precision floating point number cannot calculate a double-precision floating point number, and similarly, an ALU for calculating a double-precision floating point number cannot calculate a single-precision floating point number.

As described above, in order to meet various application requirements, such as pursuing more extreme rendering of pictures, hardware accelerators are currently required to have the capability to square root the inverse of high precision floating point numbers (e.g., 64-bit floating point numbers). However, the hardware accelerators currently on the market generally only have an ALU that calculates a floating point number with low precision (e.g., less than 64 bits), or an ALU that calculates an integer, and if it is desired to implement the calculation of the inverse square root of the high-precision floating point number, it is necessary to develop hardware, i.e., design and develop an ALU for the inverse square root of the high-precision floating point number, which increases the design complexity of the hardware and also increases the development period of the product.

In order to solve the above-mentioned problem, the present disclosure proposes that the central processing unit identifies a high-precision floating point number, generates a processing instruction so that the hardware accelerator can process the high-precision floating point number into a low-precision floating point number, and further, the hardware accelerator can use an ALU corresponding to the low-precision floating point number to perform square root reciprocal calculation on the obtained low-precision floating point number to determine a newton iteration initial value by using the obtained square root reciprocal, and determines the square root of the high-precision floating point number by adopting a newton iteration method.

In order to facilitate description of the technical solution of the present disclosure, the following description of floating point numbers is given:

floating point numbers are a digital representation in which various real numbers can be expressed using floating point numbers, and various methods have been proposed in the development of computer systems to express real numbers, such as fixed point numbers relative to floating point numbers, in which decimal points are fixed somewhere in the middle of all the digits of a real number. For example, the expression of currency may be used in this way, e.g. 99.00 or 00.99 may be used to express currency with two decimal places. However, since the fixed position of the decimal point is unfavorable for expressing particularly large or particularly small numbers in the fixed-point number, most computer systems currently use the expression mode of the floating-point number to express real numbers.

In floating point numbers, real numbers are expressed by a Mantissa (Mantissa), a radix (Base), an Exponent (exponents), and a sign representing positive and negative. For example, 121.1 may be expressed as 1.211×10 ² Where 1.211 is mantissa, 10 is radix, 2 is exponent, floating point number expresses the effect of floating decimal point with exponent, thus allowing expression of a wider range of real numbers.

Since the numerical expressions in a computer are all binary-based, the base of a floating point number defaults to 2 in a computer, and the number of digits of a mantissa is referred to as the precision of a floating point number. For example, floating point number 1.001101 ×2 ⁴ The accuracy of (2) is 7.

Various floating point formats are specified in IEEE (institute of electrical and electronics engineers), including single precision floating point numbers, double precision floating point numbers, extended double precision floating point numbers, and the like are common. Wherein, single-precision floating point number is 32 bits, namely, one single-precision floating point number needs to occupy continuous 32 bits, wherein, a sign occupies 1 bit, an exponent occupies 8 bits, a mantissa occupies 23 bits, and an implicit bit. The double precision floating point number is 64 bits, where the sign takes 1 bit, the exponent takes 11 bits, the mantissa takes 52 bits, and there is one implied bit. The extended double precision floating point number is 80 bits, with the sign taking 1 bit, the exponent taking 15 bits, and the mantissa taking 64 bits. The IEEE754 standard specifies that a real number V can be used as V= (-1) ^s ×M×2 ^E Where S is a sign, which may be 0 to indicate that the floating point number is positive, 1 to indicate that the floating point number is negative, M is mantissa, and E is an exponent.

As shown in fig. 1, a schematic diagram of a single-precision floating point number (32-bit floating point number) when stored in a computer, wherein the single-precision floating point number occupies 32 bits (4 bytes) in total in the computer, and the continuous 32 bits are divided into three domains, including: a sign field, an exponent field, and a mantissa field, wherein the stored values are used to represent the sign, exponent, and mantissa, respectively, in a given single precision floating point number, so that a given value can be expressed by the mantissa and the exponent that can be adjusted.

As shown in fig. 1, the sign-field bit width is 1 bit, 0 represents positive, and 1 represents negative.

The exponent is also called a step code, and the exponent field is 8 bits wide. The stored value is 0-255, in order to cope with the negative number, the actual exponent is added with a Bias value (Bias) as the value stored in the exponent domain, the Bias value is 2 ^{(exponential number-1)} -1, single precision offset value of 2 ^(8-1) -1=127, so the value in the exponent field is the actual value of the exponent plus 127, so the actual value that the 8-bit exponent can represent is-127-128. For example, a single precision actual exponent value of 0 would be saved in the exponent field as 127; while 64 stored in the exponent field represents the actual exponent value-63.

The mantissa field bit width is 23 bits, including 23 decimal places to the right of the decimal point, i.e., the fractional part of the mantissa, and the mantissa also includes one hidden integer digit, i.e., the integer part of the mantissa, so that although only 23 decimal places of the mantissa are stored, the total precision of the mantissa digits is 24 bits.

As shown in fig. 2, a schematic diagram of a double-precision floating point number (64-bit floating point number) when stored in a computer, wherein the double-precision floating point number occupies 64 bits (8 bytes) in total in the computer, and the continuous 64 bits are divided into three domains, including: the sign takes 1 bit, the exponent takes 11 bits, and the mantissa takes 52 bits. From the above, it is apparent that floating point numbers of different accuracies are stored in different forms in a computer.

In order to make the technical solutions and advantages of the embodiments of the present disclosure more apparent, the following detailed description of exemplary embodiments of the present disclosure is provided in conjunction with the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments of which are exhaustive. It should be noted that, without conflict, the embodiments of the present disclosure and features of the embodiments may be combined with each other.

As shown in fig. 3, a schematic structure of an apparatus for determining the reciprocal square root of a floating point number according to the present disclosure includes a Central Processing Unit (CPU) 310 and a hardware accelerator 320.

Based on the device, after the central processing unit acquires the first precision floating point number to be processed, the central processing unit identifies the first precision floating point number, generates a processing instruction based on an identification result, and sends the processing instruction to the hardware accelerator;

the first precision floating point number is specifically a floating point number that the hardware accelerator does not have a direct processing capability, for example, currently, the hardware accelerator generally has an FP16 (16-bit floating point number) ALU, an FP32 (32-bit floating point number) ALU and an integer ALU, that is, the hardware accelerator can perform calculation processing on the FP16, the FP32 and the integer, while the first precision floating point number is an FP64, that is, a 64-bit floating point number, and the hardware accelerator cannot directly identify and calculate the 64-bit floating point number.

In the step, after acquiring a first precision floating point number calculated by a to-be-opened party, a central processing unit identifies the first precision floating point number, generates a processing instruction based on the identified first precision floating point number, and sends the generated processing instruction to a hardware accelerator. In general, the cpu can recognize floating point numbers of various precision as compared with the hardware accelerator, and thus can recognize the precision type of the high-precision floating point number and the size of the floating point number, and generate a processing instruction according to the recognition result.

When generating the processing instruction based on the identified first precision floating point number, the processing instruction including converting the first precision floating point number into the second precision floating point number and then performing newton iterative computation based on the second precision floating point number may be specifically generated. The second precision floating point number is specifically a floating point number that the hardware accelerator can directly calculate, so that the hardware accelerator can identify and calculate the second precision floating point number. Along the above example, if the hardware accelerator has an FP16 (16-bit floating point number) ALU, an FP32 (32-bit floating point number) ALU, the second precision floating point number may be a 16-bit floating point number or a 32-bit floating point number. For ease of description, the first precision floating point number is hereinafter referred to as a 64-bit floating point number and the second precision floating point number is hereinafter referred to as a 32-bit floating point number, unless otherwise specified.

In this step, after the central processing unit identifies the first precision floating point number, a processing manner of the first precision floating point number may be determined based on a preset software processing logic, and the processing manner may be compiled into a hardware processing instruction that may be executed by the hardware accelerator.

Specifically, the central processing unit may identify the mantissa domain and the exponent domain of the first precision floating point number, and determine different processing modes for the mantissa and the exponent of the first precision floating point number respectively.

Specifically, the processing mode for the mantissa of the first precision floating point number may be: and intercepting the mantissa of the first precision floating point number to obtain a first mantissa meeting the mantissa bit width requirement of the second precision floating point number, wherein the first mantissa is used as the mantissa of the second precision floating point number. For example, the first precision floating point number is a 64-bit floating point number, the mantissa bit width is 52 bits, the second precision floating point number is a 32-bit floating point number, and the mantissa bit width is 23 bits, so that the 23-bit mantissa can be truncated backward with the highest bit of the mantissa of the first precision floating point number as the starting point to obtain the first mantissa.

For example, the mantissa of a 64-bit floating point number is:

1111_0101_1010_1101_1110_1110_1110_1111_1110_0000_1111_1110_1110. It is truncated to get the first mantissa 1111_0101_1010_1101_1110_111.

The processing mode of the index of the first precision floating point number can be as follows: splitting the index to obtain a first index meeting the requirement of the index range of the second precision floating point number, wherein the first index is used as the index of the second precision floating point number;

specifically, the exponent of the first precision floating point number may be split into a first exponent and a second exponent to be added; the first index meets the index range requirement of the second precision floating point number, and the second index is an even number.

Since the exponent representation range of the high-precision floating point number is larger than the exponent representation range of the low-precision floating point number, the exponent of the first precision floating point number may not be in the exponent representation range of the second precision floating point number, and it is necessary to first determine whether the exponent of the first precision floating point number is in the exponent range of the second precision floating point number.

For example, the exponent range of a 64-bit floating point number is [ -1023,1024], and the exponent range of a 32-bit floating point number is [ -127,128], and since the value in the exponent field specified in the IEEE standard is an exponent actual value plus a deviation, if the value in the exponent field of the first precision floating point number is e, the exponent actual value of the first precision floating point number is e-1023 (for a 64-bit floating point number, the deviation is 1023), it is further necessary to determine whether e-1023 belongs to [ -127,128], if so, it is determined directly that e-1023 is the actual value of the first exponent, and also, because of the requirements of the IEEE standard, it is determined that the first exponent stored in the exponent field of the second precision floating point number is e-1203+127 (for a 32-bit floating point number, the deviation is 127). In this step, if the exponent of the first precision floating point number is within the range of the second precision floating point number exponent, it may be understood that the exponent of the first precision floating point number is split into a first exponent, and a second exponent, where the second exponent is specifically 0.

When the index of the first precision floating point number is not in the second precision floating point number index range, the split first index is ensured to be in the second precision floating point number index range, and the second index is non-zero even number.

Along the above example, the exponent of the first precision floating point number has an actual value of e-1023, and if it is not in [ -127,128], the exponent is split into x+y, where the first exponent X is a number in the range of [ -127,128], which may be represented by 8 bits, x+127 is a value in the exponent field of the second precision floating point number, and the second exponent Y is specifically an even number, and for that reason, the splitting is not described in detail herein.

In this step, when the cpu determines X and Y, a specific embodiment may be as follows:

that is, if the exponent actual value of the first precision floating point number is e-1023 and is greater than 128, determining whether e-1023 is odd or even, if it is even, X takes 128, Y takes e-1023-128; if it is odd, X takes 127 and Y takes e-1023-127.

In addition, if e-1023 is less than-127, determining whether e-1023 is odd or even, if it is even, X takes-126, Y takes e-1023+126; if it is odd, X takes-127 and y takes e-1023+127.

It will be appreciated that the above splitting process is only one specific implementation, and the first mantissa and the second mantissa may be obtained based on other splitting manners, so that the first mantissa meets the exponent range requirement of the second precision floating point number, and the second mantissa is an even number.

In this step, after determining the processing manner of the first precision floating point number, the central processor may generate a hardware processing instruction that may be executed by the hardware accelerator, and send the processing instruction and the first precision floating point number to a memory, for example, a RAM, of the hardware accelerator, where the hardware accelerator performs processing based on the received instruction.

As shown in fig. 5, for the logic schematic diagram of the execution of the CPU and the hardware accelerator shown in the present disclosure, after the CPU recognizes the high-precision 64-bit floating point number, the CPU determines the processing mode for the exponent and the mantissa based on the software processing logic, and then generates the instruction to be executed by the hardware accelerator after the processing mode for the second precision floating point number is subsequently processed, compiles the instruction into the hardware instruction format supported by the hardware accelerator through the compiler, and writes the compiled instruction and the first precision floating point number into the memory of the hardware accelerator, such as the RAM (Random Access Memory ). After the hardware accelerator reads the instructions and data from the local RAM, the method as shown in fig. 4 is performed. FIG. 4 shows a flow diagram of a method for determining the square root of a floating point number presented by the present disclosure, the method being performed by a hardware accelerator, the method comprising:

S401, processing the first precision floating point number to obtain a second precision floating point number; wherein the precision of the second precision floating point number is less than the precision of the first precision floating point number;

in this step, the hardware accelerator processes the first precision floating point number based on the processing instruction sent by the central processing unit, and the processing manner may refer to the above description, such as splitting the exponent and intercepting the mantissa, which is not described herein in detail.

And the hardware accelerator respectively processes the mantissa and the exponent of the first precision floating point number according to the instruction to obtain the first mantissa and the first exponent, and then the second precision floating point number is obtained.

Along the first mantissa 1111_0101_1010_1101_1110_111 obtained by the above example, the first exponent X obtained is 2 ^X The second precision floating point number is 1111_0101_1010_1101_1110_111×2 ^X It will be appreciated that the second precision floating point number is stored in the computer in particular in a standard floating point number form, and that the foregoing combination is presented for ease of illustration and description only.

S402, calculating the second precision floating point number by adopting a square root reciprocal arithmetic logic unit ALU corresponding to the second precision floating point number to obtain the square root reciprocal of the second precision floating point number;

The second precision floating point number is a floating point number which can be directly calculated by the hardware accelerator, so the hardware accelerator can calculate the second precision floating point number by utilizing a local inverse square root arithmetic logic unit ALU corresponding to the second precision floating point number to obtain the inverse square root of the second precision floating point number. For example, the 32-bit floating point number is processed by the inverse square root ALU of the local 32-bit floating point number to obtain the inverse square root a.times.2 ^Y Which is stored in the computer in the form of floating point numbers, is shown here for ease of illustration only as the format described above.

S403, determining a Newton iteration initial value according to the reciprocal square root of the second precision floating point number, and calling an integer arithmetic logic unit ALU to simulate Newton iteration method to determine the reciprocal square root of the first precision floating point number.

The idea of newton's iterative method, also known as newton-rafeison method, is to solve the polynomial using a linear approximation of a curve as a tangent, and reference may be made to the related art for details, which are not described in detail herein. In this disclosure, the reciprocal square root of the first precision floating point number is determined, in particular, by way of newton's iteration.

In the related art, the Newton's iterative formula is x _n+1 ＝x _n -f(x _n )/f′(x _n ) Wherein f (x _n ) For calculating the reciprocal square root, i.e. f (x) =1/x ² Y, where y is the target value of the inverse square root to be calculated and x is the inverse square root of the target value y. f' (x _n ) As a result of deriving f (x), i.e. f' (x) _n )＝-2x ^-3 F (x) and f' (x) _n ) Bring into Newton's iterative formula x _n+1 ＝x _n -f(x _n )/f′(x _n ) Obtaining x _n+1 ＝1.5x _n -0.5y x _n ³ 。

I.e. an iteration initial value x is determined first ₀ Starting iteration, and obtaining a value x after each iteration _n Ratio x _n-1 Closer to the inverse square root of the target value y, the result is obtained after the iteration reaches a preset stopping condition, which may be the obtained x _n The difference between the inverse square of (a) and the target value y is smaller than a preset value and x after iteration _n And x _n-1 The comparison difference is smaller than a preset value.

In this step, a specific embodiment may be:

taking the mantissa of the reciprocal square root of the second precision floating point number as an initial value of Newton iteration, taking the mantissa of the first precision floating point number as a target value, and carrying out iteration by using a Newton iteration method for calculating the reciprocal square root in a simulation manner to obtain an output value; taking the product of the output value, a first value and a second value as the inverse square root of the first precision floating point number, wherein the first value is the square result of taking the exponent of the inverse square root of the second precision floating point number as the base of 2, and the second value is the square result of taking the inverse of one half of the second exponent as the exponent of the second value.

Along the above example, the inverse square root of the second precision floating point number is a×2 ^b A second index of Y, a second value of 2 ^-Y/2 It will be appreciated that since Y is an integer, the hardware accelerator can calculate-Y/2 using an integer ALU. The calculation of the second numerical value may be started when the second index is obtained, or may be started when calculation using the second numerical value is necessary, and this is not limited in this embodiment. If the first precision floating point number is c 2 ^d Then a is taken as Newton iteration initial value x ₀ Taking c as a target value y, carrying out iteration by utilizing a Newton iteration method for obtaining the square root reciprocal by local ALU simulation to obtain an output value, namely simulating the x _n+1 ＝1.5x _n -0.5y x _n ³ Performing iteration, e.g. after the iteration is completed, obtaining an output value Z, determining z×2 ^b *2 ^-Y/2 Is the inverse square root of the first precision floating point number.

In this embodiment, the mantissa of the square root reciprocal of the second precision floating point number is used as the newton iteration initial value, and the mantissa of the first precision floating point number is used as the target value, and the mantissa of the square root reciprocal of the second precision floating point number and the mantissa of the first precision floating point number are integers, so that the hardware accelerator can execute the iterative calculation process by using the local integer ALU. If it is desired to realize a shape such as x _n+1 ＝1.5x _n -0.5y x _n ³ The multiplication and addition involved in the calculation of (a) can be realized by using an integer multiplication ALU and an integer addition ALU. In addition, the mantissa of the second precision floating point number is used as an iteration initial value, so that the iteration times can be greatly reduced, the iteration result can be converged as quickly as possible, and the overall processing efficiency is higher compared with the random selection of the iteration initial value.

Another specific embodiment may be:

taking the product of the integer value corresponding to the square root reciprocal of the second precision floating point number and the second value as a Newton iteration initial value, taking the integer value corresponding to the first precision floating point number as a target value, and simulating the Newton iteration method for solving the square root reciprocal to iterate to obtain an output value; the output value is taken as the inverse square root of the first precision floating point number.

With the above example, for example, the inverse square root of the second precision floating point number is a×2 ^b The second value is 2 ^-Y/2 Then a is 2 ^b *2 ^-Y/2 As the initial value of newton iteration, when calculating the product of the inverse square root of the second precision floating point number and the second value, it is necessary to determine the integer value of the inverse square root of the second precision floating point number, and the integer value of the inverse square root of the second precision floating point number and the second value are processed by the integer ALU to obtain the product. In determining the integer value of the inverse square root of the second precision floating point number, it may be determined that it is the product of the mantissa of the second precision floating point number and a fourth value that is a product of the exponent of the second precision floating point number with the exponent of the second precision floating point number being the base of 2. All of the computations involved in the integer number process described above to obtain the inverse square root of the second precision floating point number may be processed using an integer ALU.

Taking the integer value corresponding to the first precision floating point number as a target value y, and carrying out iteration by utilizing a Newton iteration method for obtaining the square root reciprocal through local ALU simulation to obtain an output value, namely simulating the x _n+1 ＝1.5x _n -0.5y x _n ³ After the iteration is completed, for example, an output value Z is obtained, and Z is determined to be the inverse square root of the first precision floating point number. In this embodiment, when determining the integer value of the first precision floating point number, a third value corresponding to the exponent of the first precision floating point number may be determined first, that is, a result based on 2 and using the first precision floating point number exponent as the exponent is determined, and since the third value is an integer, the product of the third value and the mantissa of the first precision floating point number may be calculated by using the local integer multiplication ALU, to obtain the integer value corresponding to the first precision floating point number. The process of calculating the first precision floating point number may be started when the hardware accelerator receives the first precision floating point number, or may be performed by using the first precision floating point in this stepThe execution is started when the number corresponds to the integer number, and similarly, the calculation of the second number may be started when the second exponent is obtained, or may be started when the calculation using the second number is necessary, which is not limited in this embodiment.

In this embodiment, the product of the integer value corresponding to the inverse square root of the second precision floating point number and the second value is used as the newton iteration initial value, and the integer value corresponding to the first precision floating point number is used as the target value, and the product and the integer value corresponding to the first precision floating point number are both integers, so that the hardware accelerator can execute the iterative calculation process by using the local integer ALU. If it is desired to realize a shape such as x _n+1 ＝1.5x _n -0.5y x _n ³ The multiplication and addition involved in the calculation of (a) can be realized by using an integer multiplication ALU and an integer addition ALU. In addition, the product of the integer value corresponding to the reciprocal square root of the second precision floating point number and the reciprocal square root of the second value is used as an iteration initial value, so that the iteration times can be greatly reduced, the iteration result can be converged as quickly as possible, and the overall processing efficiency is higher compared with the random selection of the iteration initial value.

In both embodiments, it is necessary to calculate and utilize one half of the second exponent, so that when splitting the exponent of the first precision floating point number, it is necessary to ensure that the split second exponent is even.

As shown in fig. 5, when the hardware accelerator performs processing based on the newton iteration method, the hardware accelerator may specifically call a plurality of local existing ALUs to perform calculation, so as to simulate and implement the newton iteration method. Division is not used in the newton iteration process, and most hardware accelerators currently do not have division ALU, so the calculation method is suitable for most hardware accelerators in the market. The second precision floating point number square root reciprocal ALU, integer addition ALU, and integer multiplication ALU illustrated in FIG. 5 are merely examples of ALUs in current general purpose hardware accelerators, which often have other ALUs as well.

It is to be appreciated that while the foregoing are illustrated with a first precision floating point number being a 64-bit floating point number and a second precision floating point number being a 32-bit floating point number, the second precision floating point number may be any floating point number that has direct processing capabilities for any hardware accelerator, and the first precision floating point number may be a floating point number that does not have processing capabilities for any hardware accelerator and that has a precision higher than the second precision, as those skilled in the art can flexibly apply to handle practical problems in light of the present disclosure. For example, the first precision floating point number may be an extended double precision floating point number, the second precision floating point number may be a 16-bit floating point number, and so on. The first precision floating point number being a 64-bit floating point number and the second precision floating point number being a 32-bit floating point number should not be limiting of the present disclosure.

By adopting the mode, the ALU with high precision floating point number does not need to be subjected to hardware development, the existing ALU of the hardware accelerator is utilized to obtain the square root of the high precision floating point number, meanwhile, the square root of the second precision floating point number is used as the reference of the Newton iteration initial value, and the random value is not used as the iteration initial value, so that the iteration initial value is more close to the final value in a large probability, the iteration times are greatly reduced, and the calculation efficiency is improved. In addition, most ALUs do not have division calculation capability, and the method for determining the reciprocal square root by adopting the Newton iteration method can avoid division calculation, so that the method is suitable for most hardware accelerators in the market at present.

As shown in FIG. 3, corresponding to the foregoing method for determining the reciprocal square root of a floating point number, the present disclosure also provides an apparatus for determining the reciprocal square root of a floating point number, which is characterized by comprising a central processing unit CPU 310 and a hardware accelerator 320;

the Central Processing Unit (CPU) 310 is configured to identify a first precision floating point number, generate a processing instruction based on an identification result, and send the processing instruction to a hardware accelerator;

the hardware accelerator 320 is configured to execute, based on the received processing instruction:

In one embodiment, the hardware accelerator 320 is specifically configured to intercept the mantissa of the first precision floating point number to obtain a first mantissa that meets the mantissa bit width requirement of the second precision floating point number, and is used as the mantissa of the second precision floating point number;

In one embodiment, the hardware accelerator 320 is specifically configured to split the exponent of the first precision floating point number into a first exponent and a second exponent to be added; the first index meets the index range requirement of the second precision floating point number, and the second index is an even number.

In one embodiment, the hardware accelerator 320 is specifically configured to call an integer arithmetic logic unit ALU to simulate a newton iteration method to iterate with a mantissa of a square root reciprocal of the second precision floating point number as an initial newton iteration value and a mantissa of the first precision floating point number as a target value to obtain an output value;

In one embodiment, the hardware accelerator 320 is specifically configured to iterate by using a product of an integer value corresponding to the reciprocal square root of the second precision floating point number and the second value as a newton iteration initial value, using an integer value corresponding to the first precision floating point number as a target value, and simulating a newton iteration method for obtaining the reciprocal square root to obtain an output value; wherein the second value is a power result based on 2 and based on the inverse of one half of the second index; the integer value corresponding to the first precision floating point number is the product of the mantissa of the first precision floating point number and a third value, wherein the third value is a product of taking 2 as a base and taking the exponent of the first precision floating point number as an exponent; the integer value corresponding to the second precision floating point number is the product of the mantissa of the second precision floating point number and a fourth value, wherein the fourth value is a product of taking 2 as a base and taking the exponent of the second precision floating point number as an exponent;

The embodiment of the disclosure also provides electronic equipment, which comprises the device for determining the inverse square root of the floating point number. In some use scenarios, the product form of the electronic device is a portable electronic device, such as a smart phone, a tablet computer, a VR device, etc.; in some use cases, the electronic device is in the form of a personal computer, game console, workstation, server, etc.

The disclosed embodiments also provide a hardware accelerator, comprising:

a controller for reading the processing instructions in the memory to perform:

The embodiment of the disclosure also provides a central processing unit, including:

a memory for storing a processing program;

In one particular embodiment, the hardware accelerator described in this disclosure may be a GPU, as shown in fig. 6, comprising at least:

GPU core, used for processing commands, such as the command of drawing, according to drawing command, carry out the Pipeline of the image rendering. The GPU core mainly comprises a computing unit and is used for executing commands compiled by the loader, belongs to a programmable module and consists of a large number of ALUs; a Cache (memory) for caching data of the GPU core to reduce access to the memory; the controller (not shown) further has various functional modules such as rasterization (a fixed stage of the 3D rendering pipeline), tilling (dicing a frame in TBR and TBDR GPU architectures), clipping (a fixed stage of the 3D rendering pipeline, clipping out of view or primitives not shown on the back), post-processing (scaling, clipping, rotating, etc. operations on the drawn graph), etc.

A general DMA for performing data movement between the host memory and the GPU graphics card memory, for example, the vertex data for 3D drawing, and for moving the vertex data from the host memory to the GPU graphics card memory;

the network on chip is used for data exchange between each master and salve on the SOC;

the application processor is used for scheduling tasks of each module on the SOC, for example, the GPU is notified to the application processor after rendering a frame of image, and the application processor is restarted to display the image drawn by the GPU on a screen by the display controller;

and the PCIe controller is used for realizing PCIe protocol by the interface communicated with the host computer, so that the GPU display card is connected to the host computer through the PCIe interface. The host computer runs graphics API, driver of display card, etc.;

the memory controller is used for connecting memory equipment and storing data on the SOC;

a display controller for controlling the frame buffer in the memory to be output to the display by a display interface (HDMI, DP, etc.);

and the video decoder is used for decoding the coded video on the hard disk of the host into pictures which can be displayed.

And the video encoder is used for encoding the original video code stream on the hard disk of the host into a specified format and returning the encoded video code stream to the host.

As shown in the figure, the host computer is a central processing unit, after generating a processing instruction, the host computer sends the instruction to a memory in a GPU core of a GPU chip, and a controller in the GPU core executes the processing flow according to the processing instruction to obtain the inverse square root of the first precision floating point number, and returns the inverse square root of the first precision floating point number to the host computer.

While preferred embodiments of the present disclosure have been described above, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the appended claims be interpreted as including the preferred embodiments and all alterations and modifications that fall within the scope of this disclosure, and that those skilled in the art will recognize that the invention also includes the true scope of the embodiments of the disclosure without departing from the spirit and scope of the disclosure.

Claims

1. A method of determining the inverse square root of a floating point number, comprising:

2. The method of claim 1, the processing the first precision floating point number to obtain a second precision floating point number, comprising:

3. The method of claim 2, the splitting the exponent of the first precision floating point number comprising:

4. The method of claim 3, the determining the initial value of newton iteration from the inverse square root of the second precision floating point number, invoking an integer arithmetic logic unit ALU to simulate newton iteration to determine the inverse square root of the first precision floating point number, comprising:

5. The method of claim 3, the determining the initial value of newton iteration from the inverse square root of the second precision floating point number, invoking an integer arithmetic logic unit ALU to simulate newton iteration to determine the inverse square root of the first precision floating point number, comprising:

6. An apparatus for determining the reciprocal square root of floating point number includes a central processing unit and a hardware accelerator;

7. The device according to claim 6,

the hardware accelerator is specifically configured to intercept the mantissa of the first precision floating point number to obtain a first mantissa meeting the requirement of the second precision floating point number for use as the mantissa of the second precision floating point number;

8. The device according to claim 7,

the hardware accelerator is specifically configured to split an exponent of the first precision floating point number into a first exponent and a second exponent, and add the first exponent and the second exponent; the first index meets the index range requirement of the second precision floating point number, and the second index is an even number.

9. The device according to claim 8,

the hardware accelerator is specifically configured to call an integer arithmetic logic unit ALU to simulate a newton iteration method to iterate by using a mantissa of a square root reciprocal of the second precision floating point number as a newton iteration initial value and using a mantissa of the first precision floating point number as a target value to obtain an output value;

10. The device according to claim 8,

the hardware accelerator is specifically configured to iterate by using a product of an integer value corresponding to the reciprocal square root of the second precision floating point number and the second value as a newton iteration initial value, and using the integer value corresponding to the first precision floating point number as a target value, and performing a newton iteration method for calculating the reciprocal square root in a simulation manner to obtain an output value; wherein the second value is a power result based on 2 and based on the inverse of one half of the second index; the integer value corresponding to the first precision floating point number is the product of the mantissa of the first precision floating point number and a third value, wherein the third value is a product of taking 2 as a base and taking the exponent of the first precision floating point number as an exponent; the integer value corresponding to the second precision floating point number is the product of the mantissa of the second precision floating point number and a fourth value, wherein the fourth value is a product of taking 2 as a base and taking the exponent of the second precision floating point number as an exponent;

11. An electronic device comprising the apparatus of any of the preceding claims 6-10.

12. A hardware accelerator comprising:

a controller for reading the processing instructions in the memory to perform:

13. A central processing unit comprising:

a memory for storing a processing program;