CN114968368A - Sine and cosine function implementation method and system based on transcendental function acceleration instruction - Google Patents

Sine and cosine function implementation method and system based on transcendental function acceleration instruction Download PDF

Info

Publication number
CN114968368A
CN114968368A CN202210647106.5A CN202210647106A CN114968368A CN 114968368 A CN114968368 A CN 114968368A CN 202210647106 A CN202210647106 A CN 202210647106A CN 114968368 A CN114968368 A CN 114968368A
Authority
CN
China
Prior art keywords
vector
floating point
integer
function
operand
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210647106.5A
Other languages
Chinese (zh)
Inventor
沈洁
龙标
黄春
彭林
唐滔
姜浩
范小康
于恒彪
易昕
苏醒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210647106.5A priority Critical patent/CN114968368A/en
Publication of CN114968368A publication Critical patent/CN114968368A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations

Abstract

The invention discloses a sine and cosine function implementation method and a system based on a transcendental function acceleration instruction, wherein the method comprises the steps of stipulating each element of an incoming vector operand vd to an interval of [ -pi/4, pi/4 ], and obtaining a corresponding integer vector vql and a floating point number vector vdr positioned in the interval of [ -pi/4, pi/4 ]; obtaining a remainder of the integer vector vql to obtain an integer vector vqln; according to the Taylor series expansion method, a transcendental function acceleration instruction is used for carrying out polynomial approximation calculation on a floating point number vector vdr and an integer vector vqln to obtain a vector sine function or vector cosine function calculation result vr. For the instruction set architecture coded by the fixed-length instructions, the method does not need to use additional address calculation instructions and vector loading instructions to obtain the coefficient constants similar to the polynomial from the constant pool, so that the performance of the vector sine function and the vector cosine function is greatly improved.

Description

Sine and cosine function implementation method and system based on transcendental function acceleration instruction
Technical Field
The invention belongs to the technology of processor data parallel and vector trigonometric function calculation, and particularly relates to a sine and cosine function implementation method and system based on a transcendental function acceleration instruction, which are used for implementing a vector sine function and a vector cosine function without a memory access instruction in an instruction set architecture of fixed-length instruction codes.
Background
Instruction set architecture (instruction set architecture) is a processor-defined byte-level encoding of all instructions, registers, data types, and instructions. Different processor families have different instruction set architectures, such as the IA32 and x86 — 64 instruction sets of Intel processors, the ARMv8 and ARMv9 instruction sets of ARM processors, the MIPS instruction set architecture, and the SPARC 64V instruction set developed by fuji corporation based on the open source SPARC V9 instruction set. In terms of instruction encoding, a part of the instruction set architectures use fixed-length instruction encoding, that is, all instructions of the instruction set architecture use bytecodes with the same length (32 bits or 64 bits) for encoding, for example, ARMv8, ARMv9, MIPS, SPARC V9 and SPARC64 all use 32-bit fixed-length instruction encoding. Complex instruction sets like IA32 and x86 — 64 use variable length encoding of instructions, which vary from 8 to 120 bits. The instruction set encoded by using the fixed length instruction simplifies the instruction decoding process of the micro-architecture, reduces the complexity of the micro-architecture, but cannot encode the immediate and memory addresses with the size of 64 bits or even 32 bits into the instruction, but has to solve the problem that the instruction set cannot use the immediate (constant) or constant memory addresses in a constant pool manner. For an instruction set encoded using variable length instructions, the immediate and memory addresses are encoded into specific instructions for the instructions to operate directly using the immediate without using a constant pool and other additional instructions for address calculation and load operations.
The modern processor comprises a vector processing unit, can perform data level parallelism and is an important part of the processor. The instruction set architecture includes a portion of the instructions to operate the vector processor unit, which is referred to as the SIMD instruction set (also referred to as the floating point instruction set, the vector instruction set). Vector registers, also known as floating point registers, which can store multiple elements compared to common general purpose registers, are the core storage components that run the SIMD instruction set. A SIMD instruction of the SIMD instruction set may operate on multiple elements stored in the vector registers simultaneously. For instruction sets encoded using 32-bit fixed length instructions, the immediate constants cannot be encoded into SIMD instructions, but rather a constant pool is used to store the immediate, and memory addresses are calculated before the corresponding SIMD instruction is invoked and a vector load instruction is used to load multiple immediate of the constant pool into each element of the vector register. Common SIMD instruction sets include SSE2, AVX2, AVX512 instruction set of Intel processors and ADVSIMD and SVE instruction set of ARM processors, and SPARC 64V instruction set also has corresponding SIMD extensions. The industry also refers to various SIMD instructions as vector operations, which essentially impose a common scalar operation on each element of a vector register (or vector) and store the result in the corresponding element of the vector register.
The SIMD instruction set built-in function (intrinsics) is a set of C language interface provided by the SIMD instruction set. The SIMD instruction set intrinsics enables programmers to use vector registers directly with the vector variable types provided by them and the corresponding vector instructions directly with the vector built-in functions provided by them in the C/C + + language. Because the vector register corresponds to the intrinsics vector type one by one, and the SIMD instruction corresponds to the SIMD built-in function one by one, in the implementation of the present document, the vector type of intrinsics is synonymous with the vector register, the vector in the implementation of the present document refers to both the vector register and the vector type of intrinsics, and the vector operation refer to both the SIMD instruction and the vector built-in function.
The computer implementation process of the sine function and the cosine function is generally divided into three steps of reduction, approximation and reconstruction. The convention is to map any interval into a specified interval (such as [ -pi/4, pi/4 ]) by utilizing the symmetry and periodicity of the trigonometric function. The approximation is a trigonometric approximation between [ -pi/4, pi/4 ] calculated by a taylor series expansion, e.g., sin (x) can be approximated as:
Figure BDA0003686413070000021
the reconstruction is to express the trigonometric function value of any interval through the trigonometric function value of [ -pi/4, pi/4 ] interval. Vector sine functions and vector cosine functions are vectorized by using SIMD (single instruction multiple data) instructions, and a plurality of floating point operands are simultaneously calculated by one-time vector trigonometric function calling according to the vector width of the used SIMD instruction set, so that performance improvement of several times is obtained. The vector sine and vector cosine functions follow these three steps as well, and are vectorized using SIMD instructions.
Sine and cosine function specifications refer to the specification of a function input x to the interval [0, C ] or [ -C/2, C/2], and the result y being specified can be derived from the formula y ═ x-kxc (k is an integer and satisfies:
Figure BDA0003686413070000031
in the above formula, the first and second carbon atoms are,
Figure BDA0003686413070000032
indicating a lower rounding.
Widely used computer nowadaysThe IEEE754 standard floating-point system suffers from rounding errors due to the loss of low-order digits of the result of the calculation of k times the constant C of the floating-point number, resulting from the large x of the floating-point number. The industry has solved this problem using the Cody-Waite reduction algorithm, which splits a constant C into two double-precision floating-point numbers C 1 And C 2 To hold the high and low bits of the constant C, respectively 1 + 2 ) So as to avoid errors in the specification, the result y after the specification is y ═ x-C 1 ×k-C 2 X k. The industry also uses a variant of the Cody-Waite algorithm (splitting C into 3 or 4 double-precision floating-point numbers or splitting k into two double-precision floating-point numbers) to improve the precision when the input x is large. The vector function is reduced by using a vectorized Cody-Waite algorithm, each element of an input vector is adjusted to a specified interval, and the vector Cody-Waite algorithm is obtained by vectorizing each step of a common Cody-Waite algorithm by using a corresponding SIMD instruction.
The polynomial approximation of the sine function and the cosine function is a trigonometric function approximation in a specified interval calculated by Taylor series expansion, and is obtained by the Taylor series expansion (generally 8 times expansion) after the function input is reduced to [ - π/4, π/4 ]:
Figure BDA0003686413070000033
because of the precision problem of the floating point system, the computer implementation will use the extracted x 2 The formula of (a):
Figure BDA0003686413070000034
wherein x 2 Only once and then repeatedly used, the whole sine function implementation can be approximated as 8 consecutive floating-point multiply-add operations (multiply-add operation, also called multiply-accumulate operation, refers to accumulating the product of a and B and another number C to a × B + C) and 1 floating-point multiply operation, the addition operands of which are constant coefficients. Similarly, the polynomial approximation of the cosine function is implemented as:
Figure BDA0003686413070000041
the cosine function implementation can also be approximated as 8 consecutive floating-point multiply-add operations and 1 floating-point multiply operation, except that the constant coefficients used for the 8 floating-point multiply-add operations and the final multiply operation use constant 1 as the multiply operand. Vectorization of polynomial approximation refers to replacing the 8 floating-point multiply-add operations and 1 floating-point multiply element used by it with the corresponding vector floating-point multiply-add operations and vector multiply operations.
The vector sine function and vector cosine function polynomial approximation process uses 8 consecutive floating-point multiply-add operations, the addition operands of which are constant immediate. For an instruction set encoded by a fixed-length instruction, the 8 constants are stored in a constant pool, and two steps of calculating the address of the constant immediate in the constant pool and loading the constant to a vector register by using a vector load instruction are performed before each vector floating-point multiply-add instruction. This means that the whole polynomial approximation process uses 8 consecutive vector floating-point multiply-add instructions, and also introduces 8 address calculation instructions and 8 time-consuming vector load instructions, which undoubtedly greatly reduces the performance of the vector sine function and the vector cosine function on these instruction set architectures.
For this problem, some instruction set architectures propose solutions that provide three transcendental function acceleration instructions to avoid fetching the constants needed in the polynomial approximation from a constant pool. Specifically, the method comprises the following steps: 1) the override function multiply add instruction trimad uses two vector register operands vreg1 and vreg2 and an integer type immediate as inputs, which is applied by the instruction to obtain the constant coefficients needed for polynomial approximation from a table lookup in the hardware structure, which are used as addition operands to perform vector floating point multiply add with the two vector registers. The range of the immediate value is [0,7 ]]There are 8 constant coefficients used for the polynomial approximation of the sine function or the cosine function. Initial value of vreg1 is 0 corresponding to one multiplication operand of each vector floating point multiply-add operation and simultaneously keepingThe result of this calculation is stored as input for the next time. vreg2 corresponds to x of the polynomial approximation 2 Due to x 2 Is a non-negative number whose sign bit (bit 63) is always 0, so the sign bit of each element of the register is designed to hold either positive or negative x 2 So that the trimad instruction can select constant coefficients using a sine function or cosine function polynomial approximation as addition operands based on the sign bit. 2) The transcendental function squaring instruction trimul is used to calculate + -x required by the trimad instruction 2 The instruction uses two vector register operands vreg1 and vreg2 as inputs, vreg1 corresponds to the input x in the polynomial approximation, the instruction calculates the square value of x, and uses the 0 th bit of vreg2 element as the sign bit of the result of the corresponding element calculation, thereby determining that the trimad instruction uses the constant coefficients of a sine function or a cosine function. 3) A transcendental function selection instruction trisel is used to select whether the coefficients used by the last vector floating-point multiplication operation in the polynomial approximation calculation are x required for a sine function or constant 1 required for a cosine function, the instruction uses the two vector register operands vreg1 and vreg2 as inputs, vreg1 corresponds to the input x of the polynomial approximation, bit 0 of the vreg2 element is used by the instruction to select whether x is saved in the vreg1 corresponding element or constant 1 is saved as the output of the instruction, bit 1 of the vreg2 element is used by the instruction to set the sign bit of the output result corresponding element, thereby deciding whether the inverse needs to be taken for the entire polynomial approximation result of the corresponding element. The three transcendental function acceleration instructions are implemented in an ARM processor, and are respectively a transcendental function square instruction ftsmul, a transcendental function selection instruction ftssel and a transcendental function multiply-add instruction ftmad, and are also implemented in a SPARC 64V instruction set developed by Fuji general.
Although the three transcendental function acceleration instructions are implemented on the ARM and SPARC 64V instruction sets, the implementation of the Vector sine function and the Vector cosine function based on the transcendental function acceleration instructions has not been achieved, and the Vector math libraries SLEEF, FDLIBM and Vector-libm commonly used in the industry do not use the transcendental function acceleration instructions to improve the performance of the Vector sine function and the Vector cosine function.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a method and a system for realizing a sine function and a cosine function based on a transcendental function acceleration instruction.
In order to solve the technical problems, the invention adopts the technical scheme that:
a sine and cosine function implementation method based on a transcendental function acceleration instruction comprises the following steps:
1) reducing each element of the incoming vector operand vd to an interval of [ -pi/4, pi/4 ], to obtain a corresponding integer vector vql and a floating point number vector vdr within the interval of [ -pi/4, pi/4 ];
2) the integer vector vql is obtained by taking a modulus according to vqln ═ vql mod 4;
3) according to the Taylor series expansion method, a transcendental function acceleration instruction is used for carrying out polynomial approximation calculation on a floating point number vector vdr and an integer vector vqln to obtain a vector sine function or vector cosine function calculation result vr corresponding to a vector operand vd.
Optionally, step 1) comprises:
1.1) judging whether all element values of the vector operand vd are positioned in a preset interval [ -CD, CD ], wherein the value of the constant CD is 15, and if all element values of the vector operand vd are positioned in the preset interval [ -CD, CD ], executing the step 1.2); if any element value of the vector operand vd is not within the preset interval [ -CD, CD ], executing step 1.3);
1.2) using a preset reduction algorithm 1 for each element of a vector operand vd whose all element values lie within a preset interval [ -CD, CD ] to an interval [ -pi/4, pi/4 ] to obtain a corresponding integer vector vql and a floating point vector vdr lying within the interval [ -pi/4, pi/4 ], and each element in the integer vector vql satisfies vd-vql × SR × 2 ═ vdr, where the constant SR has a value of pi/4; performing step 2);
1.3) using a preset reduction algorithm 2 to reduce each element of a vector operand vd, of which any element is not located in the interval [ -CD, CD ], to the interval [ -pi/4, pi/4 ] to obtain a corresponding integer vector vql and a floating point vector vdr located in the interval [ -pi/4, pi/4 ], and each element in the integer vector vql satisfies vd-vql × SR × 2 ═ vdr, where the value of the constant SR is pi/4; step 2) is performed.
Optionally, the step 1.1) of determining whether all element values of the vector operand vd are located in a preset interval [ -CD, CD ] includes: firstly broadcasting a constant CD as a floating point vector vcond, then carrying out vector floating point absolute value operation on a vector operand vd, judging whether each element value in the operation result of the vector floating point absolute value operation is smaller than the corresponding element value of the floating point vector vcond, obtaining a comparison result of logical yes or logical no, placing the comparison result in a floating point vector vtmp, and finally judging whether the values of the floating point vector vtmp are all logical yes, if yes, judging that all the element values of the vector operand vd are located in a preset interval [ -CD, CD ], otherwise, judging that all the element values of the vector operand vd are not located in the preset interval [ -CD, CD ].
Optionally, step 1.2) comprises: firstly, constants CM _2_ PI, CNPID2_ A2 and CNPID2_ B2 required by reduction are broadcasted to be floating point number vectors v _ m _2_ PI, vnppid2_ a2 and vnpid2_ B2, then a result of vector floating point multiplication of a vector operand vd and the floating point number vectors v _ m _2_ PI is obtained, each element of the result is subjected to integer rounding in a mode of approximate rounding to obtain a floating point number vector vdql, and then each element of the floating point number vector vdql is subjected to integer forced conversion by using a vector forced type conversion function to obtain an integer vector vql; meanwhile, vector floating point multiply-add operation is carried out by taking a vector operand vd as an addition operand and a floating point vector vdql and a floating point vector vnppid2_ a2 to obtain a floating point vector vdr as a result, then vector floating point multiply-add operation is carried out by taking the vector floating point vector vdr as the addition operand and the floating point vectors vdql and vnpid2_ b2, the operation result is stored in the floating point vector vdr, the vector operand vd which is input in the floating point vector vdr is reduced to a range of [ -SR, SR ], and vql is an integer vector which meets the requirement that vd-vql × SR × 2 ═ vdr, wherein the value of the constant SR is pi/4.
Optionally, step 1.3) comprises: constants M _2_ PI _1_2_24, CM _2_ PI, NPID2_ A, NPID2_ B, NPID2_ C, NPID2_ D, C16 required for the specification are first broadcast as floating-point number vectors vm _2_ PI _1_2_24, vm _2_ PI, vnpid2_ a, vnpid2_ b, vnpid2_ c, vnpid2_ d, vc16, respectively. Then, carrying out vector floating-point multiplication on the vector operand vd and the floating-point number vector vm _2_ pi _1_2_24, carrying out vector integer rounding on each element of a result subjected to the vector floating-point multiplication in a zero rounding mode, and carrying out vector floating-point multiplication on the rounding result and the floating-point number vector vc16 to obtain a floating-point number vector vdqh; then obtaining the result of vector floating point multiply-add operation by using the floating point vector vdqh as an addition operand, the vector operand vd and the floating point vector vm _2_ pi, and rounding the result into a floating point vector vdql by using a near rounding mode; sequentially carrying out six vector floating-point multiply-add operations on a floating-point vector vdql and a floating-point vector vdqh, wherein the first multiply operand of the six vector floating-point multiply operations is vnpid2_ a, vnpid2_ a, vnpid2_ b, vnpid2_ b, vnpid2_ c and vnpid2_ c, the second multiply operand is floating-point vectors vdqh, vdql, vdqh and vdql, the addition operands of the vector floating-point multiply-add operations are vector multiply-vd, and the result of each multiply-add operation is also put into the vector operand vd to carry out the vector floating-point multiply-add operation of continuously accumulating 6 times of results; finally, a vector floating point addition operation result of the floating point vector vdqh and the floating point vector vdql is obtained, the floating point vector vnpid2_ d is used as a multiplication operand, and the result and the vector operand vd are subjected to vector floating point multiplication and addition operation again to obtain a floating point vector vdr; meanwhile, after the floating point number vector vdql is obtained through calculation, each element of the floating point number vector vdql is subjected to integer forced conversion to obtain an integer vector vql, so that the result of the floating point number vector vdr is the result of input vector operand vd being reduced to [ -SR, SR ], and the integer vector vql is an integer vector satisfying vd-vql × SR × 2 ═ vdr, wherein the value of the constant SR is pi/4.
Optionally, step 2) comprises: firstly, integer constants 0x3 and 0x1 are respectively broadcasted as integer vectors vci3 and vci 1; then judging the type of the sine and cosine function implementation method, if the type is a sine function for calculating vector operand vd, obtaining an integer vector vqln (equivalent to obtaining vqln by taking a module of vql) by obtaining the vector bitwise logical and operation result of the integer vector vql and the integer vector vci 3; if the type is a cosine function for calculating the vector operand vd, the integer vector vql and the integer vector vci1 are subjected to vector integer addition operation, and then the operation result and the integer vector vci3 are subjected to vector bitwise logical AND operation to obtain an integer vector vqln.
Optionally, the transcendental function acceleration instructions used in step 3) include a transcendental function square instruction trimul, a transcendental function selection instruction trisel, and a transcendental function multiply-add instruction trimad.
Optionally, step 3) comprises: firstly, generating a floating point number vector vt with all elements being 0, then carrying out vector floating point square operation on the floating point number vector vdr by using a transcendental function square instruction trimul to obtain a square operation result vd2, setting a sign bit of the square operation result vd2 according to an integer vector vqln, and simultaneously carrying out vector condition selection by using a transcendental function selection instruction trisel according to the floating point number vector vdr and the integer vector vqln to obtain a last coefficient vls and a sign thereof of subsequent polynomial approximate calculation; and then continuously using an eight-time transcendental function multiply-add instruction trimad to carry out vector floating point multiply-add operation supporting hardware table lookup on the floating point vector vt and the square operation result vd2, wherein the immediate operand of the eight-time trimad instruction is 7, 6, 5, 4, 3, 2, 1 and 0 in sequence, the transcendental function multiply-add instruction trimad carries out hardware table lookup according to the used immediate operand to obtain an addition operand required in the vector floating point multiply-add operation, the calculation result of the transcendental function multiply-add instruction trimad is also added into the floating point vector vt for continuous accumulation every time, and finally the vector floating point multiply-add operation result of the floating point vector vt and the coefficient vls is obtained and is used as a vector sine function or cosine function calculation result vr corresponding to the vector operand vd.
In addition, the invention also provides a sine and cosine function implementation system based on the transcendental function acceleration instruction, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the sine and cosine function implementation method based on the transcendental function acceleration instruction.
In addition, the present invention also provides a computer readable storage medium, in which a computer program is stored, the computer program being used for being executed by a microprocessor to implement the steps of the sine and cosine function implementation method based on the transcendental function acceleration instruction.
Compared with the prior art, the invention mainly has the following advantages:
1. the invention provides the complete realization of a vector sine function and a vector cosine function based on an acceleration instruction of an transcendental function for the first time. For an instruction set architecture using fixed length instruction encoding, the implementation of the method of the invention can avoid using extra address calculation instructions and vector loading instructions to load constant coefficients required by polynomial approximation from a constant pool, thereby greatly improving the performance of vector sine functions and vector cosine functions implemented by using the method.
2. The invention is not limited to a hardware platform, and the method can be realized on any instruction set architecture providing the transcendental function acceleration instruction. Therefore, the method of the invention has good applicability.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
Fig. 2 is a schematic flow chart illustrating a branch flow for determining a vector operand vd according to the method of the embodiment of the present invention.
Detailed Description
As shown in fig. 1, the method for implementing sine and cosine functions based on transcendental function acceleration instructions in this embodiment includes:
1) reducing each element of the incoming vector operand vd to an interval of [ -pi/4, pi/4 ], to obtain a corresponding integer vector vql and a floating point number vector vdr within the interval of [ -pi/4, pi/4 ];
2) the integer vector vql is obtained by taking a modulus according to vqln ═ vql mod 4;
3) according to the Taylor series expansion method, a transcendental function acceleration instruction is used for carrying out polynomial approximation calculation on a floating point number vector vdr and an integer vector vqln to obtain a vector sine function or vector cosine function calculation result vr corresponding to a vector operand vd.
Referring to fig. 2, step 1) of the present embodiment includes:
1.1) judging whether all element values of a vector operand vd are positioned in a preset interval [ -CD, CD ], wherein the value of a constant CD is 15, and if all element values of the vector operand vd are positioned in the preset interval [ -CD, CD ], executing a step 1.2); if any element value of the vector operand vd is not within the preset interval [ -CD, CD ], executing step 1.3);
1.2) denotation of each element of the incoming vector operand vd into the [ -pi/4, pi/4 ] interval, obtaining the corresponding integer vector vql and a floating point vector vdr located in the [ -pi/4, pi/4 ] interval, and each element in the integer vector vql satisfies vd-vql × SR × 2 ═ vdr, where the value of the constant SR is pi/4; performing step 2);
1.3) specifying each element of the incoming vector operand vd into the [ -pi/4, pi/4 ] interval, obtaining a corresponding integer vector vql and a floating point vector vdr lying within the [ -pi/4, pi/4 ] interval, and each element in the integer vector vql satisfies vd-vql × SR × 2 ═ vdr, where the constant SR has a value of pi/4; step 2) is performed.
In this embodiment, the step 1.1) of determining whether all the element values of the vector operand vd are located in the preset interval [ -CD, CD ] includes: firstly broadcasting a constant CD as a floating point vector vcond, then carrying out vector floating point absolute value operation on a vector operand vd, judging whether each element value in the operation result of the vector floating point absolute value operation is smaller than the corresponding element value of the floating point vector vcond, obtaining a comparison result of logical yes or logical no, placing the comparison result in a floating point vector vtmp, and finally judging whether the values of the floating point vector vtmp are all logical yes, if yes, judging that all the element values of the vector operand vd are located in a preset interval [ -CD, CD ], otherwise, judging that all the element values of the vector operand vd are not located in the preset interval [ -CD, CD ].
In this embodiment, a vector sine function is implemented by using the periodicity and symmetry of a trigonometric function, and a calculation result of the sine function for any input vd is represented by a calculation result of a sine function or a cosine function in an [ -pi/4, pi/4 ] interval, where the input vd in any real number domain may be represented as vd ═ vdr + vql × pi/2, vdr is a real number with a numeric range of [ -pi/4, pi/4 ], and vql is a positive integer. vql may be expressed as:
vql=4×N+vqln,
in the above formula, N is a positive integer, and vqln ═ 0,1,2, 3.
Sin (vd) will have 4 different computational realizations because of the four values of vqln:
sin(vd)=sin(vdr+N×2π)=sin(vdr);vql=4×N+vqln;vqln=0,
Figure BDA0003686413070000101
sin(vd)=sin(vdr+N×2π+π)=-sin(vdr);vql=4×N+vqln;vqln=2,
Figure BDA0003686413070000102
Figure BDA0003686413070000111
this means that the first reduction algorithm is used to obtain vd which is reduced to [ - π/4, π/4]Is (vi) represents vdr (vd-vql × pi/2),
Figure BDA0003686413070000112
And vqln (vqln — vql mod 4), which can be used to select sin (vdr), cos (vdr), sin (vdr), or-cos (vdr) for calculation, by calculating vdr according to the above four formulas.
Similarly, in this embodiment, a vector cosine function is also implemented by using the periodicity and symmetry of a trigonometric function, the cosine function calculation result of the input vd is represented by a sine function or cosine function calculation result in an [ -pi/4, pi/4 ] interval, and cos (vd) is implemented by 4 different calculations due to four values of vqln:
cos(vd)=cos(vdr+N×2π)=cos(vdr);vql=4×N+vqln;vqln=0,
Figure BDA0003686413070000113
cos(vd)=cos(vdr+N×2π+π)=-cos(vdr);vql=4×N+vqln;vqln=2,
Figure BDA0003686413070000114
by utilizing the periodicity and symmetry of the sine function and the cosine function for reconstruction, the realization of the vector sine function or the vector cosine function finally becomes vdr after the actual calculation of the reduction by sin (vdr), cos (vdr), sin (vdr) or cos (vdr). The sine function differs from the cosine function only in that each vqln value corresponds to a different correspondence of the calculated sin (vdr), cos (vdr), -sin (vdr), or-cos (vdr). Therefore, in the method of the present embodiment, the vector sine function and the vector cosine function share the same polynomial approximation calculation kernel, and vqln generated by the cosine function specification is adjusted to correspond to correct calculation. Therefore, this embodiment first processes the input vd using a reduction algorithm to obtain vdr adjusted to [ -pi/4, pi/4 ] and integer multiple vql of vd to vdr, then modulo 4 is performed on vql to obtain vqln, and the subsequent polynomial approximation calculation (whether sin (vdr) or cos (vdr) is specified and whether the calculation result is to be inverted) is specified using sin (vdr), cos (vdr), sin (vdr) or-cos (vdr) according to the value of vqln. Since the vector sine function and vector cosine function implementations of this example share the same polynomial approximation computation kernel that uses transcendental function acceleration instructions, the kernel accepts vdr and vqln as inputs, vqln indicates that the kernel actually computes sin (vdr), cos (vdr), -sin (vdr), or-cos (vdr). When computing the sine function, the kernel computes using sin (vdr) when vqln is 0, computes using cos (vdr) when vqln is 1, computes using-sin (vdr) when vqln is 2, and computes using-cos (vdr) when vqln is 3. When calculating the cosine function, the vector cosine function implementation of this example adds 1 to each element vql obtained by vector reduction and then modulo 4 to obtain vqln, and actually adjusts the calculation corresponding to vqln of the cosine function by coordinate rotation so as to make it correspond to the correct calculations of cos (vdr), -sin (vdr), -cos (vdr), or sin (vdr). .
Based on the above principle description, the pseudo code of the vector sine function vsin of the present embodiment can be expressed as:
Figure BDA0003686413070000121
in the above pseudo code, vfloat _ t and vint _ t are used to represent floating point number vector type and integer vector type, respectively. Broadcasting a constant CD (CD-15) to each element of a vector vcond by using a vbcast _ f () function, then calculating the absolute value of each element of an input vector vd by using a vabs _ f () function, judging whether each element of the result vector is smaller than the value of the corresponding element of vcond by using a vcmpllt _ f () function, placing the comparison result in a vector vtmp, and then judging whether each element of the vtmp is logical yes by using a testallons () function, thereby achieving the purpose of judging whether each element of the input vector vd is positioned in an interval of [ -CD, CD ]. Then, a reduction calculation is performed using reduction code 1 or reduction code 2 according to the determination result, and the reduction result is placed in vectors vdr and vql (vql is an integer multiple of vd compared to vdr). The reduction code 1 is directly realized by using a Cody-white algorithm, only the vd of which the numerical range of each element is in the range of [ -CD, CD ] can be accurately reduced, when the value of any element in the vd is outside the range of [ -CD, CD ], the reduction calculation needs to be carried out by using a reduction code 2, and the reduction code 2 uses a variant of the Cody-white algorithm, and the variant algorithm can reduce the vd of which the numerical range is outside the range of [ -CD, CD ], but has higher complexity and poorer performance. The present example uses two different reduction codes to deal with the two cases where all elements of the input vector vd are in the [ -CD, CD ] interval and any element is not in the [ -CD, CD ] interval, to balance the performance and accuracy of the reduction algorithm of the present embodiment. After vdr and vql are obtained by calculation of the specification of the present example, the vector bitwise logical and operation result vqln of vql and 0x3(0x3 is broadcasted as vector vci3 by vbcast _ i () function) is calculated by using vand _ i () function, so as to quickly calculate vql modulo 4. And finally, calculating the vdr obtained after the reduction is calculated according to vqln indication by polynomial approximation calculation kernel codes, thereby obtaining a calculation result vr of the whole vector sine function.
The pseudo code implemented by the vector cosine function vcos of this embodiment can be expressed as:
Figure BDA0003686413070000131
Figure BDA0003686413070000141
the pseudocode implemented by the vector cosine function vcos is similar to the vector sine function vsin of the present embodiment, except that 1 is added to each element of the vector vql obtained by the reduction by using the function vadd _ i () and then vql is calculated to modulo 4, so as to achieve the purpose of vql adjustment.
In this embodiment, the pseudo code of step 1.2) may be expressed as:
Figure BDA0003686413070000142
vdr←vd-vdql×PID2_A2;
vdr←vdr-vdql×PID2_B2;
vql←(int)vdql;
the pseudo code uses the Cody-Waite reduction algorithm to calculate vdr ═ vd-vql × pi/2 (vql is an integer multiple of vd vs. pi/2)
Figure BDA0003686413070000144
In the embodiment, pi/2 is divided into 2 double-precision floating point constant PID2_ A2 and PID2_ B2 for storage (PID2_ A2 stores pi/2 approximate value by using 49 bits of mantissa, and PID2_ B2 stores pi/2The remaining significant digits) and calculates vdr (lines 2 and 3 of the pseudo code) in the reduced code through vd-vql × PID2_ a2-vql × PID2_ B2, and multiplies vd by the reciprocal of pi/2 and rounds the result to vql (line 1) to avoid time-consuming division operation. vql ← (int) vdql, expressed as floating point number vector vdql, rounded up as integer vector vql.
The pseudo code corresponding to the step 1.2) and realized by using C/C + + inrinsics can be expressed as follows:
Figure BDA0003686413070000143
Figure BDA0003686413070000151
the above pseudo code represents floating point vector type and integer vector type using vfloat _ t and vint _ t, respectively, and its implementation includes: the vectors v _ m _2_ PI, vnpid2_ a2 and vnpid2_ B2 are generated by broadcasting the constants CM _2_ PI, CNPID2_ a2 and CNPID2_ B2(CM _2_ PI is a double precision floating point constant representation of 2/PI, CNPID2_ a2 and CNPID2_ B2 are the opposite numbers of the constants PID2_ a2 and PID2_ B2, respectively) required for the reduction using the vbcast _ f function. Then, a result of floating-point multiplication of vd and a v _ m _2_ pi vector is obtained by using a vmul _ f () function, each element of the result is subjected to integer rounding in a mode of rounding to 0 by using a vrundtrozero _ f () function to obtain a floating-point number vector vdql, and then each element of the vdql is subjected to forced conversion by using a vcast _ ftoi () function to obtain an integer vector vql. Meanwhile, vector vd is used as an addition operand, vector vdql and vnppid2_ a2 are used for carrying out vector floating point multiplication and addition operation by using a vmadd _ f () function to obtain a result vector vdr, then vector vdr is used as an addition operand, vector vdql and vnpid2_ b2 are used for carrying out vector floating point multiplication and addition operation, and an operation result is stored in the floating point vector vdr. The result in the vector vdr is the result of the input operand vd being reduced to the interval [ -SR, SR ] (SR ═ pi/4), and vql is an integer vector satisfying vd-vql × SR × 2 ═ vdr. It should be noted that constants CNPID2_ a2 and CNPID2_ B2 may be the same as constants PID2_ a2 and PID2_ B2, so that the multiply-subtract function vmsub _ f () may be used to replace the multiply-add function vmadd _ f () for 2 times in the following.
In this embodiment, the pseudo code of step 1.3) for performing vector reduction on the input vector vd whose arbitrary element is not in the range of [ -CD, CD ] may be expressed as:
Figure BDA0003686413070000152
vdql←[vd×CM_2_PI-vdqh];
vql←(int)vdql;
vqh←(int)vdqh;
vd←vd-vdqh×PID2_A;
vd←vd-vdql×PID2_A;
vd←vd-vdqh×PID2_B;
vd←vd-vdql×PID2_B;
vd←vd-vdqh×PID2_C;
vd←vd-vdql×PID2_C;
vd←vd-(vdql+vdqh)×PID2_D;
the code uses the Cody-Waite reduction algorithm variant to calculate vdr ═ vd-vq × pi/2, so as to improve the reduction precision. The embodiment will be constant
Figure BDA0003686413070000161
Split into 4 double-precision floating point constants (pi/2 ═ PID2_ a + PID2_ B + PID2_ C + PID2_ D), and also split the variable vq into 2 double-precision floating point variables (vq ═ vqh + vql) to participate in the calculation of vdr, so that vdr is calculated by the following formula:
vdr=vd-(vqh×vql)×(PID2_A+PID2_B+PID2_C+PID2_D)=vd-vqh×PID2_A-vql×PID2_A-vqh×PID2_B-vql×PID2_B-vqh×PID2_C-vql×PID2_C-(vql+vqh)×PID2_D;
the calculation of vdr corresponds to the 7-line multiply-add-accumulate code of the aforementioned pseudo code, lines 5 to 11. vqh and vql are calculated by codes of line 1 and line 2, vdqh and vdql are calculated by respectively calculating the multiple and remainder of vq compared with C16, in the embodiment, C16 is set to C16 which is 1 < 24, so that the lower 24 bits of mantissas of vdqh and vdql are zero, and floating point numbers vdqh and vdql are respectively converted into integer variables vql and vqh by using forced conversion (int) in line 3 and line 4.
The pseudo code corresponding to the step 1.3) and realized by using C/C + + inrinsics can be expressed as follows:
Figure BDA0003686413070000162
Figure BDA0003686413070000171
the above pseudo code represents floating point vector type and integer vector type using vfloat _ t and vint _ t, respectively, and its implementation includes: constants M _2_ PI _1_2_24, CM _2_ PI, NPID2_ A, NPID2_ B, NPID2_ C, NPID2_ D and C16(M _2_ PI _1_2_24 and CM _2_ PI are double-precision floating-point constant representations of (2/)/(16) and 2/, respectively, and NPIDs 2_ A, NPID2_ B, NPID2_ C and 2_ D are the inverse numbers of constants PID2_ A, PID2_ B, PID2_ C and PID2_ D, respectively) are broadcast using the vbcast _ f () function as floating-point vectors vm _2_ PI _1_2_24, vm _2_ PI, vpid 2_ a, npvdid 2_ b, vdid 2_ C, vp 2_ D and vc16, respectively. Then, using a vmul _ f () function to perform vector floating-point multiplication on vd and vm _2_ pi _1_2_24, using a vroundtozero _ f () function to perform integer rounding on each element of the operation result to a zero rounding mode, and then performing vector floating-point multiplication on the rounding result and vc16 to obtain a floating-point vector vdqh. Then, using a vmadd _ f () function to obtain the result of vector floating point multiply-add operation with vector vdqh as the addition operand and vectors vd and vm _2_ pi, and using a vrundtonast _ f () function to round the result with vector integer in a near rounding mode to obtain a floating point vector vdql. And then carrying out six vector floating-point multiply-add operations on the vdql and the vdqh in sequence, wherein the first multiply operand of the six vector floating-point multiply-add operations is vnpid2_ a, vnpid2_ a, vnpid2_ b, vnpid2_ b, vnpid2_ c and vnpid2_ c in sequence, the second multiply operand is vdqh, vdql, vdqh and vdql in sequence, the add operands of the vector floating-point multiply-add operations are vectors vd, and the result of each multiply-add operation is also put into the vector vd to be accumulated. And finally, obtaining a vector floating-point addition operation result of the vectors vdqh and vdql by using a vadd _ f () function, and performing vector floating-point multiplication and addition operation on the result and the vector vd once again by using vnpid2_ d as a multiplication operand to obtain a floating-point number vector vdr. Meanwhile, after vdql is obtained through calculation, each element of vdql is subjected to integer forced conversion to obtain an integer vector vql. The result of the vector vdr is the result of the input operand vd being reduced to [ -SR, SR ], and vql is an integer vector satisfying vd-vql × SR × 2 ═ vdr. It should be noted that the constants NPID2_ A, NPID2_ B, NPID2_ C and NPID2_ D may be the same as the constants PID2_ A, PID2_ B, PID2_ C and PID2_ D, so that the multiplication and reduction function vmsub _ f () may be used to replace the multiplication and addition function vmadd _ f () 7 times.
In this embodiment, the transcendental function acceleration instruction used in step 3) includes a transcendental function square instruction trimul, a transcendental function selection instruction trisel, and a transcendental function multiply-add instruction trimad. Specifically, step 3) includes: firstly, generating a floating point number vector vt with all elements being 0, then carrying out vector floating point square operation on the floating point number vector vdr by using a transcendental function square instruction trimul to obtain a square operation result vd2, setting a sign bit of the square operation result vd2 according to an integer vector vqln, and simultaneously carrying out vector condition selection by using a transcendental function selection instruction trisel according to the floating point number vector vdr and the integer vector vqln to obtain a last coefficient vls and a sign thereof of subsequent polynomial approximate calculation; and then continuously using an eight-time transcendental function multiply-add instruction trimad to carry out vector floating point multiply-add operation supporting hardware table lookup on the floating point vector vt and the square operation result vd2, wherein the immediate operand of the eight-time trimad instruction is 7, 6, 5, 4, 3, 2, 1 and 0 in sequence, the transcendental function multiply-add instruction trimad carries out hardware table lookup according to the used immediate operand to obtain an addition operand required in the vector floating point multiply-add operation, the calculation result of the transcendental function multiply-add instruction trimad is also added into the floating point vector vt for continuous accumulation every time, and finally the vector floating point multiply-add operation result of the floating point vector vt and the coefficient vls is obtained and is used as a vector sine function or cosine function calculation result vr corresponding to the vector operand vd. The above step 3) may be represented by using a pseudo code implemented by C/C + + intrinsics as follows:
Figure BDA0003686413070000181
Figure BDA0003686413070000191
wherein vt ═ vbcast _ f (0.0) denotes that a floating-point number vector vt is generated in which all elements are 0; vd2 ═ vtimul _ f (vdr, vqln) denotes that the floating point vector vdr is subjected to vector floating point square operation using a trimul instruction to obtain a square operation result vd 2; vls ═ vtisel _ f (vdr, vqln) represents that the transcendental function selection instruction trisel is used for carrying out vector condition selection on the floating point number vector vdr and the integer vector vqln to obtain the last coefficient vls of the subsequent polynomial approximation calculation; the subsequent vtrimad _ f (vt, vd2,7) -vtrimad _ f (vt, vd2,0) respectively represent vector floating point multiply-add operations supporting hardware table lookup for floating point vector vt and square operation result vd2 by continuously using eight times of transcendental function multiply-add instructions trimad; vr ═ vmul _ f (vt, vls) denotes the result of a vector floating-point multiplication that takes a floating-point vector vt and a coefficient vls. In this embodiment, the pseudo code implemented by the assembly in step 3) may be expressed as:
Figure BDA0003686413070000192
in the pseudo code implemented by assembly, the used built-in functions correspond to the instructions of the pseudo code implemented by C/C + + inrinsics one by one, and are not described one by one here.
In step 3), vdr is used as the input of the polynomial approximation calculation, vqln is used to control the constant coefficient of the sine function or cosine function for the polynomial approximation calculation, and whether the final calculation result needs to take the inverse number is controlled, so that the actual calculation content of the polynomial approximation calculation is sin (vdr), cos (vdr), -sin (vdr), or-cos (vdr). vqln takes on {0,1,2,3}, and when vqln is 0 (the lower 2 bits of vqln are all 0), the trimul instruction sets the sign bit of vdr to 0, and the sign bit of vdr is 0, so that the subsequent multiply-add instruction trimad with 8 polynomial approximations uses constant coefficients of a sine function. vqln of 0 also causes trisel to select vdr required to set the multiplication operand of the polynomial approximation calculation to the sine function, so that step 6) finally calculates sin (vdr). When vqln is 1 (the lower 2 bits of vqln are 01), the trimul instruction sets the sign bit of vdr to 1, the vdr sign bit of 1 will cause 8 trimad instructions to use the constant coefficient of the cosine function, and vqln of 1 will also cause the trisel instruction to use the constant 1 required by the cosine function as the final multiplication operand, so that cos (vdr) is finally calculated. When vqln is 2 (the low 2 bit of vql is 10), the trimul instruction sets the sign bit of vdr to 0, causing the trimad instruction to use the constant coefficients of the sine function. vqln of 2 will also cause the trisel to have-vdr required by the sine function as the final multiplication operand, so that the final calculation is-sin (vdr). Similarly, when vqln is 3, it will eventually be calculated as-cos (vdr).
It should be noted that the immediate used in the trimad instruction of 8 times in step 3) may be any 8 different constant immediate, as long as the 8 immediate and 8 constant coefficients required for the polynomial approximation calculation sequentially correspond to each other. The immediate numbers used by the 8 trimad instructions in step 3) of this embodiment are 7, 6, 5, 4, 3, 2, 1, and 0, i.e., the immediate number 7 corresponds to a constant of-1/(15! ) Or-1/(14! ) The immediate 6 corresponds to the constant 1/(13! ) Or 1/(12! ) And the like. If the trimad instructions implemented by the instruction set use different mappings, the 8 immediate values used in step 3) also need to be modified accordingly to satisfy the requirement that the 1 st trimad instruction of the polynomial approximation uses-1/(15! ) Or-1/(14! ) As a constant coefficient, the 2 nd trimad instruction uses 1/(13! ) Or 1/(12! ) As constant coefficients, etc.
It should be noted that the vectors used in the processes of step 1) to step 3) in this embodiment refer to both the SIMD instruction set vector register and various vector types in the SIMD instruction set vector intrinsics. The vector operation in the steps 1) to 3) may be a vector instruction in the SIMD instruction set or a corresponding vector intrinsics function in the SIMD instruction set vector intrinsics.
It should be further noted that the built-in functions, variable types, variable names, instruction names, and register names used in the processes of step 1) to step 3) in this embodiment are used only for exemplary illustration, and are not limited to the same names in actual use, and any built-in functions and assembly instructions capable of implementing the operations required in step 1) to step 3) in this embodiment may be used in implementing this method.
In summary, the sin-cos function implementation method based on the transcendental function acceleration instruction of the present embodiment includes introducing a vector operand vd, selecting two different vector reduction methods to reduce each element of vd according to the numerical range of the vector operand, so as to adjust the value of each element of vd to the range of [ -SR, SR ] (SR ═ pi/4), and then performing polynomial approximation calculation on the reduced result vector vdr according to the taylor series expansion method by using the transcendental function acceleration instruction, thereby obtaining the calculation result of the vector sine function or the vector cosine function for the input vector vd. For the instruction set architecture of fixed length instruction coding, the sine and cosine function implementation method based on the transcendental function acceleration instruction of the embodiment provides a vector sine function and vector cosine function implementation method which avoids using a memory access instruction in polynomial approximation calculation, and for the instruction set architecture using the fixed length instruction coding, the method does not need to use an additional address calculation instruction and a vector loading instruction to obtain constant coefficients of polynomial approximation from a constant pool, so that the performance of the vector sine function and the vector cosine function is greatly improved.
In addition, the present embodiment also provides a system for implementing sine and cosine functions based on transcendental function acceleration instructions, which includes a microprocessor and a memory connected to each other, where the microprocessor is programmed or configured to execute the steps of the above-mentioned method for implementing sine and cosine functions based on transcendental function acceleration instructions.
In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program is stored, where the computer program is used for being executed by a microprocessor to implement the steps of the above-mentioned sine and cosine function implementation method based on the transcendental function acceleration instruction.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A sine and cosine function implementation method based on a transcendental function acceleration instruction is characterized by comprising the following steps:
1) reducing each element of the incoming vector operand vd to an interval of [ -pi/4, pi/4 ], to obtain a corresponding integer vector vql and a floating point number vector vdr within the interval of [ -pi/4, pi/4 ];
2) the integer vector vql is obtained by taking a modulus according to vqln ═ vql mod 4;
3) according to the Taylor series expansion method, a transcendental function acceleration instruction is used for carrying out polynomial approximation calculation on a floating point number vector vdr and an integer vector vqln to obtain a vector sine function or vector cosine function calculation result vr corresponding to a vector operand vd.
2. The sine-cosine function implementation method based on the transcendental function acceleration instruction as claimed in claim 1, wherein step 1) comprises:
1.1) judging whether all element values of the vector operand vd are positioned in a preset interval [ -CD, CD ], wherein the value of the constant CD is 15, and if all element values of the vector operand vd are positioned in the preset interval [ -CD, CD ], executing the step 1.2); if any element value of the vector operand vd is not within the preset interval [ -CD, CD ], executing step 1.3);
1.2) using a preset reduction algorithm 1 for each element of a vector operand vd whose all element values lie within a preset interval [ -CD, CD ] to an interval [ -pi/4, pi/4 ] to obtain a corresponding integer vector vql and a floating point vector vdr lying within the interval [ -pi/4, pi/4 ], and each element in the integer vector vql satisfies vd-vql × SR × 2 ═ vdr, where the constant SR has a value of pi/4; performing step 2);
1.3) using a preset reduction algorithm 2 to reduce each element of a vector operand vd, of which any element is not located in the interval [ -CD, CD ], to the interval [ -pi/4, pi/4 ] to obtain a corresponding integer vector vql and a floating point vector vdr located in the interval [ -pi/4, pi/4 ], and each element in the integer vector vql satisfies vd-vql × SR × 2 ═ vdr, where the value of the constant SR is pi/4; step 2) is performed.
3. The method for implementing sine and cosine functions based on transcendental function acceleration instructions according to claim 2, wherein the step 1.1) of determining whether all element values of vector operand vd are within a preset interval [ -CD, CD ] comprises: firstly broadcasting a constant CD as a floating point vector vcond, then carrying out vector floating point absolute value operation on a vector operand vd, judging whether each element value in the operation result of the vector floating point absolute value operation is smaller than the corresponding element value of the floating point vector vcond, obtaining a comparison result of logical yes or logical no, placing the comparison result in a floating point vector vtmp, and finally judging whether the values of the floating point vector vtmp are all logical yes, if yes, judging that all the element values of the vector operand vd are located in a preset interval [ -CD, CD ], otherwise, judging that all the element values of the vector operand vd are not located in the preset interval [ -CD, CD ].
4. The sine-cosine function implementation method based on the transcendental function acceleration instruction of claim 2, wherein the step 1.2) comprises: firstly, constants CM _2_ PI, CNPID2_ A2 and CNPID2_ B2 required by reduction are broadcasted to be floating point number vectors v _ m _2_ PI, vnppid2_ a2 and vnpid2_ B2, then a result of vector floating point multiplication of a vector operand vd and the floating point number vectors v _ m _2_ PI is obtained, each element of the result is subjected to integer rounding in a mode of approximate rounding to obtain a floating point number vector vdql, and then each element of the floating point number vector vdql is subjected to integer forced conversion by using a vector forced type conversion function to obtain an integer vector vql; meanwhile, vector operand vd is used as an addition operand and floating point vector vdql and floating point vector vnppid2_ a2 are subjected to vector floating point multiplication and addition operation to obtain a floating point vector vdr as a result, then the vector floating point vector vdr is used as the addition operand and floating point vectors vdql and vnpid2_ b2 are subjected to vector floating point multiplication and addition operation once, the operation result is stored in the floating point vector vdr, the result in the floating point vector vdr is the result obtained after the vector operand vd which is input is reduced to the range of [ -SR, SR ], vql is an integer vector which meets the requirement that vd-vql × SR × 2 vdr, and the value of a constant SR is pi/4.
5. The sine-cosine function implementation method based on the transcendental function acceleration instruction of claim 2, wherein the step 1.3) comprises: constants M _2_ PI _1_2_24, CM _2_ PI, NPID2_ A, NPID2_ B, NPID2_ C, NPID2_ D, C16 required for the specification are first broadcast as floating-point number vectors vm _2_ PI _1_2_24, vm _2_ PI, vnpid2_ a, vnpid2_ b, vnpid2_ c, vnpid2_ d, vc16, respectively. Then, carrying out vector floating-point multiplication on the vector operand vd and the floating-point number vector vm _2_ pi _1_2_24, carrying out vector integer rounding on each element of a result subjected to the vector floating-point multiplication in a zero rounding mode, and carrying out vector floating-point multiplication on the rounding result and the floating-point number vector vc16 to obtain a floating-point number vector vdqh; then obtaining the result of vector floating point multiply-add operation by using the floating point vector vdqh as an addition operand, the vector operand vd and the floating point vector vm _2_ pi, and rounding the result into a floating point vector vdql by using a near rounding mode; sequentially carrying out six vector floating-point multiply-add operations on a floating-point vector vdql and a floating-point vector vdqh, wherein the first multiply operand of the six vector floating-point multiply operations is vnpid2_ a, vnpid2_ a, vnpid2_ b, vnpid2_ b, vnpid2_ c and vnpid2_ c, the second multiply operand is floating-point vectors vdqh, vdql, vdqh and vdql, the addition operands of the vector floating-point multiply-add operations are vector multiply-vd, and the result of each multiply-add operation is also put into the vector operand vd to carry out the vector floating-point multiply-add operation of continuously accumulating 6 times of results; finally, a vector floating point addition operation result of the floating point vector vdqh and the floating point vector vdql is obtained, the floating point vector vnpid2_ d is used as a multiplication operand, and the result and the vector operand vd are subjected to vector floating point multiplication and addition operation again to obtain a floating point vector vdr; meanwhile, after the floating point number vector vdql is obtained through calculation, each element of the floating point number vector vdql is subjected to integer forced conversion to obtain an integer vector vql, so that the floating point number vector vdr is a result of input vector operand vd being reduced to [ -SR, SR ], the integer vector vql is an integer vector satisfying vd-vql × SR × 2 ═ vdr, and the value of the constant SR is pi/4.
6. The sine-cosine function implementation method based on the transcendental function acceleration instruction as claimed in claim 1, wherein the step 2) comprises: firstly, integer constants 0x3 and 0x1 are respectively broadcasted as integer vectors vci3 and vci 1; then judging the type of the sine and cosine function implementation method, if the type is a sine function for calculating vector operand vd, obtaining an integer vector vqln by obtaining the vector bitwise logical AND operation result of the integer vector vql and the integer vector vci 3; if the type is a cosine function for calculating the vector operand vd, the integer vector vql and the integer vector vci1 are subjected to vector integer addition operation, and then the operation result and the integer vector vci3 are subjected to vector bitwise logical AND operation to obtain an integer vector vqln.
7. The method for implementing sine and cosine function based on transcendental function acceleration instruction of claim 1, wherein the transcendental function acceleration instruction used in step 3) comprises a transcendental function square instruction trimul, a transcendental function selection instruction trisel and a transcendental function multiply-add instruction trimad.
8. The sin-cos function implementation method based on the transcendental function acceleration instruction of claim 7, wherein the step 3) comprises: firstly, generating a floating point number vector vt with all elements being 0, then carrying out vector floating point square operation on the floating point number vector vdr by using a transcendental function square instruction trimul to obtain a square operation result vd2, setting a sign bit of the square operation result vd2 according to an integer vector vqln, and simultaneously carrying out vector condition selection by using a transcendental function selection instruction trisel according to the floating point number vector vdr and the integer vector vqln to obtain a last coefficient vls and a sign thereof of subsequent polynomial approximate calculation; and then continuously using an eight-time transcendental function multiply-add instruction trimad to carry out vector floating point multiply-add operation supporting hardware table look-up on the floating point vector vt and the square operation result vd2, wherein the immediate operands of the eight-time trimad instruction are 7, 6, 5, 4, 3, 2, 1 and 0 in sequence, the transcendental function multiply-add instruction trimad carries out hardware table look-up according to the used immediate operands to obtain the addition operands required in the vector floating point multiply-add operation, the calculation result of the transcendental function multiply-add instruction trimad is also put into the floating point vector vt for continuous accumulation each time, and finally the vector floating point multiply-add operation result of the floating point vector vt and the coefficient vls is obtained and is used as the vector sine function calculation result vr corresponding to the vector operand vd.
9. A transcendental function acceleration instruction-based sine and cosine function implementation system, which comprises a microprocessor and a memory which are connected with each other, wherein the microprocessor is programmed or configured to execute the steps of the transcendental function acceleration instruction-based sine and cosine function implementation method according to any one of claims 1 to 8.
10. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is used for being executed by a microprocessor to implement the steps of the sin-cos function implementation method based on transcendental function acceleration instruction according to any one of claims 1 to 8.
CN202210647106.5A 2022-06-09 2022-06-09 Sine and cosine function implementation method and system based on transcendental function acceleration instruction Pending CN114968368A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210647106.5A CN114968368A (en) 2022-06-09 2022-06-09 Sine and cosine function implementation method and system based on transcendental function acceleration instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210647106.5A CN114968368A (en) 2022-06-09 2022-06-09 Sine and cosine function implementation method and system based on transcendental function acceleration instruction

Publications (1)

Publication Number Publication Date
CN114968368A true CN114968368A (en) 2022-08-30

Family

ID=82961294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210647106.5A Pending CN114968368A (en) 2022-06-09 2022-06-09 Sine and cosine function implementation method and system based on transcendental function acceleration instruction

Country Status (1)

Country Link
CN (1) CN114968368A (en)

Similar Documents

Publication Publication Date Title
US9690543B2 (en) Significance alignment
US9703531B2 (en) Multiplication of first and second operands using redundant representation
CN110955404A (en) Computer processor for higher precision computation using mixed precision decomposition of operations
US9015452B2 (en) Vector math instruction execution by DSP processor approximating division and complex number magnitude
US20130246496A1 (en) Floating-point vector normalisation
US20170139676A1 (en) Lane position information for processing of vector
US9720646B2 (en) Redundant representation of numeric value using overlap bits
EP2208132A1 (en) Apparatus and method for performing magnitude detection for arithmetic operations
US9928031B2 (en) Overlap propagation operation
KR20210122828A (en) Encoding Special Values in Anchor Data Elements
CN114968368A (en) Sine and cosine function implementation method and system based on transcendental function acceleration instruction
GB2343969A (en) A data processing apparatus and method for performing an arithemtic operation on a plurality of signed data values
WO2019005084A1 (en) Systems, apparatuses, and methods for vector-packed fractional multiplication of signed words with rounding, saturation, and high-result selection
KR20210124347A (en) Anchor data point transformation
CN117908962A (en) Nonlinear calculation method, open source processor, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination