WO2017185390A1 - 一种用于执行向量超越函数运算的装置和方法 - Google Patents

一种用于执行向量超越函数运算的装置和方法 Download PDF

Info

Publication number
WO2017185390A1
WO2017185390A1 PCT/CN2016/081071 CN2016081071W WO2017185390A1 WO 2017185390 A1 WO2017185390 A1 WO 2017185390A1 CN 2016081071 W CN2016081071 W CN 2016081071W WO 2017185390 A1 WO2017185390 A1 WO 2017185390A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
operation instruction
instruction
transcendental function
data
Prior art date
Application number
PCT/CN2016/081071
Other languages
English (en)
French (fr)
Inventor
韩栋
张潇
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to EP16899901.9A priority Critical patent/EP3451153B1/en
Publication of WO2017185390A1 publication Critical patent/WO2017185390A1/zh
Priority to US16/171,295 priority patent/US20190065191A1/en
Priority to US16/247,237 priority patent/US20190146793A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5446Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation using crossaddition algorithms, e.g. CORDIC
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present invention relates to the field of computer processing technologies, and in particular, to an apparatus and method for performing vector transcendental function operations.
  • the device can perform various transcendental function operations on a set of vector data according to the instruction, and can obtain a highly accurate transcendental function calculation result with high efficiency.
  • the apparatus and method of the present invention have significant advantages over traditional methods in performing transcendental function operations on vectors.
  • Transcendental functions including but not limited to exponential operations, logarithmic operations, and trigonometric functions. This type of operation is different from the traditional four-order operation. It is not a form of finite polynomial. The relationship between variables cannot be equivalent to finite number of addition, subtraction, multiplication, division, power and square. The computational difficulty and cost are far greater than the traditional addition, subtraction, multiplication and division.
  • one of the most common implementation vector implementations for transcendental function calculations is to use a general purpose processor.
  • the method performs a vector operation by executing a general purpose instruction through a general purpose register file and a generic function.
  • the general-purpose processor does not have an arithmetic component specifically for calculating the transcendental function, the result of the activation function is approximated by a high-order polynomial in the form of Taylor expansion, and the execution of multiple instructions is required to complete the entire operation.
  • the general-purpose processor is oriented to scalar operations, and needs to be executed one by one when implementing the transcendental function operation for vector data, which further reduces the computational efficiency.
  • a graphics processor In another prior art, a graphics processor (GPU) is used to perform a transcendental function operation on vector data, wherein the operation is performed by executing a general-purpose SIMD instruction using a general-purpose register file and a general-purpose stream processing unit.
  • GPU graphics processor
  • Tate expansion to calculate high-precision results using high-order polynomials.
  • the GPU on-chip cache is too small, and it is necessary to continuously perform off-chip data handling when performing large-scale transcendental functions. The off-chip bandwidth becomes the main performance bottleneck.
  • vector override function calculations are performed using specially tailored computing devices, where operations are performed using custom register files and custom processing units.
  • the existing dedicated transcendental operation device is limited by the design of the register file, and cannot flexibly support vector operations of different lengths.
  • the present invention is directed to an apparatus and method for solving a vector transcendental function operation task, which is capable of quickly and flexibly executing various transcendental function calculations for vector data of different lengths according to instructions, that is, for vector X, for each of them
  • a vector transcendental function computing device comprising:
  • a storage unit configured to store vector data related to the vector operation instruction
  • a register unit for storing scalar data related to the vector operation instruction
  • control unit for decoding a vector operation instruction and controlling an operation process of the vector operation instruction
  • a transcendental function calculation unit for performing a transcendental function calculation on a vector operation instruction
  • the transcendental function calculation unit includes a preprocessing part and an iterative calculation part, wherein the preprocessing part preprocesses the input vector data to be within a range that the CORDIC algorithm can process, and the iterative calculation part uses the CORDIC algorithm pair
  • the preprocessed input vector data is iteratively operated to obtain result vector data.
  • the transcendental function computing unit is implemented in hardware.
  • the storage unit is a scratch pad memory.
  • the scalar data stored by the register unit includes an input vector data start address, an output vector data storage address, and an input vector data length associated with the vector operation instruction; wherein, the start address and the output of the input vector data
  • the vector data storage address is an address in the storage unit.
  • the transcendental function calculation unit further includes a post-processing portion for post-processing the iterative calculation portion output result vector data.
  • control unit includes:
  • the instruction queue module is configured to sequentially store the decoded vector operation instructions and obtain scalar data related to the vector operation instructions.
  • control unit includes:
  • the dependency processing unit is configured to determine whether the current vector operation instruction has a dependency relationship with the previously unexecuted operation instruction before the transcendental function calculation unit acquires the current vector operation instruction.
  • control unit includes:
  • the storage queue module is configured to temporarily store the current vector operation instruction when the current vector operation instruction has a dependency relationship with the previously unexecuted operation instruction, and send the temporarily stored vector operation instruction to the transcendental function when the dependency relationship is eliminated. Calculation unit.
  • the device further includes:
  • An instruction cache unit configured to store a vector operation instruction to be executed
  • the input/output unit is configured to store the vector data related to the vector operation instruction in the storage unit, or obtain the operation result of the vector operation instruction from the storage unit.
  • the vector operation instruction includes an operation code and an operation domain
  • the opcode is used to indicate which transcendental function to perform
  • the operational field includes an immediate value and/or a register number indicating scalar data associated with the vector operation, wherein the register number is used to point to the register unit address.
  • a vector transcendental function computing device comprising:
  • the fetch module is configured to take out the next vector operation instruction to be executed from the instruction sequence, and transmit the vector operation instruction to the decoding module;
  • a decoding module configured to decode the vector operation instruction, and transmit the decoded vector operation instruction to the instruction queue module;
  • the instruction queue module is configured to temporarily store the decoded vector operation instruction, and obtain scalar data related to the vector instruction operation from the vector operation instruction or the scalar register; after obtaining the scalar data, send the vector operation instruction to the dependency relationship Processing unit
  • a scalar register file including a plurality of scalar registers for storing scalar data associated with vector operation instructions
  • a dependency processing unit configured to determine whether there is a dependency between the vector operation instruction and the previously unexecuted operation instruction; if there is a dependency, send the vector operation instruction to the storage queue module, if there is no dependency a relationship, the vector operation instruction is sent to the transcendental function calculation unit;
  • a storage queue module configured to store a vector operation instruction having a dependency relationship with the previous operation instruction, and sending the vector operation instruction to the transcendental function calculation unit after the dependency relationship is released;
  • a transcendental function calculating unit configured to perform a transcendental function calculation on the input vector data according to the received vector operation instruction
  • a scratchpad memory for storing input vector data and output vector data
  • An input/output access module for directly accessing the scratchpad memory, responsible for reading input vector data and writing output vector data from the scratchpad memory.
  • the transcendental function calculation unit includes:
  • a preprocessing module for preprocessing input data data, converting the input vector data into a range that CORDIC can handle
  • An iterative calculation module configured to perform CORDIC calculation on the preprocessed input vector data to obtain a transcendental function operation result
  • the post-processing module is configured to post-process the operation result to obtain output vector data.
  • the transcendental function computing unit is implemented by hardware.
  • a vector transcendental function operation method comprising:
  • the value module takes out the next vector operation instruction to be executed from the instruction sequence, and transmits the vector operation instruction to the decoding module;
  • the decoding module decodes the vector operation instruction, and transmits the decoded vector operation instruction to the instruction queue module;
  • the instruction queue module temporarily stores the decoded vector operation instruction, and obtains scalar data related to the vector instruction operation from the vector operation instruction or the scalar register; after obtaining the scalar data, the vector operation instruction is sent to the dependency processing unit;
  • the dependency processing unit determines whether there is a dependency between the vector operation instruction and the previously unexecuted operation instruction; if there is a dependency, the vector operation instruction is sent to the storage queue module, if there is no dependency, Sending the vector operation instruction to the transcendental function calculation unit;
  • the storage queue module stores a vector operation instruction having a dependency relationship with the previous operation instruction, and after the dependency relationship is released, the vector operation instruction is sent to the transcendental function calculation unit;
  • the transcendental function calculation unit extracts the input vector data from the scratchpad memory through the input/output access module according to the received vector operation instruction, and then performs the transcendental function operation on the input vector data, and writes the operation result to the high speed through the input/output access module. Scratch memory.
  • the transcendental function calculating unit performs preprocessing on the input vector data, and converts the input vector data into a range that the CORDIC can process; and then performs CORDIC calculation on the preprocessed input vector data to obtain a transcendental function. The result of the operation; finally, the operation result is post-processed to obtain output vector data.
  • the vector transcendental function computing device can implement a reduced instruction operation beyond the function operation instruction in hardware, and can implement a complete vector transcendental function operation by one instruction.
  • the vector operation process can more flexibly and efficiently support data of different widths, and the transcendental operation unit is realized by hardware, which can be more efficient.
  • the present invention can be applied to the following scenarios (including but not limited to): data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other types of transportation; televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, Electric lights, gas stoves, range hoods and other household appliances; and including nuclear magnetic resonance instruments, B-ultrasound, electrocardiograph and other medical equipment.
  • FIG. 1 is a schematic structural diagram of a vector transcendental function computing device provided by the present invention.
  • FIG. 2 is a schematic diagram showing the format of a vector operation instruction set provided by the present invention.
  • FIG. 3 is a schematic structural diagram of a vector transcendental function computing apparatus according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a device performing a transcendental function operation according to an embodiment of the present invention.
  • the invention provides a vector transcendental function computing device, comprising a storage unit, a register unit, a control unit and a transcendental function computing unit.
  • the storage unit stores a vector
  • the register unit stores a vector storage address and other scalar parameters
  • the control unit performs the translation.
  • the code operation controls each module according to the instruction
  • the transcendental function calculation unit acquires the vector address, the length and other parameters in the instruction or the register unit according to the operation instruction, and then acquires the corresponding vector data in the storage unit according to the address and the length, and then Perform a transcendental function on the vector data to obtain the result of the operation.
  • the invention temporarily stores the vector data participating in the calculation on the cache memory, so that the vector data of different widths can be more flexibly and effectively supported in the operation process, and the performance of the algorithm including the calculation task of a large number of vector transcendental functions is improved.
  • FIG. 1 is a schematic structural diagram of a vector transcendental function computing device provided by the present invention. As shown in FIG. 1, the device includes:
  • the storage unit may be a scratch pad memory (Scratchpad Memory) capable of supporting vector data of different sizes.
  • the present invention temporarily stores necessary calculation data in the cache.
  • the device can more flexibly and efficiently support data of different widths in the process of performing the transcendental function operation.
  • the scratchpad memory can be implemented by a variety of different memory devices such as SRAM, DRAM, eDRAM, memristor, 3D-DRAM, and nonvolatile memory.
  • a register unit for storing scalar data related to the vector operation instruction, including a start address and a length of the input vector data, a storage address of the output vector data, and may also be used for storing scalar data used in other operations, wherein the input vector
  • the start address of the data and the storage address of the output vector data are addresses stored in the storage unit;
  • the control unit decodes the vector operation instruction and controls the operation process of the vector operation instruction; in an embodiment, the control unit reads the prepared vector operation instruction, decodes it to generate a control signal, and sends the control signal to the device.
  • the other units in the other units perform corresponding operations according to the obtained control signals.
  • a transcendental function calculation unit that implements a specified transcendental function calculation of specified vector data in accordance with control of the control unit.
  • This unit is a vector operation unit, and performs the same operation on all input vector data, that is, performs the same transcendental function operation on each element in the vector. It should be noted that this unit is a custom transcendental function calculation unit that implements the transcendental function calculation using a different method than the traditional Taylor expansion.
  • the hardware circuit of the customized transcendental function calculation unit uses a Coordinate Rotation Digital Computer (CORDIC) algorithm, which pre-processes the input vector data.
  • CORDIC Coordinate Rotation Digital Computer
  • the pre-processing and post-processing of the transcendental function calculation unit are all hardwareized, thus providing a more complete hardware operation module, which further improves the speed of the entire operation process.
  • the transcendental function calculation unit requires three stages of operations, including preprocessing, CORDIC calculation, and post processing.
  • the first is the pre-processing module.
  • the CORDIC method can calculate various transcendental function values very efficiently, it is only applicable to a limited input range. Therefore, in the present invention, the input data is converted into a range that CORDIC can handle by the hardware circuit.
  • the CORDIC calculation circuit calculates and calculates the preprocessed data, and calculates the structure output to the post-processing circuit for processing and output.
  • the transcendental function computing unit includes a preprocessing module, an iterative computing module, and a post processing module.
  • the preprocessing module converts the input vector data into a reasonable computable domain range, and the iterative computing module calculates the transformed using the CORDIC algorithm.
  • the transcendental function value of the data, the post-processing module performs post-processing on the obtained transcendental function value, wherein the pre-processing module, the iterative calculation module and the post-processing module are all implemented by hardware.
  • the transcendental function computing unit is implemented by the following hardware circuits (including but not limited to): FPGA, CGRA, application specific integrated circuit ASIC, analog circuit, memristor, and the like.
  • the vector transcendental function computing device further includes an instruction buffer unit for storing a vector operation instruction to be executed. During the execution of the instruction, it is also cached in the instruction cache unit. When an instruction is executed, the instruction will be submitted.
  • control unit of the vector transcendental function computing device further includes: an instruction queue module for sequentially storing the decoded vector operation instructions, and obtaining a scalar required for the vector operation instruction After the data is sent to the dependency processing module.
  • the control unit of the vector transcendental function computing device further includes: a dependency processing unit, configured to determine the operation instruction and the previously unexecuted operation instruction before the transcendental function calculation unit acquires the instruction Whether there is a dependency relationship, that is, whether the same vector storage address is accessed, and if so, the vector operation instruction is stored in the storage queue module, and after the execution of the previous vector operation instruction is completed, the vector operation instruction in the storage queue module is provided to The transcendental function calculation unit; otherwise, the vector operation instruction is directly provided to the transcendental function calculation unit.
  • the front and back instructions may access the same block of storage space. In order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, the instruction You must wait in the storage queue until the dependency is removed.
  • control unit of the vector transcendental function computing device further includes: a storage queue module, the module includes an ordered queue, and the instruction having a dependency on the data in the previous instruction is stored therein Within the sequence queue until the dependency is eliminated, after the dependency is eliminated, it provides the operation instruction to the transcendental function calculation unit.
  • the vector transcendental function computing device further includes: an input and output unit for storing the vector operation data in the storage unit, or acquiring the vector operation result from the storage unit.
  • the input/output unit can directly access the storage unit and is responsible for reading vector data or writing vector data from the memory.
  • the instruction design of the present invention employs a streamlined manner in which a single instruction can perform a complete vector transcendental function calculation.
  • the vector transcendental function computing device fetches the instruction for decoding, and then sends it to the instruction queue for storage, and obtains each parameter in the instruction according to the decoding result, and these parameters may be directly written.
  • the operation field of the instruction it is also possible to read from the specified register according to the register number in the instruction operation field.
  • the dependency processing unit determines whether the data actually needed by the instruction has a dependency relationship with the data in the previous operation instruction, which determines whether the instruction can be immediately sent to the execution function of the transcendental function unit. . Once it is found that there is a dependency relationship with the data in the previous operation instruction, the instruction must wait until the instruction it depends on is executed before it can be sent to the operation unit for execution. In the custom transcendental function calculation unit, the instruction will be executed quickly, and the result, that is, the generated random vector, is written back to the address provided by the instruction, and the instruction is executed.
  • the transcendental operation instruction includes an operation code and at least one operation field, wherein the operation code is used to indicate which transcendental function calculation is performed.
  • the operation field is used to indicate the data information of the operation instruction, wherein the data information includes an immediate number and/or a register number. For example, when a vector is to be acquired, the vector start address and the vector length can be obtained in the corresponding register according to the register number. Then, the vector stored in the corresponding address is obtained in the storage unit according to the vector start address and the vector length.
  • COS cosine operation instruction
  • COT cotangent operation instruction
  • General purpose CPUs do not provide this type of machine instructions. They are usually implemented by higher level library functions. Each library function contains a plurality of machine instructions. The present invention implements the above vector transcendental function instructions through a hardware structure.
  • FIG. 3 is a schematic structural diagram of a vector transcendental function computing apparatus according to an embodiment of the present invention.
  • the apparatus includes an instruction fetch module, a decoding module, an instruction queue module, a scalar register file, a dependency processing unit, and a storage.
  • the fetch module which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
  • the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue module;
  • the instruction queue module is used to temporarily store the instruction obtained from the decoding module, and obtain the corresponding data of the instruction operation from the instruction or the scalar register, including the starting address and size of the vector data and some scalar constants. After obtaining the data, the instruction is sent to the dependency processing unit;
  • a dependency processing unit that handles the storage dependencies that an instruction may have with the previous instruction.
  • the vector operation instruction accesses the scratch pad memory to obtain the operation vector, and the front and back instructions may access the same block of memory.
  • the instruction is sent to the storage queue module until the dependency is eliminated. That is, whether the storage section of the input data for detecting the instruction of this instruction overlaps with the storage section of the output data of the instruction that has not been executed before, and the storage section is determined by the start address and the data length. If there is overlap, it means that this instruction actually needs the execution result of the previous instruction as input, so it must wait until the instruction is executed before the instruction can start execution. In this process, the instructions are actually temporarily stored in the storage queue module.
  • Storing a queue module the module includes an ordered queue, instructions associated with previous instructions on the data are stored in the ordered queue until the storage relationship is eliminated; the instructions whose dependencies are eliminated are sent to the transcendental computing unit ;
  • Transcendental function calculation unit which is responsible for transcending function calculation operations, including but not limited to exponential operations, logarithmic operations, trigonometric functions, and inverse trigonometric functions.
  • all common transcendental functions are basically included in exponential, logarithmic, triangular, anti-triangular operations and their four combinations.
  • the calculation of the transcendental function is realized by the CORDIC method, which is an iterative calculation method.
  • CORDIC method is an iterative calculation method.
  • the transcendental function calculation module also includes pre-processing and post-processing parts.
  • the pre-processing is to convert the input data into a reasonable calculation interval, that is, the input data is converted into the calculation range that the CORDIC algorithm can handle, and the post-processing is based on the transcendental function.
  • the difference in itself is post-processing the CORDIC calculation results, such as the transformation of the symbols of the output data and some four arithmetic operations.
  • the pre-processing and post-processing in the prior art are usually completed by software, and all of the devices are implemented by hardware circuits;
  • a high-speed temporary storage memory which is a temporary storage device dedicated to vector data, capable of supporting vector data of different sizes; it is used for storing vector data and operation results to be operated;
  • IO memory access module which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
  • FIG. 4 is a flowchart of a vector transcendental function operation device performing a vector transcendental function operation instruction according to an embodiment of the present invention. As shown in FIG. 4, the process of performing a vector transcendental function operation includes:
  • the fetch module extracts the vector transcendental operation instruction and sends the instruction to the decoding module.
  • the decoding module decodes the instruction, and sends the decoded instruction to the instruction queue module.
  • the required scalar data from the instruction immediate or register, that is, the data corresponding to the instruction operation domain, including the input vector address, the input vector length, the output vector address, and the constant required for the transcendental function operation.
  • the instruction queue module sends the instruction to the dependency processing unit.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with an instruction that has not been executed before. If there is a dependency, the instruction is sent to the storage queue module until it waits until the previous unexecuted instruction no longer has a dependency on the data, and is sent by the storage queue module to the transcendental function calculation unit. . If there is no dependency, the instruction is sent directly to the transcendental function calculation unit;
  • the transcendental function calculating unit extracts a part of the vector data required in the operation data from the cache by the input/output unit according to the vector storage address and the length.
  • the preprocessing module in the transcendental function calculation unit transforms the input data into a convergence domain that the coordinate rotation numerical calculation algorithm CORDIC can calculate.
  • the iterative calculation module in the transcendental function calculation unit calculates the transcendental function value of the extracted part of the vector data in parallel.
  • step S9 going to step S6, the transcendental function calculating unit continues to take out the next part of the vector data for calculation until the transcendental function calculation of all the input vector data is completed.
  • the present invention provides a vector transcendental function computing device, and with the corresponding instructions, can well solve more and more computing tasks for vector transcendental functions in the current computer field, including artificial nerves that are currently excellent. Network algorithm.
  • the present invention can have the advantages of simple instruction, convenient use, flexible vector length support, and sufficient on-chip buffering.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)
  • Advance Control (AREA)

Abstract

一种用于执行向量超越函数运算装置和方法,该装置包括:存储单元,存储向量运算指令相关的向量数据;寄存器单元,用于存储向量运算指令相关的标量数据;控制单元,用于对向量运算指令进行译码,并控制向量运算指令的运算过程;超越函数计算单元,用于对向量运算指令进行超越函数计算;其中,所述超越函数计算单元包括预处理部分和迭代计算部分,其中预处理部分对输入向量数据进行预处理,使其处于CORDIC算法能够处理的范围之内,所述迭代计算部分利用CORDIC算法对经过预处理的输入向量数据进行迭代运算,得到结果向量数据。该装置能够硬件实现超越函数运算指令的精简指令运算,通过一条指令实现一完整的向量超越函数运算。

Description

一种用于执行向量超越函数运算的装置和方法 技术领域
本发明涉及计算机处理技术领域,尤其涉及一种用于执行向量超越函数运算的装置和方法。该装置可以根据指令对一组向量数据进行各种超越函数运算,能够以较高的效率得到高精确度的超越函数计算结果。本发明装置及方法在进行针对向量的超越函数运算上有着区别于传统方法的显著优势。
背景技术
超越函数,包括但不限于指数运算、对数运算和三角函数运算。这类运算区别于传统的四则运算,不是有限多项式的形式,变量之间的关系也不能用有限次的加、减、乘、除、乘方、开方来等价。其运算难度和代价远远大于传统的加减乘除。而当前的计算机领域中,确实存在着许多对一整列向量数据甚至向量数据进行超越函数运算的要求,例如许多机器学习算法中就需要对大量数据进行指数和对数运算。因此,就要求有一种能够针对向量数据高效实现各种超越函数计算的装置和方法。
在现有技术中,一种最常用的实现向量超越函数计算的方案是使用通用处理器。该方法通过通用寄存器堆和通用功能部件来执行通用指令,从而执行向量运算。然而,因为通用处理器并没有专门用于计算超越函数的运算部件,必须采用泰勒展开的形式用高次多项式来逼近得到激活函数的结果,需要多条指令的执行才能够完成整个运算。同时,通用处理器面向标量运算,在实现对于向量数据的超越函数运算时需要逐个执行,这就进一步降低了运算效率。
在另一种现有技术中,使用图形处理器(GPU)来针对向量数据进行超越函数运算,其中,通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来进行运算。该方案虽然解决了通用处理器串行计算的问题,但仍需要采用泰特展开的方式使用高次多项式来计算得到高精度的结果。同时,GPU片上缓存太小,在进行大规模超越函数运算时需要不断进行片外数据搬运,片外带宽成为了主要性能瓶颈。
在另一种现有技术中,使用专门定制的计算装置来进行向量超越函数计算,其中,使用定制的寄存器堆和定制的处理单元进行运算。然而,根据这种方法,目前已有的专用超越函数运算装置受限于寄存器堆的设计,不能够灵活地支持不同长度的向量运算。
综上所述,现有的不管是通用处理器、还是图形处理器或者其他的定制计算装置都无法进行灵活高效的向量超越函数运算,并且这些现有技术在处理向量乘 运算问题时存在着代码量大,速度慢,效率低,片上缓存不够,支持的向量规模不够灵活等问题。
发明内容
本发明旨在提供一种用于解决向量超越函数运算任务的装置和方法,能够根据指令快速、灵活执行针对不同长度的向量数据的各种超越函数计算,即对于向量X,针对其中的每一个元素xi快速计算出相应的超越函数值yi=f(xi),其中f可以是各种超越函数,包括但不限于指数函数、对数函数、三角函数和反三角函数。
根据本发明一方面,提供了一种向量超越函数运算装置,该装置包括:
存储单元,用于存储向量运算指令相关的向量数据;
寄存器单元,用于存储向量运算指令相关的标量数据;
控制单元,用于对向量运算指令进行译码,并控制向量运算指令的运算过程;
超越函数计算单元,用于对向量运算指令进行超越函数计算;
其中,所述超越函数计算单元包括预处理部分和迭代计算部分,其中预处理部分对输入向量数据进行预处理,使其处于CORDIC算法能够处理的范围之内,所述迭代计算部分利用CORDIC算法对经过预处理的输入向量数据进行迭代运算,得到结果向量数据。
可选地,所述超越函数计算单元利用硬件实现。
可选地,所述存储单元为高速暂存存储器。
可选地,所述寄存器单元所存储的标量数据包括向量运算指令相关的输入向量数据起始地址、输出向量数据存储地址、输入向量数据长度;其中,所述输入向量数据的起始地址以及输出向量数据存储地址为所述存储单元中的地址。
可选地,所述超越函数计算单元还包括后处理部分,其用于对所述迭代计算部分输出结果向量数据进行后处理。
可选地,所述控制单元包括:
指令队列模块,用于对译码后的向量运算指令进行顺序存储,并获取向量运算指令相关的标量数据。
可选地,所述控制单元包括:
依赖关系处理单元,用于在超越函数计算单元获取当前向量运算指令前,判断当前向量运算指令与之前未执行完的运算指令是否存在依赖关系。
可选地,所述控制单元包括:
存储队列模块,用于在当前向量运算指令与之前未执行完的运算指令存在依赖关系时,暂时存储当前向量运算指令,并且在该依赖关系消除时,将暂存的向量运算指令送往超越函数计算单元。
可选地,所述装置还包括:
指令缓存单元,用于存储待执行的向量运算指令;
输入输出单元,用于将向量运算指令相关的向量数据存储于存储单元,或者,从存储单元中获取向量运算指令的运算结果。
可选地,所述向量运算指令包括操作码和操作域;
所述操作码用于指示执行何种超越函数;
所述操作域包括立即数和/或寄存器号,指示向量运算相关的标量数据,其中寄存器号用于指向所述寄存器单元地址。
根据本发明第二方面,提供了一种向量超越函数运算装置,包括:
取指模块,用于从指令序列中取出下一条要执行的向量运算指令,并将该向量运算指令传给译码模块;
译码模块,用于对该向量运算指令进行译码,并将译码后的向量运算指令传送给指令队列模块;
指令队列模块,用于暂存译码后的向量运算指令,并从向量运算指令或标量寄存器获得向量指令运算相关的标量数据;获得所述标量数据后,将所述向量运算指令送至依赖关系处理单元;
标量寄存器堆,包括多个标量寄存器,用于存储向量运算指令相关的标量数据;
依赖关系处理单元,用于判断所述向量运算指令与之前未执行完的运算指令之间是否存在依赖关系;如果存在依赖关系,则将所述向量运算指令送至存储队列模块,如果不存在依赖关系,则将所述向量运算指令送至超越函数计算单元;
存储队列模块,用于存储与之前运算指令存在依赖关系的向量运算指令,并且在所述依赖关系解除后,将所述向量运算指令送至超越函数计算单元;
超越函数计算单元,用于根据接收到向量运算指令对输入向量数据进行超越函数计算;
高速暂存存储器,用于存储输入向量数据和输出向量数据;
输入输出存取模块,用于直接访问所述高速暂存存储器,负责从所述高速暂存存储器中读取输入向量数据和写入输出向量数据。
可选地,所述超越函数计算单元包括:
预处理模块,用于对输入向量数据进行预处理,将所述输入向量数据转换至CORDIC能够处理的范围之内;
迭代计算模块,用于对预处理后的输入向量数据进行CORDIC计算,得到超越函数运算结果;
后处理模块,用于对所述运算结果进行后处理,得到输出向量数据。
可选地,所述超越函数计算单元由硬件实现。
根据本发明一方面,提供了一种向量超越函数运算方法,该方法包括:
取值模块从指令序列中取出下一条要执行的向量运算指令,并将该向量运算指令传给译码模块;
译码模块对该向量运算指令进行译码,并将译码后的向量运算指令传送给指令队列模块;
指令队列模块暂存译码后的向量运算指令,并从向量运算指令或标量寄存器获得向量指令运算相关的标量数据;获得所述标量数据后,将所述向量运算指令送至依赖关系处理单元;
依赖关系处理单元判断所述向量运算指令与之前未执行完的运算指令之间是否存在依赖关系;如果存在依赖关系,则将所述向量运算指令送至存储队列模块,如果不存在依赖关系,则将所述向量运算指令送至超越函数计算单元;
存储队列模块存储与之前运算指令存在依赖关系的向量运算指令,并且在所述依赖关系解除后,将所述向量运算指令送至超越函数计算单元;
超越函数计算单元根据接收到向量运算指令,通过输入输出存取模块从高速暂存存储器取出输入向量数据,然后对输入向量数据进行超越函数运算,并通过输入输出存取模块将运算结果写入高速暂存存储器。
可选地,所述超越函数计算单元对输入向量数据进行预处理,将所述输入向量数据转换至CORDIC能够处理的范围之内;之后对预处理后的输入向量数据进行CORDIC计算,得到超越函数运算结果;最后对所述运算结果进行后处理,得到输出向量数据。
本发明提供的向量超越函数运算装置,能够硬件实现超越函数运算指令的精简指令运算,其可通过一条指令实现一完整的向量超越函数运算。本发明通过将参与计算的向量数据暂存在高速暂存存储器上(Scratchpad Memory),使得向量运算过程中可以更加灵活有效地支持不同宽度的数据,同时超越函数运算单元通过硬件实现,能够更加高效地实现各种超越函数运算,提升包含大量超越函数计算任务的算法执行性能。
本发明可以应用于以下场景中(包括但不限于):数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
附图说明
图1是本发明提供的向量超越函数运算装置的结构示意图。
图2是本发明提供的向量运算指令集的格式示意图。
图3是本发明实施例提供的向量超越函数运算装置的结构示意图。
图4是本发明实施例提供的装置执行超越函数运算的流程图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。
本发明提供一种向量超越函数计算装置,包括存储单元、寄存器单元、控制单元和超越函数计算单元,存储单元中存储有向量,寄存器单元中存储有向量存储地址和其他标量参数,控制单元执行译码操作,根据指令控制各个模块,超越函数计算单元根据运算指令在指令中或寄存器单元中获取向量地址、长度和其他参数,然后,根据该地址和长度在存储单元中获取相应的向量数据,接着,对向量数据进行超越函数运算,得到运算结果。本发明将参与计算的向量数据暂存在高速暂存存储器上,使得运算过程中可以更加灵活有效地支持不同宽度的向量数据,提升包含大量向量超越函数计算任务的算法执行性能。
图1是本发明提供的向量超越函数计算装置的结构示意图,如图1所示,所述装置包括:
存储单元,用于存储向量数据,在一种实施方式中,该存储单元可以是高速暂存存储器(Scratchpad Memory),能够支持不同大小的向量数据;本发明将必要的计算数据暂存在高速暂存存储器上,使得本装置在进行超越函数运算的过程中可以更加灵活有效地支持不同宽度的数据。高速暂存存储器可以通过各种不同存储器件如SRAM、DRAM、eDRAM、忆阻器、3D-DRAM和非易失存储等实现。
寄存器单元,用于存储向量运算指令相关的标量数据,包括输入向量数据的起始地址以及长度,输出向量数据的存储地址,也可用于存储其他运算过程中用到的标量数据,其中,输入向量数据的起始地址和输出向量数据的存储地址为所述存储单元中存储的地址;
控制单元,对向量运算指令进行译码,并控制向量运算指令的运算过程;在一种实施方式中,控制单元读取准备好的向量运算指令,对其进行译码生成控制信号,发送给装置中的其他单元,其他单元根据得到的控制信号执行相应的操作。
超越函数计算单元,该单元根据控制单元的控制实现对指定向量数据的指定超越函数计算。该单元是向量运算单元,同时对所有输入的向量数据进行相同的运算,即对向量中的每一个元素执行相同的超越函数运算。需要注意的是,本单元是定制的超越函数计算单元,采用不同于传统泰勒展开的方法来实现超越函数计算。定制的超越函数计算单元的硬件电路使用坐标旋转数字计算(Coordinate Rotation Digital Computer,CORDIC)算法,该硬件电路对输入的向量数据进行预 处理,将其转换至CORDIC算法能够处理的范围之内,并对CORDIC算法的计算结果进行后处理,如输出结果的符号变换及相应四则运算等。本发明中将超越函数计算单元的预处理和后处理全部硬件化,因此提供了一个更加完备的硬件运算模块,进一步提高了整个运算过程的速度。
超越函数计算单元需要进行三个阶段的运算,包括预处理、CORDIC计算和后处理。首先是预处理模块,CORDIC方法虽然可以很高效地计算各种超越函数值,但都仅适用于有限的输入范围,因此,本发明中通过硬件电路将将输入数据转换至CORDIC可以处理的范围内,之后CORDIC计算电路对预处理后的数据进行计算相应计算,计算结构输出至后处理电路进行处理后输出。
在一实施例中,超越函数计算单元包括预处理模块、迭代计算模块和后处理模块,预处理模块将输入的向量数据转化至合理的可计算域范围内,迭代计算模块利用CORDIC算法计算转化后数据的超越函数值,后处理模块将得到的超越函数值进行后处理,其中预处理模块、迭代计算模块和后处理模块都采用硬件实现。
在一实施例中,所述超越函数计算单元通过以下硬件电路实现(包括但不限于):FPGA、CGRA、专用集成电路ASIC、模拟电路和忆阻器等。
根据本发明的一种实施方式,所述向量超越函数计算装置还包括:指令缓存单元,用于存储待执行的向量运算指令。指令在执行过程中,同时也被缓存在指令缓存单元中,当一条指令执行完之后,该指令将被提交。
根据本发明的一种实施方式,所述向量超越函数计算装置的控制单元还包括:指令队列模块,用于对译码后的向量运算指令进行顺序存储,并在获得向量运算指令所需的标量数据后,将其送至依赖关系处理模块。
根据本发明的一种实施方式,所述向量超越函数计算装置的控制单元还包括:依赖关系处理单元,用于在超越函数计算单元获取指令前,判断该运算指令与之前未执行完的运算指令是否存在依赖关系,即是否访问相同的向量存储地址,若是,将该向量运算指令存储在存储队列模块中,待前一向量运算指令执行完毕后,将存储队列模块中的该向量运算指令提供给所述超越函数计算单元;否则,直接将该向量运算指令提供给所述超越函数计算单元。具体地,向量运算指令访问高速暂存存储器时,前后指令可能会访问同一块存储空间,为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
根据本发明的一种实施方式,所述向量超越函数计算装置的控制单元还包括:存储队列模块,该模块包括一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该有序队列内直至依赖关系被消除,在依赖关系消除后,其将运算指令提供给超越函数计算单元。
根据本发明的一种实施方式,所述向量超越函数计算装置还包括:输入输出单元,用于将向量运算数据存储于存储单元,或者,从存储单元中获取向量运算结果。其中,输入输出单元可直接访问存储单元,负责从内存中读取向量数据或写入向量数据。
根据本发明的一种实施方式,本发明的指令设计采用精简化的方式,一条指令可以完成一次完整的向量超越函数计算。
在本发明执行向量超越函数运算的过程中,所述向量超越函数计算装置取出指令进行译码,然后送至指令队列存储,根据译码结果,获取指令中的各个参数,这些参数可以是直接写在指令的操作域中,也可以是根据指令操作域中的寄存器号从指定的寄存器中读取。这种使用寄存器存储参数的好处是无需改变指令本身,只要用指令改变寄存器中的值,就可以实现大部分的循环,因此大大节省了在解决某些实际问题时所需要的指令条数。在获取了全部操作数之后,依赖关系处理单元会判断指令实际需要使用的数据与之前运算指令中的数据是否存在依赖关系,这决定了这条指令是否可以被立即发送至超越函数计算单元中执行。一旦发现与之前运算指令中的数据之间存在依赖关系,则该条指令必须等到它依赖的指令执行完毕之后才可以送至运算单元执行。在定制的超越函数计算单元中,该条指令将快速执行完毕,并将结果,即生成的随机向量写回至指令提供的地址,该条指令执行完毕。
图2是本发明提供的超越函数运算指令的格式示意图,如图2所示,超越函数运算指令包括一操作码和至少一操作域,其中,操作码用于指示进行哪一种超越函数计算,操作域用于指示该运算指令的数据信息,其中,数据信息包括立即数和/或寄存器号,例如,要获取一个向量时,根据寄存器号可以在相应的寄存器中获取向量起始地址和向量长度,再根据向量起始地址和向量长度在存储单元中获取相应地址存放的向量。
有下列几种向量超越函数运算指令:
指数运算指令(EXP),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行指数计算,即Y=exp(X)并将计算结果写回至指令中指定的高速暂存存储器的地址。
对数运算指令(LOG),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行对数计算,即Y=log(X)并将计算结果写回至指令中指定的高速暂存存储器的地址。
正弦运算指令(SIN),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行正弦计算,即Y=sin(X)并将计算结果写回至指令中指定的高速暂存存储器的地址。
余弦运算指令(COS),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行余弦计算,即Y=cos(X)并将计算结果写回至指定的高速暂存存储器的地址。
正切运算指令(TAN),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行正切计算,即Y=tan(X)并将计算结果写回至指定的高速暂存存储器的地址。
余切运算指令(COT),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行余切计算,即Y=cot(X)并将计算结果写回至指定的高速暂存存储器的地址。
反正弦运算指令(ARCSIN),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行反正弦计算,即Y=arcsin(X)并将计算结果写回至指定的高速暂存存储器的地址。
反余弦运算指令(ARCCOS),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行反余弦计算,即Y=arccos(X)并将计算结果写回至指定的高速暂存存储器的地址。
反正切运算指令(ARCTAN),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行反正切计算,即Y=arctan(X)并将计算结果写回至指定的高速暂存存储器的地址。
反余切运算指令(ARCCOT),根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在超越函数计算单元中对向量进行反余切计算,即Y=arccot(X)并将计算结果写回至指令中指定的高速暂存存储器的地址。
通用CPU不提供这种类型的机器指令,它们通常是由高层的库函数来实现的,每个库函数都包含了多条机器指令,本发明通过硬件结构实现了上述向量超越函数指令。
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图3是本发明一实施例提供的向量超越函数计算装置的结构示意图,如图3所示,该装置包括取指模块、译码模块、指令队列模块、标量寄存器堆、依赖关系处理单元、存储队列模块、超越函数计算单元、高速暂存器、IO内存存取模块;
取指模块,该模块负责从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;
译码模块,该模块负责对指令进行译码,并将译码后得到的指令传给指令队列模块;
指令队列模块,该模块用于暂存从译码模块获得的指令,并从指令或标量寄存器获得指令运算相应的数据,包括向量数据的起始地址和大小以及一些标量常数。获得数据后,指令被送至依赖关系处理单元;
标量寄存器堆,提供运算过程中所需的标量寄存器;
依赖关系处理单元,该单元用于处理指令与前一条指令可能存在的存储依赖关系。向量运算指令会访问高速暂存存储器以获取运算向量,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令被送至存储队列模块内等待至依赖关系被消除。即检测本条指令的输入数据的存储区间与之前没有执行完毕的指令的输出数据的存储区间是否有重叠,存储区间是由起始地址和数据长度决定的。如果有重叠,则说明本条指令实际上是需要之前指令的执行结果作为输入的,因此必须等到那条指令执行完毕后,这条指令才能开始执行。在这个过程中,指令实际被暂存在存储队列模块中。
存储队列模块,该模块包括一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该有序队列内直至存储关系被消除;依赖关系被消除的指令被送至超越函数计算单元;
超越函数计算单元,该单元负责超越函数计算操作,包括但不限于指数运算、对数运算、三角函数运算和反三角函数。实际上,所有常见的超越函数基本包含在指数、对数、三角、反三角运算以及他们的四则组合中。超越函数的计算通过CORDIC方法实现,该方法是一种迭代计算的方法,对于某超越函数f(x),一次迭代可以算出结果中的一位精度,所以对于16位精度的结果要求,则最多只需要迭代16次,即可算出该结果。同时超越函数计算模块同时还包括预处理和后处理的部分,预处理是将输入数据转换至合理的计算区间内,即将输入数据转换至CORDIC算法能够处理的计算范围内,后处理则根据超越函数本身的不同对CORDIC的计算结果进行后处理,如对输出数据的符号的变换以及一些四则运算。现有技术中预处理和后处理通常由软件完成,在本装置中全部由硬件电路实现;
高速暂存存储器,该存储器是向量数据专用的暂存存储装置,能够支持不同大小的向量数据;其用于存储待运算的向量数据和运算结果;
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高速暂存存储器中读取数据或写入数据。
图4是本发明实施例提供的向量超越函数运算装置执行向量超越函数运算指令的流程图,如图4所示,执行向量超越函数运算的过程包括:
S1,取指模块取出该条向量超越函数运算指令,并将该指令送往译码模块。
S2,译码模块对该指令译码,并将译码后的指令送往指令队列模块。
S3,在指令队列模块中,从指令立即数或寄存器中获取所需的标量数据,即指令操作域对应的数据,包括输入向量地址、输入向量长度、输出向量地址以及超越函数运算所需的常数。
S4,在取得需要的标量数据后,指令队列模块将该指令送往依赖关系处理单元。
S5,依赖关系处理单元分析该指令与前面尚未执行结束的指令在数据上是否存在依赖关系。如果存在依赖关系,则将该条指令送入存储队列模块中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止,并由存储队列模块将其送往超越函数计算单元。如果不存在依赖关系,则直接将该条指令送往超越函数计算单元;
S6,超越函数计算单元根据向量存储地址和长度通过输入输出单元从高速暂存器中取出运算数据中所需要的一部分向量数据。
S7,超越函数计算单元中的预处理模块将输入数据变换至坐标旋转数字计算算法CORDIC能够计算的收敛域内。
S8,超越函数计算单元中的迭代计算模块并行地计算出所取出的一部分向量数据的超越函数值。
S9,转步骤S6,超越函数计算单元继续取出向量数据的下一部分进行计算,直至完成全部输入向量数据的超越函数计算。
S10,运算完成后,将运算结果向量通过输入输出单元写回至高速暂存存储器的向量输出地址中。
综上所述,本发明提供向量超越函数计算装置,并配合相应的指令,能够很好地解决当前计算机领域越来越多的针对向量的超越函数的计算任务,包括目前表现十分出色的人工神经网络算法。相比于已有的传统解决方案,本发明可以具有指令精简、使用方便、支持的向量长度灵活、片上缓存充足等优点。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (15)

  1. 一种用于执行向量超越函数运算的装置,其中,该装置包括:
    存储单元,用于存储向量运算指令相关的向量数据;
    寄存器单元,用于存储向量运算指令相关的标量数据;
    控制单元,用于对向量运算指令进行译码,并控制向量运算指令的运算过程;
    超越函数计算单元,用于对向量运算指令进行超越函数计算;
    其中,所述超越函数计算单元包括预处理部分和迭代计算部分,其中预处理部分对输入向量数据进行预处理,使其处于CORDIC算法能够处理的范围之内,所述迭代计算部分利用CORDIC算法对经过预处理的输入向量数据进行迭代运算,得到结果向量数据。
  2. 如权利要求1所述的装置,其中,所述超越函数计算单元利用硬件实现。
  3. 如权利要求1所述的装置,其中,所述存储单元为高速暂存存储器。
  4. 如权利要求1-3任一项所述的装置,其中,所述寄存器单元所存储的标量数据包括向量运算指令相关的输入向量数据起始地址、输出向量数据存储地址、输入向量数据长度;其中,所述输入向量数据的起始地址以及输出向量数据存储地址为所述存储单元中的地址。
  5. 如权利要求1-3任一项所述的装置,其中,所述超越函数计算单元还包括后处理部分,其用于对所述迭代计算部分输出结果向量数据进行后处理。
  6. 如权利要求1所述的装置,其中,所述控制单元包括:
    指令队列模块,用于对译码后的向量运算指令进行顺序存储,并获取向量运算指令相关的标量数据。
  7. 如权利要求1或6所述的装置,其中,所述控制单元包括:
    依赖关系处理单元,用于在超越函数计算单元获取当前向量运算指令前,判断当前向量运算指令与之前未执行完的运算指令是否存在依赖关系。
  8. 如权利要求1或6所述的装置,其中,所述控制单元包括:
    存储队列模块,用于在当前向量运算指令与之前未执行完的运算指令存在依赖关系时,暂时存储当前向量运算指令,并且在该依赖关系消除时,将暂存的向量运算指令送往超越函数计算单元。
  9. 如权利要求1-3、6任一项所述的装置,其中,所述装置还包括:
    指令缓存单元,用于存储待执行的向量运算指令;
    输入输出单元,用于将向量运算指令相关的向量数据存储于存储单元,或者,从存储单元中获取向量运算指令的运算结果。
  10. 如权利要求1所述的装置,其中,所述向量运算指令包括操作码和操作域;
    所述操作码用于指示执行何种超越函数;
    所述操作域包括立即数和/或寄存器号,指示向量运算相关的标量数据,其中寄存器号用于指向所述寄存器单元地址。
  11. 一种用于执行向量超越函数运算的装置,其中,包括:
    取指模块,用于从指令序列中取出下一条要执行的向量运算指令,并将该向量运算指令传给译码模块;
    译码模块,用于对该向量运算指令进行译码,并将译码后的向量运算指令传送给指令队列模块;
    指令队列模块,用于暂存译码后的向量运算指令,并从向量运算指令或标量寄存器获得向量指令运算相关的标量数据;获得所述标量数据后,将所述向量运算指令送至依赖关系处理单元;
    标量寄存器堆,包括多个标量寄存器,用于存储向量运算指令相关的标量数据;
    依赖关系处理单元,用于判断所述向量运算指令与之前未执行完的运算指令之间是否存在依赖关系;如果存在依赖关系,则将所述向量运算指令送至存储队列模块,如果不存在依赖关系,则将所述向量运算指令送至超越函数计算单元;
    存储队列模块,用于存储与之前运算指令存在依赖关系的向量运算指令,并且在所述依赖关系解除后,将所述向量运算指令送至超越函数计算单元;
    超越函数计算单元,用于根据接收到向量运算指令对输入向量数据进行超越函数计算;
    高速暂存存储器,用于存储输入向量数据和输出向量数据;
    输入输出存取模块,用于直接访问所述高速暂存存储器,负责从所述高速暂存存储器中读取输入向量数据和写入输出向量数据。
  12. 如权利要求11所述的装置,其中,所述超越函数计算单元包括:
    预处理模块,用于对输入向量数据进行预处理,将所述输入向量数据转换至CORDIC能够处理的范围之内;
    迭代计算模块,用于对预处理后的输入向量数据进行CORDIC计算,得到超越函数运算结果;
    后处理模块,用于对所述运算结果进行后处理,得到输出向量数据。
  13. 如权利要求11或12所述的装置,其中,所述超越函数计算单元由硬件实现。
  14. 一种用于执行向量超越函数运算的方法,其中,该方法包括:
    取值模块从指令序列中取出下一条要执行的向量运算指令,并将该向量运算指令传给译码模块;
    译码模块对该向量运算指令进行译码,并将译码后的向量运算指令传送给指令队列模块;
    指令队列模块暂存译码后的向量运算指令,并从向量运算指令或标量寄存器获得向量指令运算相关的标量数据;获得所述标量数据后,将所述向量运算指令送至依赖关系处理单元;
    依赖关系处理单元判断所述向量运算指令与之前未执行完的运算指令之间是否存在依赖关系;如果存在依赖关系,则将所述向量运算指令送至存储队列模块,如果不存在依赖关系,则将所述向量运算指令送至超越函数计算单元;
    存储队列模块存储与之前运算指令存在依赖关系的向量运算指令,并且在所述依赖关系解除后,将所述向量运算指令送至超越函数计算单元;
    超越函数计算单元根据接收到向量运算指令,通过输入输出存取模块从高速暂存存储器取出输入向量数据,然后对输入向量数据进行超越函数运算,并通过输入输出存取模块将运算结果写入高速暂存存储器。
  15. 如权利要求14所述的方法,其中,所述超越函数计算单元对输入向量数据进行预处理,将所述输入向量数据转换至CORDIC能够处理的范围之内;之后对预处理后的输入向量数据进行CORDIC计算,得到超越函数运算结果;最后对所述运算结果进行后处理,得到输出向量数据。
PCT/CN2016/081071 2016-04-26 2016-05-05 一种用于执行向量超越函数运算的装置和方法 WO2017185390A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP16899901.9A EP3451153B1 (en) 2016-04-26 2016-05-05 Apparatus and method for executing transcendental function operation of vectors
US16/171,295 US20190065191A1 (en) 2016-04-26 2018-10-25 Apparatus and Methods for Vector Based Transcendental Functions
US16/247,237 US20190146793A1 (en) 2016-04-26 2019-01-14 Apparatus and Methods for Vector Based Transcendental Functions

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610266916.0A CN107315564B (zh) 2016-04-26 2016-04-26 一种用于执行向量超越函数运算的装置和方法
CN201610266916.0 2016-04-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/171,295 Continuation-In-Part US20190065191A1 (en) 2016-04-26 2018-10-25 Apparatus and Methods for Vector Based Transcendental Functions

Publications (1)

Publication Number Publication Date
WO2017185390A1 true WO2017185390A1 (zh) 2017-11-02

Family

ID=60160572

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081071 WO2017185390A1 (zh) 2016-04-26 2016-05-05 一种用于执行向量超越函数运算的装置和方法

Country Status (4)

Country Link
US (2) US20190065191A1 (zh)
EP (1) EP3451153B1 (zh)
CN (2) CN107315564B (zh)
WO (1) WO2017185390A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754061B (zh) * 2017-11-07 2023-11-24 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN108388446A (zh) 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 运算模块以及方法
CN109271134B (zh) * 2018-12-13 2020-08-25 上海燧原科技有限公司 超越函数运算方法及装置、存储介质及电子设备
CN111260048B (zh) * 2020-01-14 2023-09-01 上海交通大学 一种基于忆阻器的神经网络加速器中激活函数的实现方法
US20210350221A1 (en) * 2020-05-05 2021-11-11 Silicon Laboratories Inc. Neural Network Inference and Training Using A Universal Coordinate Rotation Digital Computer
CN114707110B (zh) * 2022-06-07 2022-08-30 中科亿海微电子科技(苏州)有限公司 一种三角函数和双曲函数扩展指令计算装置及处理器核

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015882A1 (en) * 2001-06-05 2004-01-22 Ping Tak Peter Tang Branch-free software methodology for transcendental functions
CN101630243A (zh) * 2009-08-14 2010-01-20 西北工业大学 超越函数装置以及用该装置实现超越函数的方法
CN102722469A (zh) * 2012-05-28 2012-10-10 西安交通大学 基于浮点运算单元的基本超越函数运算方法及其协处理器
CN102799412A (zh) * 2012-07-09 2012-11-28 上海大学 基于并行流水线设计的cordic加速器
CN103677738A (zh) * 2013-09-26 2014-03-26 中国人民解放军国防科学技术大学 基于混合模式cordic算法的低延时基本超越函数实现方法及装置

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0334621A3 (en) * 1988-03-23 1992-12-30 Du Pont Pixel Systems Limited System with improved instruction execution
US5537606A (en) * 1995-01-31 1996-07-16 International Business Machines Corporation Scalar pipeline replication for parallel vector element processing
US5848241A (en) * 1996-01-11 1998-12-08 Openframe Corporation Ltd. Resource sharing facility functions as a controller for secondary storage device and is accessible to all computers via inter system links
US7734581B2 (en) * 2004-05-18 2010-06-08 Oracle International Corporation Vector reads for array updates
US7707387B2 (en) * 2005-06-01 2010-04-27 Microsoft Corporation Conditional execution via content addressable memory and parallel computing execution model
US8819099B2 (en) * 2006-09-26 2014-08-26 Qualcomm Incorporated Software implementation of matrix inversion in a wireless communication system
US7827357B2 (en) * 2007-07-31 2010-11-02 Intel Corporation Providing an inclusive shared cache among multiple core-cache clusters
US20100088309A1 (en) * 2008-10-05 2010-04-08 Microsoft Corporation Efficient large-scale joining for querying of column based data encoded structures
CN101957743B (zh) * 2010-10-12 2012-08-29 中国电子科技集团公司第三十八研究所 并行数字信号处理器
CN102262525B (zh) * 2011-08-29 2014-11-19 孙瑞玮 基于矢量运算的矢量浮点运算装置及方法
JP5834997B2 (ja) * 2012-02-23 2015-12-24 株式会社ソシオネクスト ベクトルプロセッサ、ベクトルプロセッサの処理方法
US9483266B2 (en) * 2013-03-15 2016-11-01 Intel Corporation Fusible instructions and logic to provide OR-test and AND-test functionality using multiple test sources
US9813223B2 (en) * 2013-04-17 2017-11-07 Intel Corporation Non-linear modeling of a physical system using direct optimization of look-up table values
US9691034B2 (en) * 2013-05-14 2017-06-27 The Trustees Of Princeton University Machine-learning accelerator (MLA) integrated circuit for extracting features from signals and performing inference computations
US9594983B2 (en) * 2013-08-02 2017-03-14 Digimarc Corporation Learning systems and methods
US9880845B2 (en) * 2013-11-15 2018-01-30 Qualcomm Incorporated Vector processing engines (VPEs) employing format conversion circuitry in data flow paths between vector data memory and execution units to provide in-flight format-converting of input vector data to execution units for vector processing operations, and related vector processor systems and methods
US10168990B1 (en) * 2014-01-17 2019-01-01 The Mathworks, Inc. Automatic replacement of a floating-point function to facilitate fixed-point program code generation
US9846836B2 (en) * 2014-06-13 2017-12-19 Microsoft Technology Licensing, Llc Modeling interestingness with deep neural networks
CN104834502A (zh) * 2015-05-11 2015-08-12 江苏宏云技术有限公司 一种dsp中高效cordic指令实现方法
US10586168B2 (en) * 2015-10-08 2020-03-10 Facebook, Inc. Deep translations
US10366163B2 (en) * 2016-09-07 2019-07-30 Microsoft Technology Licensing, Llc Knowledge-guided structural attention processing
US10916135B2 (en) * 2018-01-13 2021-02-09 Toyota Jidosha Kabushiki Kaisha Similarity learning and association between observations of multiple connected vehicles

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040015882A1 (en) * 2001-06-05 2004-01-22 Ping Tak Peter Tang Branch-free software methodology for transcendental functions
CN101630243A (zh) * 2009-08-14 2010-01-20 西北工业大学 超越函数装置以及用该装置实现超越函数的方法
CN102722469A (zh) * 2012-05-28 2012-10-10 西安交通大学 基于浮点运算单元的基本超越函数运算方法及其协处理器
CN102799412A (zh) * 2012-07-09 2012-11-28 上海大学 基于并行流水线设计的cordic加速器
CN103677738A (zh) * 2013-09-26 2014-03-26 中国人民解放军国防科学技术大学 基于混合模式cordic算法的低延时基本超越函数实现方法及装置

Also Published As

Publication number Publication date
US20190146793A1 (en) 2019-05-16
EP3451153A4 (en) 2019-12-11
CN107315564A (zh) 2017-11-03
CN111651200B (zh) 2023-09-26
CN107315564B (zh) 2020-07-17
CN111651200A (zh) 2020-09-11
EP3451153A1 (en) 2019-03-06
US20190065191A1 (en) 2019-02-28
EP3451153B1 (en) 2020-11-18

Similar Documents

Publication Publication Date Title
WO2017185390A1 (zh) 一种用于执行向量超越函数运算的装置和方法
CN107315574B (zh) 一种用于执行矩阵乘运算的装置和方法
CN111857819B (zh) 一种用于执行矩阵加/减运算的装置和方法
CN107315718B (zh) 一种用于执行向量内积运算的装置和方法
WO2017185395A1 (zh) 一种用于执行向量比较运算的装置和方法
WO2017185384A1 (zh) 一种用于执行向量循环移位运算的装置和方法
CN107315568B (zh) 一种用于执行向量逻辑运算的装置
CN111651206B (zh) 一种用于执行向量外积运算的装置和方法
WO2017185392A1 (zh) 一种用于执行向量四则运算的装置和方法
WO2017185385A1 (zh) 一种用于执行向量合并运算的装置和方法
WO2017185419A1 (zh) 一种用于执行向量最大值最小值运算的装置和方法
EP3451158B1 (en) Device and method for generating random vectors conforming to certain distribution
CN115328547A (zh) 一种数据处理方法、电子设备及存储介质
KR102467544B1 (ko) 연산 장치 및 그 조작 방법

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016899901

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899901

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016899901

Country of ref document: EP

Effective date: 20181126