WO2017185392A1 - 一种用于执行向量四则运算的装置和方法 - Google Patents

一种用于执行向量四则运算的装置和方法 Download PDF

Info

Publication number
WO2017185392A1
WO2017185392A1 PCT/CN2016/081107 CN2016081107W WO2017185392A1 WO 2017185392 A1 WO2017185392 A1 WO 2017185392A1 CN 2016081107 W CN2016081107 W CN 2016081107W WO 2017185392 A1 WO2017185392 A1 WO 2017185392A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
instruction
unit
arithmetic
operation instruction
Prior art date
Application number
PCT/CN2016/081107
Other languages
English (en)
French (fr)
Inventor
陶劲桦
支天
刘少礼
陈天石
陈云霁
Original Assignee
北京中科寒武纪科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京中科寒武纪科技有限公司 filed Critical 北京中科寒武纪科技有限公司
Priority to EP16899903.5A priority Critical patent/EP3451185A4/en
Priority to EP21154589.2A priority patent/EP3832500B1/en
Publication of WO2017185392A1 publication Critical patent/WO2017185392A1/zh
Priority to US16/172,657 priority patent/US11341211B2/en
Priority to US16/172,592 priority patent/US10585973B2/en
Priority to US16/172,432 priority patent/US10997276B2/en
Priority to US16/172,515 priority patent/US10592582B2/en
Priority to US16/172,653 priority patent/US11507640B2/en
Priority to US16/172,629 priority patent/US11100192B2/en
Priority to US16/172,649 priority patent/US11436301B2/en
Priority to US16/172,533 priority patent/US10599745B2/en
Priority to US16/172,566 priority patent/US20190073339A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/223Execution means for microinstructions irrespective of the microinstruction function, e.g. decoding of microinstructions and nanoinstructions; timing of microinstructions; programmable logic arrays; delays and fan-out problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/3013Organisation of register space, e.g. banked or distributed register file according to data content, e.g. floating-point registers, address registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3856Reordering of instructions, e.g. using queues or age tags
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • G06F15/8053Vector processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • G06F7/505Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination
    • G06F7/506Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages
    • G06F7/507Adding; Subtracting in bit-parallel fashion, i.e. having a different digit-handling circuit for each denomination with simultaneous carry generation for, or propagation over, two or more stages using selection between two conditionally calculated carry or sum values
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to a vector four-sequence operation device and a method, which are used for efficiently and flexibly executing a vector four-order operation according to a vector four-order operation instruction, which can well solve the problem that more and more algorithms in the current computer field contain a large number of vector four-order operations.
  • the vector four-order operation refers to the four operations of adding, subtracting, multiplying, and dividing the corresponding components of the vector.
  • vector multiplication is defined as: [a1*b1,a2*b2,...,an*bn]
  • Vector division is defined as: [a1/b1, a2/b2,...,an/bn].
  • a known scheme for performing a vector quadruple operation is to use a general purpose processor that performs a vector four operation by executing a general purpose instruction through a general purpose register file and a general purpose function.
  • a single general-purpose processor is mostly used for scalar calculation, and the performance of the vector is lower when performing vector four operations.
  • mutual communication between general-purpose processors may become a performance bottleneck.
  • a vector processing is performed using a graphics processing unit (GPU) in which vector quadruple operations are performed by executing general SIMD instructions using a general purpose register file and a general purpose stream processing unit.
  • GPU graphics processing unit
  • the GPU on-chip cache is too small, and it is necessary to continuously perform off-chip data transfer when performing large-scale vector four-time operations, and the off-chip bandwidth becomes a main performance bottleneck.
  • a vector-specific calculation is performed using a custom-made vector quadruple arithmetic device, in which a custom register file and a custom processing unit are used for vector quadruple operations. Count.
  • the existing dedicated vector four-sequence operation device is limited by the register file, and cannot flexibly support vector four-order operations of different lengths.
  • the present invention provides a vector quadruple operation device for performing vector four arithmetic operations according to a vector four arithmetic operation instruction, including:
  • a storage unit for storing a vector
  • a register unit for storing a vector address, wherein the vector address is an address stored in the storage unit by the vector;
  • the vector four arithmetic unit is configured to obtain a vector four operation instruction, obtain a vector address in the register unit according to the vector four operation instruction, and then obtain a corresponding vector in the storage unit according to the vector address, and then perform a vector four according to the obtained vector.
  • the operation is performed to obtain the result of the vector four operations.
  • the vector quadruple operation device and method provided by the present invention temporarily store the vector data participating in the calculation on a scratch pad memory.
  • the vector four arithmetic unit can more flexibly and efficiently support data of different widths, and can solve the correlation problem in the data storage, thereby improving the execution performance of a large number of vector computing tasks.
  • the instructions used in the invention have a compact format, which makes the instruction set easy to use and the supported vector length is flexible.
  • the invention can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, telephones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , camera, camera, projection Instruments, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other vehicles; TV, air conditioning, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, soot Machines and other household appliances; and various types of medical equipment including nuclear magnetic resonance instruments, B-ultrasound, electrocardiographs.
  • FIG. 1 is a schematic structural diagram of a vector quadruple operation device provided by the present invention.
  • FIG. 2 is a schematic diagram of the format of an instruction set provided by the present invention.
  • FIG. 3 is a schematic structural diagram of a vector four-sequence operation device according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a vector four-order operation device vector vector instruction according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a vector four-sequence operation unit according to an embodiment of the present invention.
  • the invention provides a vector four-sequence operation device and a matching instruction set.
  • the device comprises a storage unit, a register unit and a vector four-sequence operation unit.
  • the storage unit stores a vector
  • the register unit stores an address vector of the vector storage.
  • the arithmetic unit is based on the vector four.
  • the operation instruction acquires a vector address in the register unit, and then acquires a corresponding vector in the storage unit according to the vector address, and then performs a vector four operation operation according to the obtained vector to obtain a vector four operation result.
  • the invention temporarily stores the vector data participating in the calculation on the scratchpad memory, so that the vector four-step operation process can more flexibly and effectively support different width data, and improve the execution performance of a large number of vector calculation tasks.
  • the vector quadruple operation device includes:
  • the storage unit may be a scratchpad memory capable of supporting vector data of different sizes.
  • the present invention temporarily stores necessary calculation data in a scratch pad memory (Scratchpad Memory).
  • Scratchpad Memory On the computer In the vector four-step operation, data of different widths can be supported more flexibly and efficiently.
  • the memory cells can be implemented by a variety of different memory devices (SRAM, eDRAM, DRAM, memristor, 3D-DRAM or non-volatile memory, etc.).
  • the register unit for storing a vector address, wherein the vector address is an address in which the vector is stored in the storage unit;
  • the register unit may be a scalar register file, providing a scalar register required for the operation, a scalar register Not only store vector addresses, but also scalar data.
  • the vector four arithmetic unit not only gets the vector address from the register unit, but also obtains the corresponding scalar from the register unit.
  • the number of register units is generally plural to form a register file for storing a plurality of vector addresses and scalars.
  • the vector four arithmetic unit is configured to obtain a vector four operation instruction, obtain a vector address in the register unit according to the vector four operation instruction, and then obtain a corresponding vector in the storage unit according to the vector address, and then perform a vector four according to the obtained vector.
  • the operation obtains the result of the vector four operation, and stores the result of the vector operation in the storage unit.
  • the vector quadruple operation unit includes a vector quadruple addition component, a vector quadruple subtraction component, a vector quadruple multiplication component, and a vector quadruple division component, and the vector quadruple operation unit is a multi-stream water level structure, wherein the addition component and the subtraction component are at the first flow level.
  • the multiplying component and the dividing component are at the second flow level. These units are at different pipeline levels. When the sequential order of multiple vector four arithmetic instructions is consistent with the order of the corresponding unit, the operations required by the series of four arithmetic instructions can be implemented more efficiently.
  • the vector quadruple operation device further includes: an instruction buffer unit, configured to store the vector quadruple operation instruction to be executed.
  • the instruction is also cached in the instruction cache unit during execution.
  • the instruction cache unit may be a reordering cache.
  • the vector quadruple operation device further includes: an instruction processing unit, configured to acquire a vector four operation instruction from the instruction cache unit, and process the vector four operation instruction, and provide the vector four operations unit.
  • the instruction processing order The yuan includes:
  • An instruction fetch module configured to obtain a vector four operation instruction from the instruction cache unit
  • a decoding module configured to decode the acquired vector four operation instructions
  • the instruction queue is used for sequentially storing the decoded vector four operation instructions. Considering that different instructions may have dependencies on the included registers, the instructions for buffering the decoded instructions are sent after the dependencies are satisfied. .
  • the vector quadruple operation device further includes: a dependency processing unit, configured to determine whether the vector four operation instruction and the previous vector four operation instruction are accessed before the vector four operation unit acquires the vector four operation instruction. The same vector, if yes, storing the vector four operation instruction in a storage queue, and after the execution of the previous vector operation instruction, the vector operation instruction in the storage queue is provided to the vector operation unit; Directly supplying the vector four arithmetic operation instruction to the vector four arithmetic unit.
  • a dependency processing unit configured to determine whether the vector four operation instruction and the previous vector four operation instruction are accessed before the vector four operation unit acquires the vector four operation instruction. The same vector, if yes, storing the vector four operation instruction in a storage queue, and after the execution of the previous vector operation instruction, the vector operation instruction in the storage queue is provided to the vector operation unit; Directly supplying the vector four arithmetic operation instruction to the vector four arithmetic unit.
  • the front and back instructions may access the same block of storage space, and in order to ensure the correctness of the instruction execution result, if the current instruction is detected to have a dependency relationship with the data of the previous instruction, The instruction must wait in the store queue until the dependency is removed.
  • the vector quadruple operation device further includes: an input/output unit configured to store the vector in the storage unit, or obtain a vector operation result from the storage unit.
  • the input and output unit can directly store the unit, and is responsible for reading vector data or writing vector data from the memory.
  • the present invention also provides a vector four-order operation method for performing a vector four-order operation according to a vector four-order operation instruction, and the method includes:
  • step S2 storing a vector address, the vector address indicating a location where the vector is stored in step S1;
  • step S3 before step S3, the method further includes:
  • the decoded vector four arithmetic instructions are sequentially stored.
  • step S3 before step S3, the method further includes:
  • step S3 Determining whether the vector four operation instruction and the previous vector four operation instruction access the same vector, if so, storing the vector four operation instruction in a storage queue, after the execution of the previous vector four operation instruction is completed, then executing step S3; Otherwise, step S3 is directly executed.
  • the method further comprises storing the result of the vector four operations.
  • step S1 comprises storing the vector in a scratchpad memory.
  • the vector quadruple operation instruction includes an operation code and at least one operation field, wherein the operation code is used to indicate a function of the vector operation instruction, and the operation field is used to indicate data of the vector operation instruction. information.
  • the vector four operations include vector addition, vector subtraction, vector multiplication, and vector division.
  • the vector operation unit is a multi-stream water level structure including a first flow level and a second flow level, wherein the vector addition and the vector subtraction are performed at the first flow level, at the second flow level Perform vector multiplication and vector division.
  • the instruction set of the present invention adopts a Load/Store structure, and the vector four arithmetic unit does not operate on data in the memory.
  • This instruction set uses a reduced instruction set architecture.
  • the instruction set only provides the most basic vector four arithmetic operations.
  • the complex vector four arithmetic operations are combined by these simple instructions, so that the instructions can be executed in a single cycle at a high clock frequency.
  • the instruction set uses the fixed length instruction at the same time, so that the vector quadruple operation device proposed by the present invention fetches the next instruction in the decoding stage of the previous instruction.
  • the device fetches the instruction for decoding, and then sends it to the instruction queue for storage, and obtains each parameter in the instruction according to the decoding result, and these parameters may be directly written in the operation domain of the instruction. It can also be a register in the operating domain according to the instruction. The device number is read from the specified register.
  • the dependency processing unit determines whether the data actually needed by the instruction has a dependency relationship with the previous instruction, which determines whether the instruction can be immediately sent to the execution unit. Once a dependency is found between the previous data and the previous data, the instruction must wait until the instruction it depends on has been executed before it can be sent to the arithmetic unit for execution. In the custom operation unit, the instruction will be executed quickly, and the result, that is, the generated vector operation result is written back to the address provided by the instruction, and the instruction is executed.
  • the vector four operation instruction includes one operation code and a plurality of operation fields, wherein the operation code is used to indicate the function and function of the vector four operation instructions.
  • the operation code is used to indicate the function and function of the vector four operation instructions.
  • the vector four arithmetic unit can perform a vector four operation by identifying the operation code, and the operation field is used to indicate the data information of the vector four operation instruction, wherein the data information can be an immediate number or a register number, for example, when a vector is to be acquired, according to The register number can obtain the vector start address and the vector length in the corresponding register, and then obtain the vector stored in the corresponding address in the storage unit according to the vector start address and the vector length.
  • the instruction set contains vector four arithmetic instructions with different functions:
  • VA Vector addition instruction
  • VAS Vector plus scalar instruction
  • the device fetches the vector data of the specified size from the designated address of the scratch pad memory, extracts the scalar data from the specified address of the scalar register file, and adds the scalar value to each element of the vector in the scalar operation unit, and Write the result back to the specified address of the scratch pad memory;
  • Vector subtraction instruction (VS). According to the instruction, the device separately extracts two pieces of vector data of a specified size from the specified address of the scratch pad memory, performs a subtraction operation in the vector operation unit, and writes the result back to the designated address of the scratch pad memory;
  • Scalar minus vector instruction SSV
  • the device is specified from the scalar register file Extracting the scalar data from the address, extracting the vector data from the specified address of the scratchpad memory, subtracting the corresponding element in the vector from the scalar in the vector calculation unit, and writing the result back to the specified address of the scratchpad memory;
  • VMV Vector Multiply Instruction
  • VMS Vector multiplier instruction
  • VD Vector division instruction
  • Scalar Divisor Vector Instructions According to the instruction, the device takes the scalar data from the specified position of the scalar register file, extracts the vector data of the specified size from the specified position of the scratchpad memory, and divides the scalar by the corresponding element in the vector, and the result is Write back to the specified location of the scratch pad memory;
  • the device fetches the vector data of the specified size from the specified address of the scratch pad memory, extracts the corresponding element in the vector as an output according to the specified position, and writes the result back to the specified address of the scalar register file;
  • VLOAD Vector load instruction
  • Vector store instruction (VS). According to the instruction, the device stores the vector data of the specified size of the specified address of the scratch pad memory to the external destination address;
  • VMOVE Vector handling instructions
  • the device stores the vector data of the specified size of the specified address of the scratch pad memory to another specified address of the scratch pad memory.
  • FIG. 3 is a schematic structural diagram of a vector quadruple operation device according to an embodiment of the present invention.
  • the device includes an instruction fetch module, a decoding module, an instruction queue, a scalar register file, a dependency processing unit, a storage queue, and a reordering.
  • Cache vector four arithmetic unit, high speed register, IO memory access module;
  • the fetch module which is responsible for fetching the next instruction to be executed from the instruction sequence and passing the instruction to the decoding module;
  • the module is responsible for decoding the instruction, and transmitting the decoded instruction to the instruction queue;
  • the instruction queue considering that different instructions may have dependencies on the included scalar registers, for buffering the decoded instructions, and transmitting the instructions when the dependencies are satisfied;
  • a scalar register file that provides the scalar registers required by the device during the operation
  • a dependency processing unit that handles storage dependencies that may exist between a processing instruction and a previous instruction.
  • the vector four arithmetic instruction accesses the scratch pad memory, and the front and rear instructions may access the same block of memory.
  • the instruction In order to ensure the correctness of the execution result of the instruction, if the current instruction is detected to have a dependency on the data of the previous instruction, the instruction must wait in the storage queue until the dependency is eliminated.
  • the storage queue, the module is an ordered queue, and instructions related to the previous instruction on the data are stored in the queue until the storage relationship is eliminated;
  • the instruction is also cached in the module during execution.
  • the instruction will be submit. Once submitted, the operation of the instruction will not be able to cancel the change of the device status;
  • a vector operation unit which is responsible for all vector operations of the device, and a vector operation instruction is sent to the operation unit;
  • the high-speed register the module is a temporary storage device dedicated to vector data, and can support vector data of different sizes;
  • IO memory access module which is used to directly access the scratchpad memory and is responsible for reading data or writing data from the scratchpad memory.
  • VA vector addition instruction
  • the fetch module takes the vector addition instruction (VA) and sends the instruction to the decoding module.
  • VA vector addition instruction
  • the decoding module decodes the instruction and sends a vector addition instruction (VA) to the instruction queue.
  • VA vector addition instruction
  • the vector addition instruction (VA) needs to obtain the data in the scalar register corresponding to the four operation fields in the instruction from the scalar register file, including the start address of the vector vin0, the length of the vector vin0, and the vector. The starting address of vin1 and the length of the vector vin1.
  • the instruction is sent to the dependency processing unit.
  • the dependency processing unit analyzes whether the instruction has a dependency on the data with the previous instruction that has not been executed. The instruction needs to wait in the store queue until it no longer has a dependency on the data with the previous unexecuted instruction.
  • the vector addition instruction (VA) is sent to the vector four arithmetic unit.
  • the vector four arithmetic unit extracts the required vector from the data register according to the address and length of the required data, and then performs vector addition in the vector four arithmetic unit.
  • FIG. 5 is a schematic structural diagram of a vector quadruple operation unit according to an embodiment of the present invention.
  • a vector four arithmetic unit includes a vector four arithmetic unit and the like.
  • the vector four arithmetic unit is a multi-stream water level structure.
  • the vector addition component and the vector subtraction component are in the pipeline level 1
  • the vector quadruple multiplication component and the vector quadruple division component are in the pipeline level 2. These units are at different pipeline levels. When the sequential order of multiple vector four arithmetic instructions is consistent with the order of the corresponding unit, the operations required by the series of four arithmetic instructions can be implemented more efficiently.

Abstract

一种执行向量四则运算的装置及方法,用于配合一套相应的指令集,执行向量四则运算,装置包括存储单元、寄存器单元和向量四则运算单元,存储单元中存储有向量,寄存器单元中存储有向量存储的地址,向量四则运算单元根据配套指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量四则运算,得到运算结果。通过将参与计算的向量数据暂存在高速暂存存储器上,使得向量四则运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量向量四则运算应用的执行性能。

Description

一种用于执行向量四则运算的装置和方法 技术领域
本发明涉及一种向量四则运算装置及方法,用于根据向量四则运算指令高效灵活地执行向量四则运算,能够很好地解决当前计算机领域越来越多的算法包含大量向量四则运算的问题。
背景技术
在已有的计算机领域应用中,与向量运算相关的应用十分普遍。以目前的热门应用领域人工智能中的主流算法机器学习算法为例,几乎所有已有的经典算法中都含有大量的向量四则运算。向量四则运算是指对向量的对应分量进行加减乘除这四种运算。具体来说,对于两个向量a=[a1,a2,…,an]和b=[b1,b2,…,bn],向量加法定义为:a+b=[a1+b1,a2+b2,…,an+bn],向量减法定义为:a-b=[a1-b1,a2-b2,…,an-bn],向量乘法定义为:[a1*b1,a2*b2,…,an*bn]向量除法定义为:[a1/b1,a2/b2,…,an/bn]。
在现有技术中,一种进行向量四则运算的已知方案是使用通用处理器,该方法通过通用寄存器堆和通用功能部件来执行通用指令,从而执行向量四则运算。然而,该方法的缺点之一是单个通用处理器多用于标量计算,在进行向量四则运算时运算性能较低。而使用多个通用处理器并行执行时,通用处理器之间的相互通讯又有可能成为性能瓶颈。在另一种现有技术中,使用图形处理器(GPU)来进行向量计算,其中,通过使用通用寄存器堆和通用流处理单元执行通用SIMD指令来进行向量四则运算。然而,上述方案中,GPU片上缓存太小,在进行大规模向量四则运算时需要不断进行片外数据搬运,片外带宽成为了主要性能瓶颈。在另一种现有技术中,使用专门定制的向量四则运算装置来进行向量计算,其中,使用定制的寄存器堆和定制的处理单元进行向量四则运 算。然而,目前已有的专用向量四则运算装置受限于寄存器堆,不能够灵活地支持不同长度的向量四则运算。
发明内容
(一)要解决的技术问题
本发明的目的在于,提供一种向量四则运算装置及方法,解决现有技术中存在的受限于片间通讯、片上缓存不够、支持的向量长度不够灵活等问题。
(二)技术方案
本发明提供一种向量四则运算装置,用于根据向量四则运算指令执行向量四则运算,包括:
存储单元,用于存储向量;
寄存器单元,用于存储向量地址,其中,向量地址为向量在存储单元中存储的地址;
向量四则运算单元,用于获取向量四则运算指令,根据向量四则运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量四则运算,得到向量四则运算结果。
(三)有益效果
本发明提供的向量四则运算装置及方法,将参与计算的向量数据暂存在高速暂存存储器(Scratchpad Memory)上。在仅发送同一条指令的情况下,向量四则运算单元中可以更加灵活有效地支持不同宽度的数据,并可以解决数据存储中的相关性问题,从而提升了包含大量向量计算任务的执行性能,本发明采用的指令具有精简的格式,使得指令集使用方便、支持的向量长度灵活。
本发明可以应用于以下(包括但不限于)场景中:数据处理、机器人、电脑、打印机、扫描仪、电话、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、云端服务器、相机、摄像机、投影 仪、手表、耳机、移动存储、可穿戴设备等各类电子产品;飞机、轮船、车辆等各类交通工具;电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机等各类家用电器;以及包括核磁共振仪、B超、心电图仪等各类医疗设备。
附图说明
图1是本发明提供的向量四则运算装置的结构示意图。
图2是本发明提供的指令集的格式示意图。
图3是本发明实施例提供的向量四则运算装置的结构示意图。
图4是本发明实施例提供的向量四则运算装置执行向量四则指令的流程图。
图5为本发明实施例提供的向量四则运算单元的结构示意图。
具体实施方式
本发明提供一种向量四则运算装置及配套指令集,装置包括存储单元、寄存器单元和向量四则运算单元,存储单元中存储有向量,寄存器单元中存储有向量存储的地址向量四则运算单元根据向量四则运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量四则运算,得到向量四则运算结果。本发明将参与计算的向量数据暂存在高速暂存存储器上,使得向量四则运算过程中可以更加灵活有效地支持不同宽度的数据,提升包含大量向量计算任务的执行性能。
图1是本发明提供的向量四则运算装置的结构示意图,如图1所示,向量四则运算装置包括:
存储单元,用于存储向量,在一种实施方式中,该存储单元可以是高速暂存存储器,能够支持不同大小的向量数据;本发明将必要的计算数据暂存在高速暂存存储器(Scratchpad Memory)上,使本运算装置在 进行向量四则运算过程中可以更加灵活有效地支持不同宽度的数据。存储单元可以通过各种不同存储器件(SRAM、eDRAM、DRAM、忆阻器、3D-DRAM或非易失存储等)实现。
寄存器单元,用于存储向量地址,其中,向量地址为向量在存储单元中存储的地址;在一种实施方式中,寄存器单元可以是标量寄存器堆,提供运算过程中所需的标量寄存器,标量寄存器不只存放向量地址,还存放有标量数据。当涉及到向量与标量的运算时,向量四则运算单元不仅要从寄存器单元中获取向量地址,还要从寄存器单元中获取相应的标量。另外,寄存器单元的数量一般为多个,以组成寄存器堆,用于存储多个向量地址及标量。
向量四则运算单元,用于获取向量四则运算指令,根据向量四则运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量四则运算,得到向量四则运算结果,并将向量四则运算结果存储于存储单元中。向量四则运算单元包含包括向量四则加法部件、向量四则减法部件、向量四则乘法部件和向量四则除法部件,并且,向量四则运算单元为多流水级结构,其中,加法部件和减法部件处于第一流水级,乘法部件和除法部件处于第二流水级。这些单元处于不同的流水级,当连续串行的多条向量四则运算指令的先后次序与相应单元所在流水级顺序一致时,可以更加高效地实现这一连串向量四则运算指令所要求的操作。
根据本发明的一种实施方式,向量四则运算装置还包括:指令缓存单元,用于存储待执行的向量四则运算指令。指令在执行过程中,同时也被缓存在指令缓存单元中,当一条指令执行完之后,如果该指令同时也是指令缓存单元中未被提交指令中最早的一条指令,该指令将被提交,一旦提交,该条指令进行的操作对装置状态的改变将无法撤销。在一种实施方式中,指令缓存单元可以是重排序缓存。
根据本发明的一种实施方式,向量四则运算装置还包括:指令处理单元,用于从指令缓存单元获取向量四则运算指令,并对该向量四则运算指令进行处理后,提供给所述向量四则运算单元。其中,指令处理单 元包括:
取指模块,用于从指令缓存单元中获取向量四则运算指令;
译码模块,用于对获取的向量四则运算指令进行译码;
指令队列,用于对译码后的向量四则运算指令进行顺序存储,考虑到不同指令在包含的寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发送指令。
根据本发明的一种实施方式,向量四则运算装置还包括:依赖关系处理单元,用于在向量四则运算单元获取向量四则运算指令前,判断该向量四则运算指令与前一向量四则运算指令是否访问相同的向量,若是,将该向量四则运算指令存储在一存储队列中,待前一向量四则运算指令执行完毕后,将存储队列中的该向量四则运算指令提供给所述向量四则运算单元;否则,直接将该向量四则运算指令提供给所述向量四则运算单元。具体地,向量四则运算指令访问高速暂存存储器时,前后指令可能会访问同一块存储空间,为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
根据本发明的一种实施方式,向量四则运算装置还包括:输入输出单元,用于将向量存储于存储单元,或者,从存储单元中获取向量四则运算结果。其中,输入输出单元可直接存储单元,负责从内存中读取向量数据或写入向量数据。
本发明还提供一种向量四则运算方法,用于根据向量四则运算指令执行向量四则运算,方法包括:
S1,存储向量;
S2,存储向量地址,向量地址指示了向量在步骤S1中所存储的位置;
S3,获取向量四则运算指令,根据向量四则运算指令获取向量地址,然后,根据该向量地址获取存储的向量,接着,根据获取的向量进行向量四则运算,得到向量四则运算结果。
根据本发明的一种实施方式,在步骤S3之前还包括:
存储向量四则运算指令;
获取存储的向量四则运算指令;
对获取的向量四则运算指令进行译码;
对译码后的向量四则运算指令进行顺序存储。
根据本发明的一种实施方式,在步骤S3之前还包括:
判断该向量四则运算指令与前一向量四则运算指令是否访问相同的向量,若是,将该向量四则运算指令存储在一存储队列中,待前一向量四则运算指令执行完毕后,再执行步骤S3;否则,直接执行步骤S3。
根据本发明的一种实施方式,方法还包括,存储所述向量四则运算结果。
根据本发明的一种实施方式,步骤S1包括,将向量存储至一高速暂存存储器中。
根据本发明的一种实施方式,向量四则运算指令包括一操作码和至少一操作域,其中,所述操作码用于指示该向量运算指令的功能,操作域用于指示该向量运算指令的数据信息。
根据本发明的一种实施方式,向量四则运算包括向量加法运算、向量减法运算、向量乘法运算和向量除法运算。
根据本发明的一种实施方式,向量运算单元为多流水级结构,包括第一流水级和第二流水级,其中,在第一流水级执行向量加法运算和向量减法运算,在第二流水级执行向量乘法运算和向量除法运算。
本发明的指令集采用Load/Store结构,向量四则运算单元不会对内存中的数据进行操作。本指令集采用精简指令集架构,指令集只提供最基本的向量四则运算操作,复杂的向量四则运算都由这些简单指令通过组合进行模拟,使得可以在高时钟频率下单周期执行指令。另外,本指令集同时采用定长指令,使得本发明提出的向量四则运算装置在上一条指令的译码阶段对下一条指令进行取指。
在本装置执行向量四则运算的过程中,装置取出指令进行译码,然后送至指令队列存储,根据译码结果,获取指令中的各个参数,这些参数可以是直接写在指令的操作域中,也可以是根据指令操作域中的寄存 器编号从指定的寄存器中读取。这种使用寄存器存储参数的好处是无需改变指令本身,只要用指令改变寄存器中的值,就可以实现大部分的循环,因此大大节省了在解决某些实际问题时所需要的指令条数。在全部操作数之后,依赖关系处理单元会判断指令实际需要使用的数据与之前指令中是否存在依赖关系,这决定了这条指令是否可以被立即发送至运算单元中执行。一旦发现与之前的数据之间存在依赖关系,则该条指令必须等到它依赖的指令执行完毕之后才可以送至运算单元执行。在定制的运算单元中,该条指令将快速执行完毕,并将结果,即生成的向量四则运算结果写回至指令提供的地址,该条指令执行完毕。
图2是本发明提供的指令集的格式示意图,如图2所示,向量四则运算指令包括1个操作码和多个操作域,其中,操作码用于指示该向量四则运算指令的功能,功能如加、减、乘、除等。向量四则运算单元通过识别该操作码可进行向量四则运算,操作域用于指示该向量四则运算指令的数据信息,其中,数据信息可以是立即数或寄存器编号,例如,要获取一个向量时,根据寄存器编号可以在相应的寄存器中获取向量起始地址和向量长度,再根据向量起始地址和向量长度在存储单元中获取相应地址存放的向量。
指令集包含有不同功能的向量四则运算指令:
向量加法指令(VA)。根据该指令,装置从高速暂存存储器的指定地址处分别取出两块指定大小的向量数据,在向量运算单元中进行加法运算,并将结果写回至高速暂存存储器的指定地址;
向量加标量指令(VAS)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,从标量寄存器堆的指定地址取出标量数据,在标量运算单元中将向量的每一个元素加上该标量值,并将结果写回至高速暂存存储器的指定地址;
向量减法指令(VS)。根据该指令,装置从高速暂存存储器的指定地址处分别取出两块指定大小的向量数据,在向量运算单元中进行减法运算,并将结果写回至高速暂存存储器的指定地址;
标量减向量指令(SSV)。根据该指令,装置从标量寄存器堆的指定 地址取出标量数据,从高速暂存存储器的指定地址取出向量数据,在向量计算单元中用该标量减去向量中的相应元素,并将结果写回高速暂存存储器的指定地址;
向量乘法指令(VMV)。根据该指令,装置从高速暂存存储器的指定地址分别取出指定大小的向量数据,在向量计算单元中将两向量数据对位相乘,并将结果写回高速暂存存储器的指定地址;
向量乘标量指令(VMS)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,从标量寄存器堆的指定地址取出指定大小的标量数据,在向量寄存单元中进行向量乘标量运算,并将结果写回高速暂存存储器的指定地址;
向量除法指令(VD)。根据该指令,装置从高速暂存存储器的指定地址取出分别取出指定大小的向量数据,在向量运算单元中将两向量对位相除,并将结果写回至高速暂存存储器的指定地址;
标量除向量指令(SDV)。根据该指令,装置从标量寄存器堆的指定位置取出标量数据,从高速暂存存储器的指定位置取出指定大小的向量数据,在向量计算单元中用标量分别除以向量中的相应元素,并将结果写回至高速暂存存储器的指定位置;
向量检索指令(VR)。根据该指令,装置从高速暂存存储器的指定地址取出指定大小的向量数据,在向量计算单元中根据指定位置取出向量中的相应元素作为输出,并将结果写回至标量寄存器堆的指定地址;
向量加载指令(VLOAD)。根据该指令,装置从指定外部源地址载入指定大小的向量数据至高速暂存存储器的指定地址;
向量存储指令(VS)。根据该指令,装置将高速暂存存储器的指定地址的指定大小的向量数据存至外部目的地址处;
向量搬运指令(VMOVE)。根据该指令,装置将高速暂存存储器的指定地址的指定大小的向量数据存至高速暂存存储器的另一指定地址处。
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明进一步详细说明。
图3是本发明实施例提供的向量四则运算装置的结构示意图,如图3所示,装置包括取指模块、译码模块、指令队列、标量寄存器堆、依赖关系处理单元、存储队列、重排序缓存、向量四则运算单元、高速暂存器、IO内存存取模块;
取指模块,该模块负责从指令序列中取出下一条将要执行的指令,并将该指令传给译码模块;
译码模块,该模块负责对指令进行译码,并将译码后指令传给指令队列;
指令队列,考虑到不同指令在包含的标量寄存器上有可能存在依赖关系,用于缓存译码后的指令,当依赖关系被满足之后发射指令;
标量寄存器堆,提供装置在运算过程中所需的标量寄存器;
依赖关系处理单元,该模块处理处理指令与前一条指令可能存在的存储依赖关系。向量四则运算指令会访问高速暂存存储器,前后指令可能会访问同一块存储空间。为了保证指令执行结果的正确性,当前指令如果被检测到与之前的指令的数据存在依赖关系,该指令必须在存储队列内等待至依赖关系被消除。
存储队列,该模块是一个有序队列,与之前指令在数据上有依赖关系的指令被存储在该队列内直至存储关系被消除;
重排序缓存,指令在执行过程中,同时也被缓存在给该模块中,当一条指令执行完之后,如果该指令同时也是重排序缓存中未被提交指令中最早的一条指令,该指令将被提交。一旦提交,该条指令进行的操作对装置状态的改变将无法撤销;
向量四则运算单元,该模块负责装置的所有向量四则运算,向量四则运算指令被送往该运算单元执行;
高速暂存器,该模块是向量数据专用的暂存存储装置,能够支持不同大小的向量数据;
IO内存存取模块,该模块用于直接访问高速暂存存储器,负责从高速暂存存储器中读取数据或写入数据。
图4是本发明实施例提供的向量四则运算装置执行任一向量加法指 令(VA)的流程图,如图4所示,执行向量加法指令(VA)的过程包括:
S1,取指模块取出该条向量加法指令(VA),并将该指令送往译码模块。
S2,译码模块对指令译码,并将向量加法指令(VA)送往指令队列。
S3,在指令队列中,该向量加法指令(VA)需要从标量寄存器堆中获取指令中四个操作域所对应的标量寄存器里的数据,包括向量vin0的起始地址、向量vin0的长度、向量vin1的起始地址、向量vin1的长度。
S4,在取得需要的标量数据后,该指令被送往依赖关系处理单元。依赖关系处理单元分析该指令与前面的尚未执行结束的指令在数据上是否存在依赖关系。该条指令需要在存储队列中等待至其与前面的未执行结束的指令在数据上不再存在依赖关系为止。
S5:依赖关系不存在后,该条向量加法指令(VA)被送往向量四则运算单元。向量四则运算单元根据所需数据的地址和长度从数据暂存器中取出需要的向量,然后在向量四则运算单元中完成向量加法运算。
S6,运算完成后,将结果写回至高速暂存存储器的指定地址,同时提交重排序缓存中的该向量四则指令。
图5为本发明实施例提供的向量四则运算单元的结构示意图,如图5所示,向量四则运算单元内包含向量四则运算单元等。并且,向量四则运算单元为多流水级结构,
其中,向量加法部件和向量减法部件处于流水级1,向量四则乘法部件和向量四则除法部件处于流水级2。这些单元处于不同的流水级,当连续串行的多条向量四则运算指令的先后次序与相应单元所在流水级顺序一致时,可以更加高效地实现这一连串向量四则运算指令所要求的操作。
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (20)

  1. 一种向量四则运算装置,用于根据向量四则运算指令执行向量四则运算,包括:
    存储单元,用于存储向量;
    寄存器单元,用于存储向量地址,其中,向量地址为向量在存储单元中存储的地址;
    向量四则运算单元,用于获取向量四则运算指令,根据向量四则运算指令在寄存器单元中获取向量地址,然后,根据该向量地址在存储单元中获取相应的向量,接着,根据获取的向量进行向量四则运算,得到向量四则运算结果。
  2. 根据权利要求1所述的向量四则运算装置,其特征在于,还包括:指令缓存单元,用于存储待执行的向量四则运算指令。
  3. 根据权利要求2所述的向量四则运算装置,其特征在于,还包括:指令处理单元,用于从指令缓存单元获取向量四则运算指令,并对该向量四则运算指令进行处理后,提供给所述向量四则运算单元。
  4. 根据权利要求3所述的向量四则运算装置,其特征在于,所述指令处理单元包括:
    取指模块,用于从指令缓存单元中获取向量四则运算指令;
    译码模块,用于对获取的向量四则运算指令进行译码;
    指令队列,用于对译码后的向量四则运算指令进行顺序存储。
  5. 根据权利要求1所述的向量四则运算装置,其特征在于,还包括:
    依赖关系处理单元,用于在所述向量四则运算单元获取向量四则运算指令前,用于在向量四则运算单元获取向量四则运算指令前,判断该向量四则运算指令与前一向量四则运算指令是否访问相同的向量,若是,将该向量四则运算指令存储在一存储队列中,待前一向量四则运算指令执行完毕后,将存储队列中的该向量四则运算指令提供给所述向量四则运算单元;否则,直接将该向量四则运算指令提供给所述向量四则 运算单元。
  6. 根据权利要求1所述的向量四则运算装置,其特征在于,所述存储单元还用于存储所述向量四则运算结果。
  7. 根据权利要求6所述的向量四则运算装置,其特征在于,还包括:
    输入输出单元,用于将向量存储于所述存储单元,或者,从所述存储单元中获取向量四则运算结果。
  8. 根据权利要求1所述的向量四则运算装置,其特征在于,所述存储单元为高速暂存存储器。
  9. 根据权利要求1所述的向量运算装置,其特征在于,所述向量四则运算指令包括一操作码和至少一操作域,其中,所述操作码用于指示该向量运算指令的功能,操作域用于指示该向量运算指令的数据信息。
  10. 根据权利要求9所述的向量运算装置,其特征在于,所述数据信息为寄存器单元编号,所述向量四则运算单元根据该寄存器单元编号访问对应的寄存器单元,并获取向量地址。
  11. 根据权利要求1所述的向量四则运算装置,其特征在于,所述向量四则运算单元包括向量加法部件、向量减法部件、向量乘法部件和向量除法部件。
  12. 根据权利要求11所述的向量运算装置,其特征在于,所述向量运算单元为多流水级结构,包括第一流水级和第二流水级,其中,向量加法部件和向量减法部件处于第一流水级,向量乘法部件和向量除法部件处于第二流水级。
  13. 一种向量四则运算方法,用于根据向量四则运算指令执行向量四则运算,方法包括:
    S1,存储向量;
    S2,存储向量地址;
    S3,获取向量四则运算指令,根据向量四则运算指令获取向量地址,然后,根据该向量地址获取存储的向量,接着,根据获取的向量进行向 量四则运算,得到向量四则运算结果。
  14. 根据权利要求13所述的向量四则运算方法,其特征在于,在步骤S3之前还包括:
    存储向量四则运算指令;
    获取存储的向量四则运算指令;
    对获取的向量四则运算指令进行译码;
    对译码后的向量四则运算指令进行顺序存储。
  15. 根据权利要求13所述的向量四则运算方法,其特征在于,在步骤S3之前还包括:
    判断该向量四则运算指令与前一向量四则运算指令是否访问相同的向量,若是,将该向量四则运算指令存储在一存储队列中,待前一向量四则运算指令执行完毕后,再执行步骤S3;否则,直接执行步骤S3。
  16. 根据权利要求13所述的向量四则运算方法,其特征在于,还包括,存储所述向量四则运算结果。
  17. 根据权利要求13所述的向量四则运算方法,其特征在于,所述步骤S1包括,将向量存储至一高速暂存存储器中。
  18. 根据权利要求13所述的向量四则运算方法,其特征在于,所述向量四则运算指令包括一操作码和至少一操作域,其中,所述操作码用于指示该向量运算指令的功能,操作域用于指示该向量运算指令的数据信息。
  19. 根据权利要求13所述的向量四则运算方法,其特征在于,所述向量四则运算包括向量加法运算、向量减法运算、向量乘法运算和向量除法运算。
  20. 根据权利要求19所述的向量四则运算方法,其特征在于,所述向量运算单元为多流水级结构,包括第一流水级和第二流水级,其中,在第一流水级执行向量加法运算和向量减法运算,在第二流水级执行向量乘法运算和向量除法运算。
PCT/CN2016/081107 2016-04-26 2016-05-05 一种用于执行向量四则运算的装置和方法 WO2017185392A1 (zh)

Priority Applications (11)

Application Number Priority Date Filing Date Title
EP16899903.5A EP3451185A4 (en) 2016-04-26 2016-05-05 DEVICE AND METHOD FOR CARRYING OUT FOUR BASIC OPERATIONS OF THE ARITHMETIC OF VECTORS
EP21154589.2A EP3832500B1 (en) 2016-04-26 2016-05-05 Device and method for performing vector four-fundamental-rule operation
US16/172,566 US20190073339A1 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,432 US10997276B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,592 US10585973B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,657 US11341211B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,515 US10592582B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,653 US11507640B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,629 US11100192B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,649 US11436301B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,533 US10599745B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610266989.X 2016-04-26
CN201610266989.XA CN107315717B (zh) 2016-04-26 2016-04-26 一种用于执行向量四则运算的装置和方法

Related Child Applications (9)

Application Number Title Priority Date Filing Date
US16/172,533 Continuation-In-Part US10599745B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,515 Continuation-In-Part US10592582B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,649 Continuation-In-Part US11436301B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,432 Continuation-In-Part US10997276B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,629 Continuation-In-Part US11100192B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,657 Continuation-In-Part US11341211B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,592 Continuation-In-Part US10585973B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,653 Continuation-In-Part US11507640B2 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations
US16/172,566 Continuation-In-Part US20190073339A1 (en) 2016-04-26 2018-10-26 Apparatus and methods for vector operations

Publications (1)

Publication Number Publication Date
WO2017185392A1 true WO2017185392A1 (zh) 2017-11-02

Family

ID=60161696

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/081107 WO2017185392A1 (zh) 2016-04-26 2016-05-05 一种用于执行向量四则运算的装置和方法

Country Status (4)

Country Link
US (9) US11100192B2 (zh)
EP (2) EP3451185A4 (zh)
CN (2) CN111651203A (zh)
WO (1) WO2017185392A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065193A1 (en) * 2016-04-26 2019-02-28 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109754062A (zh) * 2017-11-07 2019-05-14 上海寒武纪信息科技有限公司 卷积扩展指令的执行方法以及相关产品
CN111353124A (zh) * 2018-12-20 2020-06-30 上海寒武纪信息科技有限公司 运算方法、装置、计算机设备和存储介质
US11056098B1 (en) 2018-11-28 2021-07-06 Amazon Technologies, Inc. Silent phonemes for tracking end of speech
CN111399905B (zh) * 2019-01-02 2022-08-16 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
US10997116B2 (en) * 2019-08-06 2021-05-04 Microsoft Technology Licensing, Llc Tensor-based hardware accelerator including a scalar-processing unit
US11663056B2 (en) * 2019-12-20 2023-05-30 Intel Corporation Unified programming interface for regrained tile execution

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1349159A (zh) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 微处理器向量处理方法
US20030037221A1 (en) * 2001-08-14 2003-02-20 International Business Machines Corporation Processor implementation having unified scalar and SIMD datapath
CN101847093A (zh) * 2010-04-28 2010-09-29 中国科学院自动化研究所 具有可重构低功耗数据交织网络的数字信号处理器
CN102629238A (zh) * 2012-03-01 2012-08-08 中国人民解放军国防科学技术大学 支持向量条件访存的方法和装置
CN103699360A (zh) * 2012-09-27 2014-04-02 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
CN104407997A (zh) * 2014-12-18 2015-03-11 中国人民解放军国防科学技术大学 带有指令动态调度功能的与非型闪存单通道同步控制器

Family Cites Families (58)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4945479A (en) * 1985-07-31 1990-07-31 Unisys Corporation Tightly coupled scientific processing system
JPH0622035B2 (ja) * 1985-11-13 1994-03-23 株式会社日立製作所 ベクトル処理装置
US4888679A (en) * 1988-01-11 1989-12-19 Digital Equipment Corporation Method and apparatus using a cache and main memory for both vector processing and scalar processing by prefetching cache blocks including vector data elements
US5261113A (en) * 1988-01-25 1993-11-09 Digital Equipment Corporation Apparatus and method for single operand register array for vector and scalar data processing operations
US5197130A (en) * 1989-12-29 1993-03-23 Supercomputer Systems Limited Partnership Cluster architecture for a highly parallel scalar/vector multiprocessor system
WO1992020029A1 (en) * 1991-04-29 1992-11-12 Intel Corporation Neural network incorporating difference neurons
KR0132894B1 (ko) * 1992-03-13 1998-10-01 강진구 영상압축부호화 및 복호화 방법과 그 장치
US5717947A (en) * 1993-03-31 1998-02-10 Motorola, Inc. Data processing system and method thereof
US5402369A (en) * 1993-07-06 1995-03-28 The 3Do Company Method and apparatus for digital multiplication based on sums and differences of finite sets of powers of two
US6385634B1 (en) * 1995-08-31 2002-05-07 Intel Corporation Method for performing multiply-add operations on packed data
US5864690A (en) * 1997-07-30 1999-01-26 Integrated Device Technology, Inc. Apparatus and method for register specific fill-in of register generic micro instructions within an instruction queue
US6282634B1 (en) * 1998-05-27 2001-08-28 Arm Limited Apparatus and method for processing data having a mixed vector/scalar register file
US6295597B1 (en) * 1998-08-11 2001-09-25 Cray, Inc. Apparatus and method for improved vector processing to support extended-length integer arithmetic
US6192384B1 (en) * 1998-09-14 2001-02-20 The Board Of Trustees Of The Leland Stanford Junior University System and method for performing compound vector operations
US7100026B2 (en) * 2001-05-30 2006-08-29 The Massachusetts Institute Of Technology System and method for performing efficient conditional vector operations for data parallel architectures involving both input and conditional vector values
US20030128660A1 (en) * 2002-01-09 2003-07-10 Atsushi Ito OFDM communications apparatus, OFDM communications method, and OFDM communications program
US20030221086A1 (en) * 2002-02-13 2003-11-27 Simovich Slobodan A. Configurable stream processor apparatus and methods
CN2563635Y (zh) * 2002-05-24 2003-07-30 倚天资讯股份有限公司 提供四则运算训练的电子装置
JP4339245B2 (ja) * 2002-05-24 2009-10-07 エヌエックスピー ビー ヴィ スカラー/ベクトルプロセッサ
US20040193838A1 (en) * 2003-03-31 2004-09-30 Patrick Devaney Vector instructions composed from scalar instructions
US7096345B1 (en) * 2003-09-26 2006-08-22 Marvell International Ltd. Data processing system with bypass reorder buffer having non-bypassable locations and combined load/store arithmetic logic unit and processing method thereof
US20050226337A1 (en) * 2004-03-31 2005-10-13 Mikhail Dorojevets 2D block processing architecture
US7873812B1 (en) * 2004-04-05 2011-01-18 Tibet MIMAR Method and system for efficient matrix multiplication in a SIMD processor architecture
US7388588B2 (en) * 2004-09-09 2008-06-17 International Business Machines Corporation Programmable graphics processing engine
US20060259737A1 (en) * 2005-05-10 2006-11-16 Telairity Semiconductor, Inc. Vector processor with special purpose registers and high speed memory access
US7299342B2 (en) * 2005-05-24 2007-11-20 Coresonic Ab Complex vector executing clustered SIMD micro-architecture DSP with accelerator coupled complex ALU paths each further including short multiplier/accumulator using two's complement
WO2007081234A1 (fr) * 2006-01-12 2007-07-19 Otkrytoe Aktsionernoe Obschestvo 'bineuro' Procede de codage de la semantique de documents-textes
CN1829138B (zh) * 2006-04-07 2010-04-07 清华大学 自适应多输入多输出发送接收系统及其方法
US8099583B2 (en) * 2006-08-23 2012-01-17 Axis Semiconductor, Inc. Method of and apparatus and architecture for real time signal processing by switch-controlled programmable processor configuring and flexible pipeline and parallel processing
US20080154816A1 (en) * 2006-10-31 2008-06-26 Motorola, Inc. Artificial neural network with adaptable infinite-logic nodes
US8923510B2 (en) * 2007-12-28 2014-12-30 Intel Corporation Method and apparatus for efficiently implementing the advanced encryption standard
CN102047219A (zh) * 2008-05-30 2011-05-04 Nxp股份有限公司 矢量处理的方法
US20100149215A1 (en) * 2008-12-15 2010-06-17 Personal Web Systems, Inc. Media Action Script Acceleration Apparatus, System and Method
CN101763240A (zh) * 2008-12-25 2010-06-30 上海华虹集成电路有限责任公司 基于ucps协议的快速模乘方法及硬件实现方法
JP5573134B2 (ja) * 2009-12-04 2014-08-20 日本電気株式会社 ベクトル型計算機及びベクトル型計算機の命令制御方法
US8627044B2 (en) * 2010-10-06 2014-01-07 Oracle International Corporation Issuing instructions with unresolved data dependencies
US9122485B2 (en) * 2011-01-21 2015-09-01 Apple Inc. Predicting a result of a dependency-checking instruction when processing vector instructions
GB2489914B (en) * 2011-04-04 2019-12-18 Advanced Risc Mach Ltd A data processing apparatus and method for performing vector operations
CN102156637A (zh) * 2011-05-04 2011-08-17 中国人民解放军国防科学技术大学 向量交叉多线程处理方法及向量交叉多线程微处理器
CN102262525B (zh) * 2011-08-29 2014-11-19 孙瑞玮 基于矢量运算的矢量浮点运算装置及方法
CN103765837B (zh) * 2012-08-10 2016-08-10 华为技术有限公司 多cpu的报文处理方法及系统、交换单元、单板
US20140169468A1 (en) * 2012-12-17 2014-06-19 Lsi Corporation Picture refresh with constant-bit budget
US9389854B2 (en) * 2013-03-15 2016-07-12 Qualcomm Incorporated Add-compare-select instruction
US20140289498A1 (en) * 2013-03-19 2014-09-25 Apple Inc. Enhanced macroscalar vector operations
US9594983B2 (en) * 2013-08-02 2017-03-14 Digimarc Corporation Learning systems and methods
US9880845B2 (en) * 2013-11-15 2018-01-30 Qualcomm Incorporated Vector processing engines (VPEs) employing format conversion circuitry in data flow paths between vector data memory and execution units to provide in-flight format-converting of input vector data to execution units for vector processing operations, and related vector processor systems and methods
CN103744352A (zh) * 2013-12-23 2014-04-23 华中科技大学 一种基于fpga的三次b样条曲线的硬件插补器
US9846836B2 (en) * 2014-06-13 2017-12-19 Microsoft Technology Licensing, Llc Modeling interestingness with deep neural networks
US9785565B2 (en) * 2014-06-30 2017-10-10 Microunity Systems Engineering, Inc. System and methods for expandably wide processor instructions
US11544214B2 (en) * 2015-02-02 2023-01-03 Optimum Semiconductor Technologies, Inc. Monolithic vector processor configured to operate on variable length vectors using a vector length register
CN104699465B (zh) * 2015-03-26 2017-05-24 中国人民解放军国防科学技术大学 向量处理器中支持simt的向量访存装置和控制方法
CN105005465B (zh) * 2015-06-12 2017-06-16 北京理工大学 基于比特或字节并行加速的处理器
GB2540943B (en) * 2015-07-31 2018-04-11 Advanced Risc Mach Ltd Vector arithmetic instruction
US10586168B2 (en) * 2015-10-08 2020-03-10 Facebook, Inc. Deep translations
CN111651203A (zh) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 一种用于执行向量四则运算的装置和方法
US10366163B2 (en) * 2016-09-07 2019-07-30 Microsoft Technology Licensing, Llc Knowledge-guided structural attention processing
US10916135B2 (en) * 2018-01-13 2021-02-09 Toyota Jidosha Kabushiki Kaisha Similarity learning and association between observations of multiple connected vehicles
US11630952B2 (en) * 2019-07-22 2023-04-18 Adobe Inc. Classifying terms from source texts using implicit and explicit class-recognition-machine-learning models

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030037221A1 (en) * 2001-08-14 2003-02-20 International Business Machines Corporation Processor implementation having unified scalar and SIMD datapath
CN1349159A (zh) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 微处理器向量处理方法
CN101847093A (zh) * 2010-04-28 2010-09-29 中国科学院自动化研究所 具有可重构低功耗数据交织网络的数字信号处理器
CN102629238A (zh) * 2012-03-01 2012-08-08 中国人民解放军国防科学技术大学 支持向量条件访存的方法和装置
CN103699360A (zh) * 2012-09-27 2014-04-02 北京中科晶上科技有限公司 一种向量处理器及其进行向量数据存取、交互的方法
CN104407997A (zh) * 2014-12-18 2015-03-11 中国人民解放军国防科学技术大学 带有指令动态调度功能的与非型闪存单通道同步控制器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3451185A4 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190065193A1 (en) * 2016-04-26 2019-02-28 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US20190079766A1 (en) * 2016-04-26 2019-03-14 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US20190079765A1 (en) * 2016-04-26 2019-03-14 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US10585973B2 (en) 2016-04-26 2020-03-10 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US10592582B2 (en) 2016-04-26 2020-03-17 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US10599745B2 (en) 2016-04-26 2020-03-24 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US10997276B2 (en) 2016-04-26 2021-05-04 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11100192B2 (en) 2016-04-26 2021-08-24 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11341211B2 (en) 2016-04-26 2022-05-24 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11436301B2 (en) 2016-04-26 2022-09-06 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations
US11507640B2 (en) 2016-04-26 2022-11-22 Cambricon Technologies Corporation Limited Apparatus and methods for vector operations

Also Published As

Publication number Publication date
US20190065194A1 (en) 2019-02-28
EP3451185A1 (en) 2019-03-06
EP3451185A4 (en) 2019-11-20
US20190095207A1 (en) 2019-03-28
US10592582B2 (en) 2020-03-17
US20190065192A1 (en) 2019-02-28
US11436301B2 (en) 2022-09-06
US10997276B2 (en) 2021-05-04
US11341211B2 (en) 2022-05-24
US10599745B2 (en) 2020-03-24
EP3832500A1 (en) 2021-06-09
US20190095401A1 (en) 2019-03-28
US10585973B2 (en) 2020-03-10
CN107315717A (zh) 2017-11-03
US11100192B2 (en) 2021-08-24
US11507640B2 (en) 2022-11-22
EP3832500B1 (en) 2023-06-21
US20190073339A1 (en) 2019-03-07
US20190079766A1 (en) 2019-03-14
US20190095206A1 (en) 2019-03-28
CN107315717B (zh) 2020-11-03
CN111651203A (zh) 2020-09-11
US20190065193A1 (en) 2019-02-28
US20190079765A1 (en) 2019-03-14

Similar Documents

Publication Publication Date Title
WO2017185396A1 (zh) 一种用于执行矩阵加/减运算的装置和方法
CN109240746B (zh) 一种用于执行矩阵乘运算的装置和方法
KR102123633B1 (ko) 행렬 연산 장치 및 방법
WO2017124648A1 (zh) 一种向量计算装置
WO2017185392A1 (zh) 一种用于执行向量四则运算的装置和方法
WO2017185393A1 (zh) 一种用于执行向量内积运算的装置和方法
WO2017185404A1 (zh) 一种用于执行向量逻辑运算的装置及方法
WO2017185405A1 (zh) 一种用于执行向量外积运算的装置和方法
WO2017185395A1 (zh) 一种用于执行向量比较运算的装置和方法
WO2017185385A1 (zh) 一种用于执行向量合并运算的装置和方法
WO2017185384A1 (zh) 一种用于执行向量循环移位运算的装置和方法
WO2017185419A1 (zh) 一种用于执行向量最大值最小值运算的装置和方法
WO2018024094A1 (zh) 一种运算装置及其操作方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016899903

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16899903

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016899903

Country of ref document: EP

Effective date: 20181126