US20190073337A1 - Apparatuses capable of providing composite instructions in the instruction set architecture of a processor - Google Patents
Apparatuses capable of providing composite instructions in the instruction set architecture of a processor Download PDFInfo
- Publication number
- US20190073337A1 US20190073337A1 US16/120,645 US201816120645A US2019073337A1 US 20190073337 A1 US20190073337 A1 US 20190073337A1 US 201816120645 A US201816120645 A US 201816120645A US 2019073337 A1 US2019073337 A1 US 2019073337A1
- Authority
- US
- United States
- Prior art keywords
- unit
- functional units
- composite
- fundamental functional
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/82—Architectures of general purpose stored program computers data or demand driven
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3889—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
- G06F9/3891—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3893—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator
- G06F9/3895—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled in tandem, e.g. multiplier-accumulator for complex operations, e.g. multidimensional or interleaved address generators, macros
Definitions
- the invention relates to a novel design to implement multiple composite instructions to support the corresponding common digital signal processing algorithms in a VD SP.
- VDSP Vector Digital Signal Processor
- a Vector Digital Signal Processor is a type of efficient processor for implementing complex signal processing algorithms used in applications, such as wireless/wire line communication baseband processing, multi-media signal processing, etc.
- Conventional VDSPs support general purpose instructions, such as vector load, vector store, vector arithmetic (multiply, add, accumulation, min, max, etc.), and vector permutation (shift, move, etc.).
- VDSP may have multiple lanes to support parallel processing of multiple data samples in data vectors, and multiple functional units to support parallel execution of multiple instructions.
- the software (or firmware) run on a VDSP normally needs to further support some common digital signal processing algorithms, e.g., Fast Fourier Transform (FFT), Finite Impulse Response (FIR) filtering, Correlation, etc.
- FFT Fast Fourier Transform
- FIR Finite Impulse Response
- Correlation etc.
- these common digital signal processing algorithms are not included in the vector Instruction Set Architecture (ISA) of the current VDSP.
- VDSP VDSP architecture design
- a set of composite instructions configured to perform common digital signal processing algorithms, such as FFT, IFFT, FHT, FIR, correlation, etc., are implemented.
- An exemplary embodiment of an apparatus comprises a plurality of signal processing lanes and a composite instruction controller.
- Each signal processing lane comprises a first fundamental functional unit, a second fundamental functional unit, and a register file unit comprising a plurality of configurable vector registers.
- the composite instruction controller is coupled to the first fundamental functional units and the second fundamental functional units in the plurality of signal processing lanes, and is configured to issue a plurality of control signals in response to a composite instruction to control the first fundamental functional units and the second fundamental functional units and thereby carry out a composite operation.
- An exemplary embodiment of an apparatus comprises a plurality of signal processing lanes and a first composite instruction controller.
- Each signal processing lane comprises a first fundamental functional unit, a second fundamental functional unit, and a register file unit.
- the first fundamental functional unit comprises a plurality of first buffers and a first computation unit.
- the second fundamental functional unit comprises a plurality of second buffers and a second computation unit.
- the register file unit comprises a plurality of configurable vector registers.
- the first composite instruction controller is configured to issue a plurality of control signals in response to a first composite instruction to control the plurality of first buffers and the first computation unit of the first fundamental functional units and the plurality of second buffers and the second computation unit of the second fundamental functional units in the plurality of signal processing lanes and thereby carry out a first composite operation.
- FIG. 1A and FIG. 1B show an exemplary block diagram of an architecture of an apparatus capable of performing complex signal processing according to an embodiment of the invention
- FIG. 2 shows exemplary pseudo-codes to describe the operation control flow for performing the FFT operation via a 4-lane VDSP according to an embodiment of the invention
- FIG. 3 shows an exemplary block diagram of a composite functional unit capable of performing the FFT, IFFT and FHT operations according to an embodiment of the invention
- FIG. 4A is an exemplary block diagram of the butterfly unit according to an embodiment of the invention.
- FIG. 4B is an exemplary block diagram of the butterfly unit according to another embodiment of the invention.
- FIG. 5A is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention.
- FIG. 5B is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention.
- FIG. 6 shows exemplary pseudo-codes to describe the operation control flow for performing the FIR filtering without ramping operation via a 4-lane VDSP according to an embodiment of the invention
- FIG. 7 shows an exemplary block diagram of a composite functional unit capable of performing the FIR filtering with ramping and FIR filtering without ramping operations according to an embodiment of the invention
- FIG. 8A is a schematic diagram showing the operation of auto-correlation according to an embodiment of the invention.
- FIG. 8B is a schematic diagram showing the operation of cross-correlation/vector multiplication according to an embodiment of the invention.
- FIG. 9 shows an exemplary block diagram of a composite functional unit capable of performing the auto-correlation, cross-correlation and vector multiplication operations according to an embodiment of the invention.
- the software solution uses software functions or micro-codes to implement the algorithms. While the software implementation is flexible, the main drawbacks includes: 1-1) the performance, in terms of maximal data throughput per second when executing the algorithms, may not be optimal due to the functional limitation of general-purpose instructions and software control overhead, and 1-2) the code size may be large.
- the co-processor solution is to implement a dedicated hardware module for each algorithm, and the dedicated hardware module is used as a co-processor to the VDSP.
- the main drawback of the co-processor solution is low utilization of hardware resources. Since each algorithm is implemented by a dedicated hardware co-processor, it's difficult to share hardware resource among different co-processors and with the VDSP.
- VDSP a novel design to support common digital signal processing algorithms in VDSP.
- VDSP architecture a set of composite instructions configured to perform common digital signal processing algorithms, such as FFT, IFFT, FHT, FIR, correlation, etc., are implemented.
- the software code size can be reduced.
- better performance higher data throughput
- higher utilization of the hardware resources can be achieved.
- FIG. 1A and FIG. 1B show an exemplary block diagram of an architecture of an apparatus capable of performing complex signal processing according to an embodiment of the invention. It should be noted that in order to clarify the concept of the invention, FIG. 1A and FIG. 1B present a simplified block diagram, in which only the elements relevant to the invention are shown. However, the invention should not be limited only to what is shown in FIG. 1A and FIG. 1B .
- the apparatus 100 may be a vector digital signal processor (VDSP) that can support a plurality of complex signal processing algorithms.
- VDSP vector digital signal processor
- the apparatus 100 comprises a plurality of signal processing lanes, such as the Lane 1 , Lane 2 , Lane 3 and Lane 4 as shown in FIG. 1A and FIG. 1B . Noted that although there are four signal processing lanes shown in FIG. 1A and FIG. 1B , the invention should not be limited thereto.
- the apparatus 100 may also comprise less than 4 or more than 4 signal processing lanes.
- Each signal processing lane may comprise a plurality of fundamental functional units, such as one or more adder functional units 110 , one or more multiplier functional units 120 , one or more accumulation functional units 130 , one or more permutation functional units 140 , . . . etc.
- Each fundamental functional unit is configured to support a general-purpose instruction by carrying out a corresponding fundamental operation.
- the adder functional unit 110 is configured to carry out an addition operation in response to an add (e.g., vector-add (vAdd)) instruction.
- the multiplier functional unit 120 is configured to carry out a multiplication operation in response to a multiply (e.g., vector-multiply (vMult)) instruction.
- the accumulation functional unit 130 is configured to carry out an accumulation operation in response to an accumulate (e.g., vector-accumulate (vAcc)) instruction.
- the permutation functional unit 140 is configured to carry out a permutation operation in response to a permutation (e.g., vector-permutation (vShift) for shifting the data elements of a vector) instruction.
- vAcc vector-accumulate
- vShift vector-permutation
- the apparatus 100 receives the instructions and data that have been input via a corresponding interface, and then triggers the corresponding functional units to perform the corresponding operations.
- Each fundamental functional unit may comprise a plurality of buffers and a corresponding computation unit.
- the adder functional unit 110 may comprise two input buffers for receiving two operands, a computation unit ALU for performing the addition operation and an output buffer for outputting the calculation result.
- the multiplier functional unit 120 may comprise two input buffers for receiving two operands, a computation unit MULT for performing the multiplication operation and an output buffer for outputting the calculation result.
- the accumulation functional unit 130 may comprise two input buffers for receiving two operands, a computation unit ACC for performing the accumulation operation and an output buffer for outputting the calculation result.
- the permutation functional unit 140 may comprise an input buffer for receiving input data, a computation unit PERM for performing the permutation operation and an output buffer for outputting the permutation result.
- the apparatus 100 further comprise a RAM load unit 150 , a RAM store unit 160 , a plurality of register file units, such as the multi-port register file units 170 , a plurality of lane store units 180 , a plurality of lane load units 185 and a control functional unit 190 .
- the RAM load unit 150 is configured to load data from an external RAM 50 in response to a corresponding load instruction.
- the RAM store unit 160 is configured to store data (the results output by the fundamental functional units) into the external RAM 50 in response to a corresponding store instruction.
- a multi-port register file unit 170 is disposed in each signal processing lane and comprises a plurality of configurable registers and vector registers provided for the fundamental functional units in the same signal processing lane to buffer data.
- a lane store unit 180 is disposed in each signal processing lane and is configured to provide data to be stored into the external RAM 50 to the RAM store unit 160 .
- a lane load unit 185 is disposed in each signal processing lane and is configured to load data from the RAM load unit 150 .
- the control functional unit 190 is configured to perform scalar operations. Compared to the scalar operations, the vector-wise operations can be carried out via the fundamental functional units in multiple signal processing lanes.
- a fundamental functional unit is configured to carry out a corresponding fundamental operation in response to a corresponding instruction (i.e. the general-purpose instruction).
- the fundamental functional unit may access the data stored in the registers of the multi-port register file unit 170 via the read ports, so as to load the data into the input buffer(s) thereof, perform the corresponding fundamental operation on the data, and store the result into the output buffer thereof.
- the output data may be stored into the corresponding registers of the multi-port register file unit 170 via the write ports.
- Each fundamental functional unit may further comprise a dedicated controller for controlling the operation flow.
- FIG. 1A and FIG. 1B show only a portion of hardware devices for processing the data received by the apparatus 100 .
- the apparatus 100 may further comprise an instruction fetch unit (not shown) configured to fetch input instruction, an instruction memory (not shown) configured to store the received instructions, an instruction decode unit (not shown) configured to decode the received instructions, an instruction dispatch unit (not shown) configured to dispatch the instruction to the corresponding controller of each functional unit, and other control logics for processing the received instruction.
- the apparatus 100 may further comprise one or more composite functional units, such as the composite functional unit 200 shown in FIG. 1A and FIG. 1B .
- the composite functional units may use the fundamental functional units configured in multiple signal processing lanes to carry out a corresponding composite operation.
- multiple lanes of the same fundamental functional unit may be grouped to a composite functional unit to support the common digital signal processing algorithm as discussed above.
- Each composite functional unit may comprise a composite instruction controller, such as the composite instruction controller 250 shown in FIG. 1A and FIG. 1B .
- the composite instruction controller 250 may be coupled to one or more fundamental functional units in the plurality of processing lanes.
- the composite instruction controller 250 In response to a composite instruction received by the apparatus 100 and, after being decoded, dispatched to the composite instruction controller 250 , the composite instruction controller 250 is configured to issue a plurality of control signals to control the fundamental functional units to perform their corresponding operations, so as to carry out the corresponding composite operation.
- the buffers and the corresponding computation units of the fundamental functional units are shared with at least one composite functional unit, such as the composite functional unit 200 .
- the buffers and the corresponding computation units of the fundamental functional units may be further shared among multiple composite functional units.
- the vector registers, control registers, other general-purpose registers such as scalar data registers, instruction decode and dispatch pipeline of the apparatus 100 may also be shared among different functional units, including the fundamental functional units and the composite functional units. Since the hardware resources of the apparatus 100 can be shared among different functional units, including the fundamental functional units and the composite functional units, higher utilization of hardware resources can be achieved.
- the composite operation carried out by the composite functional unit may be selected from a group comprising a Fast Fourier Transform (FFT), an inverse Fast Fourier Transform (iFFT), a Fast Hadamard Transform (FHT), a Finite Impulse Response (FIR) filtering with ramping, an FIR filtering without ramping, an auto-correlation, a cross-correlation and a vector multiplication.
- FFT Fast Fourier Transform
- iFFT inverse Fast Fourier Transform
- FHT Fast Hadamard Transform
- FIR Finite Impulse Response
- one composite functional unit may be configured to carry out multiple composite operations with similar computation procedure.
- the composite functional units and their corresponding composite operations will be illustrated in more detailed in the following paragraphs.
- a composite functional unit may be configured to perform the FFT, IFFT and FHT operations.
- FIG. 2 shows exemplary pseudo-codes to describe the operation control flow for performing the FFT operation via a 4-lane VD SP according to an embodiment of the invention. It should be noted that it can be easily extended to the VDSP with any number of lanes.
- FIG. 3 shows an exemplary block diagram of a composite functional unit capable of performing the FFT, IFFT and FHT operations according to an embodiment of the invention.
- the FFT/IFFT/FHT operations that can be carried out by the composite functional unit 300 are illustrated in more detail in the following paragraphs.
- the ways to utilize the FFT, IFFT, and FHT instructions based on radix-2 or radix-4 FFT/IFFT/FHT algorithm are provided below:
- the input parameter Vr_dest is the name of a destination vector register
- the input parameter Vr_src is the name of a source vector register
- the input parameter Rctrl is the name of a control register used to specify the size of vector register (i.e., the number of samples in one vector register) to be processed by FFT or IFFT or FHT.
- the destination Vector register, source vector register and control register are the register/vector registers in the multi-port register file unit 170 .
- the instruction decode and dispatch unit 60 may decode the instructions received by the apparatus 100 , and dispatch the decoded result to the corresponding functional unit.
- the instruction decode and dispatch unit 60 may provide the control signals: fft_start, op_code and vector_length to the controller (the composite instruction controller) 310 .
- the control signal fft_start indicates start of the corresponding operation.
- the control signal op_code indicates which operation of the FFT, IFFT and FHT is to be performed and further indicates the names of the registers to be accessed.
- the control signal vector_length indicates the length of the vector to be processed.
- the input data is loaded from the external RAM 50 and then stored in the vector register file (VRF) 340 via the load unit 320 .
- the output data is loaded from vector register file (VRF) 340 and then stored in the external RAM 50 via the store unit 330 .
- the load unit 320 represents a combination of the functions of the RAM load unit 150 and the lane load unit 185 for simplicity.
- the store unit 330 represents a combination of the functions of the RAM store unit 160 and the lane store unit 180 for simplicity.
- the vector register file (VRF) 340 represents the vector registers, which are configured to facilitate performance of the FFT/IFFT/FHT operation via the instruction, in the multi-port register file 170 for simplicity.
- the FFT/IFFT/FHT instructions use hardware resources in multiplier, accumulation, and permutation functional units.
- the controller 310 is configured to generate control signals to conduct the data flow of the FFT, the IFFT and the FHT instructions based on the operation procedures required by FFT/IFFT/FHT algorithms.
- the controller 310 comprises an FFT/IFFT/FHT operation control unit 311 , an input data address generation unit 312 , an output data address generation unit 313 , a twiddle look-up table address generation unit 314 and an output data permutation control unit 315 .
- the FFT/IFFT/FHT operation control unit 311 is configured to issue the control signals for controlling operations of the fundamental functional units, so as to control the multi-stage FFT/IFFT/FHT operation based on Decimation-in-Frequency (DIF) or Decimation-in-Time (DIT) or a mixed DIF/DIT FFT or FHT algorithm.
- the input data address generation unit 312 is configured to generate an input data address for fetching data from the vector register file (VRF) 340 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code and providing fetched data to the input buffers of the corresponding functional units.
- VRF vector register file
- the output data address generation unit 313 is configured to generate an output data address for storing data fetched from the output buffers of the corresponding functional units to the vector register file (VRF) 340 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code. That is, the vector register file (VRF) 340 is configured to hold the source and destination data vector registers for the FFT/IFFT/FHT instructions.
- VRF vector register file
- the twiddle look-up table address generation unit 314 is configured to generate the address of the Twiddle factor look-up table (LUT) 305 .
- the Twiddle factor LUT 305 is configured to store the twiddle factors.
- the output data permutation control unit 315 is configured to generate a plurality of permutation control signals to utilize the permutation functional unit for re-ordering output data of the butterfly unit 400 as required by the FFT/IFFT/FHT algorithms.
- the butterfly unit 400 is configured perform butterfly operations
- the controller 310 issues the read request (e.g. the VRF Rd Req) and provides the read address (e.g. the VRF Rd Addr) to preload the input data from the vector register file (VRF) 340 to the input buffers of the accumulation functional units 130 (shown as the input buffers (ACC functional unit) 350 in FIG. 3 ) and the input buffers of the multiplier functional units 120 (shown as the input buffers (MULT functional unit) 360 in FIG. 3 ).
- the input buffers inside the ACC/MULT/PERM functional units are configured to hold the input data to FFT/IFFT/FHT butterfly unit 400 .
- the input data is fetched from the input buffers (ACC functional unit) 350 and the input buffers (MULT functional unit) 360 and provided to the butterfly unit 400 .
- the butterfly unit 400 is configured to perform radix-2 or radix-4 parallel butterfly operation. It is to be noted that the parameter numStage in the pseudo-code represents the number of stage of the FFT operation, and the parameter N represents the length of the data samples for the FFT operation.
- the output data of the butterfly unit 400 is provided to the output buffers of the permutation functional unit 140 (shown as the output buffers (PERM functional unit) 370 in FIG. 3 ).
- the output data of the PERM functional unit is further provided to the output buffers of the multiplier functional units 120 (shown as the output buffers (MULT functional unit) 380 in FIG. 3 ) and the output buffers of the accumulation functional units 130 (shown as the output buffers (ACC functional unit) 390 in FIG. 3 ).
- the output buffers inside the ACC/MULT/PERM functional units are configured to hold the output data to be saved back to vector register file (VRF) 340 . If the output buffer is full, the controller 310 issues the write request (e.g. the VRF Wr Req) and provides the write address (e.g. the VRF Wr Addr) to write the output data to the vector register file (VRF) 340 .
- the FFT/IFFT/FHT instructions can be executed in parallel with other normal (i.e. non-composite, or named general-purpose) instructions such as load and store instructions.
- At least part of the fundamental functional units are controlled by the composite instruction controller to carry out a butterfly operation that is required for the composite operation.
- the butterfly operation may be a radix-2 butterfly operation or a radix-4 butterfly operation.
- FIG. 4A is an exemplary block diagram of the butterfly unit according to an embodiment of the invention.
- the butterfly unit 400 A is configured to execute the Decimate-in-frequency (DIF) Radix-4 butterfly operations, where the x0 ⁇ x3, y0 ⁇ y3 and y′0 ⁇ y′3 represent the input/output data, the w1 ⁇ w3 represent twiddle factors, and the z0 ⁇ z3 represent the output data.
- the butterfly unit 400 A uses 8 complex adders and 3 complex multipliers for the multiplications of twiddle factors. Noted that the multipliers and Twiddle factors are not used by FHT instructions.
- the butterfly unit 400 A uses four lanes of multiplier functional units 120 and accumulation functional units 130 .
- the output of the accumulation functional units 130 and multiplier functional units 120 are provided to the pipeline registers of the corresponding signal processing lane.
- both the adder functional unit 110 and the accumulation functional unit 130 comprise the adders as their hardware resources. Therefore, the butterfly unit 400 A may also be designed to use four lanes of multiplier functional units 120 and adder functional units 110 , and the invention should not be limited to any specific method of implementation.
- the butterfly unit 400 A has a cross-lane architecture, that is, the output data of fundamental functional unit in one lane is provided as the input data of the fundamental functional unit in another lane.
- the output data y1 of the accumulation functional unit 130 in lane 2 is provided as the input data of the accumulation functional unit 130 in lane 3 .
- FIG. 4B is an exemplary block diagram of the butterfly unit according to another embodiment of the invention.
- the butterfly unit 400 B is configured to execute the DIF Radix-2 butterfly operations, where the x0 ⁇ x1 and y0 ⁇ y1 represent the input/output data, the w1 represents twiddle factor, and the z0 ⁇ z1 represent the output data.
- the butterfly unit 400 B uses 2 complex adders and 1 complex multiplier for the multiplications of twiddle factor. Noted that the multipliers and Twiddle factor are not used by FHT instructions.
- the butterfly unit 400 B uses two lanes of multiplier functional units 120 and accumulation functional units 130 .
- the output of the accumulation functional units 130 and multiplier functional units 120 are provided to the pipeline registers of the corresponding signal processing lane. It should be noted that the butterfly unit 400 B may also be designed to use two lanes of multiplier functional units 120 and adder functional units 110 , and the invention should not be limited to any specific method of implementation.
- FIG. 5A is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention.
- the butterfly unit 500 A is configured to execute the Decimate-in-time (DIT) Radix-4 butterfly operations, where the x0 ⁇ x3, x′0 ⁇ x′3 and y0 ⁇ y3 represent the input/output data, the w1 ⁇ w3 represent twiddle factors, and the z0 ⁇ z3 represent the output data.
- the butterfly unit 500 A uses 8 complex adders and 3 complex multipliers for the multiplications of twiddle factors.
- the butterfly unit 500 A uses four lanes of multiplier functional units 120 and accumulation functional units 130 .
- the output of the accumulation functional units 130 and multiplier functional units 120 are provided to the pipeline registers of the corresponding signal processing lane. It should be noted that the butterfly unit 500 A may also be designed to use four lanes of multiplier functional units 120 and adder functional units 110 , and the invention should not be limited to any specific method of implementation.
- the butterfly unit 500 A has a cross-lane architecture, that is, the output data of fundamental functional unit in one lane is provided as the input data of the fundamental functional unit in another lane.
- the output data x′2 of the multiplier functional units 120 in lane 2 is provided as the input data of the accumulation functional unit 130 in lane 1 .
- FIG. 5B is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention.
- the butterfly unit 500 B is configured to execute the DIT Radix-2 butterfly operations, where the x0 ⁇ x1 and y0 ⁇ y1 represent the input/output data, the w1 represents twiddle factor, and the z0 ⁇ z1 represent the output data.
- the butterfly unit 500 B uses 2 complex adders and 1 complex multiplier for the multiplications of twiddle factor.
- the butterfly unit 500 B uses two lanes of multiplier functional units 120 and accumulation functional units 130 .
- the output of the accumulation functional units 130 and multiplier functional units 120 are provided to the pipeline registers of the corresponding signal processing lane.
- the butterfly unit 500 B may also be designed to use two lanes of multiplier functional units 120 and adder functional units 110 , and the invention should not be limited to any specific method of implementation.
- a composite functional unit may be configured to perform the FIR filtering with ramping and FIR filtering without ramping operations.
- FIG. 6 shows exemplary pseudo-codes to describe the operation control flow for performing the FIR filtering without ramping operation via a 4-lane VDSP according to an embodiment of the invention. It should be noted that it can be easily extended to the VDSP with any number of lanes.
- FIG. 7 shows an exemplary block diagram of a composite functional unit capable of performing the FIR filtering with ramping and FIR filtering without ramping operations according to an embodiment of the invention.
- the FIR filtering with ramping and FIR filtering without ramping operations that can be carried out by the composite functional unit 700 are illustrated in more detail in the following paragraphs.
- the Fir is the instruction to support FIR with ramping
- the FirNoRamp is the instruction to support FIR without ramping.
- the input parameter Vr_dest is the name of a destination vector register that holds the output data of the FIR filter
- the input parameter Vr_src1 is the name of a source vector register that holds the input data to the FIR filter
- the input parameter Vr_src2 is the name of a source vector register that holds the coefficients of the FIR filter
- the input parameter Rctrl is the name of a control register used to specify the lengths of the input data vector and coefficient vector.
- the destination Vector register, source vector registers and control register are the register/vector registers in the multi-port register file unit 170 .
- the instruction decode and dispatch unit 60 may decode the instructions received by the apparatus 100 , and dispatch the decoded result to the corresponding functional unit.
- the instruction decode and dispatch unit 60 may provide the control signals: fir_start, op_code and vector_length to the controller (the composite instruction controller) 710 .
- the control signal fir_start indicates start of the corresponding operation.
- the control signal op_code indicates which operation of the FIR and FIR without ramping is to be performed and further indicates the names of the registers to be accessed.
- the control signal vector_length indicates the length of the vector to be processed.
- the input data is loaded from the external RAM 50 and then stored in the vector register file (VRF) 740 via the load unit 720 .
- the output data is loaded from vector register file (VRF) 740 and then stored in the external RAM 50 via the store unit 730 .
- the load unit 720 represents a combination of the functions of the RAM load unit 150 and the lane load unit 185 for simplicity.
- the store unit 730 represents a combination of the functions of the RAM store unit 160 and the lane store unit 180 for simplicity.
- the vector register file (VRF) 740 represents the vector registers, which are configured to facilitate performance of the FIR or FIR without ramping operation via the instruction, in the multi-port register file 170 for simplicity.
- the FIR related instructions use hardware resources in multiplier, accumulation, and permutation functional units.
- the controller 710 is configured to generate control signals to conduct the data flow of the Fir and FirNoRamp instructions based on the operation procedures required by FIR and FIR without ramping algorithms.
- the controller 310 comprises an Fir/FirNoRamp operation control unit 711 , an input data address generation unit 712 , an output data address generation unit 713 and an input data shift control unit 714 .
- the Fir/FirNoRamp operation control unit 711 is configured to issue the control signals for controlling operations of the fundamental functional units, so as to control the FIR operation with or without ramping.
- the input data address generation unit 712 is configured to generate an input data address for fetching data from the vector register file (VRF) 740 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code and providing fetched data to the input buffers of the corresponding functional units.
- the output data address generation unit 713 is configured to generate an output data address for storing data fetched from the output buffers of the corresponding functional units to the vector register file (VRF) 740 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code. That is, the vector register file (VRF) 740 is configured to hold the source and destination data vector registers for the Fir/FirNoRamp instructions.
- the input data shift control unit 714 is configured to generate a plurality of shift control signals to shift input data vector for supporting the FIR algorithm.
- the FIR instruction (Fir) is to calculate:
- the x(k) represents the input data
- the a(j) represents the coefficients
- the y(k) represents the FIR result
- the L represents the length of the data vector
- the N represents the length of the filter.
- the controller 710 issues the read request (e.g. the VRF Rd Req) and provides the read address (e.g. the VRF Rd Addr) to load the input data and the coefficients from the vector register file (VRF) 740 to the input buffers of the permutation functional units 140 (shown as the input buffers (PERM functional unit) 750 in FIG. 7 ) and the input buffers of the multiplier functional units 120 (shown as the input buffers (MULT functional unit) 770 in FIG. 7 ).
- the input data is loaded to the input buffers (PERM functional unit) 750
- the coefficients are loaded to the input buffers (MULT functional unit) 770 .
- the input data shifter of the permutation functional units 140 (shown as the input data shifter (PERM functional unit) 760 in FIG. 7 ) is configured to perform the down-shift operation on the input data.
- the shifted input data is multiplied by the coefficients via the multiplier functional units 120 and the output data of the multiplier functional units 120 is then provided to the accumulation functional units 130 via the pipeline registers 780 .
- the accumulation functional units 130 perform the accumulation calculations on the received data and store the calculation result in the output buffers thereof (shown as the output buffers (ACC functional unit) 790 in FIG. 7 ).
- the output buffers inside the ACC functional units are configured to hold the output data to be saved back to vector register file (VRF) 740 .
- the controller 710 issues the write request (e.g. the VRF Wr Req) and provides the write address (e.g. the VRF Wr Addr) to write the output data to the vector register file (VRF) 740 .
- Fir/FirNoRamp instructions can be executed in parallel with other normal (non-composite) instructions such as load and store instructions.
- a composite functional unit may be configured to perform the auto-correlation, cross-correlation and vector multiplication operations.
- FIG. 8A is a schematic diagram showing the operation of auto-correlation according to an embodiment of the invention.
- FIG. 8B is a schematic diagram showing the operation of cross-correlation/vector multiplication according to an embodiment of the invention.
- the X(k) and Y(k) represent the input data
- the R(k) represents the correlation/multiplication result.
- FIG. 9 shows an exemplary block diagram of a composite functional unit capable of performing the auto-correlation, cross-correlation/vector and multiplication operations according to an embodiment of the invention.
- the auto-correlation, cross-correlation and vector multiplication operations that can be carried out by the composite functional unit 900 are illustrated in more detail in the following paragraphs.
- Vr_dest Vr_src1, Vr_src2, Rctrl
- the AutoCorr is the instruction to support auto correlation of a data vector.
- the CrossCorr is the instruction to support cross-correlation of two data vectors.
- the VecByMat is the instruction to support the multiplication of a vector with a matrix, which can be realized as a simplified form of cross correlation of two data vectors when the Matrix is stored in a vector registers in either row-major or column-major format.
- the input parameter Vr_dest is the name of a destination vector register that holds the output data
- the input parameter Vr_src1 is the name of a source vector register that holds one data vector
- the input parameter Vr_src2 is the name of a source vector register that holds one data vector (or a Matrix for the VecByMat instruction)
- the input parameter Rctrl is the name of a control register used to specify the lengths of the input data vectors.
- the destination Vector register, source vector registers and control register are the register/vector registers in the multi-port register file unit 170 .
- the instruction decode and dispatch unit 60 may decode the instructions received by the apparatus 100 , and dispatch the decoded result to the corresponding functional unit.
- the instruction decode and dispatch unit 60 may provide the control signals: corr_start, op_code and vector_length to the controller (the composite instruction controller) 910 .
- the control signal corr_start indicates start of the corresponding operation.
- the control signal op_code indicates which operation of the auto-correlation, cross-correlation and vector multiplication is to be performed and further indicates the names of the registers to be accessed.
- the control signal vector_length indicates the length of the vector to be processed.
- the input data is loaded from the external RAM 50 and then stored in the vector register file (VRF) 940 via the load unit 920 .
- the output data is loaded from vector register file (VRF) 940 and then stored in the external RAM 50 via the store unit 930 .
- the load unit 920 represents a combination of the functions of the RAM load unit 150 and the lane load unit 185 for simplicity.
- the store unit 930 represents a combination of the functions of the RAM store unit 160 and the lane store unit 180 for simplicity.
- the vector register file (VRF) 940 represents the vector registers, which are configured to facilitate performance of the FIR or FIR without ramping operation via the instruction, in the multi-port register file 170 for simplicity.
- the correlation related instructions use hardware resources in multiplier, accumulation, and permutation functional units.
- the controller 910 is configured to generate control signals to conduct the data flow of the AutoCorr, CrossCorr and VecByMat instructions based on the operation procedures required by the auto-correlation, cross-correlation and vector multiplication algorithms.
- the controller 910 comprises an AutoCorr/CrossCorr/VecByMat operation control unit 911 , an input data address generation unit 912 , an output data address generation unit 913 and an input data shift control unit 914 .
- the AutoCorr/CrossCorr/VecByMat operation control unit 911 is configured to issue the control signals for controlling the correlation operation flow for different instructions.
- the input data address generation unit 912 is configured to generate an input data address for fetching data from the vector register file (VRF) 940 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code and providing fetched data to the input buffers of the corresponding functional units.
- the output data address generation unit 913 is configured to generate an output data address for storing data fetched from the output buffers of the corresponding functional units to the vector register file (VRF) 940 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code.
- the vector register file (VRF) 940 is configured to hold the source and destination data vector registers for the AutoCorr/CrossCorr/VecByMat instructions.
- the input data shift control unit 914 is configured to generate a plurality of shift control signals to shift input data vector for supporting the correlation algorithms.
- the auto-correlation instruction (AutoCorr) is to calculate:
- the cross-correlation instruction (CrossCorr) is to calculate:
- the vector-by-matrix multiplication instruction (VecByMat) is to calculate (assuming x holds the matrix and y holds the vector):
- the x(k) and y(j) represent the input data and the R(k) represents the calculation result
- the N represents the length the input vector (or the number of rows of the input matrix, which can be the same as the size of the input vector)
- the M represents the length of the output data vector (or the number of columns of the input matrix, which can be the same as the size of the output data vector).
- the controller 910 issues the read request (e.g. the VRF Rd Req) and provides the read address (e.g. the VRF Rd Addr) to load the input data and the coefficients from the vector register file (VRF) 940 to the input buffers of the permutation functional units 140 (shown as the input buffers (PERM functional unit) 950 in FIG. 9 ) and the input buffers of the multiplier functional units 120 (shown as the input buffers (MULT functional unit) 970 in FIG. 9 ).
- the input data shifter of the permutation functional units 140 (shown as the input data shifter (PERM functional unit) 960 in FIG. 9 ) is configured to perform shift operation on the input data.
- the shifted input data is multiplied by the input data y(j) via the multiplier functional units 120 and the output data of the multiplier functional units 120 is then provided to the accumulation functional units 130 via the pipeline registers 980 .
- the accumulation functional units 130 perform the cross-lane accumulation calculations on the received data via the cross-lane ACC registers 990 .
- the calculation result is stored in the output buffers thereof (shown as the output buffers (ACC functional unit) 995 in FIG. 9 ).
- the output buffers inside the ACC functional units are configured to hold the output data to be saved back to vector register file (VRF) 940 .
- the controller 910 issues the write request (e.g. the VRF Wr Req) and provides the write address (e.g. the VRF Wr Addr) to write the output data to the vector register file (VRF) 940 .
- AutoCorr/CrossCorr/VecByMat instructions can be executed in parallel with other normal (non-composite) instructions such as load and store instructions.
- the composite functional unit 900 has a cross-lane architecture, that is, the accumulation functional units 130 is configured to perform the cross-lane accumulation calculations.
- a single composite instruction (such as an FFT, IFFT, FHT, Fir, FirNoRamp, AutoCorr, CrossCorr, VecByMat . . . ect.) can support a complex algorithm which was realized by software subroutine or micro-codes in the software solution design.
- a single composite instruction is implemented. For a “function call”, the software control overhead is the main drawback and the code size is large.
- a composite instruction in VDSP can achieve the same performance of dedicated co-processor while sharing the same hardware resources in VDSP with other normal (non-composite) instructions.
- the technical effects and can may be achieved by this invention includes: 1) reduced software code size when using the composite instruction to realize common algorithms, as compared to the software solution, 2) better performance (higher data throughput) due to reduced control overhead, as compared to the software solution and 3) higher utilization of hardware resource, as compared to the co-processor solution.
Abstract
An apparatus includes multiple signal processing lanes and composite instruction controller. Each signal processing lane includes a first fundamental functional unit, a second fundamental functional unit and a register file unit having multiple configurable vector registers. The composite instruction controller is coupled to the first fundamental functional units and the second fundamental functional units in the plurality of signal processing lanes and is configured to issue control signals in response to a composite instruction to control the first fundamental functional units and the second fundamental functional units and thereby carry out a composite operation.
Description
- This application claims the benefit of U.S. Provisional Application No. 62/554,052 filed 2017 Sep. 5 and entitled “Composite Instructions in Vector DSP”, the entire contents of which are hereby incorporated by reference.
- The invention relates to a novel design to implement multiple composite instructions to support the corresponding common digital signal processing algorithms in a VD SP.
- A Vector Digital Signal Processor (VDSP) is a type of efficient processor for implementing complex signal processing algorithms used in applications, such as wireless/wire line communication baseband processing, multi-media signal processing, etc. Conventional VDSPs support general purpose instructions, such as vector load, vector store, vector arithmetic (multiply, add, accumulation, min, max, etc.), and vector permutation (shift, move, etc.). VDSP may have multiple lanes to support parallel processing of multiple data samples in data vectors, and multiple functional units to support parallel execution of multiple instructions.
- In applications such as the baseband signal processing in wireless or wire line communication systems, the software (or firmware) run on a VDSP normally needs to further support some common digital signal processing algorithms, e.g., Fast Fourier Transform (FFT), Finite Impulse Response (FIR) filtering, Correlation, etc. However, these common digital signal processing algorithms are not included in the vector Instruction Set Architecture (ISA) of the current VDSP.
- To solve this problem, is a novel design to support these common digital signal processing algorithms in VDSP is proposed. In the proposed VDSP architecture design, a set of composite instructions configured to perform common digital signal processing algorithms, such as FFT, IFFT, FHT, FIR, correlation, etc., are implemented.
- Apparatuses capable of providing composite instructions in the vector Instruction Set Architecture (ISA) of a processor are provided. An exemplary embodiment of an apparatus comprises a plurality of signal processing lanes and a composite instruction controller. Each signal processing lane comprises a first fundamental functional unit, a second fundamental functional unit, and a register file unit comprising a plurality of configurable vector registers. The composite instruction controller is coupled to the first fundamental functional units and the second fundamental functional units in the plurality of signal processing lanes, and is configured to issue a plurality of control signals in response to a composite instruction to control the first fundamental functional units and the second fundamental functional units and thereby carry out a composite operation.
- An exemplary embodiment of an apparatus comprises a plurality of signal processing lanes and a first composite instruction controller. Each signal processing lane comprises a first fundamental functional unit, a second fundamental functional unit, and a register file unit. The first fundamental functional unit comprises a plurality of first buffers and a first computation unit. The second fundamental functional unit comprises a plurality of second buffers and a second computation unit. The register file unit comprises a plurality of configurable vector registers. The first composite instruction controller is configured to issue a plurality of control signals in response to a first composite instruction to control the plurality of first buffers and the first computation unit of the first fundamental functional units and the plurality of second buffers and the second computation unit of the second fundamental functional units in the plurality of signal processing lanes and thereby carry out a first composite operation.
- A detailed description is given in the following embodiments with reference to the accompanying drawings.
- The invention can be more fully understood by reading the subsequent detailed description and examples with references made to the accompanying drawings, wherein:
-
FIG. 1A andFIG. 1B show an exemplary block diagram of an architecture of an apparatus capable of performing complex signal processing according to an embodiment of the invention; -
FIG. 2 shows exemplary pseudo-codes to describe the operation control flow for performing the FFT operation via a 4-lane VDSP according to an embodiment of the invention; -
FIG. 3 shows an exemplary block diagram of a composite functional unit capable of performing the FFT, IFFT and FHT operations according to an embodiment of the invention; -
FIG. 4A is an exemplary block diagram of the butterfly unit according to an embodiment of the invention; -
FIG. 4B is an exemplary block diagram of the butterfly unit according to another embodiment of the invention; -
FIG. 5A is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention; -
FIG. 5B is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention; -
FIG. 6 shows exemplary pseudo-codes to describe the operation control flow for performing the FIR filtering without ramping operation via a 4-lane VDSP according to an embodiment of the invention; -
FIG. 7 shows an exemplary block diagram of a composite functional unit capable of performing the FIR filtering with ramping and FIR filtering without ramping operations according to an embodiment of the invention; -
FIG. 8A is a schematic diagram showing the operation of auto-correlation according to an embodiment of the invention; -
FIG. 8B is a schematic diagram showing the operation of cross-correlation/vector multiplication according to an embodiment of the invention; and -
FIG. 9 shows an exemplary block diagram of a composite functional unit capable of performing the auto-correlation, cross-correlation and vector multiplication operations according to an embodiment of the invention. - The following description is of the best-contemplated mode of carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
- Currently, there are two methods of implementing the common digital signal processing algorithms in a VDSP-based signal-processing system: 1) Software solution, and 2) Co-processor solution.
- The software solution uses software functions or micro-codes to implement the algorithms. While the software implementation is flexible, the main drawbacks includes: 1-1) the performance, in terms of maximal data throughput per second when executing the algorithms, may not be optimal due to the functional limitation of general-purpose instructions and software control overhead, and 1-2) the code size may be large.
- The co-processor solution is to implement a dedicated hardware module for each algorithm, and the dedicated hardware module is used as a co-processor to the VDSP. The main drawback of the co-processor solution is low utilization of hardware resources. Since each algorithm is implemented by a dedicated hardware co-processor, it's difficult to share hardware resource among different co-processors and with the VDSP.
- In the following paragraphs, a novel design to support common digital signal processing algorithms in VDSP is proposed. In the proposed VDSP architecture, a set of composite instructions configured to perform common digital signal processing algorithms, such as FFT, IFFT, FHT, FIR, correlation, etc., are implemented. Unlike the software solution mentioned above, when using the composite instruction to realize common algorithms, the software code size can be reduced. In addition, due to reduction of the control overhead, better performance (higher data throughput) can be achieved. In addition, unlike the co-processor solution mentioned above, when using the composite instructions to realize common algorithms, higher utilization of the hardware resources can be achieved.
-
FIG. 1A andFIG. 1B show an exemplary block diagram of an architecture of an apparatus capable of performing complex signal processing according to an embodiment of the invention. It should be noted that in order to clarify the concept of the invention,FIG. 1A andFIG. 1B present a simplified block diagram, in which only the elements relevant to the invention are shown. However, the invention should not be limited only to what is shown inFIG. 1A andFIG. 1B . - According to an embodiment, the
apparatus 100 may be a vector digital signal processor (VDSP) that can support a plurality of complex signal processing algorithms. Theapparatus 100 comprises a plurality of signal processing lanes, such as theLane 1,Lane 2,Lane 3 andLane 4 as shown inFIG. 1A andFIG. 1B . Noted that although there are four signal processing lanes shown inFIG. 1A andFIG. 1B , the invention should not be limited thereto. Theapparatus 100 may also comprise less than 4 or more than 4 signal processing lanes. - Each signal processing lane may comprise a plurality of fundamental functional units, such as one or more adder
functional units 110, one or more multiplierfunctional units 120, one or more accumulationfunctional units 130, one or more permutationfunctional units 140, . . . etc. Each fundamental functional unit is configured to support a general-purpose instruction by carrying out a corresponding fundamental operation. The adderfunctional unit 110 is configured to carry out an addition operation in response to an add (e.g., vector-add (vAdd)) instruction. The multiplierfunctional unit 120 is configured to carry out a multiplication operation in response to a multiply (e.g., vector-multiply (vMult)) instruction. The accumulationfunctional unit 130 is configured to carry out an accumulation operation in response to an accumulate (e.g., vector-accumulate (vAcc)) instruction. The permutationfunctional unit 140 is configured to carry out a permutation operation in response to a permutation (e.g., vector-permutation (vShift) for shifting the data elements of a vector) instruction. As an example, theapparatus 100 receives the instructions and data that have been input via a corresponding interface, and then triggers the corresponding functional units to perform the corresponding operations. - Each fundamental functional unit may comprise a plurality of buffers and a corresponding computation unit. The adder
functional unit 110 may comprise two input buffers for receiving two operands, a computation unit ALU for performing the addition operation and an output buffer for outputting the calculation result. The multiplierfunctional unit 120 may comprise two input buffers for receiving two operands, a computation unit MULT for performing the multiplication operation and an output buffer for outputting the calculation result. The accumulationfunctional unit 130 may comprise two input buffers for receiving two operands, a computation unit ACC for performing the accumulation operation and an output buffer for outputting the calculation result. The permutationfunctional unit 140 may comprise an input buffer for receiving input data, a computation unit PERM for performing the permutation operation and an output buffer for outputting the permutation result. - The
apparatus 100 further comprise aRAM load unit 150, aRAM store unit 160, a plurality of register file units, such as the multi-portregister file units 170, a plurality oflane store units 180, a plurality oflane load units 185 and a controlfunctional unit 190. TheRAM load unit 150 is configured to load data from anexternal RAM 50 in response to a corresponding load instruction. TheRAM store unit 160 is configured to store data (the results output by the fundamental functional units) into theexternal RAM 50 in response to a corresponding store instruction. A multi-portregister file unit 170 is disposed in each signal processing lane and comprises a plurality of configurable registers and vector registers provided for the fundamental functional units in the same signal processing lane to buffer data. Alane store unit 180 is disposed in each signal processing lane and is configured to provide data to be stored into theexternal RAM 50 to theRAM store unit 160. Alane load unit 185 is disposed in each signal processing lane and is configured to load data from theRAM load unit 150. The controlfunctional unit 190 is configured to perform scalar operations. Compared to the scalar operations, the vector-wise operations can be carried out via the fundamental functional units in multiple signal processing lanes. - As discussed above, a fundamental functional unit is configured to carry out a corresponding fundamental operation in response to a corresponding instruction (i.e. the general-purpose instruction). When performing the corresponding fundamental operation, the fundamental functional unit may access the data stored in the registers of the multi-port
register file unit 170 via the read ports, so as to load the data into the input buffer(s) thereof, perform the corresponding fundamental operation on the data, and store the result into the output buffer thereof. The output data may be stored into the corresponding registers of the multi-portregister file unit 170 via the write ports. Each fundamental functional unit may further comprise a dedicated controller for controlling the operation flow. -
FIG. 1A andFIG. 1B show only a portion of hardware devices for processing the data received by theapparatus 100. Regarding the processing of received instruction, theapparatus 100 may further comprise an instruction fetch unit (not shown) configured to fetch input instruction, an instruction memory (not shown) configured to store the received instructions, an instruction decode unit (not shown) configured to decode the received instructions, an instruction dispatch unit (not shown) configured to dispatch the instruction to the corresponding controller of each functional unit, and other control logics for processing the received instruction. - According to an embodiment of the invention, besides the fundamental functional units discussed above, the
apparatus 100 may further comprise one or more composite functional units, such as the compositefunctional unit 200 shown inFIG. 1A andFIG. 1B . In the embodiments of the invention, the composite functional units may use the fundamental functional units configured in multiple signal processing lanes to carry out a corresponding composite operation. To be more specific, multiple lanes of the same fundamental functional unit may be grouped to a composite functional unit to support the common digital signal processing algorithm as discussed above. Each composite functional unit may comprise a composite instruction controller, such as thecomposite instruction controller 250 shown inFIG. 1A andFIG. 1B . Thecomposite instruction controller 250 may be coupled to one or more fundamental functional units in the plurality of processing lanes. In response to a composite instruction received by theapparatus 100 and, after being decoded, dispatched to thecomposite instruction controller 250, thecomposite instruction controller 250 is configured to issue a plurality of control signals to control the fundamental functional units to perform their corresponding operations, so as to carry out the corresponding composite operation. - It should be noted that, unlike the co-processor design, in the embodiments of the invention, the buffers and the corresponding computation units of the fundamental functional units are shared with at least one composite functional unit, such as the composite
functional unit 200. In addition, in the embodiments of the invention, the buffers and the corresponding computation units of the fundamental functional units may be further shared among multiple composite functional units. In addition, in the embodiments of the invention, the vector registers, control registers, other general-purpose registers such as scalar data registers, instruction decode and dispatch pipeline of theapparatus 100 may also be shared among different functional units, including the fundamental functional units and the composite functional units. Since the hardware resources of theapparatus 100 can be shared among different functional units, including the fundamental functional units and the composite functional units, higher utilization of hardware resources can be achieved. - According to an embodiment of the invention, the composite operation carried out by the composite functional unit, such as the composite
functional unit 200, may be selected from a group comprising a Fast Fourier Transform (FFT), an inverse Fast Fourier Transform (iFFT), a Fast Hadamard Transform (FHT), a Finite Impulse Response (FIR) filtering with ramping, an FIR filtering without ramping, an auto-correlation, a cross-correlation and a vector multiplication. Therefore, a set of composite instructions supporting common digital signal processing algorithms can be added as part of the vector Instruction Set Architecture (ISA) of the apparatus 100 (e.g. the VDSP), and can be provided for the VDSP user to use them directly (that is, the VDSP user can directly input the corresponding instruction to perform the corresponding calculation). - In addition, in the embodiments of the invention, one composite functional unit may be configured to carry out multiple composite operations with similar computation procedure. The composite functional units and their corresponding composite operations will be illustrated in more detailed in the following paragraphs.
- According to a first embodiment of the invention, a composite functional unit may be configured to perform the FFT, IFFT and FHT operations.
-
FIG. 2 shows exemplary pseudo-codes to describe the operation control flow for performing the FFT operation via a 4-lane VD SP according to an embodiment of the invention. It should be noted that it can be easily extended to the VDSP with any number of lanes. -
FIG. 3 shows an exemplary block diagram of a composite functional unit capable of performing the FFT, IFFT and FHT operations according to an embodiment of the invention. AccompanyingFIG. 2 withFIG. 3 , the FFT/IFFT/FHT operations that can be carried out by the compositefunctional unit 300 are illustrated in more detail in the following paragraphs. - According to an embodiment of the invention, the ways to utilize the FFT, IFFT, and FHT instructions based on radix-2 or radix-4 FFT/IFFT/FHT algorithm are provided below:
- FFT Vr_dest, Vr_src, Rctrl
- IFFT Vr_dest, Vr_src, Rctrl
- FHT Vr_dest, Vr_src, Rctrl
- The input parameter Vr_dest is the name of a destination vector register, the input parameter Vr_src is the name of a source vector register and the input parameter Rctrl is the name of a control register used to specify the size of vector register (i.e., the number of samples in one vector register) to be processed by FFT or IFFT or FHT. The destination Vector register, source vector register and control register are the register/vector registers in the multi-port
register file unit 170. - As shown in
FIG. 3 , the instruction decode anddispatch unit 60 may decode the instructions received by theapparatus 100, and dispatch the decoded result to the corresponding functional unit. In this embodiment, the instruction decode anddispatch unit 60 may provide the control signals: fft_start, op_code and vector_length to the controller (the composite instruction controller) 310. The control signal fft_start indicates start of the corresponding operation. The control signal op_code indicates which operation of the FFT, IFFT and FHT is to be performed and further indicates the names of the registers to be accessed. The control signal vector_length indicates the length of the vector to be processed. - The input data is loaded from the
external RAM 50 and then stored in the vector register file (VRF) 340 via theload unit 320. The output data is loaded from vector register file (VRF) 340 and then stored in theexternal RAM 50 via thestore unit 330. It should be noted that inFIG. 3 , theload unit 320 represents a combination of the functions of theRAM load unit 150 and thelane load unit 185 for simplicity. Similarly, thestore unit 330 represents a combination of the functions of theRAM store unit 160 and thelane store unit 180 for simplicity. The vector register file (VRF) 340 represents the vector registers, which are configured to facilitate performance of the FFT/IFFT/FHT operation via the instruction, in themulti-port register file 170 for simplicity. - The FFT/IFFT/FHT instructions use hardware resources in multiplier, accumulation, and permutation functional units. The
controller 310 is configured to generate control signals to conduct the data flow of the FFT, the IFFT and the FHT instructions based on the operation procedures required by FFT/IFFT/FHT algorithms. Thecontroller 310 comprises an FFT/IFFT/FHToperation control unit 311, an input dataaddress generation unit 312, an output dataaddress generation unit 313, a twiddle look-up tableaddress generation unit 314 and an output datapermutation control unit 315. The FFT/IFFT/FHToperation control unit 311 is configured to issue the control signals for controlling operations of the fundamental functional units, so as to control the multi-stage FFT/IFFT/FHT operation based on Decimation-in-Frequency (DIF) or Decimation-in-Time (DIT) or a mixed DIF/DIT FFT or FHT algorithm. The input dataaddress generation unit 312 is configured to generate an input data address for fetching data from the vector register file (VRF) 340 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code and providing fetched data to the input buffers of the corresponding functional units. The output dataaddress generation unit 313 is configured to generate an output data address for storing data fetched from the output buffers of the corresponding functional units to the vector register file (VRF) 340 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code. That is, the vector register file (VRF) 340 is configured to hold the source and destination data vector registers for the FFT/IFFT/FHT instructions. - The twiddle look-up table
address generation unit 314 is configured to generate the address of the Twiddle factor look-up table (LUT) 305. TheTwiddle factor LUT 305 is configured to store the twiddle factors. The output datapermutation control unit 315 is configured to generate a plurality of permutation control signals to utilize the permutation functional unit for re-ordering output data of thebutterfly unit 400 as required by the FFT/IFFT/FHT algorithms. Thebutterfly unit 400 is configured perform butterfly operations - As per the operation control flow shown by the pseudo-code in
FIG. 2 , thecontroller 310 issues the read request (e.g. the VRF Rd Req) and provides the read address (e.g. the VRF Rd Addr) to preload the input data from the vector register file (VRF) 340 to the input buffers of the accumulation functional units 130 (shown as the input buffers (ACC functional unit) 350 inFIG. 3 ) and the input buffers of the multiplier functional units 120 (shown as the input buffers (MULT functional unit) 360 inFIG. 3 ). The input buffers inside the ACC/MULT/PERM functional units are configured to hold the input data to FFT/IFFT/FHT butterfly unit 400. Next, the input data is fetched from the input buffers (ACC functional unit) 350 and the input buffers (MULT functional unit) 360 and provided to thebutterfly unit 400. Thebutterfly unit 400 is configured to perform radix-2 or radix-4 parallel butterfly operation. It is to be noted that the parameter numStage in the pseudo-code represents the number of stage of the FFT operation, and the parameter N represents the length of the data samples for the FFT operation. The output data of thebutterfly unit 400 is provided to the output buffers of the permutation functional unit 140 (shown as the output buffers (PERM functional unit) 370 inFIG. 3 ). The output data of the PERM functional unit is further provided to the output buffers of the multiplier functional units 120 (shown as the output buffers (MULT functional unit) 380 inFIG. 3 ) and the output buffers of the accumulation functional units 130 (shown as the output buffers (ACC functional unit) 390 inFIG. 3 ). The output buffers inside the ACC/MULT/PERM functional units are configured to hold the output data to be saved back to vector register file (VRF) 340. If the output buffer is full, thecontroller 310 issues the write request (e.g. the VRF Wr Req) and provides the write address (e.g. the VRF Wr Addr) to write the output data to the vector register file (VRF) 340. - It should be noted that in the embodiments of the invention, the FFT/IFFT/FHT instructions can be executed in parallel with other normal (i.e. non-composite, or named general-purpose) instructions such as load and store instructions.
- According to an embodiment of the invention, at least part of the fundamental functional units, either in the same lane or in different lanes, are controlled by the composite instruction controller to carry out a butterfly operation that is required for the composite operation. The butterfly operation may be a radix-2 butterfly operation or a radix-4 butterfly operation. Several exemplary designs of the butterfly unit are illustrated in the following paragraphs.
-
FIG. 4A is an exemplary block diagram of the butterfly unit according to an embodiment of the invention. InFIG. 4A , thebutterfly unit 400A is configured to execute the Decimate-in-frequency (DIF) Radix-4 butterfly operations, where the x0˜x3, y0˜y3 and y′0˜y′3 represent the input/output data, the w1˜w3 represent twiddle factors, and the z0˜z3 represent the output data. As shown inFIG. 4A , thebutterfly unit 400A uses 8 complex adders and 3 complex multipliers for the multiplications of twiddle factors. Noted that the multipliers and Twiddle factors are not used by FHT instructions. Thebutterfly unit 400A uses four lanes of multiplierfunctional units 120 and accumulationfunctional units 130. The output of the accumulationfunctional units 130 and multiplierfunctional units 120 are provided to the pipeline registers of the corresponding signal processing lane. It should be noted that both the adderfunctional unit 110 and the accumulationfunctional unit 130 comprise the adders as their hardware resources. Therefore, thebutterfly unit 400A may also be designed to use four lanes of multiplierfunctional units 120 and adderfunctional units 110, and the invention should not be limited to any specific method of implementation. - It should be noted that in the architecture shown in
FIG. 4A , thebutterfly unit 400A has a cross-lane architecture, that is, the output data of fundamental functional unit in one lane is provided as the input data of the fundamental functional unit in another lane. As an example, the output data y1 of the accumulationfunctional unit 130 inlane 2 is provided as the input data of the accumulationfunctional unit 130 inlane 3. -
FIG. 4B is an exemplary block diagram of the butterfly unit according to another embodiment of the invention. InFIG. 4B , thebutterfly unit 400B is configured to execute the DIF Radix-2 butterfly operations, where the x0˜x1 and y0˜y1 represent the input/output data, the w1 represents twiddle factor, and the z0˜z1 represent the output data. As shown inFIG. 4B , thebutterfly unit 400B uses 2 complex adders and 1 complex multiplier for the multiplications of twiddle factor. Noted that the multipliers and Twiddle factor are not used by FHT instructions. Thebutterfly unit 400B uses two lanes of multiplierfunctional units 120 and accumulationfunctional units 130. The output of the accumulationfunctional units 130 and multiplierfunctional units 120 are provided to the pipeline registers of the corresponding signal processing lane. It should be noted that thebutterfly unit 400B may also be designed to use two lanes of multiplierfunctional units 120 and adderfunctional units 110, and the invention should not be limited to any specific method of implementation. -
FIG. 5A is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention. InFIG. 5A , thebutterfly unit 500A is configured to execute the Decimate-in-time (DIT) Radix-4 butterfly operations, where the x0˜x3, x′0˜x′3 and y0˜y3 represent the input/output data, the w1˜w3 represent twiddle factors, and the z0˜z3 represent the output data. As shown inFIG. 5A , thebutterfly unit 500A uses 8 complex adders and 3 complex multipliers for the multiplications of twiddle factors. Thebutterfly unit 500A uses four lanes of multiplierfunctional units 120 and accumulationfunctional units 130. The output of the accumulationfunctional units 130 and multiplierfunctional units 120 are provided to the pipeline registers of the corresponding signal processing lane. It should be noted that thebutterfly unit 500A may also be designed to use four lanes of multiplierfunctional units 120 and adderfunctional units 110, and the invention should not be limited to any specific method of implementation. - It should be noted that in the architecture shown in
FIG. 5A , thebutterfly unit 500A has a cross-lane architecture, that is, the output data of fundamental functional unit in one lane is provided as the input data of the fundamental functional unit in another lane. As an example, the output data x′2 of the multiplierfunctional units 120 inlane 2 is provided as the input data of the accumulationfunctional unit 130 inlane 1. -
FIG. 5B is an exemplary block diagram of the butterfly unit according to yet another embodiment of the invention. InFIG. 5B , thebutterfly unit 500B is configured to execute the DIT Radix-2 butterfly operations, where the x0˜x1 and y0˜y1 represent the input/output data, the w1 represents twiddle factor, and the z0˜z1 represent the output data. As shown inFIG. 5B , thebutterfly unit 500B uses 2 complex adders and 1 complex multiplier for the multiplications of twiddle factor. Thebutterfly unit 500B uses two lanes of multiplierfunctional units 120 and accumulationfunctional units 130. The output of the accumulationfunctional units 130 and multiplierfunctional units 120 are provided to the pipeline registers of the corresponding signal processing lane. It should be noted that thebutterfly unit 500B may also be designed to use two lanes of multiplierfunctional units 120 and adderfunctional units 110, and the invention should not be limited to any specific method of implementation. - According to a second embodiment of the invention, a composite functional unit may be configured to perform the FIR filtering with ramping and FIR filtering without ramping operations.
-
FIG. 6 shows exemplary pseudo-codes to describe the operation control flow for performing the FIR filtering without ramping operation via a 4-lane VDSP according to an embodiment of the invention. It should be noted that it can be easily extended to the VDSP with any number of lanes. -
FIG. 7 shows an exemplary block diagram of a composite functional unit capable of performing the FIR filtering with ramping and FIR filtering without ramping operations according to an embodiment of the invention. AccompanyingFIG. 6 withFIG. 7 , the FIR filtering with ramping and FIR filtering without ramping operations that can be carried out by the compositefunctional unit 700 are illustrated in more detail in the following paragraphs. - According to an embodiment of the invention, the ways to utilize the FIR with/without ramping instructions are provided below:
- Fir Vr_dest, Vr_src1, Vr_src2, Rctrl
- FirNoRamp Vr_dest, Vr_src1, Vr_src2, Rctrl
- The Fir is the instruction to support FIR with ramping, and the FirNoRamp is the instruction to support FIR without ramping.
- The input parameter Vr_dest is the name of a destination vector register that holds the output data of the FIR filter, the input parameter Vr_src1 is the name of a source vector register that holds the input data to the FIR filter, the input parameter Vr_src2 is the name of a source vector register that holds the coefficients of the FIR filter and the input parameter Rctrl is the name of a control register used to specify the lengths of the input data vector and coefficient vector. The destination Vector register, source vector registers and control register are the register/vector registers in the multi-port
register file unit 170. - As shown in
FIG. 7 , the instruction decode anddispatch unit 60 may decode the instructions received by theapparatus 100, and dispatch the decoded result to the corresponding functional unit. In this embodiment, the instruction decode anddispatch unit 60 may provide the control signals: fir_start, op_code and vector_length to the controller (the composite instruction controller) 710. The control signal fir_start indicates start of the corresponding operation. The control signal op_code indicates which operation of the FIR and FIR without ramping is to be performed and further indicates the names of the registers to be accessed. The control signal vector_length indicates the length of the vector to be processed. - The input data is loaded from the
external RAM 50 and then stored in the vector register file (VRF) 740 via theload unit 720. The output data is loaded from vector register file (VRF) 740 and then stored in theexternal RAM 50 via thestore unit 730. It should be noted that, inFIG. 7 , theload unit 720 represents a combination of the functions of theRAM load unit 150 and thelane load unit 185 for simplicity. Similarly, thestore unit 730 represents a combination of the functions of theRAM store unit 160 and thelane store unit 180 for simplicity. The vector register file (VRF) 740 represents the vector registers, which are configured to facilitate performance of the FIR or FIR without ramping operation via the instruction, in themulti-port register file 170 for simplicity. - The FIR related instructions use hardware resources in multiplier, accumulation, and permutation functional units. The
controller 710 is configured to generate control signals to conduct the data flow of the Fir and FirNoRamp instructions based on the operation procedures required by FIR and FIR without ramping algorithms. Thecontroller 310 comprises an Fir/FirNoRampoperation control unit 711, an input dataaddress generation unit 712, an output dataaddress generation unit 713 and an input data shiftcontrol unit 714. The Fir/FirNoRampoperation control unit 711 is configured to issue the control signals for controlling operations of the fundamental functional units, so as to control the FIR operation with or without ramping. The input dataaddress generation unit 712 is configured to generate an input data address for fetching data from the vector register file (VRF) 740 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code and providing fetched data to the input buffers of the corresponding functional units. The output dataaddress generation unit 713 is configured to generate an output data address for storing data fetched from the output buffers of the corresponding functional units to the vector register file (VRF) 740 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code. That is, the vector register file (VRF) 740 is configured to hold the source and destination data vector registers for the Fir/FirNoRamp instructions. The input data shiftcontrol unit 714 is configured to generate a plurality of shift control signals to shift input data vector for supporting the FIR algorithm. - The FIR instruction (Fir) is to calculate:
-
y(k)=Σj=0 N-1 x(k−j)a(N−j−1), k=0,1, . . . ,(L+N−2) - The FIR instruction without ramping (FirNoRamp) to calculate:
-
y(k)=Σj=0 N-1 x(k+j)a(j), k=0,1, . . . ,(L−N) - The x(k) represents the input data, the a(j) represents the coefficients, the y(k) represents the FIR result, the L represents the length of the data vector and the N represents the length of the filter.
- As per operation control flow shown by the pseudo-code in
FIG. 6 , thecontroller 710 issues the read request (e.g. the VRF Rd Req) and provides the read address (e.g. the VRF Rd Addr) to load the input data and the coefficients from the vector register file (VRF) 740 to the input buffers of the permutation functional units 140 (shown as the input buffers (PERM functional unit) 750 inFIG. 7 ) and the input buffers of the multiplier functional units 120 (shown as the input buffers (MULT functional unit) 770 inFIG. 7 ). The input data is loaded to the input buffers (PERM functional unit) 750, and the coefficients are loaded to the input buffers (MULT functional unit) 770. The input data shifter of the permutation functional units 140 (shown as the input data shifter (PERM functional unit) 760 inFIG. 7 ) is configured to perform the down-shift operation on the input data. The shifted input data is multiplied by the coefficients via the multiplierfunctional units 120 and the output data of the multiplierfunctional units 120 is then provided to the accumulationfunctional units 130 via the pipeline registers 780. The accumulationfunctional units 130 perform the accumulation calculations on the received data and store the calculation result in the output buffers thereof (shown as the output buffers (ACC functional unit) 790 inFIG. 7 ). The output buffers inside the ACC functional units are configured to hold the output data to be saved back to vector register file (VRF) 740. Thecontroller 710 issues the write request (e.g. the VRF Wr Req) and provides the write address (e.g. the VRF Wr Addr) to write the output data to the vector register file (VRF) 740. - It should be noted that in the embodiments of the invention, the Fir/FirNoRamp instructions can be executed in parallel with other normal (non-composite) instructions such as load and store instructions.
- According to a third embodiment of the invention, a composite functional unit may be configured to perform the auto-correlation, cross-correlation and vector multiplication operations.
-
FIG. 8A is a schematic diagram showing the operation of auto-correlation according to an embodiment of the invention.FIG. 8B is a schematic diagram showing the operation of cross-correlation/vector multiplication according to an embodiment of the invention. The X(k) and Y(k) represent the input data, the R(k) represents the correlation/multiplication result. -
FIG. 9 shows an exemplary block diagram of a composite functional unit capable of performing the auto-correlation, cross-correlation/vector and multiplication operations according to an embodiment of the invention. AccompanyingFIG. 8A andFIG. 8B withFIG. 9 , the auto-correlation, cross-correlation and vector multiplication operations that can be carried out by the compositefunctional unit 900 are illustrated in more detail in the following paragraphs. - According to an embodiment of the invention, the ways to utilize the vector correlation related instructions are provided below:
- AutoCorr Vr_dest, Vr_src1, Rctrl
- CrossCorr Vr_dest, Vr_src1, Vr_src2, Rctrl
- VecByMat Vr_dest, Vr_src1, Vr_src2, Rctrl
- The AutoCorr is the instruction to support auto correlation of a data vector. The CrossCorr is the instruction to support cross-correlation of two data vectors. The VecByMat is the instruction to support the multiplication of a vector with a matrix, which can be realized as a simplified form of cross correlation of two data vectors when the Matrix is stored in a vector registers in either row-major or column-major format.
- The input parameter Vr_dest is the name of a destination vector register that holds the output data, the input parameter Vr_src1 is the name of a source vector register that holds one data vector, the input parameter Vr_src2 is the name of a source vector register that holds one data vector (or a Matrix for the VecByMat instruction) and the input parameter Rctrl is the name of a control register used to specify the lengths of the input data vectors. The destination Vector register, source vector registers and control register are the register/vector registers in the multi-port
register file unit 170. - As shown in
FIG. 9 , the instruction decode anddispatch unit 60 may decode the instructions received by theapparatus 100, and dispatch the decoded result to the corresponding functional unit. In this embodiment, the instruction decode anddispatch unit 60 may provide the control signals: corr_start, op_code and vector_length to the controller (the composite instruction controller) 910. The control signal corr_start indicates start of the corresponding operation. The control signal op_code indicates which operation of the auto-correlation, cross-correlation and vector multiplication is to be performed and further indicates the names of the registers to be accessed. The control signal vector_length indicates the length of the vector to be processed. - The input data is loaded from the
external RAM 50 and then stored in the vector register file (VRF) 940 via theload unit 920. The output data is loaded from vector register file (VRF) 940 and then stored in theexternal RAM 50 via thestore unit 930. It should be note that inFIG. 9 , theload unit 920 represents a combination of the functions of theRAM load unit 150 and thelane load unit 185 for simplicity. Similarly, thestore unit 930 represents a combination of the functions of theRAM store unit 160 and thelane store unit 180 for simplicity. The vector register file (VRF) 940 represents the vector registers, which are configured to facilitate performance of the FIR or FIR without ramping operation via the instruction, in themulti-port register file 170 for simplicity. - The correlation related instructions use hardware resources in multiplier, accumulation, and permutation functional units. The
controller 910 is configured to generate control signals to conduct the data flow of the AutoCorr, CrossCorr and VecByMat instructions based on the operation procedures required by the auto-correlation, cross-correlation and vector multiplication algorithms. Thecontroller 910 comprises an AutoCorr/CrossCorr/VecByMatoperation control unit 911, an input dataaddress generation unit 912, an output dataaddress generation unit 913 and an input data shiftcontrol unit 914. The AutoCorr/CrossCorr/VecByMatoperation control unit 911 is configured to issue the control signals for controlling the correlation operation flow for different instructions. The input dataaddress generation unit 912 is configured to generate an input data address for fetching data from the vector register file (VRF) 940 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code and providing fetched data to the input buffers of the corresponding functional units. The output dataaddress generation unit 913 is configured to generate an output data address for storing data fetched from the output buffers of the corresponding functional units to the vector register file (VRF) 940 (i.e. the multi-port register file unit) based on the register name carried in the control signal op_code. That is, the vector register file (VRF) 940 is configured to hold the source and destination data vector registers for the AutoCorr/CrossCorr/VecByMat instructions. The input data shiftcontrol unit 914 is configured to generate a plurality of shift control signals to shift input data vector for supporting the correlation algorithms. - The auto-correlation instruction (AutoCorr) is to calculate:
-
R(k)=Σj=0 N-1 x(k+j)x(j), k=0,1, . . . ,(M−1) - The cross-correlation instruction (CrossCorr) is to calculate:
-
R(k)=Σj=0 N-1 x(k+j)y(j), k=0,1, . . . ,(M−1) - The vector-by-matrix multiplication instruction (VecByMat) is to calculate (assuming x holds the matrix and y holds the vector):
-
R(k)=Σj=0 N-1 x(k*N+j)y(j), k=0,1, . . . ,(M−1) - The x(k) and y(j) represent the input data and the R(k) represents the calculation result, the N represents the length the input vector (or the number of rows of the input matrix, which can be the same as the size of the input vector), and the M represents the length of the output data vector (or the number of columns of the input matrix, which can be the same as the size of the output data vector).
- The
controller 910 issues the read request (e.g. the VRF Rd Req) and provides the read address (e.g. the VRF Rd Addr) to load the input data and the coefficients from the vector register file (VRF) 940 to the input buffers of the permutation functional units 140 (shown as the input buffers (PERM functional unit) 950 inFIG. 9 ) and the input buffers of the multiplier functional units 120 (shown as the input buffers (MULT functional unit) 970 inFIG. 9 ). The input data shifter of the permutation functional units 140 (shown as the input data shifter (PERM functional unit) 960 inFIG. 9 ) is configured to perform shift operation on the input data. The shifted input data is multiplied by the input data y(j) via the multiplierfunctional units 120 and the output data of the multiplierfunctional units 120 is then provided to the accumulationfunctional units 130 via the pipeline registers 980. The accumulationfunctional units 130 perform the cross-lane accumulation calculations on the received data via the cross-lane ACC registers 990. The calculation result is stored in the output buffers thereof (shown as the output buffers (ACC functional unit) 995 inFIG. 9 ). The output buffers inside the ACC functional units are configured to hold the output data to be saved back to vector register file (VRF) 940. Thecontroller 910 issues the write request (e.g. the VRF Wr Req) and provides the write address (e.g. the VRF Wr Addr) to write the output data to the vector register file (VRF) 940. - It should be noted that in the embodiments of the invention, the AutoCorr/CrossCorr/VecByMat instructions can be executed in parallel with other normal (non-composite) instructions such as load and store instructions.
- It should also be noted that in the architecture shown in
FIG. 9 , the compositefunctional unit 900 has a cross-lane architecture, that is, the accumulationfunctional units 130 is configured to perform the cross-lane accumulation calculations. - As discussed above, in the embodiments of the invention, a single composite instruction (such as an FFT, IFFT, FHT, Fir, FirNoRamp, AutoCorr, CrossCorr, VecByMat . . . ect.) can support a complex algorithm which was realized by software subroutine or micro-codes in the software solution design. It should be noted that unlike the software solution, in which a “function call” is created via combining multiple general-purpose instructions in the software subroutine or micro-codes, in the embodiments of the invention, a single composite instruction is implemented. For a “function call”, the software control overhead is the main drawback and the code size is large. On the contrary, for a composite instruction, there is no such software control overhead and code size problem since the VDSP users don't have to create any function by themselves and don't have to perform any further software codes or micro-codes programming, and can directly use the corresponding instruction to perform the corresponding calculation.
- In addition, a composite instruction in VDSP can achieve the same performance of dedicated co-processor while sharing the same hardware resources in VDSP with other normal (non-composite) instructions.
- Therefore, the technical effects and can may be achieved by this invention includes: 1) reduced software code size when using the composite instruction to realize common algorithms, as compared to the software solution, 2) better performance (higher data throughput) due to reduced control overhead, as compared to the software solution and 3) higher utilization of hardware resource, as compared to the co-processor solution.
- Use of ordinal terms such as “first”, “second”, etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having the same name (but for use of the ordinal term) to distinguish the claim elements.
- While the invention has been described by way of example and in terms of preferred embodiment, it is to be understood that the invention is not limited thereto. Those who are skilled in this technology can still make various alterations and modifications without departing from the scope and spirit of this invention. Therefore, the scope of the present invention shall be defined and protected by the following claims and their equivalents.
Claims (17)
1. An apparatus, comprising:
a plurality of signal processing lanes, each signal processing lane comprising:
a first fundamental functional unit;
a second fundamental functional unit; and
a register file unit, comprising a plurality of configurable vector registers; and
a composite instruction controller, coupled to the first fundamental functional units and the second fundamental functional units in the plurality of signal processing lanes, and is configured to issue a plurality of control signals in response to a composite instruction to control the first fundamental functional units and the second fundamental functional units and thereby carry out a composite operation.
2. The apparatus as claimed in claim 1 , wherein each of the first fundamental functional unit and the second fundamental functional unit is capable of carrying out an operation selected from a group comprising an addition, a multiplication, an accumulation and a permutation.
3. The apparatus as claimed in claim 1 , wherein the composite operation is selected from a group comprising a Fast Fourier Transform (FFT), an inverse Fast Fourier Transform (iFFT), a Fast Hadamard Transform (FHT), a Finite Impulse Response (FIR) filtering with ramping, an FIR filtering without ramping, an auto-correlation, a cross-correlation and a vector-by-matrix multiplication.
4. The apparatus as claimed in claim 1 , wherein the composite instruction controller comprises:
an operation control unit, configured to issue the control signals;
an input data address generation unit, configured to generate an input data address for fetching data from the register file unit; and
an output data address generation unit, configured to generate an output data address for storing data to the register file unit.
5. The apparatus as claimed in claim 1 , wherein at least part of the first fundamental functional units and the second fundamental functional units are controlled to carry out a butterfly operation.
6. The apparatus as claimed in claim 5 , wherein the composite instruction controller further comprises:
an output data permutation control unit, configured to generate a plurality of permutation control signals for re-ordering data outputted from the butterfly operations.
7. The apparatus as claimed in claim 5 , wherein the butterfly operation is a radix-2 butterfly operation or a radix-4 butterfly operation.
8. The apparatus as claimed in claim 4 , wherein the composite instruction controller further comprises:
an input data shift control unit, configured to generate a plurality of shift control signals to shift an input data vector.
9. An apparatus, comprising:
a plurality of signal processing lanes, each signal processing lane comprising:
a first fundamental functional unit, comprising a plurality of first buffers and a first computation unit;
a second fundamental functional unit, comprising a plurality of second buffers and a second computation unit; and
a register file unit, comprising a plurality of configurable vector registers; and
a first composite instruction controller, configured to issue a plurality of control signals in response to a first composite instruction to control the plurality of first buffers and the first computation unit of the first fundamental functional units and the plurality of second buffers and the second computation unit of the second fundamental functional units in the plurality of signal processing lanes and thereby carry out a first composite operation.
10. The apparatus as claimed in claim 9 , further comprising:
a second composite instruction controller, configured to issue a plurality of control signals in response to a second composite instruction to control the plurality of first buffers and the first computation unit of the first fundamental functional units and the plurality of second buffers and the second computation unit of the second fundamental functional units in the plurality of signal processing lanes and thereby carrying out a second composite operation.
11. The apparatus as claimed in claim 9 , wherein each of the first fundamental functional unit and the second fundamental functional unit is capable of carrying out an operation selected from a group comprising an addition, a multiplication, an accumulation and a permutation.
12. The apparatus as claimed in claim 9 , wherein the first composite operation is selected from a group comprising a Fast Fourier Transform (FFT), an inverse Fast Fourier Transform (iFFT), a Fast Hadamard Transform (FHT), a Finite Impulse Response (FIR) filtering with ramping, an FIR filtering without ramping, an auto-correlation, a cross-correlation and a vector-by-matrix multiplication.
13. The apparatus as claimed in claim 9 , wherein the first composite instruction controller comprises:
an operation control unit, configured to issue the control signals;
an input data address generation unit, configured to generate an input data address for fetching data from the register file unit; and
an output data address generation unit, configured to generate an output data address for storing data to the register file unit.
14. The apparatus as claimed in claim 9 , wherein at least part of the first fundamental functional units and the second fundamental functional units are controlled to carry out a butterfly operation.
15. The apparatus as claimed in claim 14 , wherein the composite instruction controller further comprises:
an output data permutation control unit, configured to generate a plurality of permutation control signals for re-ordering data outputted from the butterfly operations.
16. The apparatus as claimed in claim 14 , wherein the butterfly operation is a radix-2 butterfly operation or a radix-4 butterfly operation.
17. The apparatus as claimed in claim 13 , wherein the composite instruction controller further comprises:
an input data shift control unit, configured to generate a plurality of shift control signals to shift an input data vector.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/120,645 US20190073337A1 (en) | 2017-09-05 | 2018-09-04 | Apparatuses capable of providing composite instructions in the instruction set architecture of a processor |
TW108128199A TW202011184A (en) | 2017-09-05 | 2019-08-08 | Apparatuses capable of providing composite instructions in the instruction set architecture of a processor |
CN201910734826.3A CN110874240A (en) | 2017-09-05 | 2019-08-09 | Apparatus capable of providing compound instructions in an instruction set architecture of a processor |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762554052P | 2017-09-05 | 2017-09-05 | |
US16/120,645 US20190073337A1 (en) | 2017-09-05 | 2018-09-04 | Apparatuses capable of providing composite instructions in the instruction set architecture of a processor |
Publications (1)
Publication Number | Publication Date |
---|---|
US20190073337A1 true US20190073337A1 (en) | 2019-03-07 |
Family
ID=65517368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/120,645 Abandoned US20190073337A1 (en) | 2017-09-05 | 2018-09-04 | Apparatuses capable of providing composite instructions in the instruction set architecture of a processor |
Country Status (3)
Country | Link |
---|---|
US (1) | US20190073337A1 (en) |
CN (1) | CN110874240A (en) |
TW (1) | TW202011184A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113111300A (en) * | 2020-01-13 | 2021-07-13 | 上海大学 | Fixed point FFT implementation architecture with optimized resource consumption |
US11237831B2 (en) * | 2013-07-15 | 2022-02-01 | Texas Instmments Incorporated | Method and apparatus for permuting streamed data elements |
US11397624B2 (en) * | 2019-01-22 | 2022-07-26 | Arm Limited | Execution of cross-lane operations in data processing systems |
US11568523B1 (en) * | 2020-03-03 | 2023-01-31 | Nvidia Corporation | Techniques to perform fast fourier transform |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112506468B (en) * | 2020-12-09 | 2023-04-28 | 上海交通大学 | RISC-V general processor supporting high throughput multi-precision multiplication operation |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5235686A (en) * | 1987-02-24 | 1993-08-10 | Texas Instruments Incorporated | Computer system having mixed macrocode and microcode |
US6366937B1 (en) * | 1999-03-11 | 2002-04-02 | Hitachi America Ltd. | System and method for performing a fast fourier transform using a matrix-vector multiply instruction |
US20030110347A1 (en) * | 1998-06-25 | 2003-06-12 | Alva Henderson | Variable word length data memory using shared address source for multiple arrays |
US6832306B1 (en) * | 1999-10-25 | 2004-12-14 | Intel Corporation | Method and apparatus for a unified RISC/DSP pipeline controller for both reduced instruction set computer (RISC) control instructions and digital signal processing (DSP) instructions |
US20050055543A1 (en) * | 2003-09-05 | 2005-03-10 | Moyer William C. | Data processing system using independent memory and register operand size specifiers and method thereof |
US20070106720A1 (en) * | 2005-11-10 | 2007-05-10 | Samsung Electronics Co., Ltd. | Reconfigurable signal processor architecture using multiple complex multiply-accumulate units |
US20080147760A1 (en) * | 2006-12-18 | 2008-06-19 | Broadcom Comporation | System and method for performing accelerated finite impulse response filtering operations in a microprocessor |
US7860177B2 (en) * | 2007-08-28 | 2010-12-28 | Mediatek Inc. | Receiver detecting signals based on spectrum characteristic and detecting method thereof |
US20130148694A1 (en) * | 2011-12-13 | 2013-06-13 | Qualcomm Incorporated | Dual Fixed Geometry Fast Fourier Transform (FFT) |
US10339095B2 (en) * | 2015-02-02 | 2019-07-02 | Optimum Semiconductor Technologies Inc. | Vector processor configured to operate on variable length vectors using digital signal processing instructions |
-
2018
- 2018-09-04 US US16/120,645 patent/US20190073337A1/en not_active Abandoned
-
2019
- 2019-08-08 TW TW108128199A patent/TW202011184A/en unknown
- 2019-08-09 CN CN201910734826.3A patent/CN110874240A/en not_active Withdrawn
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5235686A (en) * | 1987-02-24 | 1993-08-10 | Texas Instruments Incorporated | Computer system having mixed macrocode and microcode |
US20030110347A1 (en) * | 1998-06-25 | 2003-06-12 | Alva Henderson | Variable word length data memory using shared address source for multiple arrays |
US6366937B1 (en) * | 1999-03-11 | 2002-04-02 | Hitachi America Ltd. | System and method for performing a fast fourier transform using a matrix-vector multiply instruction |
US6832306B1 (en) * | 1999-10-25 | 2004-12-14 | Intel Corporation | Method and apparatus for a unified RISC/DSP pipeline controller for both reduced instruction set computer (RISC) control instructions and digital signal processing (DSP) instructions |
US20050055543A1 (en) * | 2003-09-05 | 2005-03-10 | Moyer William C. | Data processing system using independent memory and register operand size specifiers and method thereof |
US20070106720A1 (en) * | 2005-11-10 | 2007-05-10 | Samsung Electronics Co., Ltd. | Reconfigurable signal processor architecture using multiple complex multiply-accumulate units |
US20080147760A1 (en) * | 2006-12-18 | 2008-06-19 | Broadcom Comporation | System and method for performing accelerated finite impulse response filtering operations in a microprocessor |
US7860177B2 (en) * | 2007-08-28 | 2010-12-28 | Mediatek Inc. | Receiver detecting signals based on spectrum characteristic and detecting method thereof |
US20130148694A1 (en) * | 2011-12-13 | 2013-06-13 | Qualcomm Incorporated | Dual Fixed Geometry Fast Fourier Transform (FFT) |
US10339095B2 (en) * | 2015-02-02 | 2019-07-02 | Optimum Semiconductor Technologies Inc. | Vector processor configured to operate on variable length vectors using digital signal processing instructions |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11237831B2 (en) * | 2013-07-15 | 2022-02-01 | Texas Instmments Incorporated | Method and apparatus for permuting streamed data elements |
US11669463B2 (en) | 2013-07-15 | 2023-06-06 | Texas Instruments Incorporated | Method and apparatus for permuting streamed data elements |
US11397624B2 (en) * | 2019-01-22 | 2022-07-26 | Arm Limited | Execution of cross-lane operations in data processing systems |
CN113111300A (en) * | 2020-01-13 | 2021-07-13 | 上海大学 | Fixed point FFT implementation architecture with optimized resource consumption |
US11568523B1 (en) * | 2020-03-03 | 2023-01-31 | Nvidia Corporation | Techniques to perform fast fourier transform |
Also Published As
Publication number | Publication date |
---|---|
CN110874240A (en) | 2020-03-10 |
TW202011184A (en) | 2020-03-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20190073337A1 (en) | Apparatuses capable of providing composite instructions in the instruction set architecture of a processor | |
CN107315574B (en) | Apparatus and method for performing matrix multiplication operation | |
CN111580865B (en) | Vector operation device and operation method | |
US9275014B2 (en) | Vector processing engines having programmable data path configurations for providing multi-mode radix-2x butterfly vector processing circuits, and related vector processors, systems, and methods | |
CN111651205B (en) | Apparatus and method for performing vector inner product operation | |
US6922716B2 (en) | Method and apparatus for vector processing | |
US8443170B2 (en) | Apparatus and method for performing SIMD multiply-accumulate operations | |
US6839728B2 (en) | Efficient complex multiplication and fast fourier transform (FFT) implementation on the manarray architecture | |
US9104510B1 (en) | Multi-function floating point unit | |
KR100841131B1 (en) | A method, apparatus, and article for performing a sign operation that multiplies | |
US9355061B2 (en) | Data processing apparatus and method for performing scan operations | |
JP2006529043A (en) | A processor reduction unit that performs sums of operands with or without saturation | |
CN107315717B (en) | Device and method for executing vector four-rule operation | |
CN107315716B (en) | Device and method for executing vector outer product operation | |
US20200334042A1 (en) | Method and device (universal multifunction accelerator) for accelerating computations by parallel computations of middle stratum operations | |
US9417843B2 (en) | Extended multiply | |
EP1212677B1 (en) | Registers for 2-d matrix processing | |
US8352528B2 (en) | Apparatus for efficient DCT calculations in a SIMD programmable processor | |
Hussain et al. | Designing fast fourier transform accelerators for orthogonal frequency-division multiplexing systems | |
US7653676B2 (en) | Efficient mapping of FFT to a reconfigurable parallel and pipeline data flow machine | |
US8787422B2 (en) | Dual fixed geometry fast fourier transform (FFT) | |
Gerlach et al. | An area efficient real-and complex-valued multiply-accumulate SIMD unit for digital signal processors | |
US20060224652A1 (en) | Instruction set processor enhancement for computing a fast fourier transform | |
US20100115232A1 (en) | Large integer support in vector operations | |
US9582473B1 (en) | Instruction set to enable efficient implementation of fixed point fast fourier transform (FFT) algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MEDIATEK SINGAPORE PTE. LTD., SINGAPORE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:XU, LIANG;KUO, MING-CHIEH;REEL/FRAME:046777/0009 Effective date: 20180830 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |