WO2022061788A1 - Versatile systolic array for maximum likelihood mimo detectors - Google Patents

Versatile systolic array for maximum likelihood mimo detectors Download PDF

Info

Publication number
WO2022061788A1
WO2022061788A1 PCT/CN2020/117947 CN2020117947W WO2022061788A1 WO 2022061788 A1 WO2022061788 A1 WO 2022061788A1 CN 2020117947 W CN2020117947 W CN 2020117947W WO 2022061788 A1 WO2022061788 A1 WO 2022061788A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing elements
array
spatial array
decomposition
perform
Prior art date
Application number
PCT/CN2020/117947
Other languages
French (fr)
Inventor
Hong Cheng
Xu Zhang
Original Assignee
Intel Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corporation filed Critical Intel Corporation
Priority to PCT/CN2020/117947 priority Critical patent/WO2022061788A1/en
Publication of WO2022061788A1 publication Critical patent/WO2022061788A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L25/00Baseband systems
    • H04L25/02Details ; arrangements for supplying electrical power along data transmission lines
    • H04L25/03Shaping networks in transmitter or receiver, e.g. adaptive shaping networks
    • H04L25/03891Spatial equalizers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04BTRANSMISSION
    • H04B7/00Radio transmission systems, i.e. using radiation field
    • H04B7/02Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas
    • H04B7/04Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas
    • H04B7/06Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station
    • H04B7/0613Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission
    • H04B7/0615Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal
    • H04B7/0617Diversity systems; Multi-antenna system, i.e. transmission or reception using multiple antennas using two or more spaced independent antennas at the transmitting station using simultaneous transmission of weighted versions of same signal for beam forming

Definitions

  • the present disclosure relates generally to a programmable spatial array that can rapidly and efficiently support a K-best maximum likelihood detector (MLD) for multiple-input multiple-output (MIMO) wireless communication.
  • MLD K-best maximum likelihood detector
  • MIMO multiple-input multiple-output
  • Integrated circuit devices are found in numerous electronic devices, many of which may perform wireless communication.
  • electronic devices may perform multiple-input multiple-output (MIMO) wireless communication, which may be used in wireless baseband systems for 5G wireless communication.
  • MIMO multiple-input multiple-output
  • the throughput of a wireless baseband system highly depends on the error performance of a MIMO detector. With a detector with a low error rate, an electronic device may transfer data using a higher Modulation and Coding Scheme (MCS) , as well as more layers.
  • MCS Modulation and Coding Scheme
  • a Maximum Likelihood Detector is one solution in a stochastic sense.
  • K-best a variant of MLD called K-best is often used.
  • K-best MLD has very low error and can fulfil the goals of many different kinds of wireless baseband systems.
  • K-best MLD may be substantially higher than other linear, lower-performing detectors such as zero-forcing (ZF) and minimized mean squared error (MMSE) .
  • ZF zero-forcing
  • MMSE minimized mean squared error
  • the baseband system may become much complicated and power consuming than that of 4G. Therefore, the hardware utilization and energy efficiency of 5G MIMO detector may have an outsized impact on overall system performance. Even so, there are many other computations that may be performed in a wireless base station, such as Cholesky decomposition, matrix multiplication, and linear equation solving.
  • FIG. 1 is a block diagram of a system that includes an integrated circuit having a programmable spatial array processor, in accordance with an embodiment
  • FIG. 2 is a block diagram of another system that includes an integrated circuit having a programmable spatial array processor, in accordance with an embodiment
  • FIG. 3 is a high-level block diagram of the programmable spatial array processor, in accordance with an embodiment
  • FIG. 4 is a block diagram illustrating a manner in which a batch of matrices may be pipelined through the programmable spatial array processor, in accordance with an embodiment
  • FIG. 5 is a block diagram of a processing element array of the programmable spatial array processor, in accordance with an embodiment
  • FIG. 6 is a diagram of data flow through the processing element array, data processing system that uses the integrated circuit to control multiple-input multiple-output communication, in accordance with an embodiment
  • FIG. 7 is a block diagram of an example architecture of a multiply-accumulate (M) processing element (PE) of the processing element array, in accordance with an embodiment
  • FIG. 8 is a data flow diagram of one manner of feeding data into the processing element array if the processing elements lacked a data queue
  • FIG. 9 is a data flow diagram of one manner of feeding data into the processing element array using data queues in respective processing elements, in accordance with an embodiment
  • FIG. 10 is a block diagram of an example architecture of a diagonal (D) processing element (PE) of the processing element array, in accordance with an embodiment
  • FIG. 11 is a flow diagram illustrating a method of pipelining operations, even on different matrices, using the diagonal (D) processing element (PE) , in accordance with an embodiment
  • FIG. 12 is a block diagram illustrating a data flow through the example architecture of the diagonal (D) processing element (PE) of the processing element array, in accordance with an embodiment
  • FIG. 13 is a block diagram showing a propagation of instructions through different processing elements of the processing element array, in accordance with an embodiment
  • FIG. 14 is a block diagram showing a propagation of instructions through multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 15 is a block diagram illustrating delays for propagation of instructions through the multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 16 is a block diagram illustrating the use of time-to-live (TTL) on instructions propagated through the multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 17 is a block diagram illustrating a propagation of instructions through diagonal (D) processing elements (PEs) and vector (V) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 18 is a block diagram illustrating a set of instructions that may be stored in a common instruction memory for all or several multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 19 is a block diagram of a main buffer that feeds the processing element array, in accordance with an embodiment
  • FIG. 20 is a block diagram of a delay alignment buffer that aligns results that were output by the processing element array staggered in time, in accordance with an embodiment
  • FIG. 21 is an example data structure of an instruction that may program multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 22 is an example data structure of an assembly code for multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment
  • FIG. 23 is a block diagram illustrating types of computations that may be carried out by a diagonal (D) processing element (PE) and a multiply-accumulate (M) processing element (PE) of the processing element array to perform Cholesky decomposition, in accordance with an embodiment;
  • D diagonal
  • M multiply-accumulate
  • FIG. 24 is a block diagram of computations that may be carried out by the processing element array to perform Cholesky decomposition, in accordance with an embodiment
  • FIG. 25 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform LU decomposition, in accordance with an embodiment
  • FIG. 26 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform pre-filtering for Cholesky-based minimum mean square error (MMSE) , in accordance with an embodiment
  • FIG. 27 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform back substitution and V*Z for Cholesky-based minimum mean square error (MMSE) , in accordance with an embodiment
  • FIG. 28 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform V H * (VZ) for Cholesky-based minimum mean square error (MMSE) , in accordance with an embodiment
  • FIG. 29 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform Givens-rotation QR based minimum mean square error (MMSE) (GR-QRD) , in accordance with an embodiment;
  • MMSE Givens-rotation QR based minimum mean square error
  • FIG. 30 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform back substitution for GR-QRD, in accordance with an embodiment
  • FIG. 31 is a block diagram illustrating a manner of performing interleaved batch GR-QRD using the processing element array, in accordance with an embodiment
  • FIG. 32 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform Gram-Schmidt QR decomposition, in accordance with an embodiment
  • FIG. 33 is a diagram illustrating a description of a multiple-input multiple-output (MIMO) wireless communication system on which a K-best maximum likelihood detector (MLD) is applied, in accordance with an embodiment
  • FIG. 34 is a diagram of a computation in one layer in decoding tree traverse, in accordance with an embodiment
  • FIG. 35 is an overview of a systolic array structure using two connected planes of (or one multiplexed) programmable spatial arrays of processing elements to perform a K-best maximum likelihood detector (MLD) computation for multiple-input multiple-output (MIMO) wireless communication, in accordance with an embodiment
  • FIG. 36 is a diagram showing QR decomposition implemented on a first plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 37 is a diagram showing a function (e.g., program, configuration) of diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 38 is a diagram of an internal functional arrangement (e.g., program, configuration) of the diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 39 is a diagram showing a function (e.g., program, configuration) of off-diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 40 is a diagram of an internal functional arrangement (e.g., program, configuration) of the off-diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 41 is a diagram illustrating a rotation function that may be carried out in the diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 42 is a diagram illustrating communication of data from processing elements of the first plane to processing elements of the second plane, in accordance with an embodiment
  • FIG. 43 is a diagram showing decoding tree traverse implemented on a second plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 44 is a diagram showing a function (e.g., program, configuration) of diagonal processing elements of the second plane of the programmable spatial arrays, in accordance with an embodiment
  • FIG. 45 is a diagram of an internal functional arrangement (e.g., program, configuration) of the diagonal processing elements of the second plane of the programmable spatial arrays, in accordance with an embodiment.
  • FIG. 46 is a diagram showing a function (e.g., program, configuration) of off-diagonal processing elements of the second plane of the programmable spatial arrays, in accordance with an embodiment.
  • a function e.g., program, configuration
  • the term “or” is intended to be inclusive (e.g., logical OR) and not exclusive (e.g., logical XOR) .
  • the phrase A “or” B is intended to mean A, B, or both A and B.
  • this disclosure describes various data structures, such as instructions for an instruction set architecture. These are described as having certain domains (e.g., fields) and corresponding numbers of bits. However, it should be understood that these domains and sizes in bits are meant as examples and are not intended to be exclusive. Indeed, the data structures (e.g., instructions) of this disclosure may take any suitable form.
  • An integrated circuit such as an application specific integrated circuit (ASIC) or a programmable logic device (PLD) like a field programmable gate array (FPGA) , may be part of an electronic device that perform wireless communications, machine learning, or many other tasks. These tasks may involve performing matrix decompositions.
  • matrix decomposition is widely used in wireless communication, machine learning, and other areas.
  • MIMO multiple-input multiple-output
  • 5G wireless systems multivariate linear regressions in machine learning, systems of linear equations, matrix inversions and determinant calculations, and many others involve performing matrix decompositions.
  • Different types of matrix decompositions include LU decomposition, QR decomposition, and Cholesky decomposition.
  • this disclosure provides a programmable spatial array processor that can be programmed to compute a variety of different types of matrix decompositions.
  • the programmable spatial array processor has a two-dimensional upper triangular Processing Element (PE) array which acts as a high throughput engine. Every PE executes under instructions that provide programmability to support different modes.
  • PE triangular Processing Element
  • matrix decompositions are more complicated than matrix multiplication.
  • the latter may generally use multiplication and addition operations and may have little or no data dependency among operations.
  • Matrix decompositions may have many data dependencies. This may cause one operation to have to wait for the result of another operation to be ready, which makes it difficult to handle data in parallel.
  • matrix decomposition usually has arithmetic operations other than multiplication, such as division and square root.
  • the programmable spatial array processor of this disclosure may use a control scheme that can mitigate the challenges of the data dependency of the various PEs in solving matrix decompositions.
  • an Instruction Share and Propagation (ISP) scheme may control all PEs efficiently. Instructions may be shared by certain PEs and propagated through them. This may substantially reduce the size or complexity of the instruction memory. Indeed, instructions may flow through the array in a systolic-like way, just like the data flow. All non-diagonal PEs may share the same instructions. This may (a) reduce instruction memory from N 2 /2 to 2 and (b) allow instructions to transfer between adjacent PEs so that a long control path may be avoided.
  • the programmability of the programmable spatial array processor may enable a fast switch between two different types of matrix operation.
  • the array of the programmable spatial array processor may simply be fed with new instructions for new matrix operation. Additional reset or reconfiguration time may be avoided, enabling transitions to computing different types of matrix decomposition to occur rapidly and seamlessly.
  • the programmable spatial array processor may also support widely used matrix operations like back substitution, matrix-vector multiplication, matrix multiplying by its transpose (A T A) , and so on.
  • matrix operations like back substitution, matrix-vector multiplication, matrix multiplying by its transpose (A T A) , and so on.
  • a T A matrix multiplying by its transpose
  • the programmable spatial array processor may have a triangular arrangement that, compared to a square array, may cut hardware resource usage nearly in half.
  • FIG. 1 illustrates a block diagram of a system 10 that may implement a programmable spatial array processor.
  • a designer may desire to implement functionality, such as the programmable spatial array processor of this disclosure, on an integrated circuit device 12 (such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) ) .
  • the designer may specify a high-level program to be implemented, such as an OpenCL program, which may enable the designer to more efficiently and easily provide programming instructions to configure a set of programmable logic cells for the integrated circuit device 12 without specific knowledge of low-level hardware description languages (e.g., Verilog or VHDL) .
  • Verilog Verilog
  • OpenCL is quite similar to other high-level programming languages, such as C++, designers of programmable logic familiar with such programming languages may have a reduced learning curve than designers that are required to learn unfamiliar low-level hardware description languages to implement new functionalities in the integrated circuit device 12.
  • Design software 14 may use a compiler 16 to convert the high-level program into a lower-level description.
  • the compiler 16 may provide machine-readable instructions representative of the high-level program to a host 18 and the integrated circuit device 12.
  • the host 18 may include any suitable processing circuitry and may receive a host program 22 which may be implemented by the kernel programs 20.
  • the host 18 may communicate instructions from the host program 22 to the integrated circuit device 12 via a communications link 24, which may be, for example, direct memory access (DMA) communications or peripheral component interconnect express (PCIe) communications.
  • DMA direct memory access
  • PCIe peripheral component interconnect express
  • a designer may use the design software 14 to generate and/or to specify a low-level program, such as the low-level hardware description languages described above.
  • the system 10 may be implemented without a separate host program 22.
  • the techniques described herein may be implemented in circuitry as hardened IP that is not programmed into a programmable logic device. Thus, embodiments described herein are intended to be illustrative and not limiting.
  • the kernel programs 20 may enable configuration of a programmable spatial array processor 26 on the integrated circuit device 12.
  • the programmable spatial array processor 26 may represent a circuit design of the kernel program 20 that is configured onto the integrated circuit device 12 (e.g., formed in soft logic) .
  • the programmable spatial array processor 26 may be partially or fully formed in hardened circuitry (e.g., application-specific circuitry of the integrated circuit 12 that is not configurable as programmable logic) .
  • the host 18 may use the communication link 24 to cause the programmable spatial array processor 26 to decompose matrices according to any suitable matrix decomposition type.
  • the programmable spatial array processor 26 may be used to perform matrix decomposition to detect or transmit a signal for multiple-input multiple-output (MIMO) communication via antennas 28.
  • MIMO multiple-input multiple-output
  • the programmable spatial array processor 26 may be component included in a data processing system 40, as shown in FIG. 2.
  • the data processing system 40 may include a host processor 42 (e.g., a central-processing unit (CPU) ) , memory and/or storage circuitry 44, and a network interface 46.
  • the data processing system 40 may include more or fewer components (e.g., electronic display, user interface structures, application specific integrated circuits (ASICs) ) .
  • ASICs application specific integrated circuits
  • the host processor 42 may include any suitable processor, such as an processor or a reduced-instruction processor (e.g., a reduced instruction set computer (RISC) , an Advanced RISC Machine (ARM) processor) that may manage a data processing request for the data processing system 40 (e.g., to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, sensing or transmitting using a phased array, communicating via a MIMO wireless system, or the like) .
  • the memory and/or storage circuitry 44 may include random access memory (RAM) , read-only memory (ROM) , one or more hard drives, flash memory, or the like.
  • the memory and/or storage circuitry 44 may hold data to be processed by the data processing system 40. In some cases, the memory and/or storage circuitry 44 may also store configuration programs (bitstreams) for programming a programmable logic device that may hold the programmable spatial array processor 26. The memory and/or storage circuitry 44 may, additionally or alternatively, store instructions to program the programmable spatial array processor 26.
  • the network interface 46 may allow the data processing system 40 to communicate with other electronic devices.
  • the data processing system 40 may include several different packages or may be contained within a single package on a single package substrate.
  • the antennas 28 may be a component of the network interface 46 or may be used by the network interface 46 to receive or transmit signals in particular spatial directions.
  • the data processing system 40 may be part of a data center that processes a variety of different requests.
  • the data processing system 40 may receive a data processing request via the network interface 46 to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, digital signal processing, or some other specialized task.
  • Some or all of the components of the data processing system 40 may be virtual machine components running on physical circuitry (e.g., managed by one or more hypervisors or virtual machine managers) . Whether physical components or virtual machine components, the various components of the data processing system 40 may be located in the same location or different locations (e.g., on different boards, in different rooms, at different geographic locations) .
  • the data processing system 40 may be accessible via a computing service provider (CSP) that may provide an interface to customers to use the data processing system 40 (e.g., to run programs and/or perform acceleration tasks) in a cloud computing environment.
  • CSP computing service provider
  • FIG. 3 shows a top block diagram of the programmable spatial array processor 26. Control flow is shown in first hatching 60, data flow is shown in second hatching 62, computation is shown in third hatching 64, and instruction flow is shown in fourth hatching 66.
  • Input data 68 streams into a main buffer 70 first, then may flow 72 to a spatial array 74 that includes a processing element (PE) array 76 and instruction memory 78 that hold instructions to control processing elements of the PE array 76.
  • the instruction memory 78 may represent separate memories for each different type of processing element of the PE array 76.
  • the PE array 76 is available, the input data 68 is enters the PE array 76. After calculation in the PE array 76, results 80 stream into a delay alignment buffer 82 for data rearrangement.
  • the output of delay alignment buffer goes to an output port 84 as output data 86 or loops back via a feedback path 88 to the main buffer 70 as intermediate data 90.
  • the second hatching 62 shows the control signal flow.
  • Control instructions 92 may enter a control instruction decoder 94 to be distributed to the main buffer 70, the spatial array 74, and the delay alignment buffer 82.
  • the third hatching 64 shows an instruction preload flow. Instruction load commands 96 may take an instruction preload path 96 to the main buffer 70, the spatial array 74, and the delay alignment buffer 82.
  • the input data 68 may take any suitable form, including a matrix or vector format with throughput of one matrix row (column) per clock cycle.
  • a block of the input data 68 may contain a batch of matrices to utilize the pipeline capability of PE array 76 and improve average throughput.
  • Any suitable quantity of matrices or vectors may be used in a batch (e.g., 2, 3, 4, 5, 6, 7, 8, 16, 32, 64, 100, 128, 200, 256, 500, 512, 1000, 1024, or more or fewer) .
  • 32 consecutive matrices may form a batch, in this case the batch size is 32.
  • a batch of three input matrices 100A, 102A, 104A may be input to the PE array 76 through the main buffer 70.
  • the PE array 76 may compute result matrices 100B, 102B, and 104B in a pipelined manner.
  • the result matrices 100B, 102B, and 104B may overlap one another in time.
  • later parts of the result matrix 100B computed from the input matrix 100A overlap with earlier parts of the result matrix 102B computed from the input matrix 102A.
  • later parts of the result matrix 102B overlap with earlier parts of the result matrix 104B computed from the input matrix 104A.
  • the delay alignment buffer 82 removes these latencies to produce aligned output matrices 100C, 102C, and 104C.
  • PE Processing Element
  • the core part of the programmable spatial array processor 26 is the two-dimensional processing element (PE) array 76.
  • the PE array 76 has an upper triangle form to achieve high utilization efficiency, since most matrix decompositions lead to triangular result matrices.
  • the PE array 76 includes at least three types of processing elements: diagonal (D) processing elements (PEs) 110, multiply-accumulate (M) processing elements (PEs) 112, and vector (V) processing elements (PEs) 114.
  • the overall dataflow direction is rightward and downward.
  • Input matrices (X) and vectors (V) stream into the PE array 76 from the upper side.
  • the PE array 76 outputs the results (Y) to the right side.
  • the PEs 110, 112, and 114 accept data from an upper side or left side, perform some operations and output the results to a bottom or right side.
  • the M PEs 112 mainly perform multiplication and accumulation (MAC) operations, and the M PEs 112 form the upper triangular part of a square N-by-N array, of which any suitable number N may define the array.
  • the M PEs 112 may be considered an internal processing element type of the processing element array 76, since they are bounded to the left and right by the D PEs 110 and the V PEs 114.
  • Multiplication and accumulation (MAC) operations are abundant in matrix operations.
  • the V PEs 114 located at the rightmost column handle vector-related operations like matrix-vector multiplication.
  • the V PEs 114 may have the same or a similar internal hardware structure as the M PEs 112.
  • the main difference between the V PEs 114 and the M PEs 112 is that they run under different instructions (with different behaviors) .
  • the D PEs 110 may include more compute resources than the M PEs 112, since the diagonal elements may perform more complicated computations than non-diagonal elements in most matrix decomposition cases.
  • the D PEs 110 may include some MAC units and other math function (such as inverse square root) units, or may include units that perform certain specific operations.
  • the PE array 76 structure may achieve a relatively high operating clock frequency, since each PE 110, 112, or 114 may only connect with adjacent PEs 110, 112, or 114. This means that there may be no long routing path or that the routing paths between PEs 110, 112, and 114 may be sufficiently similar so as to have similar (e.g., equal) latencies. And this structure may relatively easily scale up to a large array size.
  • FIG. 6 illustrates a data flow through the PE array 76.
  • FIG. 6 provides an example of an X T X (X transpose multiply X) calculation. Every column 120, 122, 124, ..., 126 of input matrix X goes downward through each M PE 112 of that column and turns right when it meets the D PE 110. The respective M PEs 112 calculate the inner product of its upper input and left input. In addition to the original data propagation path, there is a result data (inner product in this case) propagation path going through the rows of the PEs 110, 112, and 114. Final results output to the right side as Y matrix.
  • X T X X transpose multiply X
  • Example architectures of the PEs 110, 112, and 114 will be described below. It should be appreciated that these are intended to be illustrative and not exhaustive. Indeed, the PEs 110, 112, and 114 may take any suitable form and have any suitable architectures.
  • Multiply-accumulate (M) PE 112 Architecture One example architecture of an M PE 112 appears in FIG. 7.
  • the M PE 112 includes several main components:
  • An instruction decoder 140 which receives input instructions in_instr and translates them into control (Ctrl) signals to control the computational flow of the M PE 112.
  • a delay block 142 may hold the instructions while computations are performed before propagating the instructions to a neighboring M PE 112. Note that the instruction flow for the various PEs 110, 112, and 114 will be discussed further below.
  • Routing circuits for interface and internal signals which may include multiplexers (MUXes) 144, 146, 148, 150, 152, 154, and 156 and latches 158, 160, and 162.
  • MUXes multiplexers
  • An arithmetic Logic Unit (ALU) 164 which may perform arithmetic operations.
  • the ALU 164 may be a complex number ALU (e.g., CMAC or CALU) .
  • Data inverters 166, 168, and 170 may be used to invert various input data before processing in the ALU 164 or instead of processing in the ALU 164. Some data may be passed without any processing.
  • a register file (RF) 172 which may include any suitable number of registers to store data.
  • a data queue 174 which may buffer data from an upper side input.
  • the ALU 164 may perform arithmetic operations such as add, multiply, multiply-add, multiply-accumulate, and so on. It may be implemented in complex form (named CMAC or CALU) to support complex number arithmetic that is widely used in wireless communication systems.
  • the inputs of the ALU 164 can have multiple sources, such as input ports, the register file (RF) 172, or the data queue 174.
  • the input and output interfaces shown in FIG. 7 may include:
  • the data queue 174 is used to buffer upper input data, since the left input data may be delayed after upper input data.
  • One way to handle this delay gap is to input the input data in a staggered way, as shown in FIG. 8. Each input sequence is delayed to meet the systolic propagation pattern.
  • the M PE 112 may save the effort of rearranging input data, and provide flexibility to handle many different delay offset patterns of different algorithms.
  • the data queue method shown in FIG. 9 may involve more buffering resources compared to the staggered input scheme of FIG. 8. But the data queue method of FIG. 9 may reduce consumption of buffering resources in the main buffer 70 of the programmable spatial array processor 26 (FIG. 3) .
  • Diagonal (D) PE 110 Architecture Since the D PEs 110 may handle more complicated calculations than an M PE 112, the D PEs 110 may have more functional units.
  • the D PE 110 may receive input instructions (in_instr) that are translated and distributed by an instruction decoder 190.
  • a delay block 192 may hold the instructions while computations are performed before propagating the instructions to a neighboring D PE 110. Note that the instruction flow for the various PEs 110, 112, and 114 will be discussed further below.
  • the instructions may represent control signals for an issue slot architecture
  • each issue slot 194, 196, 198, 200, and 202 performs one kind of operation. Any suitable number of issue slots and register files may be used, and it should be understood that the number and types shown in FIG. 10 are provided by way of example for illustrative purposes.
  • Each issue slot 194, 196, 198, 200, and 202 can receive data from an input port (U_dat) and send data to an output port (R_dat) .
  • the input slots may operate as follows:
  • ⁇ Input slot 194 store the input data into RFs.
  • Isqrt slot 196 inverse square root
  • Other operations like square root and division can be calculated using Isqrt result
  • Output slot 202 generates output data from RFs or other issue slots.
  • Multiple issues in a D PE 110 can work in a pipelined manner to achieve high throughput. Take Cholesky decomposition, for example.
  • the process includes inverse square root (Isqrt) from the Isqrt slot 196 and multiplications in the issue slot 198, which use the result from the Isqrt slot 196.
  • the issue slot 196 may perform a first square root operation 230 on a first matrix (Matrix 1) at a first time.
  • the issue slot 196 may perform a second square root operation 232 on a second matrix (Matrix 2) in parallel while the issue slot 198 performs a first multiply-accumulate operation 234 on the first matrix (Matrix 1) using the results of the operation 230.
  • the issue slot 196 may perform a third square root operation 236 on a third matrix (Matrix 3) in parallel while the issue slot 198 performs a second multiply-accumulate operation 238 on the second matrix (Matrix 2) using the results of the operation 232.
  • the issue slot 198 may perform a third multiply-accumulate operation 240 on the third matrix (Matrix 3) using the results of operation 236.
  • FIG. 12 The corresponding dataflow is shown in FIG. 12, which is indicated by dashed lines: first, the input data goes through the issue slot 1 (IS1) 194 into RF1 204 and issue slot 2 (IS2) 196, then IS2 196 performs inverse square root and write result into RF2 206, and then issue slot 3 (IS3) 198 reads data from RF1 204 and RF2 206 to perform multiplication and outputs the results as R_dat.
  • IS1 issue slot 1
  • IS2 issue slot 2
  • IS3 issue slot 3
  • all PEs in the PE array 76 are controlled by instructions that may be stored in the instruction memory 78, which may represent separate memories for the different varieties of processing elements (PE) 110, 112, and 114.
  • PE processing elements
  • a well-designed instruction set can support more general arithmetic operations.
  • An example of a suitable Instruction Set Architecture (ISA) will be discussed in the Instruction Set Architecture (ISA) section discussed further below. This section mainly focuses on how to efficiently distribute instructions to all PEs 110, 112, and 114 in the PE array 76.
  • One straightforward way would be to use a central control unit to generate all the instructions and distribute them to all PEs.
  • Instruction Share and Propagation may overcome some of the challenges mentioned above (e.g., avoiding such high fan-out and high memory utilization problems) .
  • the design of Instruction Share and Propagation (ISP) is made possible because the M PEs 112 and the D PEs 110 generally respectively execute the same or similar programs with only time offset and slight code differences. For instance, in a Cholesky decomposition procedure, every M PE 112 may execute the same first instruction but at different start times, and almost the same remaining instructions except that some of them may be ignored, as shown in FIG. 13.
  • NOP means no operation is needed. It can be found that one more instruction (instruction 2) is ignored (NOP) for every one step right in a row, and every M PE 112 in one column has the same instructions. These regularities enable the use of Instruction Share and Propagation (ISP) .
  • ISP Instruction Share and Propagation
  • the similarity of instruction executions among M PEs 112 may allow Instruction Share and Propagation (ISP) to use as few as one instruction memory 270 that contains the programs that all M PEs 112 share.
  • Instruction Share and Propagation (ISP) propagates each instruction to all M PEs 112. One instruction is read from instruction memory 270 and sent to all rows of the PE array 76, which propagates to all M PEs 112.
  • the start time of instruction execution of each M PE 112 is different.
  • the delay of instruction arrival to each M PE 112 will be different and varies among functions.
  • the instruction delay between two adjacent M PEs 112 in one row may be 1 or 2 cycles (or more, as desired) .
  • the instruction delay between two adjacent rows of M PEs 112 could be many more cycles.
  • instruction queues 282 for the rows of M PEs 112 may implement the delay offset (e.g., some number of cycles) between array rows of M PEs 112, and right side propagation delay can be set to 1 or 2 cycles.
  • TTL Time to Live
  • TTL_R horizontal
  • TTL_D vertical
  • FIG. 17 illustrates Instruction Share and Propagation (ISP) for all of the PEs of the PE array 76.
  • ISP Instruction Share and Propagation
  • D PEs 110 with corresponding instruction queues 292
  • instruction memory 294 for V PEs 110 with corresponding instructions queues 296.
  • Each instruction is read from its respective instruction memory 290, 270, and 294 and propagated to all related PEs 110, 112, and 114.
  • the instruction queues 282, 292, and 296 insert a desired delay between two adjacent rows, or called vertical delay.
  • the delay between two adjacent M PEs 112 in a row which is called horizontal delay, should be set to 1 or 2.
  • FIG. 18 illustrates example instructions stored in the instruction memory 270 for the M PEs 112.
  • a special instruction may be used to set vertical delay and horizontal delay, which may be referred to as a Propagation Delay Setting (PDS) instruction 300.
  • the PDS instruction 300 may be located in a particular place (e.g., the first place) in a program containing any suitable number N other instructions 302, 304, ..., 306.
  • the PDS instruction 300 propagates to all M PEs 112 like other instructions and may set the value of delay for M PE 112. In the example of FIG.
  • the PDS instruction 300 includes control (Ctrl) bits, some bits that indicate vertical delay (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more bits) , some bits that indicate horizontal delay (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 bits) that may be fewer than the number of bits that indicate vertical delay, and some bits that indicate the mode of the instruction (here, PDS) .
  • control Ctrl bits
  • some bits that indicate vertical delay e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more bits
  • some bits that indicate horizontal delay e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 bits
  • PDS mode of the instruction
  • FIG. 19 shows a block diagram of the main buffer 70.
  • the main buffer 70 uses instructions translated by an instruction decoder 304 to serve as a data buffer for input data 306 or inner loop-back intermediate data 90 (e.g., from the feedback path 88 shown in FIGS. 3 and 4) .
  • the instruction decoder 304 may decode instructions into, for example, instructions for matrix size, batch size, and function to be performed.
  • the core of the main buffer 70 includes N first-in first-out (FIFO) buffers 310 for matrix buffering and at least one FIFO buffer 312 for vector buffering.
  • the main buffer 70 supports N+1 data read and write in parallel (where N is the size of one matrix row or column) .
  • the write control blocks 314 and 316 may generate access signals to the FIFOs 310 and 312 by controlling routing circuitry (e.g., multiplexers (MUXes) ) 322, 324, 326, and 328.
  • MUXes multiplexers
  • the write control block 314 may generate access signals 330 and 332 using indications val_M, val_V, start of packet (sop) , and end of packet (eop) corresponding to the input data 306.
  • the write control block 316 may generate access signals 334 and 336 using indications val_M, val_V, start of packet (sop) , and end of packet (eop) corresponding to the loop-back intermediate data 90.
  • the read control block 318 may generate access signals 338 and 340.
  • Monitor circuitry 342 may provide error and ready signals.
  • a parallel to serial (P2S) block 344 may convert a parallel vector into serial form for storage in the FIFO 312.
  • N FIFOs 310 input data with length of N in the form of one row or column of a matrix may be fed into N FIFOs 310, and data read from the N FIFOs 310 may be sent to the PE array 76 as one matrix row or column.
  • the write and read control blocks 314, 316, and 318 are used to generate FIFO access signals (e.g., 330, 332, 334, 336, 338, and 340) . Some specific data like an identity matrix can also be generated by the read control block 318.
  • the memory 320 may store the FIFO access patterns of each operation (e.g., each type of matrix decomposition) .
  • the memory 320 may store read patterns. Table 1 provides one example of a read pattern.
  • Table 2 illustrates one example instruction structure for the instructions of Table 1.
  • FIG. 20 shows a block diagram of the delay alignment buffer 82. Similar to the main buffer 70, the delay alignment buffer 82 uses instructions translated by an instruction decoder 350 to align input data 352 that is received from the PE array 76. The delay instruction buffer 82 may output the aligned data as output data 354 or as the inner loop-back intermediate data 90 (e.g., to the feedback path 88 shown in FIGS. 3 and 4) . The instruction decoder 350 may decode instructions into, for example, instructions for matrix size, batch size, and function to be performed. The core of the delay alignment buffer 82 includes N first-in first-out (FIFO) buffers 356 for matrix buffering. The pattern of input data 352 to delay alignment buffer is staggered due to the different delays of PE array rows, which may make write control logic of a write control block 358 more complex than that of the main buffer 70.
  • FIFO first-in first-out
  • a read control block 360 is used to make sure that the output of all FIFOs 356 are aligned.
  • the write control block 358 may receive instructions from delay buffer write instruction memory 362 indicating access patterns for the current application and from the instruction decoder 350 indicating, for example, matrix size (size_matrix) , batch size (size_batch) , and the function that was performed (Function) .
  • the write control block 358 may generate write enable (wr_en) signals to write into the FIFOs 356 and a start read (start_rd) signal for the read control block 360.
  • the write control block 358 may trigger some or all of these signals upon receipt of a start of packet (sop) signal corresponding to the input data 352.
  • the read control block 360 may use the start_rd signal from the write control block 358 and instructions from a delay buffer read instruction memory 364 indicating access patterns for the current application.
  • the read control block 360 may also use the instruction decoder 350 indicating, for example, matrix size (size_matrix) , batch size (size_batch) , and the function that was performed (Function) .
  • Monitor circuitry 366 may provide error signals.
  • Example instructions that may be stored in the delay buffer write instruction memory 362 are shown below in Table 3. One such instruction can serve for a write process for one batch of matrices.
  • Instructions in the delay buffer read instruction memory 364 may be organized as shown below in Table 4.
  • the instructions may be described as shown below in Table 5.
  • the delay buffer 82 may use the N FIFOs 356 to buffer both matrices and vectors.
  • the input data 352 from the PE array 76 arrive in a staggered pattern, which is different from that of the main buffer 70.
  • the write control block 358 is responsible for writing the data into alignment addresses of all the FIFOs 356.
  • the read control block 360 causes data to be read from the FIFOs 356 and sent to the output port as output data 354 or looped back to the main buffer 70 as the loop-back intermediate data 90.
  • FIG. 21 shows one example of a data structure for an instruction 380 for an arithmetic mode of operating the M PEs 112.
  • the data structure for the instruction 380 is shown to include a number of different possible domains 382 represented by a corresponding number of bits 384. Table 6, Table 7, and Table 8 describe each domain of the instruction.
  • FIG. 22 provides an example of ASMMPE code 390 and its mapping to binary instruction.
  • the ASMMPE code 390 includes a first part 392 that describes the main arithmetic operation and an output destination, a second part 394 that describes routing signals (e.g., a source of the data and delay of the output data) , and a third part 396 that describes the time-to-live (TTL) values. There is a separator ‘
  • Table 9 illustrates corresponding binary instructions for the assembly code 390 of FIG. 22.
  • Table 10 provides various keywords that may be used by the ASMMPE language.
  • each D PE 110 includes five sub-instructions, where each sub-instruction belongs to one issue slot.
  • each issue slot runs just under its corresponding sub-instruction, and time offsets among multiple issue slots may also be specified by the program.
  • Table 11, Table 12, Table 13, and Table 14 show an example instruction structure of the four kinds of issue slot discussed above with reference to FIGS. 10-12.
  • Each issue slot instruction may also include a time-to-live (TTL) domain to indicate whether that instruction should be executed or ignored (e.g., as NOP) .
  • TTL time-to-live
  • NOP time-to-live
  • the TTL of an instruction for a D PE 110 may have a data structure as described in Table 15.
  • the programmable spatial array processor 26 may be programmable to perform a wide variety of types of matrix decompositions. This section will describe the following types of matrix decompositions:
  • the D PE 110 and M PE 112 may have a dataflow as illustrated in FIG. 23.
  • sequential input signals come from upper or left side and output signals go out the lower or right side.
  • the equation in the circle or square shows the calculation that the D PE 110 and M PE 112 performs.
  • the dashed arrows are the paths of result (output) signals.
  • A is given as a positive definite Hermitian matrix:
  • FIG. 24 illustrates the dataflow of Cholesky decomposition in the PE array 76 of the programmable spatial array processor 26.
  • the result L i, j is buffered in the PEs 110 and 112.
  • the horizontal instruction propagation delay may be set to 2 for Choleksy decomposition.
  • the assembly codes of Cholesky decomposition are as below. Namely, the assembly code for Cholesky decomposition for a D PE 110 is shown in Table 16 and the assembly code for Cholesky decomposition for an M PE 112 is shown in Table 17.
  • IS1 Isqrt (IS 2) MAC (IS3) MAC (IS4)
  • Output (IS5) mv RF1_1, U_dat isqrt RF2_1, U_dat mv RF1_2, U_dat mv RF1_3, U_dat mv RF1_4, U_dat mv RF1_5, U_dat isqrt RF2_2, U_dat mv RF1_6, U_dat mv RF1_7, U_dat mv RF1_8, U_dat ... ... ... cmul RF3_1, RF1_1, RF2_1
  • latch 2
  • the programmable spatial array processor 26 can also be used to perform LU decomposition.
  • LU (lower-upper) decomposition factors matrix A as production of a lower triangluar matrix L and an upper triangular matrix U:
  • FIG. 25 illustrates the dataflow of LU decomposition through the D PEs 110 and M PEs 112 of the PE array 76.
  • the assembly code to perform LU decomposition for a D PE 110 is shown in Table 18 and the assembly code to perform LU decomposition for an M PE 112 is shown in Table 19.
  • Cholesky-based minimum mean square error MMSE
  • the programmable spatial array processor 26 can also be used to perform Cholesky-based MMSE.
  • An example procedure for performing Cholesky-based MMSE is provided below:
  • pre-filtering may take place in the PE array 76 as illustrated by FIG. 26. Sequential input signals come from upper or left side and output signals go out the lower or right side. The dashed arrows are the paths of result (output) signals.
  • the assembly code to perform pre-filtering for a D PE 110 is shown in Table 20 and the assembly code to perform pre-filtering for an M PE 112 is shown in Table 21.
  • IS1 Isqrt (IS 2) MAC (IS3) MAC (IS4)
  • IS5 Output (IS5) cjmul acc, U_dat, U_dat jmv R_dat, U_dat cjmuladd acc, U_dat, U_dat, acc jmv R_dat, U_dat cjmuladd acc, U_dat, U_dat, acc jmv R_dat, U_dat cjmuladd acc, U_dat, U_dat, acc jmv R_dat, U_dat cmuladd out_dat, U_dat, 1, acc ... ...
  • the second stage of Cholesky-based MMSE is Cholesky decomposition. This may take place in the same way described above.
  • Cholesky-based MMSE continues with back substitution and V*Z.
  • V*Z is a matrix (V) vector (Z) multiplication:
  • V L -1
  • FIG. 27 shows the dataflow of back substitution and V*Z in the PE array 76 (in this example, 4x4) .
  • the V i, i and L i, j are already buffered in corresponding PEs at the Cholesky decomposition stage. Final results output to the right side.
  • the M PEs 112 are shown as M PEs 112A or 112B depending on the operation they perform at this stage.
  • the assembly code to perform back substitution for a D PE 110 is shown in Table 22 and the assembly code to perform back substitution for an M PE 112 is shown in Table 23.
  • IS1 Isqrt (IS 2) MAC (IS3) MAC (IS4) Output (IS5) cmul R_dat, U_dat, RF3_1 cmul R_dat, U_dat, RF3_1 cmul R_dat, U_dat, RF3_1 cmul R_dat, U_dat, RF3_1 ...
  • the fourth stage of Cholesky-based MMSE is to calculate V H * (VZ) :
  • FIG. 28 shows the dataflow for calculating V H * (VZ) in the PE array 76. Final results output to the right side.
  • the M PEs 112 are shown as M PEs 112A or 112B depending on the operation they perform at this stage.
  • the horizontal instruction propagation delay may be set to 2 to calculate V H (V transpose) .
  • the assembly code to perform V H * (VZ) for a D PE 110 is shown in Table 24 and the assembly code to perform V H * (VZ) for an M PE 112 is shown in Table 25.
  • IS1 Isqrt (IS 2) MAC (IS3) MAC (IS4) Output (IS5) mv R_dat, U_dat mv R_dat, U_dat ...
  • Givens-Rotation QR based MMSE Givens Rotation based QR decomposition (GR-QRD) uses a series of Givens rotation operations to eliminate the entries of a lower triangular part of R.
  • Givens rotation can zero the lower element of a 2x1 vector:
  • the ⁇ and ⁇ may be calculated as:
  • the last step is to rotate the N-1th and Nth row of A to zero A (N, N-1) .
  • the QRD based MMSE may include the following:
  • FIG. 29 shows the dataflow for GR-QRD in the PE array 76.
  • Back substitution of R and the calculation of R -1 (Q H Y) are shown in FIG. 30. Final results output to the right side.
  • FIG. 31 shows the dataflow of interleaved batch GR-QRD in the PE array 76 (4x4) .
  • the input data is not matrix by matrix, but rather an interleaved pattern of matrices. For example, the first row of matrix 1 may be followed by the first row of matrix 2, followed by the second row of matrix 1.
  • the assembly code to perform interleaved batch GR-QRD for a D PE 110 is shown in Table 26 and the assembly code to perform interleaved batch GR-QRD for an M PE 112 is shown in Table 27.
  • Gram-Schmidt QR decomposition is a canonical and widely used matrix decomposition algorithm. The procedure is shown below:
  • A [a 1 , a 2 , ..., a N ]
  • u 3 a 3 - ⁇ q 1 , a 3 > q 1 - ⁇ q 2 , a 3 > q 2
  • ...u N a N - ⁇ q 1 , a N > q 1 -...- ⁇ q N-1 , a N > q N-1
  • FIG. 32 is a diagram of GS QR decomposition dataflow on the PE array 76.
  • the terms a k and q k are vectors representing the columns of A and Q.
  • Inner product and mutiply-subtract operations are used in each M PE 112 and D PE 110 with reciprocal.
  • the assembly code to perform GS QR for a D PE 110 is shown in Table 28 and the assembly code to perform GS QR for an M PE 112 is shown in Table 29.
  • Tables 30 and 31 provide a rough estimate of the average throughput of the matrix decomposition examples discussed above.
  • the parameters are defined as: matrix size of N x N, the size of one batch (number of matrices in one batch) is LenB, the gap between two consecutive batches is LenG clock cycles, and the delay of multiply-accumulate operations in each M PE 112 is DAcc.
  • the programmable spatial array processor 26 may be used to perform K-best maximum likelihood computations for multiple-input multiple-output (MIMO) detection.
  • MIMO multiple-input multiple-output
  • a single programmable spatial array processor 26 may be time-multiplexed to carry out alternating computations (as will be discussed below, these are QR decomposition and decoding tree traverse) to perform K-best maximum likelihood computations for multiple-input multiple-output (MIMO) detection.
  • multiple programmable spatial array processors 26 may be connected together to perform K-best maximum likelihood computations for multiple-input multiple-output (MIMO) detection.
  • programmable spatial array processors 26 may be connected (some number M total programmable spatial array processors 26) or a single programmable spatial array processor 26 multiplexed any suitable number of times.
  • the programmable spatial array processors 26 may be connected above and below.
  • PEs from one programmable spatial array processor 26 may be connected to PEs from other programmable spatial array processors 26 in a one-to-one, one-to-many, or many-to-many manner.
  • the programmable spatial array processor 26 of this disclosure may be multiplexed by holding data in a register file and performing various time-multiplexed, but related, operations at different times instead of connecting multiple separate programmable spatial array processors 26.
  • the example that follows is meant to represent a non-limiting arrangement that may be performed with a single time-multiplexed programmable spatial array processor 26 or multiple programmable spatial array processors 26 connected as described below.
  • FIG. 33 illustrates an example MIMO system 450 having four transmitter antennas 452 and four receiver antennas 454.
  • the MIMO system 450 may be referred to as a 4x4 MIMO system.
  • many more or fewer antennas may be used as the transmitter antennas 452 and the receiver antennas 454.
  • a data symbol vector x is transmitted.
  • the received vector y is a linear combination of components of transmitted symbols and additive noise, which may be denoted as:
  • H is a N ⁇ N channel matrix known at the receiver
  • n is a N ⁇ 1 Gaussian noise vector with covariance matrix ⁇ 2 I
  • a MIMO detector may be used to estimate the transmitted data vector x using received vector y and channel matrix H.
  • MIMO detector is a maximum likelihood detector (MLD) .
  • An MLD chooses from among all possible candidates to select one with the least Euclidean distance between y and Hx. This may be expressed as follows:
  • K-best MLD A hardware-friendly variant of MLD is called K-best MLD.
  • the procedures of K-best MLD is described as below.
  • step 1 QR factorization of channel matrix H:
  • step 2 traverse the decoding tree in breadth-first manner.
  • equation (3) By substituting equation (3) into equation (2) , it becomes:
  • the decoding tree is shown to be traversed with four layers. For K-best MLD, K most likely paths are reserved as traversing one layer deeper to detect one more transmitted symbol.
  • step 3 similarly, in layer 3, for each of the K survival candidates that are previously decoded partial vectors, those may be plugged into the third term in equation (6) to make only x 2 unknown in the squared error. Based on each previously decoded partial vector, K mostly likely x 2 may be chosen according to the squared error term.
  • K may be selected from K 2 expanded candidates with least PED:
  • a K-candidate data vector may be obtained at the last stage, which corresponds to leaf layer of decoding tree.
  • the value with the smallest total Euclidean distance is the hard output result of a K-best MLD detector.
  • the final K survival candidates may be used to compute log-likelihood ratio (LLR) of transmitted bits that are soft output result of a K-best MLD detector.
  • LLR log-likelihood ratio
  • the general computation of one layer in decoding tree traverse procedure 460 is illustrated as FIG. 34. Note that, when traversing layer i, the first N-i+1 symbols may not be the same as in the input and output with the same superscript p denoting survival path indices.
  • the K-best MLD detector 470 includes two planar triangular PE arrays 472 and 474. It is shown in FIG. 35 as an example for a 4 ⁇ 4 MIMO system.
  • the two triangular arrays 472 and 474, in Plane 1 and Plane 2 respectively, are used for QR decomposition and decoding tree traverse, respectively.
  • the PEs 110, 112 in the same position of the two arrays are connected to transfer data.
  • the triangular arrays 472 and 474 may operate as discussed above with reference to the PE array 76 discussed above, but may be able to communicate from Plane 1 to Plane 2, as will be discussed further below.
  • the D PEs 110 and the M PEs 112 may operate with different instructions in the different planes. For ease of explanations, these are referred to as D PEs 110-1 and M PEs 112-1 in the triangular array 472 of Plane 1, and D PEs 110-2 and M PEs 112-2 in the triangular array 474 of Plane 2.
  • the K-best MLD detector 470 may use only one triangular systolic array by combining every two PEs in the same location of the two planes.
  • the systolic array structure shown in FIG. 35 is meant to represent a logical arrangement and that the physical location of the triangular arrays 472 and 474 may take any suitable positioning that permits communication between the various PEs as provided in this disclosure.
  • the K-best MLD detector 470 may be used for other matrix operations such as matrix multiplication, Cholesky and LU decomposition, and linear equation solving, as well.
  • the array 472 in Plane 1 may be used to perform QR decomposition.
  • the data flow between D PEs 110-1 and M PEs 112-1 is demonstrated in FIG. 36.
  • the diagonal PEs 110-1 denoted in circles are used to compute a series of sines and cosines according to Givens rotation.
  • the off-diagonal M PEs 112-1 denoted in squares may operate as Complex Multiply Accumulate (CMACs) that may be used to apply rotations that computed by the diagonal D PE 110-1 in the same row to other entries of the channel matrix..
  • CMACs Complex Multiply Accumulate
  • Each M PE 112-1 can get input data from the D PE 110-1 on the left in the same row and the M PE 112-1 on the above, and output results to the two PEs on its right and below respectively.
  • data flow of the array 472 is from left to right and from top to bottom.
  • the D PEs 110-1 and M PEs 112-1 in the same row zero the lower off-diagonal entries in the i-th column of a channel matrix.
  • it When calculating a pair of sine and cosine indicating a certain rotation, it not only zeros an off-diagonal entry, but also makes the diagonal entries result R as real numbers.
  • the singleton diagonal PE 110-1 in the last row applies a rotation to a complex number, which make R N, N real.
  • the channel matrix H is fed into the array 472 from the top column-wise.
  • the i-th column of H feeds into the i-th D PE 110-1 or M PE 112-1 in the first row.
  • Each PE 110-1 or 112-1 in a cycle may read data from an input port.
  • the throughput of QR decomposition is N cycles per matrix.
  • the time difference between input elements between columns, for example the input of H 11 and H 12 depends on the processing latency of the D PEs 110-1 and M PEs 112-1.
  • FIG. 38 An example of an internal functional arrangement (e.g., program, configuration) of the diagonal D PE 110-1 of Plane 1 is illustrated in FIG. 38.
  • the D PE 110-1 uses three basic arithmetic modules.
  • a first module 480 is is a squared accumulator.
  • the first module 480 calculates the squared magnitudes of input data and then output cumulative accumulated results of the squared magnitudes.
  • a second module 482 is used to compute both square roots and reciprocal square roots of the output of the first module 480.
  • a third module 484 includes a complex multiply-accumulate (CMAC) operation, in which the output of the second module 482 and the initial input h 1 , ..., h M are multiplied together to obtain final sines and cosines.
  • the output L ii is from the last output of the second module 482.
  • CMAC complex multiply-accumulate
  • the rotation matrix is:
  • the result h′ 2 , ..., h′ M is output to the bottom.
  • the final internal result r M is passed to the corresponding M PE 112-2 on Plane 2.
  • the off-diagonal M PE 112-1 is at row-i and column-j, then its r M is equal to R ij after QR decomposition.
  • An off-diagonal M PE 112-1 may include four identical CMACs 490, 492, 494, and 496 (e.g., the M PE 112-1 may be programmed to perform four identical CMAC operations using the ALU 164 as a CALU shown in FIG. 7) .
  • the CMACs 490 and 492 on the left compute the downward output h′ i s, and related data paths are marked by solid lines.
  • the CMACs 494 and 496 on the right compute the internal values r i s, and related data paths are marked by dashed lines.
  • sines and cosines that output to off-diagonal M PEs 112-1 on the right also feed into this rotation module. Rotations represented by sines and cosines are applied to input y i s along the diagonal.
  • the final internal number of the module, as r M in off-diagonal M PEs 112-1, is the y′ i s that are transferred to a corresponding diagonal D PE 110-2 in Plane 2.
  • the result may be transferred to the PE array 474 in Plane 2 for decoding tree traverse.
  • the data that is passed is shown in FIG. 42.
  • the PE 110-1 or 112-1 in i-th row and j-th column of Plane 1 may transmit data to a corresponding PE 110-2 or 112-2 also in i-th row and j-th column of Plane 2.
  • the i-th diagonal PE 110-1 transmits two data -L ii and y′ i to a corresponding i-th diagonal PE 110-2 in Plane 2.
  • the off-diagonal M PE 112-1 located in in i-th row and j-th column may transmit R ij downwards to a corresponding off-diagonal M PE 112-2 in Plane 2.
  • the array 474 of Plane 2 may be used to traverse a decoding tree.
  • the data flow between PEs 110-2 and 112-2 is demonstrated in FIG. 43. Similar to the array 472 of Plane 1, there are two types of PEs 110-2 and 112-2 in Plane 2.
  • the diagonal PEs 110-2 denoted in circles are used to compute K candidate partial data vectors based on the inputted partial Euclidean distances (PED) and partial data vector Here, is a shorthand of the partial vector
  • the off-diagonal M PEs 112-2 denoted in squares also perform CMAC operations to calculate inter-stream interferences from the previously decoded layers. The inter-stream interferences are subtracted away at diagonal PE before detecting symbol x i .
  • the direction of data flow of the array 474 of Plane 2 is opposite to that of the array 472 of Plane 1.
  • the PEs 110-2 and 112-2 receive input data from neighboring PEs 110-2 and 112-2 on the right and below, and output results to the two PEs 110-2 and 112-2 on the left and above. In other words, data flow of the array 474 is from right to left and from bottom to top.
  • the PEs 110-2 and 112-2 in the i-th row traverse the i-th layer of decoding tree, which detects the transmitted symbol x i with K possible outcomes.
  • the tree traverse starts from the last diagonal D PE 110-2 to decode x N .
  • K candidates of x N are propagated upward to construct inter-stream interferences to the remainder layers.
  • the diagonal PE 110-2 one above the last one starts to decode x N-1 .
  • the decoding proceeds so on so forth until we reach the first diagonal PE 110-2 at the uppermost row.
  • the value x 1 is the last one to be decoded.
  • a function (e.g., program, configuration) of diagonal D PEs 110-2 in Plane 2 is shown in FIG. 44. It receives K previously decoded partial data vectors for the diagonal D PE 110-2 below. For the diagonal D PE 110-2 in the i-th row, K different possible sets of symbols x i+1 , ..., x N may already be determined, as well as the PEDs accumulated to coordinates from i+1 to N. From the right, the D PE 110-2 may also receive inter-stream interferences from previously decoded symbols. The inter-stream interferences are represented as Additionally, L ii and y′ i are from the D PE 110-1 in the same position in Plane 1. A diagonal D PE 110-2 traverses the decoding tree one-layer deeper as illustrated in FIG.
  • the D PE 110-2 sends to the above diagonal D PE 110-2 the updated PEDs and partial decoded vectors with new symbols prepended to input partial decoded vectors. Meanwhile, K candidates and indices indicating which input partial vector they correspond to are also output to the off-diagonal M PE 112-2 above.
  • FIG. 45 An example of an internal functional arrangement (e.g., program, configuration) of a diagonal PE 110-2 of Plane 2 is illustrated in FIG. 45.
  • modules 500, 502, 504, and 506 to perform various computations.
  • the module 500 which operates as a CMAC, the inter-stream interferences are subtracted away from y′ i . Since there are K different possible interferences, it has K results. After that, L ii is multiplied to have K least square (LS) estimates of
  • the module 502 may operate as an enumeration (enum) module.
  • enum enumeration
  • LS estimates of K constellation points are selected which have the minimum Euclidean distances between the LS estimate.
  • the K possible constellation points chosen are
  • K constellation points are transferred to the module 504, which may operate as a CMAC, on the lower right.
  • the module 504 may represent not a single CMAC, but rather K CMAC complex multipliers.
  • the module 504 calculates the squared magnitudes of inputted and adds them to whose results are updated PEDs
  • K 2 PEDs are received by the fourth module 506, which represents a sorter.
  • the module 506 may be a partially or fully pipelined insertion sorter that outputs K PEDs with smallest PEDs among K 2 distances and their indices. Based on the indices, K corresponding candidates of are selected from K 2 output constellation points are transferred upwards along with the indices, for off-diagonal M PEs 112-2 in order to construct inter-stream interferences. These K corresponding candidates of are appended to input partial decoded data vectors and passed to the diagonal PE above it. The K smallest PEDs from the module 506 may also be transferred upwards along the diagonal.
  • a function (e.g., program, configuration) for the off-diagonal M PEs 112-2 is shown in FIG. 46.
  • the M PEs 112-2 operate as a CMAC that multiplies K candidates of from below and R ij from Plane 1 to get the interference of these onto layer i.
  • the computed interferences are added to other interference to layer i that inputted from the right and then output them to the left. Note interference is accumulated to the idx k -th input from the right.
  • off-diagonal M PEs 112-2 directly forward inputs from below to above M PEs 112-2.
  • EXAMPLE EMBODIMENT 1 A system comprising:
  • a second spatial array of processing elements in communication with the first spatial array of processing elements, that perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
  • EXAMPLE EMBODIMENT 2 The system of example embodiment 1, wherein the first spatial array of processing elements and the second spatial array of processing elements comprise the same respective number of processing elements.
  • EXAMPLE EMBODIMENT 3 The system of example embodiment 1, wherein the first spatial array of processing elements and the second spatial array of processing elements comprise a triangular arrangement, wherein data flow through the first spatial array of processing elements is opposite that data flow through the second spatial array of processing elements.
  • EXAMPLE EMBODIMENT 4 The system of example embodiment 1, wherein a plurality of processing elements of the first spatial array of processing elements provide input data to a corresponding plurality of processing elements of the second spatial array of processing elements.
  • the processing elements of the first array of processing elements comprise processing elements of a diagonal processing element type and processing elements of an off-diagonal processing element type;
  • the processing elements of the second array of processing elements comprise processing elements of the diagonal processing element type and processing elements of the off-diagonal processing element type but having different configurations with respect to those of the first array of processing elements.
  • EXAMPLE EMBODIMENT 6 The system of any of example embodiments 1–5, wherein a first plurality of the processing elements of the first array of processing elements perform squared accumulate, square root and reciprocal square root, and complex multiply-accumulate operations.
  • EXAMPLE EMBODIMENT 7 The system of any of example embodiments 1–5, wherein a second plurality of the processing elements of the first array of processing elements perform four complex multiply-accumulate operations.
  • EXAMPLE EMBODIMENT 8 The system of any of example embodiments 1–5, wherein a first plurality of the processing elements of the second array of processing elements perform complex multiply-accumulate, enumeration, and sorting operations.
  • EXAMPLE EMBODIMENT 9 The system of any of example embodiments 1–5, wherein a second plurality of the processing elements of the second array of processing elements perform a complex multiply-accumulate operation.
  • EXAMPLE EMBODIMENT 10 The system of any of example embodiments 1–5, comprising a plurality of antennas, wherein the first array of processing elements and the second array of processing elements perform a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication using the antennas.
  • MLD K-best maximum likelihood detector
  • EXAMPLE EMBODIMENT 11 An article of manufacture comprising one or more tangible, non-transitory, machine readable media comprising instructions that, when executed by processing circuitry, cause the processing circuitry to:
  • EXAMPLE EMBODIMENT 12 The article of manufacture of example embodiment 11, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to time multiplex between performing partial QR decomposition and partial decoding tree traverse.
  • EXAMPLE EMBODIMENT 13 The article of manufacture of example embodiments 11 or 12, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to perform the partial QR decomposition and partial decoding tree traverse to carry out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication.
  • MLD K-best maximum likelihood detector
  • EXAMPLE EMBODIMENT 14 An electronic device comprising:
  • one or more configurable triangular spatial array of processing elements configurable to carry out a K-best maximum likelihood detector (MLD) for multiple-input multiple-output (MIMO) wireless communication via the plurality of antennas.
  • MLD K-best maximum likelihood detector
  • MIMO multiple-input multiple-output
  • EXAMPLE EMBODIMENT 15 The electronic device of example embodiment 14, wherein the one or more configurable triangular spatial array of processing elements comprises a single configurable triangular spatial array of processing elements that is time multiplexed to alternate between QR decomposition and decoding tree traverse.
  • EXAMPLE EMBODIMENT 16 The electronic device of example embodiment 15, wherein the single configurable triangular spatial array of processing elements performs decoding tree traverse using inputs obtained during performance of QR decomposition.
  • EXAMPLE EMBODIMENT 17 The electronic device of example embodiment 15, wherein:
  • the single configurable triangular spatial array of processing elements when the single configurable triangular spatial array of processing elements performs QR decomposition, the single configurable triangular spatial array of processing elements has a first data flow through the processing elements;
  • the single configurable triangular spatial array of processing elements when the single configurable triangular spatial array of processing elements performs decoding tree traverse, the single configurable triangular spatial array of processing elements has a second data flow through the processing elements that is different from the first data flow.
  • EXAMPLE EMBODIMENT 18 The electronic device of example embodiment 17, wherein the second data flow is at least partially opposite the first data flow.
  • EXAMPLE EMBODIMENT 19 The electronic device of any of example embodiments 14-18, wherein the one or more configurable triangular spatial array of processing elements is configurable to perform Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE) , Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition.
  • MMSE Cholesky-based minimum mean square error
  • Givens-Rotation QR based MMSE Givens-Rotation QR based MMSE
  • Gram-Schmidt QR decomposition the electronic device of any of example embodiments 14-18, wherein the one or more configurable triangular spatial array of processing elements is configurable to perform Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE) , Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition.
  • EXAMPLE EMBODIMENT 20 The electronic device of example embodiment 14, wherein the one or more configurable triangular spatial array of processing elements comprise:
  • a second spatial array of processing elements in communication with the first spatial array of processing elements, that perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
  • EXAMPLE EMBODIMENT 21 A method comprising:
  • EXAMPLE EMBODIMENT 22 The method of example embodiment 21, wherein the triangular spatial array of processing elements is time-multiplexed between performing the QR decomposition and the decoding tree traverse.
  • EXAMPLE EMBODIMENT 23 The method of example embodiments 21 or 22, wherein using the triangular spatial array of processing elements to perform the QR decomposition and the decoding tree traverse comprises carrying out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication.
  • MLD K-best maximum likelihood detector
  • EXAMPLE EMBODIMENT 24 A method comprising:
  • EXAMPLE EMBODIMENT 25 The method of example embodiment 24, comprising providing the input data from a plurality of processing elements of the first spatial array of processing elements to a corresponding respective plurality of processing elements of the second spatial array of processing elements.

Abstract

Systems, methods, and devices are provided to perform K-best maximum likelihood detector (MLD) for multiple-input multiple-output (MIMO) wireless communication. An electronic device may have multiple antennas and one or more configurable triangular spatial arrays of processing elements configurable to carry out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication via the antennas.

Description

[Title established by the ISA under Rule 37.2] VERSATILE SYSTOLIC ARRAY FOR MAXIMUM LIKELIHOOD MIMO DETECTORS BACKGROUND
The present disclosure relates generally to a programmable spatial array that can rapidly and efficiently support a K-best maximum likelihood detector (MLD) for multiple-input multiple-output (MIMO) wireless communication.
This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present disclosure, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it may be understood that these statements are to be read in this light, and not as admissions of prior art.
Integrated circuit devices are found in numerous electronic devices, many of which may perform wireless communication. For instance, electronic devices may perform multiple-input multiple-output (MIMO) wireless communication, which may be used in wireless baseband systems for 5G wireless communication. The throughput of a wireless baseband system highly depends on the error performance of a MIMO detector. With a detector with a low error rate, an electronic device may transfer data using a higher Modulation and Coding Scheme (MCS) , as well as more layers. A Maximum Likelihood Detector (MLD) is one solution in a stochastic sense. To make the MLD algorithm more suitable to implement on fixed hardware like a programmable logic device (e.g., a field programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) ) , a variant of MLD called K-best is often used. K-best MLD has very low error and can fulfil the goals of many different kinds of wireless baseband systems.
Yet the computational complexity of K-best MLD is quite considerable. In fact, it may be substantially higher than other linear, lower-performing detectors such as zero-forcing (ZF) and minimized mean squared error (MMSE) . As more use cases are supported in 5G wireless communication, the baseband system may become much complicated and power consuming than that of 4G. Therefore, the hardware utilization and energy efficiency of 5G MIMO detector may have an outsized impact on overall system performance. Even so, there are many other computations that may be performed in a wireless base station, such as Cholesky decomposition, matrix multiplication, and linear equation solving.
BRIEF DESCRIPTION OF THE DRAWINGS
Various aspects of this disclosure may be better understood upon reading the following detailed description and upon reference to the drawings in which:
FIG. 1 is a block diagram of a system that includes an integrated circuit having a programmable spatial array processor, in accordance with an embodiment;
FIG. 2 is a block diagram of another system that includes an integrated circuit having a programmable spatial array processor, in accordance with an embodiment;
FIG. 3 is a high-level block diagram of the programmable spatial array processor, in accordance with an embodiment;
FIG. 4 is a block diagram illustrating a manner in which a batch of matrices may be pipelined through the programmable spatial array processor, in accordance with an embodiment;
FIG. 5 is a block diagram of a processing element array of the programmable spatial array processor, in accordance with an embodiment;
FIG. 6 is a diagram of data flow through the processing element array, data processing system that uses the integrated circuit to control multiple-input multiple-output communication, in accordance with an embodiment;
FIG. 7 is a block diagram of an example architecture of a multiply-accumulate (M) processing element (PE) of the processing element array, in accordance with an embodiment;
FIG. 8 is a data flow diagram of one manner of feeding data into the processing element array if the processing elements lacked a data queue;
FIG. 9 is a data flow diagram of one manner of feeding data into the processing element array using data queues in respective processing elements, in accordance with an embodiment;
FIG. 10 is a block diagram of an example architecture of a diagonal (D) processing element (PE) of the processing element array, in accordance with an embodiment;
FIG. 11 is a flow diagram illustrating a method of pipelining operations, even on different matrices, using the diagonal (D) processing element (PE) , in accordance with an embodiment;
FIG. 12 is a block diagram illustrating a data flow through the example architecture of the diagonal (D) processing element (PE) of the processing element array, in accordance with an embodiment;
FIG. 13 is a block diagram showing a propagation of instructions through different processing elements of the processing element array, in accordance with an embodiment;
FIG. 14 is a block diagram showing a propagation of instructions through multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 15 is a block diagram illustrating delays for propagation of instructions through the multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 16 is a block diagram illustrating the use of time-to-live (TTL) on instructions propagated through the multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 17 is a block diagram illustrating a propagation of instructions through diagonal (D) processing elements (PEs) and vector (V) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 18 is a block diagram illustrating a set of instructions that may be stored in a common instruction memory for all or several multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 19 is a block diagram of a main buffer that feeds the processing element array, in accordance with an embodiment;
FIG. 20 is a block diagram of a delay alignment buffer that aligns results that were output by the processing element array staggered in time, in accordance with an embodiment;
FIG. 21 is an example data structure of an instruction that may program multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 22 is an example data structure of an assembly code for multiply-accumulate (M) processing elements (PEs) of the processing element array, in accordance with an embodiment;
FIG. 23 is a block diagram illustrating types of computations that may be carried out by a diagonal (D) processing element (PE) and a multiply-accumulate (M) processing element (PE) of the processing element array to perform Cholesky decomposition, in accordance with an embodiment;
FIG. 24 is a block diagram of computations that may be carried out by the processing element array to perform Cholesky decomposition, in accordance with an embodiment;
FIG. 25 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform LU decomposition, in accordance with an embodiment;
FIG. 26 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform pre-filtering for Cholesky-based minimum mean square error (MMSE) , in accordance with an embodiment;
FIG. 27 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform back substitution and V*Z for Cholesky-based minimum mean square error (MMSE) , in accordance with an embodiment;
FIG. 28 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform V H* (VZ) for Cholesky-based minimum mean square error (MMSE) , in accordance with an embodiment;
FIG. 29 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform Givens-rotation QR based minimum mean square error (MMSE) (GR-QRD) , in accordance with an embodiment;
FIG. 30 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform back substitution for GR-QRD, in accordance with an embodiment;
FIG. 31 is a block diagram illustrating a manner of performing interleaved batch GR-QRD using the processing element array, in accordance with an embodiment;
FIG. 32 is a block diagram illustrating types of computations that may be carried out by the processing element array to perform Gram-Schmidt QR decomposition, in accordance with an embodiment;
FIG. 33 is a diagram illustrating a description of a multiple-input multiple-output (MIMO) wireless communication system on which a K-best maximum likelihood detector (MLD) is applied, in accordance with an embodiment;
FIG. 34 is a diagram of a computation in one layer in decoding tree traverse, in accordance with an embodiment;
FIG. 35 is an overview of a systolic array structure using two connected planes of (or one multiplexed) programmable spatial arrays of processing elements to perform a K-best maximum likelihood detector (MLD) computation for multiple-input multiple-output (MIMO) wireless communication, in accordance with an embodiment;
FIG. 36 is a diagram showing QR decomposition implemented on a first plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 37 is a diagram showing a function (e.g., program, configuration) of diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 38 is a diagram of an internal functional arrangement (e.g., program, configuration) of the diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 39 is a diagram showing a function (e.g., program, configuration) of off-diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 40 is a diagram of an internal functional arrangement (e.g., program, configuration) of the off-diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 41 is a diagram illustrating a rotation function that may be carried out in the diagonal processing elements of the first plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 42 is a diagram illustrating communication of data from processing elements of the first plane to processing elements of the second plane, in accordance with an embodiment;
FIG. 43 is a diagram showing decoding tree traverse implemented on a second plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 44 is a diagram showing a function (e.g., program, configuration) of diagonal processing elements of the second plane of the programmable spatial arrays, in accordance with an embodiment;
FIG. 45 is a diagram of an internal functional arrangement (e.g., program, configuration) of the diagonal processing elements of the second plane of the programmable spatial arrays, in accordance with an embodiment; and
FIG. 46 is a diagram showing a function (e.g., program, configuration) of off-diagonal processing elements of the second plane of the programmable spatial arrays, in accordance with an embodiment.
DETAILED DESCRIPTION OF SPECIFIC EMBODIMENTS
One or more specific embodiments will be described below. In an effort to provide a concise description of these embodiments, not all features of an actual implementation are described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific  decisions must be made to achieve the developers’ specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
When introducing elements of various embodiments of the present disclosure, the articles “a, ” “an, ” and “the” are intended to mean that there are one or more of the elements. The terms “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “some embodiments, ” “embodiments, ” “one embodiment, ” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. Furthermore, the phrase A “based on” B is intended to mean that A is at least partially based on B. Moreover, the term “or” is intended to be inclusive (e.g., logical OR) and not exclusive (e.g., logical XOR) . In other words, the phrase A “or” B is intended to mean A, B, or both A and B. Moreover, this disclosure describes various data structures, such as instructions for an instruction set architecture. These are described as having certain domains (e.g., fields) and corresponding numbers of bits. However, it should be understood that these domains and sizes in bits are meant as examples and are not intended to be exclusive. Indeed, the data structures (e.g., instructions) of this disclosure may take any suitable form.
An integrated circuit, such as an application specific integrated circuit (ASIC) or a programmable logic device (PLD) like a field programmable gate array (FPGA) , may be part of an electronic device that perform wireless communications, machine learning, or many other tasks. These tasks may involve performing matrix decompositions. Indeed, matrix decomposition is  widely used in wireless communication, machine learning, and other areas. For instance, multiple-input multiple-output (MIMO) wireless communication in 5G wireless systems, multivariate linear regressions in machine learning, systems of linear equations, matrix inversions and determinant calculations, and many others involve performing matrix decompositions. Different types of matrix decompositions include LU decomposition, QR decomposition, and Cholesky decomposition.
In contrast to single-purpose architectures that may support only one type of matrix decomposition, this disclosure provides a programmable spatial array processor that can be programmed to compute a variety of different types of matrix decompositions. The programmable spatial array processor has a two-dimensional upper triangular Processing Element (PE) array which acts as a high throughput engine. Every PE executes under instructions that provide programmability to support different modes.
As noted above, matrix decompositions are more complicated than matrix multiplication. The latter may generally use multiplication and addition operations and may have little or no data dependency among operations. Matrix decompositions, on the other hand, may have many data dependencies. This may cause one operation to have to wait for the result of another operation to be ready, which makes it difficult to handle data in parallel. Moreover, matrix decomposition usually has arithmetic operations other than multiplication, such as division and square root.
The programmable spatial array processor of this disclosure may use a control scheme that can mitigate the challenges of the data dependency of the various PEs in solving matrix decompositions. To solve this problem, an Instruction Share and Propagation (ISP) scheme may  control all PEs efficiently. Instructions may be shared by certain PEs and propagated through them. This may substantially reduce the size or complexity of the instruction memory. Indeed, instructions may flow through the array in a systolic-like way, just like the data flow. All non-diagonal PEs may share the same instructions. This may (a) reduce instruction memory from N 2/2 to 2 and (b) allow instructions to transfer between adjacent PEs so that a long control path may be avoided. Furthermore, the programmability of the programmable spatial array processor may enable a fast switch between two different types of matrix operation. The array of the programmable spatial array processor may simply be fed with new instructions for new matrix operation. Additional reset or reconfiguration time may be avoided, enabling transitions to computing different types of matrix decomposition to occur rapidly and seamlessly.
In addition to matrix decompositions, the programmable spatial array processor may also support widely used matrix operations like back substitution, matrix-vector multiplication, matrix multiplying by its transpose (A TA) , and so on. The programmability even empowers it to perform customized functions. What is more, the programmable spatial array processor may have a triangular arrangement that, compared to a square array, may cut hardware resource usage nearly in half.
With this in mind, FIG. 1 illustrates a block diagram of a system 10 that may implement a programmable spatial array processor. A designer may desire to implement functionality, such as the programmable spatial array processor of this disclosure, on an integrated circuit device 12 (such as a field-programmable gate array (FPGA) or an application-specific integrated circuit (ASIC) ) . In some cases, the designer may specify a high-level program to be implemented, such as an OpenCL program, which may enable the designer to more efficiently and easily provide programming instructions to configure a set of programmable logic cells for the integrated circuit  device 12 without specific knowledge of low-level hardware description languages (e.g., Verilog or VHDL) . For example, because OpenCL is quite similar to other high-level programming languages, such as C++, designers of programmable logic familiar with such programming languages may have a reduced learning curve than designers that are required to learn unfamiliar low-level hardware description languages to implement new functionalities in the integrated circuit device 12.
Designers may implement their high-level designs using design software 14, such as a version of
Figure PCTCN2020117947-appb-000001
Prime by INTEL CORPORATION. The design software 14 may use a compiler 16 to convert the high-level program into a lower-level description. The compiler 16 may provide machine-readable instructions representative of the high-level program to a host 18 and the integrated circuit device 12. The host 18 may include any suitable processing circuitry and may receive a host program 22 which may be implemented by the kernel programs 20. To implement the host program 22, the host 18 may communicate instructions from the host program 22 to the integrated circuit device 12 via a communications link 24, which may be, for example, direct memory access (DMA) communications or peripheral component interconnect express (PCIe) communications. While the techniques described above refer to the application of a high-level program, in some embodiments, a designer may use the design software 14 to generate and/or to specify a low-level program, such as the low-level hardware description languages described above. Further, in some embodiments, the system 10 may be implemented without a separate host program 22. Moreover, in some embodiments, the techniques described herein may be implemented in circuitry as hardened IP that is not programmed into a programmable logic device. Thus, embodiments described herein are intended to be illustrative and not limiting.
In some embodiments, the kernel programs 20 may enable configuration of a programmable spatial array processor 26 on the integrated circuit device 12. Indeed, the programmable spatial array processor 26 may represent a circuit design of the kernel program 20 that is configured onto the integrated circuit device 12 (e.g., formed in soft logic) . In some embodiments, the programmable spatial array processor 26 may be partially or fully formed in hardened circuitry (e.g., application-specific circuitry of the integrated circuit 12 that is not configurable as programmable logic) . The host 18 may use the communication link 24 to cause the programmable spatial array processor 26 to decompose matrices according to any suitable matrix decomposition type. For example, the programmable spatial array processor 26 may be used to perform matrix decomposition to detect or transmit a signal for multiple-input multiple-output (MIMO) communication via antennas 28.
The programmable spatial array processor 26 may be component included in a data processing system 40, as shown in FIG. 2. The data processing system 40 may include a host processor 42 (e.g., a central-processing unit (CPU) ) , memory and/or storage circuitry 44, and a network interface 46. The data processing system 40 may include more or fewer components (e.g., electronic display, user interface structures, application specific integrated circuits (ASICs) ) . The host processor 42 may include any suitable processor, such as an
Figure PCTCN2020117947-appb-000002
processor or a reduced-instruction processor (e.g., a reduced instruction set computer (RISC) , an Advanced RISC Machine (ARM) processor) that may manage a data processing request for the data processing system 40 (e.g., to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, sensing or transmitting using a phased array, communicating via a MIMO wireless system, or the like) . The memory and/or storage  circuitry 44 may include random access memory (RAM) , read-only memory (ROM) , one or more hard drives, flash memory, or the like. The memory and/or storage circuitry 44 may hold data to be processed by the data processing system 40. In some cases, the memory and/or storage circuitry 44 may also store configuration programs (bitstreams) for programming a programmable logic device that may hold the programmable spatial array processor 26. The memory and/or storage circuitry 44 may, additionally or alternatively, store instructions to program the programmable spatial array processor 26. The network interface 46 may allow the data processing system 40 to communicate with other electronic devices. The data processing system 40 may include several different packages or may be contained within a single package on a single package substrate. In some cases, the antennas 28 may be a component of the network interface 46 or may be used by the network interface 46 to receive or transmit signals in particular spatial directions.
In one example, the data processing system 40 may be part of a data center that processes a variety of different requests. For instance, the data processing system 40 may receive a data processing request via the network interface 46 to perform encryption, decryption, machine learning, video processing, voice recognition, image recognition, data compression, database search ranking, bioinformatics, network security pattern identification, spatial navigation, digital signal processing, or some other specialized task. Some or all of the components of the data processing system 40 may be virtual machine components running on physical circuitry (e.g., managed by one or more hypervisors or virtual machine managers) . Whether physical components or virtual machine components, the various components of the data processing system 40 may be located in the same location or different locations (e.g., on different boards, in different rooms, at different geographic locations) . Indeed, the data processing system 40 may be accessible via a computing service provider (CSP) that may provide an interface to customers to use the data  processing system 40 (e.g., to run programs and/or perform acceleration tasks) in a cloud computing environment.
High-Level Architecture of Programmable Spatial Array Processor
FIG. 3 shows a top block diagram of the programmable spatial array processor 26. Control flow is shown in first hatching 60, data flow is shown in second hatching 62, computation is shown in third hatching 64, and instruction flow is shown in fourth hatching 66. Input data 68 streams into a main buffer 70 first, then may flow 72 to a spatial array 74 that includes a processing element (PE) array 76 and instruction memory 78 that hold instructions to control processing elements of the PE array 76. The instruction memory 78 may represent separate memories for each different type of processing element of the PE array 76. When the PE array 76 is available, the input data 68 is enters the PE array 76. After calculation in the PE array 76, results 80 stream into a delay alignment buffer 82 for data rearrangement. The output of delay alignment buffer goes to an output port 84 as output data 86 or loops back via a feedback path 88 to the main buffer 70 as intermediate data 90. The second hatching 62 shows the control signal flow. Control instructions 92 may enter a control instruction decoder 94 to be distributed to the main buffer 70, the spatial array 74, and the delay alignment buffer 82. The third hatching 64 shows an instruction preload flow. Instruction load commands 96 may take an instruction preload path 96 to the main buffer 70, the spatial array 74, and the delay alignment buffer 82.
The input data 68 may take any suitable form, including a matrix or vector format with throughput of one matrix row (column) per clock cycle. A block of the input data 68 may contain a batch of matrices to utilize the pipeline capability of PE array 76 and improve average throughput. Any suitable quantity of matrices or vectors may be used in a batch (e.g., 2, 3, 4, 5, 6, 7, 8, 16, 32,  64, 100, 128, 200, 256, 500, 512, 1000, 1024, or more or fewer) . For instance, 32 consecutive matrices may form a batch, in this case the batch size is 32.
For example, as shown in FIG. 4, a batch of three  input matrices  100A, 102A, 104A may be input to the PE array 76 through the main buffer 70. The PE array 76 may compute  result matrices  100B, 102B, and 104B in a pipelined manner. As a consequence, the  result matrices  100B, 102B, and 104B may overlap one another in time. In the example shown in FIG. 4, later parts of the result matrix 100B computed from the input matrix 100A overlap with earlier parts of the result matrix 102B computed from the input matrix 102A. Likewise, later parts of the result matrix 102B overlap with earlier parts of the result matrix 104B computed from the input matrix 104A. The delay alignment buffer 82 removes these latencies to produce aligned  output matrices  100C, 102C, and 104C.
Processing Element (PE) Array
The core part of the programmable spatial array processor 26 is the two-dimensional processing element (PE) array 76. As shown in FIG. 5, the PE array 76 has an upper triangle form to achieve high utilization efficiency, since most matrix decompositions lead to triangular result matrices. The PE array 76 includes at least three types of processing elements: diagonal (D) processing elements (PEs) 110, multiply-accumulate (M) processing elements (PEs) 112, and vector (V) processing elements (PEs) 114. The overall dataflow direction is rightward and downward. Input matrices (X) and vectors (V) stream into the PE array 76 from the upper side. The PE array 76 outputs the results (Y) to the right side. The  PEs  110, 112, and 114 accept data from an upper side or left side, perform some operations and output the results to a bottom or right side.
The M PEs 112 mainly perform multiplication and accumulation (MAC) operations, and the M PEs 112 form the upper triangular part of a square N-by-N array, of which any suitable number N may define the array. The M PEs 112 may be considered an internal processing element type of the processing element array 76, since they are bounded to the left and right by the D PEs 110 and the V PEs 114. Multiplication and accumulation (MAC) operations are abundant in matrix operations. The V PEs 114 located at the rightmost column handle vector-related operations like matrix-vector multiplication. The V PEs 114 may have the same or a similar internal hardware structure as the M PEs 112. The main difference between the V PEs 114 and the M PEs 112 is that they run under different instructions (with different behaviors) . The D PEs 110 may include more compute resources than the M PEs 112, since the diagonal elements may perform more complicated computations than non-diagonal elements in most matrix decomposition cases. As discussed further below, the D PEs 110 may include some MAC units and other math function (such as inverse square root) units, or may include units that perform certain specific operations.
The PE array 76 structure may achieve a relatively high operating clock frequency, since each  PE  110, 112, or 114 may only connect with  adjacent PEs  110, 112, or 114. This means that there may be no long routing path or that the routing paths between  PEs  110, 112, and 114 may be sufficiently similar so as to have similar (e.g., equal) latencies. And this structure may relatively easily scale up to a large array size.
FIG. 6 illustrates a data flow through the PE array 76. FIG. 6 provides an example of an X TX (X transpose multiply X) calculation. Every  column  120, 122, 124, …, 126 of input matrix X goes downward through each M PE 112 of that column and turns right when it meets the D PE 110. The respective M PEs 112 calculate the inner product of its upper input and left input. In  addition to the original data propagation path, there is a result data (inner product in this case) propagation path going through the rows of the  PEs  110, 112, and 114. Final results output to the right side as Y matrix.
Example architectures of the  PEs  110, 112, and 114 will be described below. It should be appreciated that these are intended to be illustrative and not exhaustive. Indeed, the  PEs  110, 112, and 114 may take any suitable form and have any suitable architectures.
Multiply-accumulate (M) PE 112 Architecture. One example architecture of an M PE 112 appears in FIG. 7. The M PE 112 includes several main components:
● An instruction decoder 140, which receives input instructions in_instr and translates them into control (Ctrl) signals to control the computational flow of the M PE 112. A delay block 142 may hold the instructions while computations are performed before propagating the instructions to a neighboring M PE 112. Note that the instruction flow for the  various PEs  110, 112, and 114 will be discussed further below.
● Routing circuits for interface and internal signals, which may include multiplexers (MUXes) 144, 146, 148, 150, 152, 154, and 156 and latches 158, 160, and 162.
● An arithmetic Logic Unit (ALU) 164, which may perform arithmetic operations. For certain applications, such as for a multiple-input multiple-output (MIMO) receiver, the ALU 164 may be a complex number ALU (e.g., CMAC or CALU) .  Data inverters  166, 168, and 170 may be used to invert various input data before processing in the ALU 164 or instead of processing in the ALU 164. Some data may be passed without any processing.
● A register file (RF) 172, which may include any suitable number of registers to store data.
● A data queue 174, which may buffer data from an upper side input.
The ALU 164 may perform arithmetic operations such as add, multiply, multiply-add, multiply-accumulate, and so on. It may be implemented in complex form (named CMAC or CALU) to support complex number arithmetic that is widely used in wireless communication systems. The inputs of the ALU 164 can have multiple sources, such as input ports, the register file (RF) 172, or the data queue 174. The input and output interfaces shown in FIG. 7 may include:
Input:
■ in_instr: input instruction
■ L_at: left data in (path of original data)
■ in_data: input data (path of result data)
■ U_dat: data from upper side
■ U_val: validation of U_dat
Output:
■ out_instr: output instruction (propagates in_instr to next PE)
■ R_dat: right data out (path of original data)
■ out_dat: output data (path of result data)
■ D_dat: data to downwards
■ D_val: validation of D_dat
The data queue 174 is used to buffer upper input data, since the left input data may be delayed after upper input data. One way to handle this delay gap is to input the input data in a staggered way, as shown in FIG. 8. Each input sequence is delayed to meet the systolic propagation pattern. Using the data queue 174, however, the M PE 112 may save the effort of rearranging input data, and provide flexibility to handle many different delay offset patterns of different algorithms.
It can be observed that the data queue method shown in FIG. 9 may involve more buffering resources compared to the staggered input scheme of FIG. 8. But the data queue method of FIG. 9 may reduce consumption of buffering resources in the main buffer 70 of the programmable spatial array processor 26 (FIG. 3) .
Diagonal (D) PE 110 Architecture. Since the D PEs 110 may handle more complicated calculations than an M PE 112, the D PEs 110 may have more functional units. In an example, shown in FIG. 10, the D PE 110 may receive input instructions (in_instr) that are translated and distributed by an instruction decoder 190. A delay block 192 may hold the instructions while computations are performed before propagating the instructions to a neighboring D PE 110. Note that the instruction flow for the  various PEs  110, 112, and 114 will be discussed further below. Among other things, the instructions may represent control signals for an issue slot architecture
In the example architecture of the D PE 110 shown in FIG. 10, there are five  issue slots  194, 196, 198, 200, and 202 and three  register files  204, 206, and 208 connected by a crossbar 210. Routing circuitry may include several multiplexers (MUXes) 212, 214, 216, 218, 220, and 222 to selectively route data through the D PE 110 according to the received instructions. Each  issue slot  194, 196, 198, 200, and 202 performs one kind of operation. Any suitable number of issue slots and register files may be used, and it should be understood that the number and types shown in FIG. 10 are provided by way of example for illustrative purposes. Each  issue slot  194, 196, 198, 200, and 202 can receive data from an input port (U_dat) and send data to an output port (R_dat) . The input slots may operate as follows:
● Input slot 194: store the input data into RFs.
● Isqrt slot 196: inverse square root
Figure PCTCN2020117947-appb-000003
Other operations like square root and division can be calculated using Isqrt result
Figure PCTCN2020117947-appb-000004
● MAC slot 198, 200: multiplier-accumulator.
● Output slot 202: generates output data from RFs or other issue slots.
Multiple issues in a D PE 110 can work in a pipelined manner to achieve high throughput. Take Cholesky decomposition, for example. The process includes inverse square root  (Isqrt) from the Isqrt slot 196 and multiplications in the issue slot 198, which use the result from the Isqrt slot 196. Using this pipeline scheme, the two  issue slots  196 and 198 can work in parallel. An example is shown in FIG. 11. Here, the issue slot 196 may perform a first square root operation 230 on a first matrix (Matrix 1) at a first time. At a second time, the issue slot 196 may perform a second square root operation 232 on a second matrix (Matrix 2) in parallel while the issue slot 198 performs a first multiply-accumulate operation 234 on the first matrix (Matrix 1) using the results of the operation 230. At a third time, the issue slot 196 may perform a third square root operation 236 on a third matrix (Matrix 3) in parallel while the issue slot 198 performs a second multiply-accumulate operation 238 on the second matrix (Matrix 2) using the results of the operation 232. At a fourth time, the issue slot 198 may perform a third multiply-accumulate operation 240 on the third matrix (Matrix 3) using the results of operation 236. The corresponding dataflow is shown in FIG. 12, which is indicated by dashed lines: first, the input data goes through the issue slot 1 (IS1) 194 into RF1 204 and issue slot 2 (IS2) 196, then IS2 196 performs inverse square root and write result into RF2 206, and then issue slot 3 (IS3) 198 reads data from RF1 204 and RF2 206 to perform multiplication and outputs the results as R_dat.
Instruction Share and Propagation
As previously discussed with respect to FIG. 3, all PEs in the PE array 76 are controlled by instructions that may be stored in the instruction memory 78, which may represent separate memories for the different varieties of processing elements (PE) 110, 112, and 114. To support a variety of matrix decomposition functions and provide flexibility for customized design, merely several control bits are not enough. A well-designed instruction set can support more general arithmetic operations. An example of a suitable Instruction Set Architecture (ISA) will be  discussed in the Instruction Set Architecture (ISA) section discussed further below. This section mainly focuses on how to efficiently distribute instructions to all  PEs  110, 112, and 114 in the PE array 76. One straightforward way would be to use a central control unit to generate all the instructions and distribute them to all PEs. This could cause an extremely high fan-out from such a control unit, however, which could heavily deteriorate the performance of the circuit. In addition to this high fan-out problem, such a central control unit would be complicated and could involve much higher development resources and much more hardware logic than the system discussed below. Another way to distribute instructions to all  PEs  110, 112, and 114 in the PE array 76 may involve using an instruction memory in each PE. In such a case, each  PE  110, 112, or 114 may maintain a Program Counter (PC) to read a particular instruction from its instruction memory. This, however, may involve a tremendous amount of memory. Moreover, the design of the PCs would involve taking great care to ensure the synergy of all of the PCs. The content reload for all instruction memories could also cause either high fan-out (e.g., with parallel reload) challenges or long latency (e.g., with serial reload) .
Accordingly, a scheme referred to as Instruction Share and Propagation (ISP) may overcome some of the challenges mentioned above (e.g., avoiding such high fan-out and high memory utilization problems) . The design of Instruction Share and Propagation (ISP) is made possible because the M PEs 112 and the D PEs 110 generally respectively execute the same or similar programs with only time offset and slight code differences. For instance, in a Cholesky decomposition procedure, every M PE 112 may execute the same first instruction but at different start times, and almost the same remaining instructions except that some of them may be ignored, as shown in FIG. 13. Here, there are three different instructions depicted by rectangles numbered 1, 2 and 3, and a “T + number” at its left to indicate the time this instruction to be executed. These  are schematically shown as instructions 260 amid the M PEs 112. The term “NOP” means no operation is needed. It can be found that one more instruction (instruction 2) is ignored (NOP) for every one step right in a row, and every M PE 112 in one column has the same instructions. These regularities enable the use of Instruction Share and Propagation (ISP) .
As shown in FIG. 14, the similarity of instruction executions among M PEs 112 may allow Instruction Share and Propagation (ISP) to use as few as one instruction memory 270 that contains the programs that all M PEs 112 share. Instruction Share and Propagation (ISP) propagates each instruction to all M PEs 112. One instruction is read from instruction memory 270 and sent to all rows of the PE array 76, which propagates to all M PEs 112.
As can be seen, the start time of instruction execution of each M PE 112 is different. As such, the delay of instruction arrival to each M PE 112 will be different and varies among functions. For example, the instruction delay between two adjacent M PEs 112 in one row may be 1 or 2 cycles (or more, as desired) . The instruction delay between two adjacent rows of M PEs 112 could be many more cycles. As shown in FIG. 15, instruction queues 282 for the rows of M PEs 112 may implement the delay offset (e.g., some number of cycles) between array rows of M PEs 112, and right side propagation delay can be set to 1 or 2 cycles.
There may also be a Time to Live (TTL) domain in each instruction indicating whether this instruction should be executed, as shown in FIG. 16. The value of TTL may be reduced by 1 through each hop. When it turns to less or equal to 0, the instruction is thereafter ignored (e.g., becomes NOP) . Specifically, the TTL domain is divided into 2 parts: TTL_R (horizontal) and TTL_D (vertical) .
FIG. 17 illustrates Instruction Share and Propagation (ISP) for all of the PEs of the PE array 76. In addition to the instruction memory 270, there is also an instruction memory 290 for D PEs 110 with corresponding instruction queues 292, as well as an instruction memory 294 for V PEs 110 with corresponding instructions queues 296. Each instruction is read from its  respective instruction memory  290, 270, and 294 and propagated to all  related PEs  110, 112, and 114. The  instruction queues  282, 292, and 296 insert a desired delay between two adjacent rows, or called vertical delay. The delay between two adjacent M PEs 112 in a row, which is called horizontal delay, should be set to 1 or 2.
FIG. 18 illustrates example instructions stored in the instruction memory 270 for the M PEs 112. A special instruction may be used to set vertical delay and horizontal delay, which may be referred to as a Propagation Delay Setting (PDS) instruction 300. The PDS instruction 300 may be located in a particular place (e.g., the first place) in a program containing any suitable number N  other instructions  302, 304, …, 306. The PDS instruction 300 propagates to all M PEs 112 like other instructions and may set the value of delay for M PE 112. In the example of FIG. 18, the PDS instruction 300 includes control (Ctrl) bits, some bits that indicate vertical delay (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, or more bits) , some bits that indicate horizontal delay (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 bits) that may be fewer than the number of bits that indicate vertical delay, and some bits that indicate the mode of the instruction (here, PDS) .
FIG. 19 shows a block diagram of the main buffer 70. The main buffer 70 uses instructions translated by an instruction decoder 304 to serve as a data buffer for input data 306 or inner loop-back intermediate data 90 (e.g., from the feedback path 88 shown in FIGS. 3 and 4) . The instruction decoder 304 may decode instructions into, for example, instructions for matrix size,  batch size, and function to be performed. The core of the main buffer 70 includes N first-in first-out (FIFO) buffers 310 for matrix buffering and at least one FIFO buffer 312 for vector buffering. The main buffer 70 supports N+1 data read and write in parallel (where N is the size of one matrix row or column) . There are two  write control blocks  314 and 316, which relate to the input data 306 and the loop-back intermediate data 90, respectively. The  write control blocks  314 and 316, as well as a read control block 316, may generate access signals to the  FIFOs  310 and 312 by controlling routing circuitry (e.g., multiplexers (MUXes) ) 322, 324, 326, and 328. For example, the write control block 314 may generate  access signals  330 and 332 using indications val_M, val_V, start of packet (sop) , and end of packet (eop) corresponding to the input data 306. The write control block 316 may generate  access signals  334 and 336 using indications val_M, val_V, start of packet (sop) , and end of packet (eop) corresponding to the loop-back intermediate data 90. Likewise, the read control block 318 may generate access signals 338 and 340. Monitor circuitry 342 may provide error and ready signals. A parallel to serial (P2S) block 344 may convert a parallel vector into serial form for storage in the FIFO 312.
Thus, input data with length of N in the form of one row or column of a matrix may be fed into N FIFOs 310, and data read from the N FIFOs 310 may be sent to the PE array 76 as one matrix row or column. The write and read control blocks 314, 316, and 318 are used to generate FIFO access signals (e.g., 330, 332, 334, 336, 338, and 340) . Some specific data like an identity matrix can also be generated by the read control block 318. The memory 320 may store the FIFO access patterns of each operation (e.g., each type of matrix decomposition) . The memory 320 may store read patterns. Table 1 provides one example of a read pattern.
Figure PCTCN2020117947-appb-000005
Figure PCTCN2020117947-appb-000006
Table 1
Table 2 illustrates one example instruction structure for the instructions of Table 1.
Figure PCTCN2020117947-appb-000007
Table 2
FIG. 20 shows a block diagram of the delay alignment buffer 82. Similar to the main buffer 70, the delay alignment buffer 82 uses instructions translated by an instruction decoder 350 to align input data 352 that is received from the PE array 76. The delay instruction buffer 82 may output the aligned data as output data 354 or as the inner loop-back intermediate data 90 (e.g., to the feedback path 88 shown in FIGS. 3 and 4) . The instruction decoder 350 may decode instructions into, for example, instructions for matrix size, batch size, and function to be performed. The core of the delay alignment buffer 82 includes N first-in first-out (FIFO) buffers 356 for matrix buffering. The pattern of input data 352 to delay alignment buffer is staggered due to the different delays of PE array rows, which may make write control logic of a write control block 358 more complex than that of the main buffer 70.
A read control block 360 is used to make sure that the output of all FIFOs 356 are aligned. For example, the write control block 358 may receive instructions from delay buffer write  instruction memory 362 indicating access patterns for the current application and from the instruction decoder 350 indicating, for example, matrix size (size_matrix) , batch size (size_batch) , and the function that was performed (Function) . The write control block 358 may generate write enable (wr_en) signals to write into the FIFOs 356 and a start read (start_rd) signal for the read control block 360. The write control block 358 may trigger some or all of these signals upon receipt of a start of packet (sop) signal corresponding to the input data 352. The read control block 360 may use the start_rd signal from the write control block 358 and instructions from a delay buffer read instruction memory 364 indicating access patterns for the current application. The read control block 360 may also use the instruction decoder 350 indicating, for example, matrix size (size_matrix) , batch size (size_batch) , and the function that was performed (Function) . Monitor circuitry 366 may provide error signals.
Example instructions that may be stored in the delay buffer write instruction memory 362 are shown below in Table 3. One such instruction can serve for a write process for one batch of matrices.
Figure PCTCN2020117947-appb-000008
Table 3
Instructions in the delay buffer read instruction memory 364 may be organized as shown below in Table 4.
Figure PCTCN2020117947-appb-000009
Figure PCTCN2020117947-appb-000010
Table 4
The instructions may be described as shown below in Table 5.
Figure PCTCN2020117947-appb-000011
Table 5
In this way, the delay buffer 82 may use the N FIFOs 356 to buffer both matrices and vectors. The input data 352 from the PE array 76 arrive in a staggered pattern, which is different from that of the main buffer 70. The write control block 358 is responsible for writing the data into alignment addresses of all the FIFOs 356. The read control block 360 causes data to be read from the FIFOs 356 and sent to the output port as output data 354 or looped back to the main buffer 70 as the loop-back intermediate data 90.
Instruction Set Architecture (ISA) of PE Array
ISA for an M PE 112. The behavior of each M PE 112 is controlled by the instruction it receives. An instruction contains the arithmetic operation to be performed and the routing selection of each signal. FIG. 21 shows one example of a data structure for an instruction 380 for an arithmetic mode of operating the M PEs 112. The data structure for the instruction 380 is shown to include a number of different possible domains 382 represented by a corresponding number of bits 384. Table 6, Table 7, and Table 8 describe each domain of the instruction.
Figure PCTCN2020117947-appb-000012
Table 6
Figure PCTCN2020117947-appb-000013
Table 7
Figure PCTCN2020117947-appb-000014
Figure PCTCN2020117947-appb-000015
Table 8
Assembly language for an M PE 112. To display instructions in a more readable way, the instructions may be visualized in an assembly-like language: Assembly for M PE (ASMMPE) . This is an assembly language designed for matrix decomposition using the PE array 76. FIG. 22 provides an example of ASMMPE code 390 and its mapping to binary instruction. The ASMMPE code 390 includes a first part 392 that describes the main arithmetic operation and an output destination, a second part 394 that describes routing signals (e.g., a source of the data and delay of the output data) , and a third part 396 that describes the time-to-live (TTL) values. There is a separator ‘|’ to divide them. Table 9 illustrates corresponding binary instructions for the assembly code 390 of FIG. 22.
ctrl func mul1 mul2 add3 raddr dest wraddr
10 001000 100 001 10 00101 011 00000
latch muxD muxR muxO Odly muxRF TTLD TTLR
000 10 00 10 0 0 00100 00100
Table 9
Below, Table 10 provides various keywords that may be used by the ASMMPE language.
Figure PCTCN2020117947-appb-000016
Figure PCTCN2020117947-appb-000017
Table 10
ISA for a D PE 110. The behavior of each D PE 110 is also controlled by the instruction it receives. An instruction for the D PEs 110 includes five sub-instructions, where each sub-instruction belongs to one issue slot. As may be appreciated, when the D PEs 110 include more or fewer issue slots, there may be correspondingly more or fewer sub-instructions. As  mentioned above, multiple issue slots work simultaneously to achieve a pipeline effect. Each issue slot runs just under its corresponding sub-instruction, and time offsets among multiple issue slots may also be specified by the program. Table 11, Table 12, Table 13, and Table 14 show an example instruction structure of the four kinds of issue slot discussed above with reference to FIGS. 10-12.
Figure PCTCN2020117947-appb-000018
Table 11
Figure PCTCN2020117947-appb-000019
Table 12
Figure PCTCN2020117947-appb-000020
Table 13
Figure PCTCN2020117947-appb-000021
Table 14
Each issue slot instruction may also include a time-to-live (TTL) domain to indicate whether that instruction should be executed or ignored (e.g., as NOP) . For example, the TTL of an instruction for a D PE 110 may have a data structure as described in Table 15.
Figure PCTCN2020117947-appb-000022
Table 15
Processes of Matrix Decomposition on the Programmable Spatial Array Processor
The programmable spatial array processor 26 may be programmable to perform a wide variety of types of matrix decompositions. This section will describe the following types of matrix decompositions:
● Cholesky decomposition
● LU decomposition
● Cholesky based MMSE
● Givens-Rotation QR based MMSE
● Gram-Schmidt QR decomposition
The D PE 110 and M PE 112 may have a dataflow as illustrated in FIG. 23. In FIG. 23, sequential input signals come from upper or left side and output signals go out the lower or right side. The equation in the circle or square shows the calculation that the D PE 110 and M PE 112 performs. The dashed arrows are the paths of result (output) signals.
Cholesky decomposition. Cholesky decomposition aims to find a lower triangular matrix L that satisfies L*L’ = A. A is given as a positive definite Hermitian matrix:
A=L·L H
The procedure of Cholesky decomposition is (R=A) :
Figure PCTCN2020117947-appb-000023
FIG. 24 illustrates the dataflow of Cholesky decomposition in the PE array 76 of the programmable spatial array processor 26. As seen in FIG. 24, the result L i, j is buffered in the  PEs  110 and 112. The horizontal instruction propagation delay may be set to 2 for Choleksy decomposition. The assembly codes of Cholesky decomposition are as below. Namely, the assembly code for Cholesky decomposition for a D PE 110 is shown in Table 16 and the assembly code for Cholesky decomposition for an M PE 112 is shown in Table 17.
Input (IS1) Isqrt (IS 2) MAC (IS3) MAC (IS4) Output (IS5)
mv RF1_1, U_dat isqrt RF2_1, U_dat      
mv RF1_2, U_dat        
mv RF1_3, U_dat        
mv RF1_4, U_dat        
mv RF1_5, U_dat isqrt RF2_2, U_dat      
mv RF1_6, U_dat        
mv RF1_7, U_dat        
mv RF1_8, U_dat        
     
       
    cmul RF3_1, RF1_1, RF2_1 | latch=2    
    cmul R_dat, RF1_2, latch    
    cmul R_dat, RF1_3, latch    
       
Table 16
ncjmulsub D_dat, L_dat, L_dat, U_dat | latch=2, RF1=L
ncjmulsub D_dat, L_dat, latch, U_dat | TTL=max, 2
ncjmulsub D_dat, L_dat, latch, U_dat | TTL=max, 1
NOP
Table 17
LU decomposition. The programmable spatial array processor 26 can also be used to perform LU decomposition. LU (lower-upper) decomposition factors matrix A as production of a lower triangluar matrix L and an upper triangular matrix U:
A=L*U
Example Matlab code of LU decomposition is shown below:
Figure PCTCN2020117947-appb-000024
FIG. 25 illustrates the dataflow of LU decomposition through the D PEs 110 and M PEs 112 of the PE array 76. The assembly code to perform LU decomposition for a D PE 110 is shown in Table 18 and the assembly code to perform LU decomposition for an M PE 112 is shown in Table 19.
Figure PCTCN2020117947-appb-000025
Table 18
nop  0, 0, U_dat | latch=2, RF1=U
ncmulsub D_dat, L_dat, latch, U_dat | TTL=3, max
ncmulsub D_dat, L_dat, latch, U_dat | TTL=2, max
ncmulsub D_dat, L_dat, latch, U_dat | TTL=1, max
Table 19
Cholesky-based minimum mean square error (MMSE) . The programmable spatial array processor 26 can also be used to perform Cholesky-based MMSE. An example procedure for performing Cholesky-based MMSE is provided below:
Description of input signals:
● MIMO channel coefficients H: N×N complex matrix,
● Noise power σ 2: real scalar,
● Received signal Y: N×1 complex vector
The final result is:
x= (H HH+σ 2I)  -1H HY
To implement it on the PE array 76, the procedure is divided into 4 stages:
Pre-filtering
● A=H H·H,
● R=A+σ 2·I
● Z=H H·Y.
Cholesky decomposition
● R=L·L H
Back substitution &V*Z
● V=L -1
● VZ=V*Z
V H* (VZ)
● x=V H* (VZ)
Reviewing each stage of Cholesky-based MMSE, pre-filtering may take place in the PE array 76 as illustrated by FIG. 26. Sequential input signals come from upper or left side and output signals go out the lower or right side. The dashed arrows are the paths of result (output)  signals. The assembly code to perform pre-filtering for a D PE 110 is shown in Table 20 and the assembly code to perform pre-filtering for an M PE 112 is shown in Table 21.
Input (IS1) Isqrt (IS 2) MAC (IS3) MAC (IS4) Output (IS5)
    cjmul acc, U_dat, U_dat   jmv R_dat, U_dat
    cjmuladd acc, U_dat, U_dat, acc   jmv R_dat, U_dat
    cjmuladd acc, U_dat, U_dat, acc   jmv R_dat, U_dat
    cjmuladd acc, U_dat, U_dat, acc   jmv R_dat, U_dat
    cmuladd out_dat, U_dat, 1, acc  
       
Table 20
cmulacc acc, L_dat, U_dat | D=U
cmulacc acc, L_dat, U_dat, acc | D=U
cmulacc acc, L_dat, U_dat, acc | D=U
cmulacc O_dat, L_dat, U_dat, acc | D=U, Odly=idxM
Table 21
After pre-filtering, the second stage of Cholesky-based MMSE is Cholesky decomposition. This may take place in the same way described above. After Cholesky decomposition, Cholesky-based MMSE continues with back substitution and V*Z. Back substitution is used to solve V = L -1 in which L is an upper triangular matrix. V*Z is a matrix (V) vector (Z) multiplication:
V=L -1
V*Z
The procedure of back substitution may be described as:
Figure PCTCN2020117947-appb-000026
Figure PCTCN2020117947-appb-000027
FIG. 27 shows the dataflow of back substitution and V*Z in the PE array 76 (in this example, 4x4) . The V i, i and L i, j are already buffered in corresponding PEs at the Cholesky decomposition stage. Final results output to the right side. In FIG. 27, the M PEs 112 are shown as  M PEs  112A or 112B depending on the operation they perform at this stage. The assembly code to perform back substitution for a D PE 110 is shown in Table 22 and the assembly code to perform back substitution for an M PE 112 is shown in Table 23.
Input (IS1) Isqrt (IS 2) MAC (IS3) MAC (IS4) Output (IS5)
    cmul R_dat, U_dat, RF3_1    
    cmul R_dat, U_dat, RF3_1    
    cmul R_dat, U_dat, RF3_1    
    cmul R_dat, U_dat, RF3_1    
       
Table 22
ncmulsub D_dat, L_dat, RF1, U_dat
ncmulsub D_dat, L_dat, RF1, U_dat
ncmulsub D_dat, L_dat, RF1, U_dat
ncmulsub D_dat, L_dat, RF1, U_dat
Table 23
The fourth stage of Cholesky-based MMSE is to calculate V H* (VZ) :
V H* (VZ)
FIG. 28 shows the dataflow for calculating V H* (VZ) in the PE array 76. Final results output to the right side. In FIG. 28, the M PEs 112 are shown as  M PEs  112A or 112B depending  on the operation they perform at this stage. The horizontal instruction propagation delay may be set to 2 to calculate V H (V transpose) . The assembly code to perform V H* (VZ) for a D PE 110 is shown in Table 24 and the assembly code to perform V H* (VZ) for an M PE 112 is shown in Table 25.
Input (IS1) Isqrt (IS 2) MAC (IS3) MAC (IS4) Output (IS5)
        mv R_dat, U_dat
        mv R_dat, U_dat
       
Table 24
nop | R=U
nop | D=U | TTL=3, max
nop | D=U | TTL=2, max
nop | D=U | TTL=1, max
Table 25
Givens-Rotation QR based MMSE. Givens Rotation based QR decomposition (GR-QRD) uses a series of Givens rotation operations to eliminate the entries of a lower triangular part of R. One Givens rotation can zero the lower element of a 2x1 vector:
Figure PCTCN2020117947-appb-000028
The α and β may be calculated as:
Figure PCTCN2020117947-appb-000029
Figure PCTCN2020117947-appb-000030
The entire procedure of QR decomposition may be described as:
Rotate the 1st and 2nd row of A to zero A (2, 1) .
Figure PCTCN2020117947-appb-000031
Then rotate the 1st and 3rd row of A to zero A (3, 1) .
Figure PCTCN2020117947-appb-000032
……
The last step is to rotate the N-1th and Nth row of A to zero A (N, N-1) .
Figure PCTCN2020117947-appb-000033
Finally we get
A=QR (Q=Q 1Q 2...Q m)
The MATLAB code of above procedure is:
Figure PCTCN2020117947-appb-000034
In many cases, there is no need to obtain the Q matrix explicitly. For instance, the QRD based MMSE may include the following:
● Calculate H HH+σ 2I and H HY.
● Perform QRD on H HH+σ 2I, get R.
● Perform back substitution to get R -1.
● Get x= (QR)  -1H HY=R -1 (Q HH HY)
One question about QRD is how to get Q H if there is no explicit Q calculation. The answer is that when Givens rotation is performed on H HH+σ 2I, it should also be performed on H HY simultaneously as below equation shows:
[QR, H HY] ----Givens Rotations ----> [R, Q HH HY]
FIG. 29 shows the dataflow for GR-QRD in the PE array 76. Back substitution of R and the calculation of R -1 (Q HY) are shown in FIG. 30. Final results output to the right side.
To increase the utilization rate of MAC resources and data throughput, GR-QRD may be performed using an interleaved batch mode. FIG. 31 shows the dataflow of interleaved batch GR-QRD in the PE array 76 (4x4) . The input data is not matrix by matrix, but rather an interleaved pattern of matrices. For example, the first row of matrix 1 may be followed by the first row of matrix 2, followed by the second row of matrix 1. The assembly code to perform interleaved batch GR-QRD for a D PE 110 is shown in Table 26 and the assembly code to perform interleaved batch GR-QRD for an M PE 112 is shown in Table 27.
Figure PCTCN2020117947-appb-000035
Figure PCTCN2020117947-appb-000036
Table 26
nop | RF1_B1=U
nop
nop
nop
nop | RF1_B2=U
nop
nop
nop
jcmulacc acc, L_dat, RF1_B1; c*. a1
jcmulacc RF1_B1, L_dat, U_dat, acc | latch=1, 2 ;
a1=c*. a1+s*. ak
cmulacc acc, L_dat, latch2; c. ak
ncmulacc D_dat, latch1, RF1_B1, acc ; b=c. ak-s. a1
jcmulacc acc, L_dat, RF1_B2
jcmulacc RF1_B2, L_dat, U_dat, acc | latch=1, 2
cmulacc acc, L_dat, latch2
ncmulacc D_dat, latch1, RF1_B2, acc
Table 27
Gram-Schmidt QR decomposition. GS (Gram-Schmidt) QR decomposition is a canonical and widely used matrix decomposition algorithm. The procedure is shown below:
A= [a 1, a 2, …, a N]
u 1=a 1
Figure PCTCN2020117947-appb-000037
Figure PCTCN2020117947-appb-000038
Figure PCTCN2020117947-appb-000039
u 3= a 3- <q 1, a 3> q 1 - <q 2, a 3> q 2
Figure PCTCN2020117947-appb-000040
…u N= a N- <q 1, a N> q 1 -…- <q N-1, a N> q N-1
Figure PCTCN2020117947-appb-000041
Q= [q 1, q 2, …, q N]
Figure PCTCN2020117947-appb-000042
FIG. 32 is a diagram of GS QR decomposition dataflow on the PE array 76. The terms a k and q k are vectors representing the columns of A and Q. Inner product and mutiply-subtract operations are used in each M PE 112 and D PE 110 with reciprocal. The assembly code to perform GS QR for a D PE 110 is shown in Table 28 and the assembly code to perform GS QR for an M PE 112 is shown in Table 29.
Figure PCTCN2020117947-appb-000043
Figure PCTCN2020117947-appb-000044
Table 28
; calculate <q, a>
cmulacc acc, L_dat, U_dat | RF1=U_dat
cmulacc acc, L_dat, U_dat | RF2=U_dat
cmulacc acc, L_dat, U_dat | RF3=U_dat
cmulacc RF5, L_dat, U_dat | RF4=U_dat
NOP
...
NOP
; calculate a- <q, a> q
nop, 0, 0, RF5 | latch=2
ncmulsub D_dat, L_dat, latch, RF1
ncmulsub D_dat, L_dat, latch, RF2
ncmulsub D_dat, L_dat, latch, RF3
ncmulsub D_dat, L_dat, latch, RF4
Table 29
Average throughput estimation. Tables 30 and 31 provide a rough estimate of the average throughput of the matrix decomposition examples discussed above. The parameters are defined as: matrix size of N x N, the size of one batch (number of matrices in one batch) is LenB, the gap between two consecutive batches is LenG clock cycles, and the delay of multiply-accumulate operations in each M PE 112 is DAcc.
Figure PCTCN2020117947-appb-000045
Figure PCTCN2020117947-appb-000046
Table 30
Figure PCTCN2020117947-appb-000047
Table 31
K-Best Maximum Likelihood Detector (MLD) for Multiple-Input Multiple-Output (MIMO)
The programmable spatial array processor 26 may be used to perform K-best maximum likelihood computations for multiple-input multiple-output (MIMO) detection. For example, a single programmable spatial array processor 26 may be time-multiplexed to carry out alternating computations (as will be discussed below, these are QR decomposition and decoding tree traverse) to perform K-best maximum likelihood computations for multiple-input multiple-output (MIMO) detection. Additionally or alternatively, multiple programmable spatial array processors 26 may be connected together to perform K-best maximum likelihood computations for multiple-input multiple-output (MIMO) detection. While this disclosure provides two connected programmable spatial array processors 26 by way of example, it should be appreciated that any number of programmable spatial array processors 26 may be connected (some number M total programmable spatial array processors 26) or a single programmable spatial array processor 26 multiplexed any suitable number of times. In the examples that will follow, the programmable spatial array processors 26 may be connected above and below. Moreover, PEs from one programmable spatial array processor 26 may be connected to PEs from other programmable spatial array processors 26 in a one-to-one, one-to-many, or many-to-many manner. However, it should be understood that  the programmable spatial array processor 26 of this disclosure may be multiplexed by holding data in a register file and performing various time-multiplexed, but related, operations at different times instead of connecting multiple separate programmable spatial array processors 26. In other words, the example that follows is meant to represent a non-limiting arrangement that may be performed with a single time-multiplexed programmable spatial array processor 26 or multiple programmable spatial array processors 26 connected as described below.
In modern wireless systems, multiple antennas (e.g., antennas 28 shown in FIGS. 1 and 2) may be placed both at transmitters and receivers to increase the number of parallel streams or transmission reliability. FIG. 33 illustrates an example MIMO system 450 having four transmitter antennas 452 and four receiver antennas 454. Thus, the MIMO system 450 may be referred to as a 4x4 MIMO system. However, many more or fewer antennas may be used as the transmitter antennas 452 and the receiver antennas 454. Here, a data symbol vector x is transmitted. After propagation in a MIMO channel, the received vector y is a linear combination of components of transmitted symbols and additive noise, which may be denoted as:
y=Hx+n                     (1)
where H is a N×N channel matrix known at the receiver, and n is a N×1 Gaussian noise vector with covariance matrix σ 2I
A MIMO detector may be used to estimate the transmitted data vector x using received vector y and channel matrix H. One form of MIMO detector is a maximum likelihood detector (MLD) . An MLD chooses from among all possible candidates to select one with the least Euclidean distance between y and Hx. This may be expressed as follows:
Figure PCTCN2020117947-appb-000048
A hardware-friendly variant of MLD is called K-best MLD. The procedures of K-best MLD is described as below.
First (step 1) , QR factorization of channel matrix H:
H=QR                    (3)
where Q is a unitary matrix that has Q HQ=I, and R is an upper triangular matrix.
Second (step 2) , traverse the decoding tree in breadth-first manner. By substituting equation (3) into equation (2) , it becomes:
Figure PCTCN2020117947-appb-000049
where y′=Q Hy.
The vector of Euclidean distance can be rewritten as the sum of squared distance on each dimension:
Figure PCTCN2020117947-appb-000050
Taking the 4×4 MIMO system of FIG. 33 for example, the above equation is expanded to four terms as:
|y′ 4-R 44x 4| 2+|y′ 3-R 34x 4-R 33x 3| 2 +|y′ 2-R 24x 4-R 23x 3-R 22x 2| 2+|y′ 2-R 14x 4-R 13x 3-R 12x 2-R 11x 1| 2    (6)
Here, the decoding tree is shown to be traversed with four layers. For K-best MLD, K most likely paths are reserved as traversing one layer deeper to detect one more transmitted symbol. First, among all possible symbols of x 4, choose K of them with least squared errors |y′ 4-R 44x 4| 2. They are denoted as
Figure PCTCN2020117947-appb-000051
Second, for each of K possible previously decoded data symbols, substitute it into |y′ 3-R 34x 4-R 33x 3| 2, which only has one unknown x 3. From all possible of symbols of x 3, choose K of them with least squared errors |y′ 3-R 34x 4-R 33x 3| 2. Now there are K 2 candidates of symbol x 4 and x 3. From all K 2 candidates of x 4 and x 3 at hand, we only select K with least partial Euclidean distance (PED) :
PED 3= |y′ 4-R 44x 4| 2+|y′ 3-R 34x 4-R 33x 3| 2
These may be denoted as
Figure PCTCN2020117947-appb-000052
Third (step 3) , similarly, in layer 3, for each of the K survival candidates that are previously decoded partial vectors, those may be plugged into the third term in equation (6) to make only x 2 unknown in the squared error. Based on each previously decoded partial vector, K mostly likely x 2 may be chosen according to the squared error term.
Next, K may be selected from K 2 expanded candidates with least PED:
PED 2= |y′ 4-R 44x 4| 2+|y′ 3-R 34x 4-R 33x 3| 2+|y′ 2-R 24x 4-R 23x 3-R 22x 2| 2
The results may be denoted as
Figure PCTCN2020117947-appb-000053
Fourth (step 4) , followed by the same process of step 2 or step 3, a K-candidate data vector may be obtained at the last stage, which corresponds to leaf layer of decoding tree. An example follows: 
Figure PCTCN2020117947-appb-000054
The value with the smallest total Euclidean distance
Figure PCTCN2020117947-appb-000055
Figure PCTCN2020117947-appb-000056
is the hard output result of a K-best MLD detector. Additionally or alternatively, the final K survival candidates may be used to compute log-likelihood ratio (LLR) of transmitted bits that are soft output result of a K-best MLD detector.
The general computation of one layer in decoding tree traverse procedure 460 is illustrated as FIG. 34. Note that, when traversing layer i, the first N-i+1 symbols
Figure PCTCN2020117947-appb-000057
may not be the same as in the input and output with the same superscript p denoting survival path indices.
One systolic structure, shown in FIG. 35, may operate as a K-best MLD detector 470. The K-best MLD detector 470 includes two planar  triangular PE arrays  472 and 474. It is shown in FIG. 35 as an example for a 4×4 MIMO system. The two  triangular arrays  472 and 474, in Plane 1 and Plane 2 respectively, are used for QR decomposition and decoding tree traverse, respectively. The  PEs  110, 112 in the same position of the two arrays are connected to transfer data.
The  triangular arrays  472 and 474 may operate as discussed above with reference to the PE array 76 discussed above, but may be able to communicate from Plane 1 to Plane 2, as will be discussed further below. Moreover, the D PEs 110 and the M PEs 112 may operate with different instructions in the different planes. For ease of explanations, these are referred to as D  PEs 110-1 and M PEs 112-1 in the triangular array 472 of Plane 1, and D PEs 110-2 and M PEs 112-2 in the triangular array 474 of Plane 2. In other embodiments, the K-best MLD detector 470 may use only one triangular systolic array by combining every two PEs in the same location of the two planes. Finally, it should be appreciated that the systolic array structure shown in FIG. 35 is meant to represent a logical arrangement and that the physical location of the  triangular arrays  472 and 474 may take any suitable positioning that permits communication between the various PEs as provided in this disclosure.
In the example of FIG. 35, since all or some of the PEs 110-1, 110-2, 112-1, and 112-2 in Plane1 and Plane 2 may include a complex multiply-accumulate (CMAC) function, the K-best MLD detector 470 may be used for other matrix operations such as matrix multiplication, Cholesky and LU decomposition, and linear equation solving, as well.
The array 472 in Plane 1 may be used to perform QR decomposition. The data flow between D PEs 110-1 and M PEs 112-1 is demonstrated in FIG. 36. As discussed above, there are two types of PEs 110-1 and 112-1 in the array 472. The diagonal PEs 110-1 denoted in circles are used to compute a series of sines and cosines according to Givens rotation. The off-diagonal M PEs 112-1 denoted in squares may operate as Complex Multiply Accumulate (CMACs) that may be used to apply rotations that computed by the diagonal D PE 110-1 in the same row to other entries of the channel matrix.. Each M PE 112-1 can get input data from the D PE 110-1 on the left in the same row and the M PE 112-1 on the above, and output results to the two PEs on its right and below respectively.. In other words, data flow of the array 472 is from left to right and from top to bottom.
The D PEs 110-1 and M PEs 112-1 in the same row (e.g., the i-th row) zero the lower off-diagonal entries in the i-th column of a channel matrix. When calculating a pair of sine and cosine indicating a certain rotation, it not only zeros an off-diagonal entry, but also makes the diagonal entries result R as real numbers. The singleton diagonal PE 110-1 in the last row applies a rotation to a complex number, which make R N, N real.
The channel matrix H is fed into the array 472 from the top column-wise. The i-th column of H feeds into the i-th D PE 110-1 or M PE 112-1 in the first row. Each PE 110-1 or 112-1 in a cycle may read data from an input port. Thus, for an N×N calculation, the throughput of QR decomposition is N cycles per matrix. The time difference between input elements between columns, for example the input of H 11 and H 12, depends on the processing latency of the D PEs 110-1 and M PEs 112-1.
The function (e.g., program, configuration) of diagonal D PEs 110-1 is shown in FIG. 37. Its input data is a column of original or sub matrix to be zeroed except its first element. If the diagonal D PE 110-1 is in the i-th row, the number of input data is M= N+1-i. Based on Givens rotation, M-1 rotations are applied on the tail M-1 to make them zero. These M-1 pairs of sines and cosines are output to the off-diagonal M PEs 112-1 on the right. After M-1 rotations, the first non-zero element represents the reciprocal of the diagonal of matrix R, which is 
Figure PCTCN2020117947-appb-000058
This number L ii is passed to a corresponding diagonal D PE 110-2 in Plane 2 for further decoding tree traverse.
An example of an internal functional arrangement (e.g., program, configuration) of the diagonal D PE 110-1 of Plane 1 is illustrated in FIG. 38. The D PE 110-1 uses three basic arithmetic  modules. A first module 480 is is a squared accumulator. The first module 480 calculates the squared magnitudes of input data and then output cumulative accumulated results of the squared magnitudes. A second module 482 is used to compute both square roots and reciprocal square roots of the output of the first module 480. Lastly, in a third module 484 includes a complex multiply-accumulate (CMAC) operation, in which the output of the second module 482 and the initial input h 1, …, h M are multiplied together to obtain final sines and cosines. The output L ii is from the last output of the second module 482.
A function (e.g., program, configuration) of the off-diagonal M PEs 112-1 is shown in FIG. 39. If the M PE 112-1 is in the i-th row, it receives M= N+1-i number of data from the top and M-1 pairs of sines and cosines from the left. It directly forwards the sines and cosines to the M PE 112-1 on its right. It also applies M-1 rotations successively to the inputs h is and internal updated numbers r is. The rotation matrix is:
Figure PCTCN2020117947-appb-000059
The result h′ 2, …, h′ M is output to the bottom. The final internal result r M is passed to the corresponding M PE 112-2 on Plane 2. Suppose the off-diagonal M PE 112-1 is at row-i and column-j, then its r M is equal to R ij after QR decomposition.
An example of an internal functional arrangement (e.g., program, configuration) of off-diagonal M PEs 112-1 of Plane 1 is illustrated in FIG. 40. An off-diagonal M PE 112-1 may include four  identical CMACs  490, 492, 494, and 496 (e.g., the M PE 112-1 may be programmed to perform four identical CMAC operations using the ALU 164 as a CALU shown in FIG. 7) . The  CMACs  490 and 492 on the left compute the downward output h′ is, and related data paths are  marked by solid lines. The  CMACs  494 and 496 on the right compute the internal values r is, and related data paths are marked by dashed lines.
In each diagonal D PE 110-1, there may be an additional rotation module that is same as an off-diagonal M PE 112-1, which may be used to compute y′=Q Hy. As shown in FIG. 41, sines and cosines that output to off-diagonal M PEs 112-1 on the right also feed into this rotation module. Rotations represented by sines and cosines are applied to input y is along the diagonal. The final internal number of the module, as r M in off-diagonal M PEs 112-1, is the y′ is that are transferred to a corresponding diagonal D PE 110-2 in Plane 2.
After computing QR decomposition and y′ by the PE array 472 in Plane 1, the result may be transferred to the PE array 474 in Plane 2 for decoding tree traverse. The data that is passed is shown in FIG. 42. The PE 110-1 or 112-1 in i-th row and j-th column of Plane 1 may transmit data to a corresponding PE 110-2 or 112-2 also in i-th row and j-th column of Plane 2. For example, in FIG. 42, the i-th diagonal PE 110-1 transmits two data -L ii and y′ i to a corresponding i-th diagonal PE 110-2 in Plane 2. The off-diagonal M PE 112-1 located in in i-th row and j-th column may transmit R ij downwards to a corresponding off-diagonal M PE 112-2 in Plane 2.
The array 474 of Plane 2 may be used to traverse a decoding tree. The data flow between PEs 110-2 and 112-2 is demonstrated in FIG. 43. Similar to the array 472 of Plane 1, there are two types of PEs 110-2 and 112-2 in Plane 2. The diagonal PEs 110-2 denoted in circles are used to compute K candidate partial data vectors
Figure PCTCN2020117947-appb-000060
based on the inputted partial Euclidean distances (PED) and partial data vector
Figure PCTCN2020117947-appb-000061
Here, 
Figure PCTCN2020117947-appb-000062
is a shorthand of the partial vector
Figure PCTCN2020117947-appb-000063
The off-diagonal M PEs 112-2 denoted in squares also perform  CMAC operations to calculate inter-stream interferences from the previously decoded layers. The inter-stream interferences
Figure PCTCN2020117947-appb-000064
are subtracted away at diagonal PE before detecting symbol x i.
The direction of data flow of the array 474 of Plane 2 is opposite to that of the array 472 of Plane 1. The PEs 110-2 and 112-2 receive input data from neighboring PEs 110-2 and 112-2 on the right and below, and output results to the two PEs 110-2 and 112-2 on the left and above. In other words, data flow of the array 474 is from right to left and from bottom to top.
The PEs 110-2 and 112-2 in the i-th row traverse the i-th layer of decoding tree, which detects the transmitted symbol x i with K possible outcomes. The tree traverse starts from the last diagonal D PE 110-2 to decode x N. Next, K candidates of x N are propagated upward to construct inter-stream interferences to the remainder layers. The diagonal PE 110-2 one above the last one starts to decode x N-1. The decoding proceeds so on so forth until we reach the first diagonal PE 110-2 at the uppermost row. The value x 1 is the last one to be decoded.
A function (e.g., program, configuration) of diagonal D PEs 110-2 in Plane 2 is shown in FIG. 44. It receives K previously decoded partial data vectors for the diagonal D PE 110-2 below. For the diagonal D PE 110-2 in the i-th row, K different possible sets of symbols x i+1, …, x N may already be determined, as well as the PEDs accumulated to coordinates from i+1 to N. From the right, the D PE 110-2 may also receive inter-stream interferences from previously decoded symbols. The inter-stream interferences are represented as
Figure PCTCN2020117947-appb-000065
Additionally, L ii and y′ i are from the D PE 110-1 in the same position in Plane 1. A diagonal D PE 110-2 traverses the decoding tree one-layer deeper as illustrated in FIG. 34. After the computation, the D PE 110-2 sends to the above diagonal D PE 110-2 the updated PEDs and partial decoded vectors with new symbols
Figure PCTCN2020117947-appb-000066
prepended to input partial decoded vectors. Meanwhile, K
Figure PCTCN2020117947-appb-000067
candidates and indices indicating which input partial vector they correspond to are also output to the off-diagonal M PE 112-2 above.
An example of an internal functional arrangement (e.g., program, configuration) of a diagonal PE 110-2 of Plane 2 is illustrated in FIG. 45. There are four  modules  500, 502, 504, and 506 to perform various computations. In the module 500, which operates as a CMAC, the inter-stream interferences are subtracted away from y′ i. Since there are K different possible interferences, it has K results. After that, L ii is multiplied to have K least square (LS) estimates of
Figure PCTCN2020117947-appb-000068
The module 502 may operate as an enumeration (enum) module. In the module 502, based on each input LS estimates of
Figure PCTCN2020117947-appb-000069
K constellation points are selected which have the minimum Euclidean distances between the LS estimate. For the k-th LS estimate, the K possible constellation points chosen are
Figure PCTCN2020117947-appb-000070
Every cycle, K constellation points
Figure PCTCN2020117947-appb-000071
are transferred to the module 504, which may operate as a CMAC, on the lower right. The module 504 may represent not a single CMAC, but rather K CMAC complex multipliers. The module 504 calculates the squared magnitudes of inputted
Figure PCTCN2020117947-appb-000072
and adds them to
Figure PCTCN2020117947-appb-000073
whose results are updated PEDs
Figure PCTCN2020117947-appb-000074
In K cycles, K 2 PEDs are received by the fourth module 506, which represents a sorter. The module 506 may be a partially or fully pipelined insertion sorter that outputs K PEDs with smallest PEDs among K 2 distances and their indices. Based on the indices, K corresponding candidates of
Figure PCTCN2020117947-appb-000075
are selected from K 2 output constellation points are transferred upwards along with the indices, for off-diagonal M PEs 112-2 in order to construct inter-stream interferences.  These K corresponding candidates of
Figure PCTCN2020117947-appb-000076
are appended to input partial decoded data vectors and passed to the diagonal PE above it. The K smallest PEDs from the module 506 may also be transferred upwards along the diagonal.
A function (e.g., program, configuration) for the off-diagonal M PEs 112-2 is shown in FIG. 46. In the horizontal direction, the M PEs 112-2 operate as a CMAC that multiplies K candidates of
Figure PCTCN2020117947-appb-000077
from below and R ij from Plane 1 to get the interference of these
Figure PCTCN2020117947-appb-000078
onto layer i. The computed interferences are added to other interference to layer i that inputted from the right and then output them to the left. Note interference
Figure PCTCN2020117947-appb-000079
is accumulated to the idx k-th input from the right. In the vertical direction, off-diagonal M PEs 112-2 directly forward inputs from below to above M PEs 112-2.
EXAMPLE EMBODIMENTS
Various example embodiments, representing a non-limiting set of embodiments that may follow from this disclosure, are provided below.
EXAMPLE EMBODIMENT 1. A system comprising:
a first spatial array of processing elements that perform QR decomposition; and
a second spatial array of processing elements, in communication with the first spatial array of processing elements, that perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
EXAMPLE EMBODIMENT 2. The system of example embodiment 1, wherein the first spatial array of processing elements and the second spatial array of processing elements comprise the same respective number of processing elements.
EXAMPLE EMBODIMENT 3. The system of example embodiment 1, wherein the first spatial array of processing elements and the second spatial array of processing elements comprise a triangular arrangement, wherein data flow through the first spatial array of processing elements is opposite that data flow through the second spatial array of processing elements.
EXAMPLE EMBODIMENT 4. The system of example embodiment 1, wherein a plurality of processing elements of the first spatial array of processing elements provide input data to a corresponding plurality of processing elements of the second spatial array of processing elements.
EXAMPLE EMBODIMENT 5. The system of example embodiment 1, wherein:
the processing elements of the first array of processing elements comprise processing elements of a diagonal processing element type and processing elements of an off-diagonal processing element type; and
the processing elements of the second array of processing elements comprise processing elements of the diagonal processing element type and processing elements of the off-diagonal processing element type but having different configurations with respect to those of the first array of processing elements.
EXAMPLE EMBODIMENT 6. The system of any of example embodiments 1–5, wherein a first plurality of the processing elements of the first array of processing elements perform squared accumulate, square root and reciprocal square root, and complex multiply-accumulate operations.
EXAMPLE EMBODIMENT 7. The system of any of example embodiments 1–5, wherein a second plurality of the processing elements of the first array of processing elements perform four complex multiply-accumulate operations.
EXAMPLE EMBODIMENT 8. The system of any of example embodiments 1–5, wherein a first plurality of the processing elements of the second array of processing elements perform complex multiply-accumulate, enumeration, and sorting operations.
EXAMPLE EMBODIMENT 9. The system of any of example embodiments 1–5, wherein a second plurality of the processing elements of the second array of processing elements perform a complex multiply-accumulate operation.
EXAMPLE EMBODIMENT 10. The system of any of example embodiments 1–5, comprising a plurality of antennas, wherein the first array of processing elements and the second array of processing elements perform a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication using the antennas.
EXAMPLE EMBODIMENT 11. An article of manufacture comprising one or more tangible, non-transitory, machine readable media comprising instructions that, when executed by processing circuitry, cause the processing circuitry to:
instruct a triangular spatial array of processing elements to perform partial QR decomposition; and
instruct the triangular spatial array of processing elements to perform partial decoding tree traverse using data from the partial QR decomposition.
EXAMPLE EMBODIMENT 12. The article of manufacture of example embodiment 11, wherein the instructions cause the processing circuitry to instruct the triangular  spatial array of processing elements to time multiplex between performing partial QR decomposition and partial decoding tree traverse.
EXAMPLE EMBODIMENT 13. The article of manufacture of  example embodiments  11 or 12, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to perform the partial QR decomposition and partial decoding tree traverse to carry out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication.
EXAMPLE EMBODIMENT 14. An electronic device comprising:
a plurality of antennas; and
one or more configurable triangular spatial array of processing elements configurable to carry out a K-best maximum likelihood detector (MLD) for multiple-input multiple-output (MIMO) wireless communication via the plurality of antennas.
EXAMPLE EMBODIMENT 15. The electronic device of example embodiment 14, wherein the one or more configurable triangular spatial array of processing elements comprises a single configurable triangular spatial array of processing elements that is time multiplexed to alternate between QR decomposition and decoding tree traverse.
EXAMPLE EMBODIMENT 16. The electronic device of example embodiment 15, wherein the single configurable triangular spatial array of processing elements performs decoding tree traverse using inputs obtained during performance of QR decomposition.
EXAMPLE EMBODIMENT 17. The electronic device of example embodiment 15, wherein:
when the single configurable triangular spatial array of processing elements performs QR decomposition, the single configurable triangular spatial array of processing elements has a first data flow through the processing elements; and
when the single configurable triangular spatial array of processing elements performs decoding tree traverse, the single configurable triangular spatial array of processing elements has a second data flow through the processing elements that is different from the first data flow.
EXAMPLE EMBODIMENT 18. The electronic device of example embodiment 17, wherein the second data flow is at least partially opposite the first data flow.
EXAMPLE EMBODIMENT 19. The electronic device of any of example embodiments 14-18, wherein the one or more configurable triangular spatial array of processing elements is configurable to perform Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE) , Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition.
EXAMPLE EMBODIMENT 20. The electronic device of example embodiment 14, wherein the one or more configurable triangular spatial array of processing elements comprise:
a first spatial array of processing elements that perform QR decomposition; and
a second spatial array of processing elements, in communication with the first spatial array of processing elements, that perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
EXAMPLE EMBODIMENT 21. A method comprising:
using a triangular spatial array of processing elements to perform QR decomposition; and
using the triangular spatial array of processing elements to perform decoding tree traverse using data from the partial QR decomposition.
EXAMPLE EMBODIMENT 22. The method of example embodiment 21, wherein the triangular spatial array of processing elements is time-multiplexed between performing the QR decomposition and the decoding tree traverse.
EXAMPLE EMBODIMENT 23. The method of  example embodiments  21 or 22, wherein using the triangular spatial array of processing elements to perform the QR decomposition and the decoding tree traverse comprises carrying out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication.
EXAMPLE EMBODIMENT 24. A method comprising:
using a first spatial array of processing elements to perform QR decomposition; and
using a second spatial array of processing elements, in communication with the first spatial array of processing elements, to perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
EXAMPLE EMBODIMENT 25. The method of example embodiment 24, comprising providing the input data from a plurality of processing elements of the first spatial array of processing elements to a corresponding respective plurality of processing elements of the second spatial array of processing elements.
While the embodiments set forth in the present disclosure may be susceptible to various modifications and alternative forms, specific embodiments have been shown by way of example  in the drawings and have been described in detail herein. However, it should be understood that the disclosure is not intended to be limited to the particular forms disclosed. The disclosure is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the following appended claims. Moreover, the techniques presented and claimed herein are referenced and applied to material objects and concrete examples of a practical nature that demonstrably improve the present technical field and, as such, are not abstract, intangible or purely theoretical. Further, if any claims appended to the end of this specification contain one or more elements designated as “means for [perform] ing [a function] …” or “step for [perform] ing [a function] …” , it is intended that such elements are to be interpreted under 35 U.S.C. 112 (f) . However, for any claims containing elements designated in any other manner, it is intended that such elements are not to be interpreted under 35 U.S.C. 112 (f) .

Claims (25)

  1. A system comprising:
    a first spatial array of processing elements that perform QR decomposition; and
    a second spatial array of processing elements, in communication with the first spatial array of processing elements, that perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
  2. The system of claim 1, wherein the first spatial array of processing elements and the second spatial array of processing elements comprise the same respective number of processing elements.
  3. The system of claim 1, wherein the first spatial array of processing elements and the second spatial array of processing elements comprise a triangular arrangement, wherein data flow through the first spatial array of processing elements is opposite that data flow through the second spatial array of processing elements.
  4. The system of claim 1, wherein a plurality of processing elements of the first spatial array of processing elements provide input data to a corresponding plurality of processing elements of the second spatial array of processing elements.
  5. The system of claim 1, wherein:
    the processing elements of the first array of processing elements comprise processing elements of a diagonal processing element type and processing elements of an off-diagonal processing element type; and
    the processing elements of the second array of processing elements comprise processing elements of the diagonal processing element type and processing elements of the off-diagonal processing element type but having different configurations with respect to those of the first array of processing elements.
  6. The system of any of claims 1–5, wherein a first plurality of the processing elements of the first array of processing elements perform squared accumulate, square root and reciprocal square root, and complex multiply-accumulate operations.
  7. The system of any of claims 1–5, wherein a second plurality of the processing elements of the first array of processing elements perform four complex multiply-accumulate operations.
  8. The system of any of claims 1–5, wherein a first plurality of the processing elements of the second array of processing elements perform complex multiply-accumulate, enumeration, and sorting operations.
  9. The system of any of claims 1–5, wherein a second plurality of the processing elements of the second array of processing elements perform a complex multiply-accumulate operation.
  10. The system of any of claims 1–5, comprising a plurality of antennas, wherein the first array of processing elements and the second array of processing elements perform a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication using the antennas.
  11. An article of manufacture comprising one or more tangible, non-transitory, machine readable media comprising instructions that, when executed by processing circuitry, cause the processing circuitry to:
    instruct a triangular spatial array of processing elements to perform partial QR decomposition; and
    instruct the triangular spatial array of processing elements to perform partial decoding tree traverse using data from the partial QR decomposition.
  12. The article of manufacture of claim 11, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to time multiplex between performing partial QR decomposition and partial decoding tree traverse.
  13. The article of manufacture of claims 11 or 12, wherein the instructions cause the processing circuitry to instruct the triangular spatial array of processing elements to perform the partial QR decomposition and partial decoding tree traverse to carry out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication.
  14. An electronic device comprising:
    a plurality of antennas; and
    one or more configurable triangular spatial array of processing elements configurable to carry out a K-best maximum likelihood detector (MLD) for multiple-input multiple-output (MIMO) wireless communication via the plurality of antennas.
  15. The electronic device of claim 14, wherein the one or more configurable triangular spatial array of processing elements comprises a single configurable triangular spatial array of processing elements that is time multiplexed to alternate between QR decomposition and decoding tree traverse.
  16. The electronic device of claim 15, wherein the single configurable triangular spatial array of processing elements performs decoding tree traverse using inputs obtained during performance of QR decomposition.
  17. The electronic device of claim 15, wherein:
    when the single configurable triangular spatial array of processing elements performs QR decomposition, the single configurable triangular spatial array of processing elements has a first data flow through the processing elements; and
    when the single configurable triangular spatial array of processing elements performs decoding tree traverse, the single configurable triangular spatial array of processing elements has a second data flow through the processing elements that is different from the first data flow.
  18. The electronic device of claim 17, wherein the second data flow is at least partially opposite the first data flow.
  19. The electronic device of any of claims 14-18, wherein the one or more configurable triangular spatial array of processing elements is configurable to perform Cholesky decomposition, LU decomposition, Cholesky-based minimum mean square error (MMSE) , Givens-Rotation QR based MMSE, and Gram-Schmidt QR decomposition.
  20. The electronic device of claim 14, wherein the one or more configurable triangular spatial array of processing elements comprise:
    a first spatial array of processing elements that perform QR decomposition; and
    a second spatial array of processing elements, in communication with the first spatial array of processing elements, that perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
  21. A method comprising:
    using a triangular spatial array of processing elements to perform QR decomposition; and
    using the triangular spatial array of processing elements to perform decoding tree traverse using data from the partial QR decomposition.
  22. The method of claim 21, wherein the triangular spatial array of processing elements is time-multiplexed between performing the QR decomposition and the decoding tree traverse.
  23. The method of claims 21 or 22, wherein using the triangular spatial array of processing elements to perform the QR decomposition and the decoding tree traverse comprises carrying out a K-best maximum likelihood detector (MLD) method for multiple-input multiple-output (MIMO) wireless communication.
  24. A method comprising:
    using a first spatial array of processing elements to perform QR decomposition; and
    using a second spatial array of processing elements, in communication with the first spatial array of processing elements, to perform decoding tree traverse in parallel using input data from the first spatial array of processing elements.
  25. The method of claim 24, comprising providing the input data from a plurality of processing elements of the first spatial array of processing elements to a corresponding respective plurality of processing elements of the second spatial array of processing elements.
PCT/CN2020/117947 2020-09-25 2020-09-25 Versatile systolic array for maximum likelihood mimo detectors WO2022061788A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/117947 WO2022061788A1 (en) 2020-09-25 2020-09-25 Versatile systolic array for maximum likelihood mimo detectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/117947 WO2022061788A1 (en) 2020-09-25 2020-09-25 Versatile systolic array for maximum likelihood mimo detectors

Publications (1)

Publication Number Publication Date
WO2022061788A1 true WO2022061788A1 (en) 2022-03-31

Family

ID=80844806

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117947 WO2022061788A1 (en) 2020-09-25 2020-09-25 Versatile systolic array for maximum likelihood mimo detectors

Country Status (1)

Country Link
WO (1) WO2022061788A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033843A (en) * 2022-08-09 2022-09-09 之江实验室 Circuit implementation method for covariance matrix calculation based on triangular pulse array

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102291215A (en) * 2011-09-14 2011-12-21 北京大学 Signal detection method and device for MIMO (Multiple Input Multiple Output) system
CN102307080A (en) * 2011-09-14 2012-01-04 北京大学 Method and device for detecting serial block signal in MIMO (multiple-input multiple-output) system
US20140185716A1 (en) * 2011-05-09 2014-07-03 St-Ericsson Sa Mimo Receiver Using Lattic Reduction and K-Best Detection
US20160218827A1 (en) * 2015-01-26 2016-07-28 Mitsubishi Electric Research Laboratories, Inc. System and Method for Decoding Block of Data Received Over Communication Channel

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140185716A1 (en) * 2011-05-09 2014-07-03 St-Ericsson Sa Mimo Receiver Using Lattic Reduction and K-Best Detection
CN102291215A (en) * 2011-09-14 2011-12-21 北京大学 Signal detection method and device for MIMO (Multiple Input Multiple Output) system
CN102307080A (en) * 2011-09-14 2012-01-04 北京大学 Method and device for detecting serial block signal in MIMO (multiple-input multiple-output) system
US20160218827A1 (en) * 2015-01-26 2016-07-28 Mitsubishi Electric Research Laboratories, Inc. System and Method for Decoding Block of Data Received Over Communication Channel

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033843A (en) * 2022-08-09 2022-09-09 之江实验室 Circuit implementation method for covariance matrix calculation based on triangular pulse array
CN115033843B (en) * 2022-08-09 2022-11-08 之江实验室 Circuit implementation method for covariance matrix calculation based on triangular pulse array

Similar Documents

Publication Publication Date Title
US9647731B2 (en) Reconfigurable network on a chip (NoC) radio through reduced instruction set computer (RISC) agents by overwriting program store for different phases of demodulation
US8819099B2 (en) Software implementation of matrix inversion in a wireless communication system
Luethi et al. VLSI implementation of a high-speed iterative sorted MMSE QR decomposition
Liao et al. A 3.1 Gb/s 8$\,\times\, $8 Sorting Reduced K-Best Detector With Lattice Reduction and QR Decomposition
CN103516643A (en) MIMO detecting preprocessing device and method
WO2022061788A1 (en) Versatile systolic array for maximum likelihood mimo detectors
Lee et al. Efficient low-latency implementation of CORDIC-based sorted QR decomposition for multi-Gbps MIMO systems
Shabany et al. High-Throughput 0.13-$\mu {\rm m} $ CMOS Lattice Reduction Core Supporting 880 Mb/s Detection
CN107483090B (en) Large-scale MIMO system precoding realization method based on LDLT decomposition
Wu et al. A GPU implementation of a real-time MIMO detector
Eberli et al. Divide-and-conquer matrix inversion for linear MMSE detection in SDR MIMO receivers
Shahabuddin et al. Programmable ASIPs for multimode MIMO transceiver
WO2022061781A1 (en) Programmable spatial array for matrix decomposition
CN112528224B (en) Matrix eigenvalue decomposition grouping circulation iteration flow realization method and system
Hänninen et al. Novel detector implementations for 3G LTE downlink and uplink
Irturk et al. Automatic generation of decomposition based matrix inversion architectures
Guo et al. Scalable FPGA architectures for LMMSE-based SIMO chip equalizer in HSDPA downlink
Guo et al. Rapid prototyping and vlsi exploration for 3g/4g mimo wireless systems using integrated catapult-c methodology
Mohammed et al. A MIMO decoder accelerator for next generation wireless communications
Xu Systolic array for universal matrix arithmetic
Munafo Cooperative high-performance computing with FPGAs-matrix multiply case-study
Zhang et al. Heterogeneous reconfigurable processors for real-Time baseband processing
Irturk Implementation of QR decomposition algorithm using FPGAs
Barrenechea et al. Implementation of complex enumeration for multiuser MIMO vector precoding
Salmela et al. 3G Long Term Evolution baseband processing with application-specific processors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20954640

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20954640

Country of ref document: EP

Kind code of ref document: A1