WO2020259805A1

WO2020259805A1 - De-spreader system and method

Info

Publication number: WO2020259805A1
Application number: PCT/EP2019/066778
Authority: WO
Inventors: Moti BAR; Avraham Gal; Alon Eran; Amir ARTSI
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-06-25
Filing date: 2019-06-25
Publication date: 2020-12-30
Also published as: EP3987392A1

Abstract

Apparatus for signal extraction and de-spreading of complex signal elements comprises a vector processing unit arranged into lanes, and a despreader that carries out a despread instruction over the lanes, the despread instruction defining obtaining successive complex chips, and transforming the complex chips in each lane by multiplication in a sequence of iterations according to a transformation defined in a control code obtained from a scrambling sequence. The transformations are summed following the iterations and the result is provided as separate real and imaginary parts.

Description

DE-SPREADER SYSTEM AND METHOD

FIELD AND BACKGROUND OF THE INVENTION

The present invention, in some embodiments thereof, relates to a de-spreader system and method and, more particularly, but not exclusively, to a way of improving the efficiency of the de spreading operation by modifying the hardware in the processing or DSP core, and modifying the corresponding instruction set.

Spread and de-spread functionality is widely used in different telecommunications modules such as Spread- Spectrum De-spreaders and Correlators, and similar functionality may be used in channel decoders (Viterbi, Turbo, etc.). The applications often demand high-speed computation. To satisfy the speed requirement as well as power efficiency, efficient De-spread architectures are required as well as advanced IC technologies.

The more straightforward of the existing solutions may use specific HW or DSP processor instructions. While the latter gives flexibility with different de-spreading requirements, the former solution keeps the same level of flexibility as other vector processor based solutions while increasing efficiency, both in speed and in power consumption.

An issue arises as to how to assign an efficient solution for a vector processor.

One of the challenges for the DSP solution is to fully utilize the vector processor. The optimal solution may take advantage of the different data types (such as char, short, long). Current state of the art solutions do not fully utilize the hardware.

An example of a current solution that does not fully utilize the hardware is the PRACH Preamble searcher. The PRACH preamble searcher is a major function in the UMTS receiver. A typical NodeB PRACH Preamble searcher implementation requires a bank of 16 correlation machines that produce signature hypothesis’ power profdes, from which a so-called peak finding machine produces a Random Access hypothesis’.

Fig. 1 shows a typical implementation of such a current solution in the form of an FHT unit 10. By far, the main computational load in this implementation is due to correlators 12. The correlators in practice perform De-Scramble based on a descrambling code generator 14, and De-Spread operations. The typical implementation of the correlation part of the searcher uses fixed function hardware acceleration. The operation is carried out on an incoming signal received at Rx antennas 16, and is computed one antenna at a time, and the peak finding is carried out on the descrambled and despread results after the FHT unit 10 and peak finding machine 18.

Fig. 2 shows three of the correlators 12 of a sequence of 15 correlators. The correlators take as inputs antenna samples 20 from the antennas 16 (Fig. 1) and a scrambling sequence 22 from scrambling code generator 14 (Fig. 1) at 16-chip intervals. This imposes even greater constraints on implementation due to memory access patterns.

Prior art system BBE DSPR1DANX8CSF16 performs a 16-way vector multiply-accumulate (MAC) of 8-bit complex elements which are contained in narrow input vector register vs, and coded 2 -bit complex elements which are contained in narrow input vector register vr. The input registers from vector register fde vec are 256-bits wide. A multiply accumulate operation is involved, which uses 256 bits from the narrow input vector register vr. Furthermore the multiply accumulate operation uses 32 bits from the coded narrow input/output vector register vr to form a codeset in the coded multiplicand register. The 3 -bit immediate opl_idx8 determines which codeset from the vr register to use. When opl_idx8 is 0, the least significant useful 32 bits of the register are used. When the opl_idx8 is 7, the most significant 32 bits are used. Other values between 1 and 6 select the corresponding set of 32 bit values. For a given spreading factor of 16, 16 contiguous complex products are added together to form the (1) result elements. The 1-bit immediate dspr code identifies which de-spreading code book to use. When dspr_code == 0, the codes are taken from the set {+-1 +-i}. When dspr_code == 1, the codes are taken from the set {1, -1 ,i , -i}.

The above instruction performs 16-way MAC but only on 8-bit complex elements. The performance is the same as if performed on the full 16-bit complex elements and the instruction does not take advantage of the fact that the data type is half the width.

Fixed function hardware correlators on the other hand have the following drawbacks:

The correlator may have only a single functionality, and that functionality is only required for a UMTS preamble search. When operating in a different mode (or RAT) the silicone area, and associated leakage, is redundant.

The hardware may lack flexibility for performing similar tasks in the chip-rate processing domain.

Typical SW based correlator solutions may have the following drawbacks:

A large silicone-area to processing-capacity ratio; and

Inefficiencies in relation to memory access patterns.

SUMMARY OF THE INVENTION

A software-based method may utilize vector processing units, specifically the components already available in vector complex MAC machines, to efficiently perform multiply accumulate and preamble search functionality. An ability to utilize the full width of the registers is provided in a way that is suitable for the data type, and suitable instructions are provided for efficient hardware utilization. According to an aspect of some embodiments of the present invention there is provided apparatus for signal extraction and de-spreading of complex signal elements comprising:

a vector processing unit arranged into lanes;

a despreader configured to carry out a despread instruction over the lanes, the despread instruction defining obtaining successive complex elements, the despread instruction further defining a transform of respective complex elements in each lane by multiplication in a sequence of iterations according to a transformation defined by control inputs, the de-spreader configured to sum the transformed elements following the iterations to form real and imaginary results. Thus each lane computes a part of the de-spread function. Eventually, when all the iterations of the loop are done, all the lane results are combined (summed) to give the final de-spread result.

A lane, also called an element may be defined as the element on which the operation is done. The same operation is performed on all vector (register) lanes.

In an embodiment, the de-spreader is configured to carry out the despread instruction on 64 eight-bit complex elements per iteration from first and second source registers into a result register. It is noted that 64 complex chips per iteration gives 64N complex chips for N iterations of the loop

In an embodiment, the 64 eight bit complex elements are processed in 16 thirty -two bit lanes, and each lane may take two 8-bit element pairs from the first source register and two 8-bit element pairs from the second source register.

In an embodiment, each lane is configured to carry out one of a set of four complex multiplications according to the control inputs, the control inputs being applied with the defined instruction.

In an embodiment, each lane has a connection to receive the control input for the transformation instruction.

In an embodiment, the control input is derived from a scrambling sequence.

In an embodiment, the scrambling sequence comprises a pseudo-noise sequence.

In an embodiment, each lane is configured to accommodate a real part and an imaginary part of a current one of the complex elements, such that the real parts at each lane respectively are connected to a real accumulator and imaginary parts at each lane respectively are connected to an imaginary accumulator, therefrom to provide separate real and imaginary accumulated results.

In an embodiment, an extractor may be provided to carry out extraction and interleaving.

Embodiments may extract a consecutive four-bit value from a register and to interleave with a further consecutive four bit value to form an eight bit value.

Embodiments may form 32 bit values from the eight-bit values by zero extension. Embodiments may provide sixteen of the thirty-two bit entries from an initial set of 32 of the four-bit values.

According to a second aspect of the present invention there is provided a method for signal extraction and de-spreading of complex signal elements comprising:

providing lanes;

assigning complex signal elements to respective lanes; and

applying a despread instruction over the lanes, the despread instruction comprising:

obtaining successive complex elements;

defining a transform of respective complex elements in each lane by multiplication in a sequence of iterations according to a transformation defined by control inputs; and

summing the transformed elements following the iterations to form a result in real and imaginary parts.

The method may comprise applying values to the control inputs from a pseudo noise sequence.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

Implementation of the method and/or system of embodiments of the invention can involve performing or completing selected tasks manually, automatically, or a combination thereof. Moreover, according to actual instrumentation and equipment of embodiments of the method and/or system of the invention, several selected tasks could be implemented by hardware, by software or by firmware or by a combination thereof using an operating system.

For example, hardware for performing selected tasks according to embodiments of the invention could be implemented as a chip or a circuit. As software, selected tasks according to embodiments of the invention could be implemented as a plurality of software instructions being executed by a computer using any suitable operating system. In an exemplary embodiment of the invention, one or more tasks according to exemplary embodiments of method and/or system as described herein are performed by a data processor, such as a computing platform for executing a plurality of instructions, and may in particular involve a signal processor. Optionally, the data processor or signal processor includes a volatile memory for storing instructions and/or data and/or a non-volatile storage, for example, a magnetic hard-disk and/or removable media or other flash memory, for storing instructions and/or data. Optionally, a network connection is provided as well.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWING(S)

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a simplified diagram showing a prior art FHT processor;

FIG. 2 is a simplified diagram showing in greater detail the correlators of Fig. 1;

FIG. 3 is a simplified diagram showing de-spreading and descrambling apparatus according to an embodiment of the present invention; and

FIG. 4 is a simplified flow chart illustrating operation of the apparatus of Fig. 3.

DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION

The present invention, in some embodiments thereof, relates to a de-spreader system and method and, more particularly, but not exclusively, to a way of improving the efficiency of the de spreading operation by modifying the hardware and the corresponding instruction set.

The present embodiments describe a SW based method to utilize vector processing units, specifically the components already available in vector complex MAC machines, to efficiently perform said functionality.

The vector processing units have all the flexibility to perform other computational tasks in other modes.

The present embodiments may provide processor instructions and a method that accelerates the calculation of the de-spreading procedure and supports different types of data, de-spreading factors and code sets.

The present embodiments may accelerate de-spreading, and this may be achieved with one or both of an extract instruction that includes interleaving, and specific de-spread instructions provided by the present embodiments. The present embodiments may provide improved performance compared to known in the art methods. Such may be achieved with negligible additional hardware and relatively small complexity.

Using the current art as a benchmark may provide a factor of two improvement for the present embodiments. Specifically, the present embodiments may utilize

8 -bits complex data and may achieve a de-spread factor of 4, using a code set { 1, -1 ,i , -i}

As will be discussed below, the present embodiments may use 2X vector length differences to provide ~ a 2X better performance.

For purposes of better understanding some embodiments of the present invention, as illustrated in Figures 3, onwards of the drawings, reference is first made to the construction and operation of a prior art de-spreader as illustrated in Figures 1 and 2.

Fig. 1 shows a typical implementation of such a current solution in the form of an FHT unit 10. By far, the main computational load in this implementation is due to correlators 12. The correlators in practice perform De-Scramble based on a descrambling code generator 14, and De- Spread operations. The typical implementation of the correlation part of the searcher uses fixed function hardware acceleration. The operation is carried out on an incoming signal received at Rx antennas 16, and the peak finding is carried out on the descrambled and despread results after the FHT unit 10 and peak finding machine 18.

Fig. 2 shows three of the correlators 12 of a sequence of 15 correlators. The correlators take as inputs antenna samples 20 from the antennas 16 (Fig. 1) and a scrambling sequence 22 from scrambling code generator 14 (Fig. 1) at 16-chip intervals. The use of correlators imposes even greater constraints on implementation due to memory access patterns.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

Reference is now made to Figure 3, which illustrates vector processing apparatus 100 for signal extraction and de-spreading of complex signal elements according to embodiments of the present invention. The apparatus may typically be part of a vector processor. As with Fig. 1, antenna samples are processed to provide de-spreading. The complex signal elements, or chips, are sixteen bit chips made up of an eight-bit real part and an eight-bit imaginary part. The apparatus may be built into lanes 102, each lane taking one chip at a time.

A despreader 104 may carry out a de-spread instruction over the lanes. The despread instruction may involve obtaining successive complex chips, say two successive chips, and then two further successive chips from 32 positions further ahead. Thus as illustrated in the example in the figure, chips n, n+1, 32 + n and 32 + n + 1 are taken. For each chip the real and complex parts are placed in separate eight bit registers, thus registers 106 and 108 for the real and imaginary parts respectively of the n^th chip, 110 and 1 12 for the n+l^th chip, 114 and 1 16 for the 32+n^th chip, and 1 18 and 120 for the 32+ n +l^st chip. The complex chips in each lane 102 are multiplied in the lanes 102 according to a transformation defined in an instruction, and as modified by control bits co, ci, C2, C3 or C4. The settings of the control bits are discussed in greater detail hereinbelow. The results are inserted into registers 121, and then the transformed results from registers 121 are summed by summer 122 which forms the real sum and summer 124 which forms the imaginary sum. The summed results are added to any previous results as the procedure iterates through a full set of chips. In the example illustrated, the full set is 64 chips, where the first 32 chips are interleaved with the following 32 chips in pairs. Thus the result is a multiply, interleave and accumulate over a sequence of 64 chips to form a single iteration of the de-spread.

The de-spread instruction on 64 eight bit complex chips per iteration is repeated for a number N of iterations, hence there are 64N complex chips for N iterations of the loop. Each iteration constitutes an invocation of the de-spread instruction.

Each of the lanes 102 is a thirty -two bit lane, and each lane is configured to take two 8 bit chip pairs from a first source register and two 8-bit chip pairs from a second source register. As in the above example the source registers may be 32 chips apart. There may be sixteen such lanes altogether.

Each lane is configured to carry out one of a set of four complex multiplications according to the control inputs co to C4 as mentioned, which may be applied along with the defined instruction. The different multiplications are a function of the control inputs and not of instruction flavor. It is noted that the control inputs to the instruction are derived from the spreading sequence.

The control inputs are derived from the spreading sequence, as will be discussed in greater detail hereinbelow.

Register 126 holds the summed real parts of each lane and register 128 holds the summed imaginary parts of each lane. Successive sums from these registers are accumulated at accumulation registers RL(V[i]) and IM(V[i]) to provide separate real and imaginary accumulated results.

The initial set of registers 106 - 120 carries out extraction of the chips for current processing and carries out the interleaving. Interleaving may involve taking four-bit values from one of the registers, either real or imaginary, and interleaving with a consecutive four-bit value to form an interleaved eight-bit value.

The eight-bit values may be formed into 32-bit values by zero extension.

Sixteen such thirty -two bit entries may be formed by interleaving and zero extension from an initial set of 32 four-bit values.

It is noted that each lane computes a part of the de-spread function. Eventually, when all the iterations of the loop are done, all the lane results are combined (summed) to give the final de- spread result.

The embodiment shown in Fig. 3 is now considered in greater detail, with reference to table 1 and Fig. 4.

The code shown in table 1 below provides combined de-spreading and descrambling over the apparatus as illustrated in Fig. 3.

Table 1 - code for de-spreading and descrambling Reference is now made to Fig. 4, which is a simplified flow chart explaining Table 1. Each iteration or loop cycle performs 64 complex MAC operations as discussed above. In lines 3 and 4 of table 1, loading is carried out of the 64 8-bit complex data chips -140. Then, in line 5, the PN.POLY instruction generates a 128-bit pseudo noise (PN) sequence -142. The sequence forms 64 bit pairs. Subsequently, in line 6, interleaved extraction of the PN sequence is carried out in order to control the de-spreading -144. That is to say, the PN sequence bit pairs provide the control bits co.. C4 as shown in Fig. 3. An instruction EXT.U8.ITL.V in line 6 carries out the interleaved extraction. Further details of the extraction instruction EXT.U8.ITL.V are given in table 2. The role of the EXT.U8.ITL is to organize the control bits in the interleaved order (0, 1, 32, 33, 2, 3, 34, 35, ...) and to place each 4 pairs in the appropriate lane.

The instruction used for de-spreading is in the present example DSPRD4A, and in line 7, the de-spread instruction carries out 64 Complex multiply and accumulate operations MAC and with addition of the results ADD - 146. Each quadruplet of the DSPRD4A instruction operates on complex elements:

x(n), x(n+l), x(n+32), x(n+33).

Further details of the DSPRD4A de-spreading instructions are given in Table 3.

The de-spreading operates on 64 elements located in two vectors V0:V1 in the example. Vector VO holds elements 0..31 and vector VI holds elements 32..63. The 4 elements in lane 0 may thus be elements 0, 1, 32 and 33. The next 4 elements are in lane 1 and these will be elements 2, 3, 34 and 35 and so on. The PN.POLY instruction in line 5 generates the 128 bit PN sequence that is mentioned, from which 64 bit pairs may be formed in the order 0, 1, 2, 3, 4, 5, 6, ....

Subsequently, line 9 obtains the inner sum of the 16 vector elements to produce the result - 148.

Thus the loop calculates 64 complex 8-bits data in 1 cycle, and performance is 2X better, per SIMD lane, than the prior art BBE32.

Using the above instructions, an interleaved extraction of the PN sequence is achieved.

The following table contains a more detailed description of the EXT.U8.ITL.V instruction:

Description

EXT.U8.1TL.V performs an extraction of a consecutive 32 4-bit values from the low 128-bit of vector register Vs into vector register Vv. The 4-bit values are interleaved to formi 8-bit values. The 8-bit values are zero extended to 32-bit and placed in the consecutive 1632-bit entries of vector register Vv.

• Vv is restricted to be one of the following 4 vector registers group: {V2, V6, V10. V14}

Bit-Exact Description

Table 2 - Details of the EXT.U8.ITL.V extraction instruction.

The following table describes in greater detail the DSPRED4.CS16.CS8.CS8.V command: Description

DSPRD4.CS16.CS8.CS8 V performs a despread 4 of 64 8-bit complex chips from rector register pair Vp:Vq into rector register Vt. Vector registers Ft, Vp. Vq and Vy are viewed as a 16 32-bit lanes. In each lane the 2 8 -bit complex chips from vector register Vq and the 2 8-bit complex chips from vector register Vp are transformed by transformation T according to control field located in the least significant 8-bit of vector register Vy lane. The 4 transformed 8-bit complex chips are summed as a 16-bit complex and placed in the corresponding vector register Vt lane The control field is a 4 2-bit values sequence, where each 2-hit value controls an 8-bit complex chip transformation. Ordered from least to most file 4 control value are assigned to the 2 Vq chips least to most Mowed by the 2 Vp chips least to most. The transformation T uses the control values 0,1,2 or 3 to multiply the 8-bit complex value by 1, -j, -1 or j respectively as shown in TransformT table below. The actual bask operations performed are possible swap of the real and imaginary values and possible negate of one or both of them.

* VptVq is a sequential registers pair, where p can be any even number in the range [0..14] an q =p+I

* Vy is restricted to be one of the following 4 vector registers group: { V2, V6, V10, V14 }

Table 4 - The Control values and their corresponding transformations T.

Table 4 shows exemplary control values 0 to 3 which may be derived from the pseudo- noise or scrambling sequence and which may be applied to the control inputs at the different lanes.

The present embodiments may provide improved performance for various de-spreaders in digital processing techniques and standards.

Specifically, additional hardware may be provided as an accelerator on the digital signal processor to accelerate the De-spreader.

The additional instructions according to the present embodiments may accelerate de spreading by ~ 2x per SIMD Lane.

It is expected that during the life of a patent maturing from this application many relevant de-spreading and extraction techniques will be developed and the scopes of the corresponding terms are intended to include all such new technologies a priori.

The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to".

The term“consisting of’ means“including and limited to”. The term "consisting essentially of' means that the composition, method or structure may include additional ingredients, steps and/or parts, but only if the additional ingredients, steps and/or parts do not materially alter the basic and novel characteristics of the claimed composition, method or structure.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment, and the text is to be construed as if such a single embodiment is explicitly written out in detail. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention, and the text is to be construed as if such separate embodiments or subcombinations are explicitly set forth herein in detail.

Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:

1. Apparatus for signal extraction and de-spreading of complex signal elements comprising:

a vector processing unit arranged into lanes;

a despreader configured to carry out a despread instruction over said lanes, the despread instruction defining obtaining successive complex elements, the despread instruction further defining a transform of respective complex elements in each lane by multiplication in a sequence of iterations according to a transformation defined by control inputs, the de-spreader configured to sum the transformed elements following the iterations to form real and imaginary results.

2. Apparatus according to claim 1, wherein the de-spreader is configured to carry out said despread instruction on 64 eight-bit complex elements per iteration from first and second source registers into a result register.

3. Apparatus according to claim 2, wherein said 64 eight bit complex elements are processed in 16 thirty -two bit lanes, wherein each lane is configured to take two 8-bit element pairs from said first source register and two 8-bit element pairs from said second source register.

4. Apparatus according to claim 3, wherein each lane is configured to carry out one of a set of four complex multiplications according to said control inputs, said control inputs being applied with said defined instruction.

5. Apparatus according to any one of the preceding claims, each lane having a connection to receive said control input for said transformation instruction.

6. Apparatus according to any one of the preceding claims, wherein said control input is derived from a scrambling sequence.

7. Apparatus according to claim 6, wherein said scrambling sequence comprises a pseudo-noise sequence.

8. Apparatus according to any one of the preceding claims, wherein each lane is configured to accommodate a real part and an imaginary part of a current one of said complex elements, such that said real parts at each lane respectively are connected to a real accumulator and imaginary parts at each lane respectively are connected to an imaginary accumulator, therefrom to provide separate real and imaginary accumulated results.

9. Apparatus according to any one of the preceding claims, further comprising an extractor, the extractor configured to carry out extraction and interleaving.

10. Apparatus according to claim 9, configured to extract a consecutive four-bit value from a register and to interleave with a further consecutive four bit value to form an eight bit value.

11. Apparatus according to claim 10, further configured to form 32 bit values from the eight-bit values by zero extension.

12. Apparatus according to claim 11, configured to provide sixteen of said thirty -two bit entries from an initial set of 32 of said four-bit values.

13. Method for signal extraction and de-spreading of complex signal elements comprising:

providing lanes;

assigning complex signal elements to respective lanes; and

applying a despread instruction over said lanes, the despread instruction comprising: obtaining successive complex elements;

defining a transform of respective complex elements in each lane by multiplication in a sequence of iterations according to a transformation defined by control inputs; and summing the transformed elements following the iterations to form a result in real and imaginary parts.

14. The method of claim 13, comprising applying values to said control inputs from a pseudo-noise sequence.