CN116775123A

CN116775123A - Processor for channel estimation

Info

Publication number: CN116775123A
Application number: CN202210219640.6A
Authority: CN
Inventors: 冯笑天; 黄俊桦; 葛文洁; 沈西哲
Original assignee: Chenxin Technology Co ltd
Current assignee: Chenxin Technology Co ltd
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2023-09-19

Abstract

The invention discloses a processor for channel estimation. Comprising the following steps: the hardware message management module is used for processing serial instructions in the channel estimation special instruction set and configuring execution parameters of parallel instructions in the channel estimation special instruction set; the cache management subsystem is used for moving vector data according to the execution parameters; the arithmetic logic unit is used for carrying out vector operation on the vector data according to the execution parameters and transmitting the operation result back to the cache management subsystem; and the arbiter is used for accessing corresponding cache data according to the vector data moved by the target channel of the cache management subsystem. The processor for channel estimation can be compatible with various different channel estimation algorithms, provides customized operation and processing, has a simpler instruction set, and therefore has lower energy consumption and complexity.

Description

Processor for channel estimation

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a processor for channel estimation.

Background

The current algorithm for realizing the vector operation of channel estimation is basically mainly based on a digital signal processing (DSP (Digital Signal Process, DSP) type scheme or a custom application specific integrated circuit (Application Specified Integrated Circuit, ASIC) circuit scheme.

The computation of the DSP can be implemented in software, which gives rise to its high flexibility, but this also makes it very power-consuming and complex. Meanwhile, the DSP has the universality of supporting most algorithms, and few instructions specially customized for channel estimation are provided, so that the performance of realizing channel estimation is lower. The ASIC circuit scheme is to customize the circuit according to the channel estimation algorithm, so that the performance can be very high, but the ASIC circuit scheme lacks flexibility and cannot be compatible with various different channel estimation algorithms.

Disclosure of Invention

The invention provides a processor for channel estimation, which is provided with an instruction set and a hardware structure specially customized for channel estimation, can be compatible with different channel estimation algorithms and is flexible and configurable.

The embodiment of the invention provides a processor for channel estimation, which comprises the following steps:

a hardware message management module (Hardware Message Management, HMM) for processing serial instructions in a channel estimation specific instruction set and configuring execution parameters of parallel instructions in the channel estimation specific instruction set;

a cache management subsystem (Memory Management Subsystem, MMS) for moving vector data in accordance with the execution parameters;

an arithmetic logic unit (Arithmetic and Logic Unit, ALU) for performing a vector operation on the vector data according to the execution parameters and for returning the operation result to the cache management subsystem;

an ARBITER (ARBITER) for accessing corresponding cache data according to vector data moved by the target channel of the cache management subsystem.

Optionally, the hardware message management module HMM comprises:

an instruction cache module (Instruction Random Access Memory, IRAM) for storing the serial instruction and the parallel instruction;

the vector parameter configuration module is used for configuring the execution parameters of the parallel instruction and respectively transmitting the execution parameters to the cache management subsystem and the arithmetic logic unit;

and the instruction processing module is used for processing the serial instructions.

Optionally, during the execution of the first parallel instruction by the cache management subsystem and the arithmetic logic unit, the hardware message management module parses and executes a serial instruction and parses a second parallel instruction;

after the hardware message management module analyzes the second parallel instruction, if the cache management subsystem and the arithmetic logic unit have completed the first parallel instruction, the hardware message management module configures the execution parameters of the second parallel instruction, otherwise, the hardware message management module waits for the execution of the first parallel instruction to complete and configures the execution parameters of the second parallel instruction.

Optionally, the vector parameter configuration module includes a first vector parameter register set and a second vector parameter register set;

the value of the first vector parameter register set is updated in real time according to the current analyzed channel estimation special instruction;

the values of the second vector parameter register set are synchronized to the values of the first vector parameter register set at the time of configuration of the execution parameters.

Optionally, the MMS is specifically configured to:

interleaving the vector data according to the operation mode of the arithmetic logic unit and transmitting the vector data to the arithmetic logic unit;

and interleaving the operation result according to the operation mode of the arithmetic logic unit ALU and transmitting the operation result to the arbiter.

Optionally, the MMS is further configured to:

and extracting vector data according to the starting address, the length, the jump step length and the repetition number.

Optionally, the MMS is specifically configured to:

in each extraction process, determining the starting address of the current extraction, the starting address of the next extraction and the number of times of the current extraction;

if the jump address corresponding to the jump step length does not exceed the range indicated by the length based on the currently extracted starting address, extracting vector data from the jump address;

repeating the extraction process until the current extraction times reach the set times.

Optionally, the ALU includes an adder, a multiplier, and a selector, the adder, the multiplier, and the selector being configured to provide a plurality of operational modes.

Optionally, the ALU includes an operation mode for processing butterfly operation and complex multiplication together.

Optionally, the arithmetic logic unit further comprises a flip-flop for storing the Real Part (RE) and the Imaginary Part (IM) of the butterfly factor.

The embodiment of the invention provides a processor for channel estimation, which comprises the following steps: the hardware message management module is used for processing serial instructions in the channel estimation special instruction set and configuring execution parameters of parallel instructions in the channel estimation special instruction set; the cache management subsystem is used for moving vector data according to the execution parameters; the arithmetic logic unit is used for carrying out vector operation on the vector data according to the execution parameters and transmitting the operation result back to the cache management subsystem; and the arbiter is used for accessing corresponding cache data according to the vector data moved by the target channel of the cache management subsystem. The processor for channel estimation is customized for a channel estimation algorithm, and the performance of the processor is obviously superior to that of a DSP scheme; and meanwhile, a plurality of customized operations and processes are performed, and an instruction set is simpler, so that the energy consumption and the complexity are lower than those of the DSP.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a processor for channel estimation according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a processor for channel estimation according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an execution sequence of serial and parallel instructions provided in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of the internal structure of an HMM provided according to an embodiment of the invention;

FIG. 5 is a schematic diagram of a prior art 32-point radix-4 butterfly operation according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a radix-4 butterfly operation with a mechanism for repeating skip and interleaving provided in accordance with an embodiment of the invention;

FIG. 7 is a flow chart of request address acquisition provided in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of a radix-4 butterfly mode of operation according to an embodiment of the invention;

FIG. 9 is a schematic diagram of the flow of input data inside an ALU in radix-4 butterfly mode according to an embodiment of the invention;

FIG. 10 is a schematic diagram of a data flow of a complex multiplication operation mode according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an ALU according to an embodiment of the present invention;

FIG. 12 is a schematic diagram of an ALU mode of a radix-4 butterfly and complex multiplication co-process specifically designed for FFT algorithms according to an embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Fig. 1 is a schematic diagram of a processor for channel estimation according to an embodiment of the present invention. The processor is suitable for channel estimation, and the invention belongs to an implementation scheme of ASIP class, as shown in figure 1, the processor for channel estimation comprises:

a hardware message management module (HMM) 10 for processing serial instructions in the channel estimation dedicated instruction set and configuring execution parameters of parallel instructions in the channel estimation dedicated instruction set; a cache management subsystem (MMS) 11 for moving vector data in accordance with the execution parameters; an Arithmetic Logic Unit (ALU) 12, configured to perform vector operation on vector data according to the execution parameters, and return the operation result to the cache management subsystem (MMS) 11; an arbiter 13 for accessing the corresponding buffer data according to the vector data moved by the destination channel of the MMS 11.

Specifically, the architecture of the whole ASIP can be seen as a combination of a scalar processor and some vector processing and operation units, and the instruction sets of the processor for channel estimation can be divided into two types: serial instructions and parallel instructions. The serial instruction can be mainly used for completing operations such as assignment, access and branching of the register set, and is the scalar processing. The parallel instruction can be mainly used for indicating channel information needing data movement in vector operation and a mode of the vector operation, and the vector processing and operation unit can finish operations of source data stream reading, vector operation and result data stream output according to the information and parameters of the channels.

The HMM10 may be configured to process serial instructions in the channel estimation dedicated instruction set, and specifically, the HMM10 may perform instruction fetching and decoding on all instructions, and complete execution and write-back operations of the serial instructions; for the parallel instruction, after decoding is completed, the execution parameters of the parallel instruction in the channel estimation dedicated instruction set, such as which operation mode is adopted, parameters related to fetch, related parameters of interleaving, a plurality of channels, and the like, are configured. MMS 11 is mainly used for moving vector data in a memory, or interleaving the data according to the demand of ALU 12, or transmitting single beat data to ALU 12 in a mode of dividing the single beat data into a plurality of fields and transmitting the fields beat by beat; at the same time, the processing of the output data of the ALU 12 and the moving to memory are also responsible for the module. The ALU 12 mainly performs ALU operations of corresponding modes on input data, supports inter-vector, scalar, and inter-scalar operations, and a specific calculation process is determined according to an ALU operation mode (also referred to as an ALU mode), and the result of the operation is returned to the MMS 11. The arbiter 13 may select one of the channels of the MMS 11 as a target channel and access corresponding buffered data (MEMORY) according to the vector data moved by the target channel.

The processor for channel estimation provided by the embodiment of the invention has a plurality of instructions special for channel estimation and a special hardware implementation structure, including data movement and connection inside the ALU 12, and can keep a plurality of general calculation modes, thereby improving the flexibility of algorithm modification and realizing channel estimation.

In one embodiment, the HMM10 comprises: IRAM for storing serial and parallel instructions; the vector parameter configuration module is used for configuring the execution parameters of the parallel instructions and transmitting the execution parameters to the MMS 11 and the ALU 12 respectively; the instruction processing module 22 is used for processing serial instructions.

Fig. 2 is a schematic diagram of a processor for channel estimation according to an embodiment of the present invention, as shown in fig. 2, where the processor for channel estimation includes an HMM10, and the HMM10 further includes: an instruction cache module (IRAM) 20, a vector parameter configuration module 21, and an instruction processing module 22; the processor for Channel estimation further comprises an MMS 11, the MMS 11 further comprises a plurality of Channels (CH) 23, in fig. 2, four channels are shown as CH0, CH1, CH2, CH3, respectively; each channel has a corresponding First In, first Out memory, denoted FIFO0, FIFO1, FIFO2, FIFO3, respectively; the processor for channel estimation further comprises an ALU 12, an arbiter 13 and a plurality of MEMORY ports. When different channels are used as target channels, the processor can access the cache data through different MEMORY ports.

Wherein, CH 23 can be used for processing and moving read-write data. MEMORY: and caching and storing the memory of the data. The arbiter 13 may select one of the read and write requests from the 4 CH 23, respond to the request, and access MEMORY based on the request. FIFO may represent a first-in first-out, i.e., a pipelined data-storage structure, where data written first is extracted first when read. The HMM10 can fetch and decode all instructions, and complete execution and write-back operation of serial instructions; for parallel instructions, after decoding is completed, relevant information of vector operation is configured to MMS 11 and ALU 12 through vector parameter configuration module 21.

In one embodiment, during execution of the first parallel instruction by the MMS 11 and ALU 12, the HMM10 parses and executes the serial instruction and parses the second parallel instruction; after the HMM10 parses the second parallel instruction, if the MMS 11 and the ALU 12 have completed the first parallel instruction, the HMM10 configures an execution parameter of the second parallel instruction, otherwise, waits for the execution of the first parallel instruction to complete and then configures an execution parameter of the second parallel instruction.

Specifically, in the related art, each vector operation needs to execute the serial instruction first, and then the related parameters are configured to complete execution of the parallel instruction, that is, next vector operation needs to wait for execution of the serial instruction corresponding to the required parameters besides the current vector operation, which results in more idle time and low efficiency of the vector operation. However, in the embodiment of the present invention, during the execution of the first parallel instruction by the MMS 11 and the ALU 12, the HMM10 can parse and execute the serial instruction, and parse the second parallel instruction, that is, can parse one instruction while executing the other instruction, so that compared with the related art, the time of vector operation can be saved, and the operation efficiency can be improved.

Fig. 3 is a schematic diagram of an execution sequence of a serial instruction and a parallel instruction according to an embodiment of the present invention. As shown in FIG. 3, P1-Pn represent parallel instructions and S1-Sn represent serial instructions.

During the execution of P1 by the MMS 11 and ALU 12, the HMM10 parses S2 and executes S2, after the HMM10 parses P2, if P1 has been executed, the HMM10 configures the execution parameters of P2, otherwise waits for P1 execution to be completed, configures the execution parameters of P2, and so on, and then executes Pn and Sn in the order shown in fig. 3. The instruction execution in fig. 3 can smoothly process Pn and Sn in parallel, not only can save the time of vector operation, but also can improve the operation efficiency.

In one embodiment, vector parameter configuration module 21 includes a first vector parameter register set and a second vector parameter register set; the value of the first vector parameter register set is updated in real time according to the current resolved channel estimation special instruction; the values of the second set of vector parameter registers are synchronized to the values of the first set of vector parameter registers at the moment of configuration of the execution parameters.

Fig. 4 is a schematic diagram of an internal structure of an HMM according to an embodiment of the present invention, as shown in fig. 4, in the vector parameter configuration module 21, there are two vector parameter register sets: vector parameter register set 1 and vector parameter register set 2. The invention adopts the processing structure as shown in fig. 4 to execute the serial instruction and the parallel instruction, and the vector parameter register group 1 can update the internal value in real time according to the current analyzed instruction; the vector parameter register set 2 is an interface between the HMM10 and the vector processing and operation unit, and only the value of the current register set 1 needs to be obtained synchronously at the moment of vector operation configuration. The method can overcome the problems existing in the related technology to a certain extent and improve the execution efficiency of the instruction.

With reference to fig. 3, the execution process of the serial instruction and the parallel instruction specifically includes: when the analysis of P1 is completed and a vector processing and operation unit is configured, the vector parameter register set 2 synchronously obtains the value of the current vector register set 1 and is used as an interface to assist in completing vector operation. In the P1 execution phase, HMM10 continues to parse and execute S2 and update the value of vector parameter register set 1, since the value of register set 2 does not change during this time, normal execution of P1 is not affected. Meanwhile, continuing to analyze P2, judging the states of the vector processing and operation units after the analysis of P2 is completed, and if P1 is completed, configuring parameters of the 2 nd vector operation; otherwise, waiting for P1 execution to complete.

It should be noted that, there may be a case where the execution of S2 needs to depend on the execution result of P1, and in this case, it is only possible to wait for the execution of P1 to be completed and then parse and execute S2, but this case may be avoided as much as possible in the instruction writing stage of software.

It should be noted that, the vector processing in fig. 4 refers to MMS 11, where CH0, CH1, CH2, and CH3 in MMS 11 need to receive parameters to correctly complete data movement; the arithmetic unit refers to an ALU that requires mode parameters to obtain the kind of operation that currently needs to be performed.

In one embodiment, the MMS 11 is specifically configured to: interleaving the vector data according to the operation mode of the ALU 12 and transmitting the vector data to the ALU 12; the operation result is interleaved according to the operation mode of the ALU 12 and transmitted to the arbiter 13.

MMS 11 is used primarily for the handling and movement of data from memory to ALU 12 and from ALU 12 back to memory. As ASIP dedicated to channel estimation, a function of supporting fast fourier transform (Fast Fourier Transformation, FFT)/inverse fast fourier transform (Inverse Fast Fourier Transform, IFFT) is indispensable.

Thus, in addition to basic continuous reading according to address and length, the MMS 11 will also implement some dedicated addressing modes according to the requirements of the channel estimation algorithm. Fig. 5 is a schematic diagram of a 32-point radix-4 butterfly operation in the prior art according to an embodiment of the present invention, as shown in fig. 5, assuming that 4 complex points can be read by one Cycle of a current MMS 11 channel. The ALU 12 shown in fig. 5 is not able to use the continuous address mode, but uses the step size of 8 plural points and the effective length of 1 plural point, and needs 4 cycles each time to combine the number of 1-time radix-4 butterfly operation, which is low in efficiency.

Therefore, ASIP designs a mechanism of repeated jump access and interleaving for data movement and processing of the radix-4 butterfly, so as to improve the data processing efficiency of the radix-4 butterfly. Fig. 6 is a schematic diagram of a radix-4 butterfly operation with a mechanism for repeatedly jumping and interleaving, which is provided in an embodiment of the present invention, as shown in fig. 6, the vector data may be repeatedly jumping and interleaving first according to the operation mode of the ALU 12, and then the vector data may be interleaved and transmitted to the ALU 12, and the operation result may be interleaved and transmitted to the arbiter according to the operation mode of the ALU 12.

In an embodiment, the MMS 11 is further used to: and extracting vector data according to the starting address, the length, the jump step length and the repetition number.

Wherein, the starting address and the jump step length jointly determine the jump address, and the length determines the range of the repeated jump number; if the jump address for extracting the vector data is in the range of the current repeated jump number, the vector data can be extracted from the jump address, otherwise, the next repeated jump number is entered.

In one embodiment, the MMS 11 is specifically configured to:

in each extraction process, determining the starting address of the current extraction, the starting address of the next extraction and the number of times of the current extraction; if the jump address corresponding to the jump step length does not exceed the range indicated by the length based on the currently extracted starting address, extracting vector data from the jump address; repeating the extraction process until the current extraction times reach the set times.

Specifically, to support the function of repeating the Jump access, a Jump step size (jump_addr) and the number of repetitions (rep_cnt) are introduced in addition to the start address (source_addr) and the length (data_length).

Fig. 7 is a flowchart of obtaining a request address according to an embodiment of the present invention, and as shown in fig. 7, specific steps include:

S10、Req_addr＝Source_addr；Rep_source_addr＝Source_addr+C_len；Rep_cnt_c＝0。

specifically, the address of the first request (i.e., req_addr) is the starting address (i.e., source_addr), i.e., req_addr=source_addr, and the source_addr is used to calculate the initial address of the next round of value (i.e., rep_source_addr), c_len is the length of 4 plural points, and rep_cnt_c represents the current round of value.

S11, judging whether the Req_addr+Jump_step exceeds the range of source_addr+Data_length, if so, executing S12; if not, S13 is performed.

S12, judging whether the Rep_cnt_c is smaller than the value of the Rep_cnt, if so, executing S14; if not, the process is ended.

Specifically, it is determined whether the current number of rounds (rep_cnt_c) is smaller than the repetition number (rep_cnt).

S13、Req_addr＝Req_addr+Jump_step。

S14、Req_addr＝Req_source_addr；

Req_source_addr＝Req_source_addr+C_len；Rep_cnt_c＝Rep_cnt_c+1。

Specifically, whether the request address and the jump step length exceed the range of the starting address and the length is judged, if the jump is not exceeded, the jump is continued, if the jump is exceeded, whether the next round of jump is carried out, if yes, the next round of jump is carried out, and if not, the current moving task is finished.

In one embodiment, the ALU 12 includes adders, multipliers and selectors for providing a plurality of operational modes.

Specifically, the ALU 12 mainly selects the trend of the input data in the module according to the operation mode of the current configuration, so as to complete the corresponding operation. For example, assume that there are 1 adder and 1 multiplier inside the ALU 12; there are 3 input data a, b, c; firstly, inputting a and b into an adder through a selector, and then selecting the output of the adder and inputting c into a multiplier to finish (a+b) c; if a and b are added to the multiplier and then added to the adder, a is added to b+c. By selecting the path of the input data between the computing units within the ALU 12 by a selector, a variety of operations can be implemented.

Fig. 8 is a schematic diagram of a radix-4 butterfly operation mode according to an embodiment of the present invention, where, as shown in fig. 8, A0 to A3 represent inputs of the radix-4 butterfly operation formula, C0 to C3 represent outputs of the radix-4 butterfly operation formula, and T0 to T3 represent intermediate calculation results of the radix-4 butterfly operation formula, and specifically, C0 to C3 are calculated by using the following formulas through A0 to A3.

T0＝A0+A2

T1＝A1+A3

T2＝A0-A2

T3＝(A1-A3)*(-j)＝(Im[A1]-Im[A3])+j*(Re[A3]-Re[A1])

C0＝T0+T1

C1＝T0-T1

C2＝T2+T3

C3＝T2-T3

According to the above calculation formula, a schematic diagram of the flow direction of the input data in the ALU 12 in the radix-4 butterfly operation mode can be drawn.

Fig. 9 is a schematic diagram of the flow direction of input data in the ALU 12 in the radix-4 butterfly mode according to the embodiment of the present invention, and as shown in fig. 9, the complex number is generally expressed by a+b×i, re represents the real part, i.e. a in the complex number, and Im represents the imaginary part, i.e. b in the complex number. The radix-4 butterfly mode of operation requires 16 adders (i.e., add_0-add_15), wherein the adder module has 3 input interfaces in addition to the necessary clock and reset signals, two of which are data interfaces and one of which is a parameter configuration interface, and the adder will perform either an addition (parameter value of 0) or subtraction (parameter value of 1) of two input data depending on this parameter.

In one embodiment, the operation mode includes an operation mode in which butterfly operation and complex multiplication are processed together.

Fig. 10 is a schematic diagram of a data flow of a complex multiplication operation mode provided in an embodiment of the present invention, where the operation mode is complex multiplication, that is, ai and Bi are both complex numbers, and is a general mode. According to the complex expression form:

Ai＝Re[Ai]+Im[Ai]*i

Bi＝Re[Bi]+Im[Bi]*i

Ci＝Re[Ci]+Im[Ci]*i

ai and Bi are inputs of complex multiplication, and Ci is an output result, so

Ci＝Ai*Bi

＝(Re[Ai]+Im[Ai]*i)*(Re[Bi]+Im[Bi]*i)

=re [ Ai ] Re [ Bi ] -Im [ Ai ] Im [ Bi ] + (Re [ Ai ] Re [ Bi ] +im [ Ai ] Im [ Bi ]) i then compares the manifestations of complex Ci, i.e.

Re[Ci]＝Re[Ai]Re[Bi]-Im[Ai]Im[Bi]

Im[Ci]＝Re[Ai]Im[Bi]+Im[Ai]Re[Bi]

Since ASIP has two input channels, each channel can input 4 complex numbers, so that 4 complex multiplication operations can be realized in the complex multiplication ALU mode, and the i has values of 0, 1, 2 and 3, and the mode needs 16 multipliers and 8 adders.

In both the ALU modes of fig. 9 and 10, the adders numbered 0-7 are used, and in both modes, the data sources of the adders are different, and the ALU 12 is capable of implementing the reconstruction of the two modes mainly depending on the structure shown in fig. 11, fig. 11 is a schematic structure of an ALU according to the embodiment of the present invention, fig. 11 is an example of an adder numbered 5 (i.e. add_5), and both parameters and two data interfaces of the ALU are connected to muxes, and the muxes gate corresponding signals as inputs of the adder through selection signals generated by the ALU modes. For example, when the mode is complex multiplication, the two data input muxes gate multiplier output signals numbered 10 and 11, respectively, and the parameter gates a constant 0, indicating that the current adder is performing an addition operation.

By these muxes, the computation units (adders, multipliers, etc.) inside the ALU can be multiplexed as much as possible in different ALU modes, saving area.

Meanwhile, as ASIP is specially used for processing the channel estimation algorithm, a special hardware connection path for channel estimation can be designed inside the ALU. Taking 4096-point FFT algorithm as an example, if the operation of each stage is completed by adding 1 complex multiplication to 1 radix-4 butterfly, then 6 radix-4 butterfly modes and 5 complex multiplication modes are required to complete the whole algorithm. Assuming that n cycles are required for each pattern from read, calculate to write, the FFT algorithm requires 11n cycles to complete.

In an embodiment, the arithmetic logic unit further comprises a flip-flop for storing the real and imaginary parts of the butterfly factor.

Fig. 12 is a schematic diagram of an ALU mode of co-processing of radix-4 butterfly and complex multiplication specially designed for FFT algorithm, as shown in fig. 12, the complex multiplication diagram includes 4 complex multiplication structures (i.e. fig. 9), so there are 16 multipliers and 8 adders, and the radix-4 butterfly, complex multiplication and CDFF in fig. 11 are all in operation at the same time and are in co-processing mode, so they need to operate at the same time.

In fig. 12, A0 to A3 are input data before each stage of butterfly operation, and B0 to B3 are corresponding butterfly factors. In order to enter the Complex multiplication stage synchronously with the output result of the radix-4 butterfly, the butterfly factor needs to add two stages of D Flip-flops (Complex D Flip-flop), and since B0-B3 are Complex numbers, a Complex is added, which means that both the real part and the imaginary part need to be registered with DFF, so the newly added dedicated mode needs 16 multipliers, 24 adders and 16D Flip-flops.

The dedicated mode, although having 8 adders and 16D flip-flops more than the case of the previous split processing, reduces the time to implement the 4096-point FFT algorithm to 6n cycles, and improves performance by approximately 2 times. Meanwhile, the added adder and D trigger can also be used for customizing the special ALU mode of the other channel estimation algorithms, so that the resource waste is basically avoided.

Accordingly, ASIP has some instructions dedicated to channel estimation and dedicated hardware implementation structures, including data movement and connections inside the ALU, while preserving some general-purpose computing modes, providing some flexibility for algorithm modification.

The foregoing description is only exemplary embodiments of the invention and is not intended to limit the scope of the invention.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.

Embodiments of the invention may be implemented by a data processor of a mobile device executing computer program instructions, e.g. in a processor entity, either in hardware, or in a combination of software and hardware. The computer program instructions may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages.

The block diagrams of any of the logic flows in the figures of this invention may represent program steps, or may represent interconnected logic circuits, modules, and functions, or may represent a combination of program steps and logic circuits, modules, and functions. The computer program may be stored on a memory. The Memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as, but not limited to, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), optical storage devices and systems (digital versatile disks (Digital Video Disc, DVD) or Compact Disks (CD), etc., the computer readable medium may comprise a non-transitory storage medium, the data processor may be of any type suitable to the local technical environment, such as, but not limited to, a general purpose computer, a special purpose computer, a microprocessor, a digital signal processor (Digital Signal Processing, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic device (Field-Programmable Gate Array, FGPA), and a processor based on a multi-core processor architecture.

The foregoing detailed description of exemplary embodiments of the invention has been provided by way of exemplary and non-limiting examples. Various modifications and adaptations to the above embodiments may become apparent to those skilled in the art without departing from the scope of the invention, which is defined in the accompanying drawings and claims. Accordingly, the proper scope of the invention is to be determined according to the claims.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A processor for channel estimation, comprising:

the hardware message management module is used for processing serial instructions in the channel estimation special instruction set and configuring execution parameters of parallel instructions in the channel estimation special instruction set;

the cache management subsystem is used for moving vector data according to the execution parameters;

the arithmetic logic unit is used for carrying out vector operation on the vector data according to the execution parameters and transmitting the operation result back to the cache management subsystem;

and the arbiter is used for accessing corresponding cache data according to the vector data moved by the target channel of the cache management subsystem.

2. The processor of claim 1, wherein the hardware message management module comprises:

the instruction cache module is used for storing the serial instructions and the parallel instructions;

3. The processor of claim 1, wherein the processor further comprises a processor controller,

during the execution of a first parallel instruction by the cache management subsystem and the arithmetic logic unit, the hardware message management module parses and executes a serial instruction and parses a second parallel instruction;

4. The processor of claim 2, wherein the vector parameter configuration module comprises a first vector parameter register set and a second vector parameter register set;

5. The processor of claim 1, wherein the cache management subsystem is specifically configured to:

and interleaving the operation result according to the operation mode of the arithmetic logic unit and transmitting the operation result to the arbiter.

6. The processor of claim 1, wherein the cache management subsystem is further to:

7. The processor of claim 6, wherein the cache management subsystem is specifically configured to:

8. The processor of claim 1, wherein the arithmetic logic unit comprises an adder, a multiplier, and a selector, the adder, multiplier, and selector to provide a plurality of modes of operation.

9. The processor of claim 8, wherein the operation mode comprises an operation mode in which butterfly operations and complex multiplication are jointly processed.

10. The processor of claim 9, wherein the arithmetic logic unit further comprises a flip-flop for storing the real and imaginary parts of the butterfly factor.