Background
Currently, the main body of modern wireless communication is mobile communication, which operates in a complex and diverse mobile environment, and thus the impact of severe time-varying and multipath propagation must be considered, with reference to International Telecommunication Union (ITU) recommendation M1225. In modern wireless communication systems, particularly Code Division Multiple Access (CDMA) systems, it is generally desirable to use smart antennas with joint detection techniques in order to increase system capacity, improve system sensitivity, and achieve greater communication distances at lower transmit powers.
In many published technical documents, the research on the beamforming algorithm in the smart antenna is involved, and the research results show that the stronger the function, the more complex the algorithm is. However, in a mobile communication environment, the joint detection technology and beamforming must be completed in real time, and the time for completing the algorithm can only be calculated in microseconds. However, due to the state of modern microelectronics, Digital Signal Processors (DSPs) or application specific chips (ASICs) are not capable of performing overly complex real-time processing in such short time periods.
On the other hand, the technology and standard of mobile communication are continuously proposed and updated, and the software radio technology is increasingly highly regarded. How to solve the problem of spatial interfaces of different systems on a common hardware platform by using other programmable devices such as a Digital Signal Processor (DSP) or a programmable logic array (FPGA) has become a main research topic of numerous communication companies of various countries around the world. Furthermore, not only can software radio be used for user terminals to solve the problem of multimode handsets, it will also be used for wireless base stations. Especially in the case of the continuous update of third generation mobile communication technologies and standards, it is only possible to keep the product up with the technological development using software radio technology.
In the implementation technology of software radio technology, research has shown that the programmable logic device has better performance, especially for high-parallelism operation than the DSP widely used at presentThe method has obvious advantages that the former can not only improve the operation speed, but also can improve the overall work efficiency of system hardware through an effective and flexible design method, namely: all logic resources in the system are in an effective working state as much as possible, and the power of the system is reduced. This is not comparable to some current dedicated chips or even to DSPs. However, for the operation with higher iteration, it is generally considered that it is difficult to realize the operation in the FPGA with higher cost performance ratio. Taking the equation solution as an example, for example: knowing the vector e and the matrix A, solving the vector by equation (1)dWhereinAis an m x m dimensional non-negative Hermite array.
e= A· d (1)
Then, the following three iterative operations are generally required to solve the problem by the conventional method:
the method comprises the following steps: the matrix a is subjected to a decomposition operation shown in formula (2):
A=L*TL (2)
wherein L is a lower triangular matrix, L*TIs the conjugate transpose of L.
Step two: and (3) completing the iterative operation shown in the formula (3):
L*Ty=e (3)
wherein y is an intermediate variable to be solved.
Step three: and (4) completing the iterative operation shown in the formula (4):
L d=y (4)
in the above solving process, when the data amount of the a matrix is large, in order to increase the processing speed, a plurality of processors (processors) are often required to work in parallel to complete the above steps. For the operation of step one, when N processors are used to complete in parallel, the result shown in fig. 2 is obtained, that is, in the initial operation time T1, all the processors Processor1, Processor2, …, and Processor N are all in the active state, and in the time period T1 < T < Tn, the Processor1 is in the idle state, and in the time period T2 < T < Tn, the Processor2 is in the idle state, …, and the operation to time step one is completed. As can be seen, processor1 has a life cycle of T1, processor2 has a life cycle of T2, processor 3 has a life cycle of T3, processor 4 has a life cycle of T4, …, processor N-1 has a life cycle of T (N-1), and processor N has a life cycle of Tn. When iterative operation is performed using multiple processors, if the structure of each processor has a performance structure as shown in fig. 2, such operation characteristics are referred to as staircase operation characteristics. If Tn is taken as the calculation time of the whole iterative operation, the idle time of the processor1 is Tn-T1; for processor2, its idle time is Tn-T2; for processor 3, its idle time is Tn-T3; for processor 4, its idle time is Tn-T4; …, respectively; for processor N-1, the idle time is Tn-T (N-1). Thus, for the iterative operation unit, the wasted hardware resources are: n × Tn- (T1+ T2+ T3+ T4+ … + T (N-1)). This means that more processors are used to do this, which may result in more wasted hardware resources while increasing processing speed. Similar problems are also caused for the processing operations of step two and step three.
Based on the above analysis, people tend to use DSP to implement such iterative operations, and therefore, one has to divide a complete operation module into a plurality of operation sub-modules, that is: the parallelism operation with higher performance requirement is realized in FPGA, and the operation with higher iteration is realized in DSP. However, a series of negative effects are brought about, and most prominently, the overhead brought by data communication between the modules is increased, and the overall performance of the system is reduced. Currently, the highest performance DSPs or ASICs cannot achieve overly complex real-time processing due to the ever-improving higher performance smart antennas and joint detection algorithms requiring higher baseband processing power and speed. It is therefore necessary to seek a higher performance processing method.
Disclosure of Invention
In view of the above, the main objective of the present invention is to provide an iterative operation structure and method suitable for software radio technology implementation, so that hardware resources can be fully utilized, meanwhile, the calculation efficiency can be improved, the hardware resource occupation can be reduced, the operation processing speed can be increased, the baseband processing capability can be improved, and the implementation is simple and convenient, and the performance is good.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
an iterative operation structure suitable for software radio technology implementation, comprising at least: a processor module for data processing and computation; the central control unit is used for controlling and coordinating the work of each module; a system matrix memory for storing system matrix data; a main common factor generator unit for extracting a main common factor; a slave common factor generator unit for generating slave operation factors; a main common factor storage storing a main operation factor; a slave common factor storage storing slave operation factors; the central control unit controls the master common factor generator unit, the slave common factor generator unit and the processor module; the stored data of the system matrix memory is sent to a slave common factor generator unit as input, the slave common factor generator unit is connected with a slave common factor memory, the output of the slave common factor memory is connected with the input of the system matrix memory, the master common factor generator unit is connected with a master common factor memory, and the output of the master common factor memory is connected with the input of the slave common factor generator unit;
the key point is that: the processor module further comprises more than one sub-processing module with the same structure; each sub-processing module mainly comprises a sub-processor unit, a memory and a multiplexing unit, wherein data to be processed is input into the sub-processor unit for processing through the memory and the multiplexing unit, and the data processed by the sub-processor unit is input into the memory as data to be processed in the next step;
the central control unit is connected with the multiplexing unit of each sub-processing module through a bus; the data are respectively processed by the sub-processor units which are input to all the sub-processing modules by the system matrix memory, the slave common factor generator unit, the slave common factor memory, the master common factor generator unit and the master common factor memory through buses.
The number of the sub-processing modules is determined according to the size of the data to be processed, the time specified for completing the corresponding operation and the number of available hardware resources.
The sub-processor unit further comprises a main operation module and a slave operation module, and the central control unit controls and selects the main operation module or the slave operation module to work according to certain conditions. The certain condition is that all data required for calculation by the master operation module or the slave operation module are generated.
A method for realizing iterative operation by using the iterative operation structure is characterized in that: when more than one processor is needed for iterative operation processing, iterative operation with complementary step-like operation characteristics is completed by at least two processors in the same time slice under certain conditions.
The complementary step-shaped operation means that more than one common factor exists between two iterative operation steps, and in a calculation framework formed by taking the processor as a vertical axis and the time slice as a horizontal axis, the calculation results of each step of the two iterative operations are in a complementary step shape. The iterative operation is more than one different iterative operation step; or different levels of iteration in the same iterative operation step. The certain condition means that all data required by the next step of iterative operation are generated by the previous step of iterative operation; or all the data required for processing the next data in the current iteration operation step are generated by the current iteration operation step.
Therefore, the iterative operation structure and the iterative operation method which are suitable for the software radio technology implementation provided by the invention have the following advantages and characteristics:
1) because the iP structure and the pP structure comprise a plurality of multiplexing modules, the logic resources of the programmable gate array can be multiplexed to achieve the purpose that the available logic resources can be utilized to the maximum extent on any time slice, so that the best performance can be obtained while the resources are optimally utilized, namely the use limit of hardware is reached.
2) All sub-processors in the iP structure work simultaneously and parallelly, so that the life cycles of all sub-processor units in the iP structure are the same, current hardware resources can be fully utilized, and the resource utilization rate is improved; and the iterative operation required by the system is completed with higher performance, and a solution is provided for realizing software radio by using an FPGA or other similar devices.
3) The invention only partially improves the iteration part of the whole operation structure without changing the whole structure, so the invention is simple and convenient to realize and is convenient for real-time calculation.
4) Because the invention extracts the common factor aiming at a plurality of iterative sub-operations included in the iterative operation when the iterative operation is finished, the main operation module and the slave operation module aim at different iterative sub-operations logically respectively, but the hardware resources are multiplexed together, the proposal has less occupied resources and high calculation efficiency, and can realize higher performance than the DSP with the highest performance when the iterative operation is finished.
5) The invention is used in a mobile communication system, has higher capacity and better performance, greatly improves the baseband processing capability and is more beneficial to the realization of complex baseband algorithm.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a schematic diagram of a composition structure of an iterative arithmetic unit, which mainly includes an iP structure 10 and a pP structure 30. The iP structure 10 is a module for performing iterative operations, and the pP structure 30 is a module for performing flat operations, i.e., non-iterative operations. As shown in fig. 1, the result of the iterative operation of the iP structure 10, i.e., the signal S101, is used as the input of the pP structure 30; meanwhile, the signal S100 is a control signal output by a central processing unit in the iP structure, and can be used to control and start a module in the pP structure 30 to perform corresponding signal processing and calculation, thereby completing an operation process required by a user.
Data symbol sequence transmitted by each user in wireless communication system
dGenerally based on the received signal on the basis of channel estimation
And recovering the product. Usually, the channel estimation is performed in the channel estimation module, and after the action of other preamble modules in the system, the corresponding channel impulse response is obtained, and the system matrix is obtained based on the channel impulse response
It is the system matrix that is utilized in this embodiment
From received signals
Recovering the data symbol sequence transmitted by each user
dThe operation structure of (2).
Recovering the user transmitted data symbol sequence using the obtained received signal and the system matrix, is implemented using the following equation:
<math> <mrow> <msup> <mi>e</mi> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>a</mi> </msub> <mo>)</mo> </mrow> </msup> <mo>=</mo> <msup> <munder> <mi>A</mi> <mo>‾</mo> </munder> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>a</mi> </msub> <mo>)</mo> </mrow> </msup> <mo>·</mo> <munder> <mi>d</mi> <mo>‾</mo> </munder> <mo>+</mo> <msup> <munder> <mi>n</mi> <mo>‾</mo> </munder> <mrow> <mo>(</mo> <msub> <mi>k</mi> <mi>a</mi> </msub> <mo>)</mo> </mrow> </msup> <msub> <mi>k</mi> <mi>a</mi> </msub> <mo>=</mo> <mi>OK</mi> <msub> <mi>K</mi> <mi>a</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </math> wherein,
is the channel impulse response matrix corresponding to a particular antenna,
dis a symbol vector transmitted by the transmitting end,
is corresponding to a specific antenna k
aThe interference vector of (a) is calculated,
i.e. to a specific antenna k
aThe received signal. The purpose of this embodiment is to solve the equation
dAccordingly, all K's can be matched
aAnd solving the root antenna.
For simplicity of explanation of the algorithm structure, parameters may not be considered at allThen the above equation can be simplified as:
e= A· d (5)
in the hypothesis (5)AThe method is a non-negative definite Hermite array, and if the traditional method is used for solving the problem that three steps of iterative operation are needed, and the iterative operation structure containing the iP structure is used for solving the problem, the operation can be completed only by one step of iterative operation and two steps of flat operation. The embodiment adopts a multiplexing iP structure and a multiplexing pP structure, and the specific implementation process, principle and effect thereof are described in detail below with reference to fig. 3, fig. 4 and fig. 5.
As shown in fig. 3, the process of solving equations using the iterative calculation structure of the present invention is as follows:
the iP architecture 10 in fig. 3 mainly comprises a central control unit 100, a system matrix memory 102, a master common factor generator unit 104, a slave common factor generator unit 101, a master common factor memory 103, a slave common factor memory 105 and a processor module 106. The central control unit 100 controls the master common factor generator unit 104, the slave common factor generator unit 101 and the processor module 106; the stored data in the system matrix memory 102 is sent as input to the slave common factor generator unit 101; the slave common factor generator unit 101 is connected to the slave common factor storage 105; the output from the common factor memory 105 is in turn connected to the input of the system matrix memory 102, since the system matrix memory 102 has a much larger memory space than the common factor memory 105, and therefore some of the data processed from the common factor memory 105 will also be stored in the system matrix memory 102; the main common factor generator unit 104 is connected to the main common factor storage 103; the output of the master common factor store 103 is in turn connected to the input of the slave common factor generator unit 101.
The processor module 106 further comprises a plurality of sub-processing modules, such as U1, U2, …, Un, etc., each sub-processing module is composed of sub-processor units P1, P2, …, Pn, memory M1, M2, …, Mn and multiplexing units X1, X2, …, Xn, the data to be processed is input to the sub-processor units P1, P2, …, Pn for processing through the memories M1, M2, …, Mn and multiplexing units X1, X2, …, Xn, the data processed by the sub-processor units P1, P2, …, Pn is input to the memories M1, M2, …, Mn as the data to be processed next; the central control unit 100 is connected to the multiplexing unit of each sub-processing module via a bus. For each sub-processor unit P1, P2, …, Pn, the input data is derived from several parts: data to be processed from the system matrix memory 102; data from the processing output from the common factor storage 105; stored data from the main common factor storage 103; and data from memories M1, M2, …, Mn, which are all transmitted over the bus. The specific number of sub-processing modules is determined by the size of the data to be processed, the required computational performance and the amount of available hardware resources. The respective sub-processing modules may operate in parallel under the scheduling and control of the central control unit 100, and the structural compositions of the respective sub-processing modules are the same.
The iterative operation in the iP structure at least comprises the following three steps:
a) suppose that
Is a known system matrix and is stored in a predetermined manner in a system matrix memory 102, the system matrix memory 102 being based on the system matrix under the control of the central control unit 100
The input parameter a is supplied to each of the sub-processor units P1, P2, …, Pn in the slave common factor generator unit 101 and the processor module 106
ij(ii) a At the same time, the master common factor storage 103 supplies the parameter l to the slave common factor generator unit 101
ijThe 1 of
ijIs set to 0, the secondary operation factor q is obtained from the common factor generator unit 101 by the calculation according to equation (6)
j:
<math> <mrow> <msub> <mi>q</mi> <mi>j</mi> </msub> <mo>=</mo> <msup> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>jj</mi> </msub> <mo>-</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>l</mi> <mi>jk</mi> </msub> <msubsup> <mi>l</mi> <mi>jk</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> <mo>/</mo> <mn>2</mn> </mrow> </msup> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow> </math> Wherein, a
ijIs a system matrix
The elements (A) and (B) in (B),
is a
jkThe complex conjugate of (a).
b) Under the control of the
central control unit 100, the slave computing factor q
jIs supplied to each sub-processor unit P1, P2, …, Pn after being intermediately stored from the common factor memory 105, and at the same time, the main common factor memory 103 supplies a main operation factor (P) to each sub-processor unit P1, P2, …, Pn
i1,p
i2,…,p
ik) The system matrix memory 102 is based on the system matrix
Providing the input parameter a to the sub-processor units P1, P2, …, Pn
ijAnd further calculates an intermediate result l
ijAnd t
ij。
Taking the sub-processing module 200A as an example, the slave computing factor qjA main operation factor pikAnd system matrix element aijAfter the data is inputted into the sub-processing module 200A, the iteration data in the memory M1 is further inputted into the main operation module or the slave operation module inside the sub-processor unit P1 through the multiplexing unit X1 to calculate the intermediate result lijAnd tij. The master operation module and the slave operation module in the sub-processor unit P1 are multiplexing modules.
The main calculation module in the sub-processor unit P1 is according to equation (7), and is mainly used for calculating the intermediate result l
ij:
<math> <mrow> <msub> <mi>l</mi> <mi>ij</mi> </msub> <mo>=</mo> <msub> <mi>q</mi> <mi>j</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>a</mi> <mi>ij</mi> </msub> <mo>-</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>j</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>p</mi> <mi>ik</mi> </msub> <msubsup> <mi>l</mi> <mi>jk</mi> <mo>*</mo> </msubsup> <mo>)</mo> </mrow> <mi>i</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>;</mo> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow> </math> In the formula (7), a
ijIs a system matrix
Element (ii) q
jIs a slave operation factor, p
ikIs a factor of the main operation, and is,
is a
jkThe complex number of the conjugate of (a),
and l
jkIs the intermediate operation result of the iP structure.
The slave arithmetic module in the sub-processor unit P1 is according to equation (8), and is mainly used for calculating the intermediate result tij: <math> <mrow> <msub> <mi>t</mi> <mi>ij</mi> </msub> <mo>=</mo> <mo>-</mo> <msub> <mi>q</mi> <mi>j</mi> </msub> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mi>j</mi> </mrow> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mrow> <mo>(</mo> <msub> <mi>p</mi> <mi>ik</mi> </msub> <msub> <mi>t</mi> <mi>kj</mi> </msub> <mo>)</mo> </mrow> <mi>i</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>;</mo> <mi>j</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>8</mn> <mo>)</mo> </mrow> </mrow> </math> In the formula (8), qjIs a slave operation factor, pikIs a main operational factor, tijIs the operation result of the iP operation module and satisfies tjj=qj。
c) Through the above calculation, l calculated by each sub-processorijAnd tijAre sent to and stored in memories M1, M2, …, Mn. Meanwhile, the primary common factor generator unit 104 will extract the output results of P1, P2, …, Pn as its input to generate the primary operation factor P required for the next iteration operationik. The master common factor storage 103 provides on the one hand the slave common factor generator unit 101 with the input parameter/ijFor generating a slave operation factor q required for the next iteration operationj(ii) a On the other hand, each sub-processor unit P1, P2, …, Pn in the processor module 106 is supplied with an input parameter, i.e., a main operation factor Pik。
Repeating steps a) to c) until all data in the system matrix memory 102 have been processed and stored in M1, M2, …, Mn. By this point, the operations in the iP structure are all completed, and the system can further start the following operations in the pP structure. Of course, the system may start the operation of the pP structure when all operations in the iP structure are not completed, and if this operation mode is adopted, certain conditions need to be satisfied, where the conditions are: the operation of the pP structure can be started after the parameters required by the pP structure are generated.
Under the control of the central control unit 100, the master common factor generator unit 104, the master common factor storage 103, the slave common factor generator unit 101, the slave common factor storage 105 and the processor module 106 form a pipelined arithmetic structure, that is, several parts of data streams form a pipelined arithmetic structure. In all the operation time periods, all the hardware modules work cooperatively, so that the hardware resources are effectively utilized, and the requirement of the system on the real-time performance is met.
After the operation in the iP structure is finished, the final operation result is sent to the pP structure as an input signal, and the equation (5) is finally solved. The pP calculation structure is a flat structure, and the flat structure is: all elements in the vector to be solved have equal chances to be solved without the precedence of solving, i.e. the structure is not iterative.
The pP operation structure mainly comprises a local controller 302, a multiplexing processing module 303, multiplexers 305 and 306, a conjugate transpose module 301 and a memory 304; s100, S101, S102 are three input signals. The local controller 302 controls the multiplexed processing module 303, the multiplexers 305 and 306, the conjugate transpose module 301, and the memory 304 at the same time, the output of the processing module 303 is input to the multiplexer 306 through the memory 304, the output signal S101 of the iP architecture is input to the multiplexer 305 directly or through the conjugate transpose module 301, the output of the multiplexer 305 is connected to the multiplexed processing module 303, and the control signal S100 of the iP architecture is directly connected to the local controller 302. As shown in fig. 4, S100 is a control signal output by the central controller 100 in the iP architecture, and is used to start the local controller 302 in the pP architecture; s101 is a calculation result t of an iP structureij(ii) a S102 is known received signal data input by the systeme (k,a). The functional operation performed by the pP structure mainly comprises the following two steps:
the first step is as follows: <math> <mrow> <msub> <mi>r</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>i</mi> </munderover> <msub> <mi>t</mi> <mi>ik</mi> </msub> <msub> <mi>e</mi> <mi>k</mi> </msub> <mi>i</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (9), tikIs the resulting vector from the iP structure, i.e., signal S101 in fig. 4; e.g. of the typekIs receiving a data vectore (k,a)I.e., signal S102 in fig. 4; the result obtained by this step is riI.e. signal S105 in fig. 4, which is stored in the memory 304 and further used for the next operation.
The second step is that: <math> <mrow> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msubsup> <mi>t</mi> <mi>ik</mi> <mo>*</mo> </msubsup> <msub> <mi>r</mi> <mi>k</mi> </msub> <mi>i</mi> <mo>=</mo> <mn>1,2</mn> <mo>,</mo> <mo>.</mo> <mo>.</mo> <mo>.</mo> <mi>m</mi> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>10</mn> <mo>)</mo> </mrow> </mrow> </math>
in the formula (10), the compound represented by the formula (10),
is t
ikI.e. signal S106 in fig. 4; d
iIs a symbol vector transmitted by the transmitting end
dI.e., signal S107 in fig. 4; m is r
iLength of (d).
Under the control of the local controller 302, the multiplexers 305 and 306 select the signals S101, S106 or S102, S105, respectively, and the outputs S103, S104, S103 and S104 are input parameters of the processing module 303. Through the operations of the above two steps, a final output result, i.e. a signal S107, is obtained, namely: the sequence of user transmission data symbols sought by this embodimentd。
Although only the corresponding antenna k is solved in the present embodimentaIs/are as followsdLikewise, all K's can be treated according to the structure and method of the present inventionaAnd solving the root antenna. In the present embodiment, the a matrix is assumed to be a non-negative-definite Hermite matrix, and in practical applications, similar methods can be used as long as the order primary sub-formula of the a matrix is not zeroThe architecture is realized and higher performance is achieved.
When the iterative operation structure provided by the invention realizes the parallel work of the multiple processors, the n sub-processors shown in figure 3 adopt parallel processing, and the iterative operation represented by a formula (7) and a formula (8) shows that the iterative operation structure extracts a common factor q aiming at two iterative operationsjAnd pikAnd a common operation module, namely a master operation module and a slave operation module in the sub-processor units P1, P2, … and Pn are multiplexed, so that the performance shown in FIG. 5 can be achieved, namely the life cycles of the sub-processors are basically the same. For example, in the above process of solving the equation by using the equation coefficient matrix, the coefficient matrix a is assumed to be an m × m dimension Hermite matrix, and m processors are used for parallel processing, in this example, n equals m. Since the matrix a is a Hermite matrix, the matrix is capable of generating two conjugate matrices by triangular decomposition, and the ith column (or row) element of the generated matrix has a specific correlation with the ith-1 column element, namely: processing elements for column 2 (or row) must be based on all processing of elements for column 1 (or row), processing elements for column 3 (or row) must be based on all processing of elements for column 2 (or row), …, and so on. Meanwhile, the processing equations (7) and (8) have two common factors. Then, at time t1, m processors operate in parallel to process m elements of column 1 (or row) of matrix a according to equation (7) to obtain m non-zero elements; by time t2, m-1 processors work in parallel to process the elements of column 2 (or row) of matrix A according to equation (7) to obtain m-1 non-zero elements, while at the same time the input data of equation (8) is already available, at which time processor1 processes according to equation (8); at the time t3, m-2 processors work in parallel to process the elements of the 3 rd column (or row) of the matrix A according to a formula (7) to obtain m-2 non-zero elements, and the processors 1 and 2 process the elements of the 3 rd column according to a formula (8); …, respectively; and so on. When m is 5, the case shown in fig. 6 can be obtained, where P1 to P5 represent 5 processors, T1 to T5 represent 5 time instants, the positive slope filled part is the calculation result of the iterative sub-operation represented by formula (7), and the negative slope filled part is the calculation result of the iterative sub-operation represented by formula (8)As a result, the positive and negative sloped portions have complementary staircase-like operational characteristics, that is: the calculation results of all iterative sub-operations of the formula (7) and the formula (8) are respectively in a step shape, and the two step shapes are complementary, namely the two step shapes can be mutually filled to form a rectangle. The elements in which the grid lines fill the part are completed from the common factor generator unit 101 in the illustrated embodiment of the present invention, thereby calculating the operation factors required for equations (7) and (8). In this way, the life cycle of each sub-processor is close to Tn, which means that hardware resources are fully utilized, and not only the computational efficiency can be improved, but also the resource utilization rate can be greatly improved.
The above embodiments are mainly applied to the iterative algorithm implementation of the software radio algorithm of the wireless communication system, and the operation structure and the implementation method related by the invention provide a high-performance solution for implementing software radio on a single-chip large-scale programmable logic device. Meanwhile, the invention can be applied to other occasions needing to solve the multi-element linear equation, such as an image processing system, a pattern recognition system and the like, only by slightly changing the input signal and the composition structure, and under the condition that the sequence principle formula of the coefficient matrix corresponding to the multi-element linear equation is not zero or certain operation has the step-shaped operational characteristic shown in figure 2, the operation of a plurality of complementary step-shaped operational characteristics can be realized by using a similar system structure, and higher performance is achieved, thereby obtaining higher cost performance.
The hardware architecture related by the invention can be completely used in the design of hard cores and soft cores based on iterative operation, and the hardware architecture provides a solution for the design of high-performance special chips.
In short, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.