CN105323036A - Method and device for performing singular value decomposition on complex matrix and computing equipment - Google Patents

Method and device for performing singular value decomposition on complex matrix and computing equipment Download PDF

Info

Publication number
CN105323036A
CN105323036A CN201410377159.5A CN201410377159A CN105323036A CN 105323036 A CN105323036 A CN 105323036A CN 201410377159 A CN201410377159 A CN 201410377159A CN 105323036 A CN105323036 A CN 105323036A
Authority
CN
China
Prior art keywords
matrix
real
diagonal
complex
diagonalization
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410377159.5A
Other languages
Chinese (zh)
Inventor
袁雁南
孔令斌
段然
崔春风
易芝玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201410377159.5A priority Critical patent/CN105323036A/en
Publication of CN105323036A publication Critical patent/CN105323036A/en
Pending legal-status Critical Current

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The embodiment of the invention provides a method and a device for performing singular value decomposition on a complex matrix and computing equipment, and belongs to the field of singular value decomposition. The method for performing singular value decomposition on the complex matrix comprises the following steps: computing constitution elements of a real matrix shown in the description corresponding to the complex matrix in parallel through a plurality of threads, and constructing the real matrix; diagonalizing the real matrix X through Jacobi iteration to obtain a diagonalizable matrix in which non-diagonal elements of the real matrix are smaller than or equal to a preset threshold, wherein in each iteration process, operations of matrixes by Jacobi rotation transformation matrixes corresponding to different sequence pairs are performed in parallel through a plurality of threads; and computing a diagonal matrix D and a right singular matrix V of the complex matrix according to the Jacobi rotation transformation matrixes used for obtaining the diagonalizable matrix. Through the technical scheme of the invention, the data processing waiting delay can be reduced.

Description

The method of singular value decomposition, device and computing equipment are carried out to complex matrix
Technical field
The present invention relates to singular value decomposition, particularly a kind of can reduce data processing latency delays the method for singular value decomposition, device and computing equipment are carried out to H=A+Bi.
Background technology
SVD (SingularValueDecomposition, singular value decomposition) is one of most most important instrument of fundamental sum of analyzing of modern numerical, and it is widely used in statistical analysis, signal and image procossing, Systems Theory and control.
As in a communications system, channel capacity is one of important indicator of Shannon information communication theory, and this concept characterizes the largest data transfer ability of communication channel, is the important evidence of comparative evaluation communication system performance.Since HSPDA communication system introduces MIMO, MIMO technology always is the hot spot technology of mobile communication.
At present, the main wireless communication technology standard comprising 802.11n/ac and LTE-Advanced has all continued to use MIMO technology.Major reason is exactly " orthogonality " by using MIMO technology can utilize channel space, thus promotes communication system channel largest data transfer ability further, and this this is that traditional SISO system cannot realize exactly.
In MIMO communication system, first need the problem solved is exactly how to suppress the interference in wireless channel propagation process between multi-stream data.For addressing this problem, need to introduce precoding technique in the base band signal process of mimo system.MIMO precoding utilizes known channel condition information to carry out precoding processing to signal in MIMO transmitter, the transmission channel transmitted with current is mated the most, can obtain the index performances such as good error performance or power system capacity.Optimal precoding is the complete known channel information of hypothesis transmitting terminal, and now optimum precoding is the precoding based on SVD.
But the existing method that SVD carries out to complex matrix all to there is the processing time longer, more and more higher requirement of real-time cannot be adapted to.
Summary of the invention
The object of the present invention is to provide a kind of can reduce data processing latency delays the method for singular value decomposition, device and computing equipment are carried out to H=A+Bi.
To achieve these goals, embodiments provide a kind of method of complex matrix H=A+Bi being carried out to singular value decomposition, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, described method comprises:
Real matrix construction step, utilizes multiple thread parallel to calculate real matrix corresponding to described complex matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Diagonalization step, carries out diagonalization to described real matrix X by Jacobi iteration, and the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
Matrix computations step, the diagonal matrix D of complex matrix and right singular matrix V according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix.
Above-mentioned method, wherein, described real matrix construction step specifically comprises:
Calculate sub-step, utilize multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B;
Build sub-step, utilize matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
Above-mentioned method, wherein, in described diagonalization step, carry out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein, x kfor the intermediary matrix that last iterative process exports, J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses, N is the exponent number of matrix X.
Above-mentioned method, wherein, described diagonalization step specifically comprises:
First diagonalization sub-step, utilizes N 2individual thread parallel calculates the N of Matrix C 2individual element;
Second diagonalization sub-step, utilizes N 2individual thread parallel calculates matrix X k+1n 2individual element.
Above-mentioned method, wherein, after building the element that multiple thread parallel calculates in the diagonal matrix D ' of described real matrix and right singular matrix V ' in matrix computations step, build diagonal matrix D and the right singular matrix V of complex matrix according to diagonal matrix D ' and right singular matrix V '.
In order to realize above-mentioned purpose better, the embodiment of the present invention additionally provides a kind of device complex matrix H=A+Bi being carried out to singular value decomposition, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, described device comprises:
Real matrix builds module, calculates real matrix corresponding to described complex matrix for utilizing multiple thread parallel X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Diagonalization module, for carrying out diagonalization to described real matrix X by Jacobi iteration, the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
Second matrix computations module, for diagonal matrix D and the right singular matrix V of complex matrix according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix.
Above-mentioned device, wherein, described real matrix builds module and specifically comprises:
Computing unit, for utilizing multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B;
Construction unit, for utilizing matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
Above-mentioned device, wherein, described diagonalization module is specifically for carrying out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein, x kfor the intermediary matrix that last iterative process exports, J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses, N is the exponent number of matrix X.
Above-mentioned device, wherein, described diagonalization module specifically comprises:
First diagonalization unit, for utilizing N 2individual thread parallel calculates the N of Matrix C 2individual element;
Second diagonalization unit, for utilizing N 2individual thread parallel calculates matrix X k+1n 2individual element.
In order to realize above-mentioned purpose better, the embodiment of the present invention additionally provides a kind of computing equipment, comprise main control unit and the many-core processor with memory, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, described main control unit is used for the memory complex matrix H=A+Bi of pending singular value decomposition being sent to described many-core processor, and described many-core processor is used for:
Multiple thread parallel is utilized to calculate real matrix corresponding to described complex matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Carry out diagonalization to described real matrix X by Jacobi iteration, the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
The diagonal matrix D of complex matrix and right singular matrix V according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix;
Described main control unit exports described diagonal matrix D and right singular matrix V after copying described diagonal matrix D and right singular matrix V.
Above-mentioned computing equipment, wherein, described many-core processor is specifically for utilizing multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B, then utilizes matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
Above-mentioned computing equipment, wherein, described many-core processor is specifically for carrying out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein, x kfor the intermediary matrix that last iterative process exports, J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses, N is the exponent number of matrix X.
Above-mentioned computing equipment, wherein, described many-core processor is specifically for utilizing N 2individual thread parallel calculates the N of Matrix C 2after individual element, utilize N 2individual thread parallel calculates matrix X k+1n 2individual element.
Above-mentioned computing equipment, wherein, described main control unit also for discharging the resource that singular value decomposition takies after singular value decomposition terminates.
Above-mentioned computing equipment, wherein, described main control unit is general processor.
In the embodiment of the present invention, by Jacobi iteration, diagonalization is being carried out to described real matrix X, the off-diagonal element obtaining described real matrix is all less than or equal in the process of the diagonalizable matrix of predetermined threshold, build multiple thread parallel to calculate, reduce the time needed for diagonalization, reduce data processing latency delays, more and more higher requirement of real-time can be met.
Accompanying drawing explanation
Fig. 1 represents the schematic flow sheet of the method for the embodiment of the present invention;
Fig. 2 represents the structural representation of the real matrix of the embodiment of the present invention;
Fig. 3 represents that sorting in parallel rule is to determine the schematic diagram of sequence pair (i, j);
Fig. 4 represents the structural representation of the device of the embodiment of the present invention;
Fig. 5 represents the hardware structure schematic diagram of the computing equipment of the embodiment of the present invention;
Fig. 6 represents that the computing equipment of the embodiment of the present invention carries out the schematic flow sheet of SVD;
Fig. 7 represents that the computing equipment of the embodiment of the present invention carries out the schematic flow sheet of Parallel SVD decomposition.
Embodiment
In the embodiment of the present invention, by Jacobi iteration, diagonalization is being carried out to described real matrix X, the off-diagonal element obtaining described real matrix is all less than or equal in the process of the diagonalizable matrix of predetermined threshold, build multiple thread parallel to calculate, reduce the time needed for diagonalization, reduce data processing latency delays, more and more higher requirement of real-time can be met.
The method of complex matrix H=A+Bi being carried out to singular value decomposition of the embodiment of the present invention, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, as shown in Figure 1, comprising:
Real matrix construction step 101, utilizes multiple thread parallel to calculate real matrix corresponding to described complex matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Diagonalization step 102, carries out diagonalization to described real matrix X by Jacobi iteration, and the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
Matrix computations step 103, the diagonal matrix D of complex matrix and right singular matrix V according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix.
In the embodiment of the present invention, by Jacobi iteration, diagonalization is being carried out to described real matrix X, the off-diagonal element obtaining described real matrix is all less than or equal in the process of the diagonalizable matrix of predetermined threshold, build multiple thread parallel to calculate, reduce the time needed for diagonalization, reduce data processing latency delays, more and more higher requirement of real-time can be met.
For complex matrix H=A+Bi, assuming that its singular value is σ, its conjugate transpose is H h=A t-B ti, then: H hh=(A t-B ti) (A+Bi)=(A ta+B tb)+(A tb-B ta) i, characteristic of correspondence value is σ 2, characteristic of correspondence vector is u+vi, can obtain:
[(A TA+B TB)+(A TB-B TA)i](u+vi)=σ 2(u+vi)
That is:
( A T A + B T B ) u + ( B T A - A T B ) v = σ 2 u ( A T A + B T B ) v + ( A T B - B T A ) u = σ 2 v
The expression-form being converted to matrix is:
A T A + B T B B T A - A T B A T B - B T A A T A + B T B u v = σ 2 u v
Can find from above analytic process, the decomposition of the complex matrix on N × N rank can be converted to the decomposition of the real number matrix on 2N × 2N rank, namely needs to solve real matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Decomposition.
And above-mentioned real matrix X is actually diagonally symmetrical matrix as shown in Figure 2, namely the upper triangular matrix of real matrix X and lower triangular matrix are symmetrical, and the upper triangular matrix (or lower triangular matrix) therefore only calculating real matrix can obtain complete real matrix.
Meanwhile, due to matrix A ta+B tjust matrix A is obtained after B transposition ta+B tb, and the matrix form of expression being attached to matrix in Fig. 2, that is: matrix D (A ta+B tthe upper triangular matrix of B) obtain matrix E (in fact matrix D and E are same matrixes) after transposition, therefore only needing to calculate one of them can (here compute matrix D).Therefore, complex matrix decomposes the conversion that real matrix is decomposed, and makes full use of the symmetry of real matrix, only needs compute matrix C and matrix D, and do not need to calculate whole real matrix, decrease workload so to a great extent.
Therefore, in order to reduce the workload calculated, improve computational speed, the real matrix construction step 101 in the embodiment of the present invention specifically comprises simultaneously:
Calculate sub-step, utilize multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B;
Build sub-step, utilize matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
In the embodiment of the present invention, the symmetry according to real matrix X of novelty, after the element calculating in real matrix X 3/8, this element of 3/8 can be utilized to obtain whole matrix X, because this reducing data amount of calculation, thus the time that reduce further needed for diagonalization, reduce data processing latency delays, more and more higher requirement of real-time can be met.
And simultaneously, in a particular embodiment of the present invention, utilize multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B, reduce further the time needed for diagonalization, reduces data processing latency delays, can meet more and more higher requirement of real-time.
After obtaining real matrix X, just need to carry out diagonalization to real matrix X.In a particular embodiment of the present invention, Jacobi algorithm is adopted to carry out the diagonalization of real matrix X.
The basic thought of Jacobi algorithm is converted real matrix by a series of spin matrix, reduces the norm of matrix off-diagonal element step by step, the diagonalization of realization matrix.
Carry out Jacobi rotational transformation matrix to a certain matrix K can be represented by the formula:
Wherein, the element in matrix J (i, j, θ) meets the following conditions:
j mm=1j mn=0
j ii=cosθj ij=sinθ
j ji=-sinθj jj=cosθ
Wherein:
φ + θ = tan - 1 K ( i , j ) + K ( j , i ) K ( j , j ) - K ( i , i ) ;
φ - θ = tan - 1 K ( j , i ) - K ( i , j ) K ( i , i ) + K ( j , j ) .
Because real matrix X is symmetrical, so:
K(j,i)=K(i,j)
Therefore:
φ = θ = 1 2 tan - 1 K ( i , j ) + K ( j , i ) K ( j , j ) - K ( i , i ) .
For real matrix X, its Jacobi iterative algorithm can represent by following iterative manner, i.e. the intermediary matrix X of the output of current iteration process k+1be expressed as follows:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein, x kfor the intermediary matrix that last iterative process exports, J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses, N is the exponent number of matrix X.
To given sequence pair (i, j), corresponding rotational transformation matrix is J (i, j, θ), converts (premultiplication or the right side are taken advantage of), the i-th, j row or column of an influence matrix to matrix.Wherein rotation angle θ only with the element in matrix a ii a ij a ji a jj Relevant.
Iteration selects different sequence pair (i, j) each time, calculates corresponding spin matrix J (i, j, θ), and repeatedly converting matrix can the diagonalization of realization matrix.
In whole iterative process, if the submatrix that all sequence pair generate comprises and only comprises all 2 × 2 submatrixs of matrix, then such process is called once cleans.
In a particular embodiment of the present invention, utilize the characteristic of the Jacobi iterative algorithm only element of the part of influence matrix, and then utilize multithreading to carry out concurrent operation, reduce diagonalizable operation time, reduce data processing latency delays, more and more higher requirement of real-time can be met.
For the i-th, j row and column of an once bilateral rotation influence matrix of sequence pair (i, j), therefore in once cleaning, it is feasible that concurrent operation is carried out in the bilateral rotation transformation for non conflicting.Hypothesis matrix is 4 rank matrixes, and so once cleaning can be analyzed to following three non conflicting rotary collectings:
rot_set(1)={(1,2),(3,4)};
rot_set(2)={(1,3),(2,4)};
rot_set(3)={(1,4),(2,3)};
Each gyrator problem in each rotary collecting is non conflicting, can parallel processing, namely to the gyrator problem (1,2) and (3 of first rotary collecting, 4) can parallel processing, this is because each subproblem only has influence on row and column selected by it.As 1,2 row and columns of subproblem (1, a 2) influence matrix, and 3,4 row and columns of subproblem (3, a 4) influence matrix.
As shown in Figure 3, give a kind of sorting in parallel rule to determine sequence pair (i, j), wherein often row composition one rotates, and is non conflicting between each each son rotation of moving rear composition along arrow, all can parallel processing.N rank matrix completes once cleaning needs N-1 to walk.As N=8, following seven steps can be decomposed into:
S1:(1,2),(3,4),(5,6),(7,8)
S2:(1,4),(2,6),(3,8),(5,7)
S3:(1,6),(4,8),(2,7),(3,5)
S4:(1,8),(6,7),(4,5),(2,3)
S5:(1,7),(8,5),(6,3),(4,2)
S6:(1,5),(7,3),(8,2),(6,4)
S7:(1,3),(5,2),(7,4),(8,6)
Matrix can complete diagonalization through limited number of time cleaning.
Turn back to the iteration of the real matrix X of the embodiment of the present invention, as follows:
X k + 1 = J k , N T J k , N - 1 T . . . J k , 2 T J k , 1 T X k J k , 1 J k , 2 . . . J j , N - 1 J k , N
Order:
C = J k , N T J k , N - 1 T . . . J k , 2 T J k , 1 T X k
Then:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein J k, 1, J k, 2..., J k, N-1.J k,Nthe Jacobi rotational transformation matrix that different sequence pair in corresponding once cleaning are corresponding, namely to matrix X kthe operation of different row or column, and be non conflicting.
Known by matrix properties, matrix premultiplication (or the right side is taken advantage of) matrix, then only change matrix row (or row) accordingly.Therefore, formula just to matrix X krow do computing, and J k, 1, J k, 2..., J k, N-1.J k,Nbetween ensuring escapement from confliction complete matrix X kthe renewal of row.
And formula X k+1=BJ k, 1j k, 2j k, N-1j k,Nbe the conflict free renewal of all row to Matrix C, this completes a cleaning process.
Then, determine the rotary collecting next time cleaned according to sorting in parallel rule, just can complete the diagonalization of real matrix like this through limited number of time cleaning.
Due to formula due to matrix X kthe computing of row be non conflicting, therefore concurrent operation can be used to upgrade and to obtain Matrix C, therefore can design the calculating that a thread carrys out an element in responsible Matrix C, for N rank real matrix, then complete the computing of element in Matrix C with needing N × N number of thread parallel.
Equally, for formula X k+1=CJ k, 1j k, 2j k, N-1j k,N, be also that non conflicting (is actually matrix A to the column operations of Matrix C krow operation), therefore can make to use the same method uses N × N number of thread parallel ground computing to obtain matrix A k+1.
That is, in a particular embodiment of the present invention, described diagonalization step specifically comprises:
First diagonalization sub-step, utilizes N 2individual thread parallel calculates the N of Matrix C 2individual element;
Second diagonalization sub-step, utilizes N 2individual thread parallel calculates matrix X k+1n 2individual element.
Finally, after the diagonalization completing real matrix, real matrix X diagonal matrix D' can be calculated by following formula and right singular matrix V' is as follows:
D'=J N TJ N-1 T…J 1 TCJ 1J 2…J N
V'=J 1J 2…J N
Wherein, J 1, J 2..., J nact on the rotational transformation matrix on penultimate intermediary matrix.
And the diagonal matrix D that the SVD of complex matrix H=A+Bi decomposes and V matrix are respectively:
V=[U1+iV1,…Ui+iVi,…]
Wherein:
vi = Ui Vi
Finally arrange from big to small according to characteristic value, characteristic vector also adjusts accordingly.
Can find, in the embodiment of the present invention, also can by after building element that multiple thread parallel calculates in the diagonal matrix D ' of described real matrix and right singular matrix V ', diagonal matrix D and the right singular matrix V of complex matrix is built, to reduce time loss further according to diagonal matrix D ' and right singular matrix V '.
The embodiment of the present invention additionally provides a kind of device complex matrix H=A+Bi being carried out to singular value decomposition, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, as shown in Figure 4, comprising:
Real matrix builds module, calculates real matrix corresponding to described complex matrix for utilizing multiple thread parallel X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Diagonalization module, for carrying out diagonalization to described real matrix X by Jacobi iteration, the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
Second matrix computations module, for diagonal matrix D and the right singular matrix V of complex matrix according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix.
Described real matrix builds module and specifically comprises:
Computing unit, for utilizing multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B;
Construction unit, for utilizing matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
Above-mentioned device, wherein, described diagonalization module is specifically for carrying out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein, x kfor the intermediary matrix that last iterative process exports, J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses, N is the exponent number of matrix X.
Above-mentioned device, wherein, described diagonalization module specifically comprises:
First diagonalization unit, for utilizing N 2individual thread parallel calculates the N of Matrix C 2individual element;
Second diagonalization unit, for utilizing N 2individual thread parallel calculates matrix X k+1n 2individual element.
In the embodiment of the present invention, described second matrix computations module, specifically for after the element that builds multiple thread parallel and calculate in the diagonal matrix D ' of described real matrix and right singular matrix V ', builds diagonal matrix D and the right singular matrix V of complex matrix according to diagonal matrix D ' and right singular matrix V '.
The embodiment of the present invention additionally provides a kind of computing equipment, comprise main control unit and the many-core processor with memory, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, described main control unit is used for the memory complex matrix H=A+Bi of pending singular value decomposition being sent to described many-core processor, and described many-core processor is used for:
Multiple thread parallel is utilized to calculate real matrix corresponding to described complex matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Carry out diagonalization to described real matrix X by Jacobi iteration, the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
The diagonal matrix D of complex matrix and right singular matrix V according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix;
Described main control unit exports described diagonal matrix D and right singular matrix V after copying described diagonal matrix D and right singular matrix V.
Described many-core processor is specifically for utilizing multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B, then utilizes matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
Described many-core processor is specifically for carrying out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein, x kfor the intermediary matrix that last iterative process exports, J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses, N is the exponent number of matrix X.
Described many-core processor is specifically for utilizing N 2individual thread parallel calculates the N of Matrix C 2after individual element, utilize N 2individual thread parallel calculates matrix X k+1n 2individual element.
Described main control unit also for discharging the resource that singular value decomposition takies after singular value decomposition terminates.
Described main control unit is general processor.
Below the decomposable process that the computing equipment of the embodiment of the present invention carries out the singular value of complex matrix is described in detail as follows.
In the specific embodiment of the invention, by the singular value Solve problems of complex matrix is resolved into several parts, each several part by one independently thread carry out parallel computation, solve the spent time to reduce singular value.
Concurrent computational system both can be the custom-designed computing equipment containing multiple processor, also can be the cluster that the some platform independent computers interconnected in some way are formed, or graphics processing unit.
For multi-threading parallel process, need to provide a hardware platform for multiple thread performs simultaneously.Just introduce its parallel programming model and hardware configuration for programmable graphics processor (GPU) below.
At present, along with promotion that is real-time, high definition graphics process demand, programmable graphics processor (GPU) is evolved into high degree of parallelism, multithreading, has the polycaryon processor of larger computing capability and high bandwidth of memory.Especially, GPU is applicable to processing the problem that those can be expressed as data parallel (same program is executed in parallel in multiple data) very much, and the algorithm calculations density (ratio of arithmetical operation and storage operation) of data parallel is very high.Because same program performs on each element, therefore considerably less to the requirement of Complicated Flow control, more because perform and high bulk density on multiple element, access delay can be hidden by calculating, does not therefore need large data buffer storage.And this parallel computation hardware structure just can utilize the acceleration of decomposing with the complex matrix SVD of the embodiment of the present invention to realize.
CUDA is a kind of using the software and hardware architecture of GPU as data parallel equipment, is released in June, 2007 by NVIDIA.CUDA programming model is using CPU as main frame (Host), and GPU is as coprocessor (Co-processor) or equipment (Device).In this model, CPU and GPU collaborative work, Each performs its own functions.CPU is responsible for carrying out the strong transaction of logicality and serial computing, and GPU is then absorbed in highly threading parallel processing task.Once determine the parallel section in program, this part evaluation work just can be considered to give GPU.The CUDA parallel computation function operated on GPU is called Kernel (kernel function).Complete CUDA program is made up of jointly the serial processing steps of a series of equipment end Kernel function parallel step and main control unit.The work that CPU serial code completes is included in the work that kernel starts advance row data encasement and device initialize, and carries out some serial computing between kernel.CUDA calculation task is mapped as a large amount of can the thread of executed in parallel, and dispatched by hardware dynamic and perform these threads.
In the specific embodiment of the invention, this computing equipment comprises general processor and many-core processor, realizes the parallel processing of singular value Solve problems with this.The singular value that general processor and many-core processor realize above-mentioned complex matrix of comprising below with regard to the embodiment of the present invention solves and is described in detail as follows.
Many-core processor contains hundreds of and calculates core-scalar stream processor SP (StreamProcessor), multiple SP is according to certain forms composition stream multiprocessor SM (SteamMultiprocessor), a SM performs one or more thread block, thread in same thread block can pass through shared drive swap data, and different threads block can carry out swap data by global memory.Many-core processor, when calling kernel function, by configuration execution parameter, can be arranged the number of executing threads in parallel, with regard to being equivalent to, the hardware module of practical function being copied N time, running concurrently.
As shown in Figure 5, the computing equipment of the embodiment of the present invention comprises 3 layers: master control layer, many core processing layers be connected both altitude information channel layer.Be provided with the interface of main control unit (general processor can be adopted to realize) and data input and output in master control layer, and many core processing layers comprise many-core processor and global storage.
Main control unit provides data to be processed, and be sent to the global storage of many-core processor by high-speed data channel (as PCIe or 10Ge), the result calculated is sent to main frame by high-speed data channel by many-core processor again.
As shown in Figure 6, the course of work that framework shown in Fig. 5 realizes SVD is given.
As shown in Figure 6, during system starts, main control unit (general processor) first works, and is mainly behind many-core processor memory allocated space, then copy data to many-core processor from main control unit, and many-core processor execution parameter and many-core processor initial work are set.
After completing above-mentioned work, many-core processor is started working, and the SVD completing complex matrix decomposes, and comprises following process:
The decomposition and inversion of complex matrix is become the decomposition of real matrix, therefore first will obtain this real matrix, in this symmetry according to real matrix, only need to calculate C matrix as shown in Figure 2 and D matrix, then just can obtain other element of real matrix according to symmetry.
After obtaining real matrix, during use alternative manner has walked abreast and once cleaned, different Jacobi rotational transformation matrix is to the diagonalization computational process of real matrix;
After completing an iteration, judge whether current iteration makes the off-diagonal element of real matrix all little of thinking zero, if, then iteration terminates, complete the diagonalization of real matrix, otherwise the real matrix calculated is upgraded the real matrix of last iteration, enter next iteration.
After iteration terminates, Jacobi rotational transformation matrix is acted in complex matrix and calculate its diagonal matrix and right singular matrix.
So far, the SVD decomposition of complex matrix terminates at the evaluation work of many-core processor.
Finally, copy from many-core processor the result that complex matrix SVD decomposes to general processor, and discharge the resource that data processing takies, system finishing work.
In above-mentioned processing procedure, according to the hardware architectural features of many-core processor, complex matrix SVD decomposing module can be made a kernel function, then when arranging many-core processor execution parameter according to this many-core processor hardware resource, determine the Thread Count in kernel function and membership credentials thereof, the SVD completing multiple complex matrix concurrently decomposes.
As shown in Figure 7, after parallel complex matrix SVD decomposing module is started working, after main control unit completes data copy to the initialization of many-core processor and many-core processor, corresponding complex matrix SVD decomposing module in many-core processor can be given according to the call number of kernel function by multiple different complex matrix.After all complex matrix SVD decomposing module are all finished the work, then copy result to main control unit by data/address bus together, and discharge corresponding resource.
The embodiment of the present invention has following beneficial effect:
In the embodiment of the present invention, the SVD carrying out multiple different complex matrix that can simultaneously walk abreast decomposes, and increases data-handling capacity, sees on the whole and reduce data processing latency delays, is conducive to the real-time application of data.
Decomposing in complex matrix converts in real matrix decomposition, due to real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B Be symmetrical, in the embodiment of the present invention, only need the Partial Elements calculating triangle on it (or lower triangle), decrease amount of calculation.
Due to real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B Upper triangle (or lower triangle) partial arithmetic uniquely determined by matrix A row and column corresponding to matrix B, calculate triangle on it (or lower triangle) Partial Elements with therefore can using multiple thread parallel.Not only reduce amount of calculation like this, and also reduce data processing time, improve data-handling efficiency to a certain extent.
Matrix operation in a cleaning process, and in the computing of diagonal matrix D and V matrix, the sequence pair corresponding due to effect (premultiplication or the right side are taken advantage of) different rotary transformation matrix on same matrix is non conflicting, and therefore different spin matrixs changes the different row or column of this matrix.Utilize this characteristic, use multithreading to complete the renewal of matrix once, be namely responsible for the calculating of the element upgrading rear matrix by each thread.And general computational process needs the computational process sequentially completing each element, which increase the data processing stand-by period, therefore the embodiment of the present invention substantially reduces the data processing stand-by period, and the larger effect of matrix size is more obvious.
Many functional parts described in this specification are all called as module, specifically to emphasize the independence of its implementation.
In the embodiment of the present invention, module can use software simulating, to be performed by various types of processor.For example, the executable code module of a mark can comprise one or more physics or the logical block of computer instruction, and for example, it can be built as object, process or function.However, the executable code of institute's identification module is does not have to be physically positioned at together, but can comprise and be stored in different different instruction physically, when these command logics combine, and its composition module and realize the regulation object of this module.
In fact, executable code module can be individual instructions or many bar instructions, and even can be distributed on multiple different code segment, is distributed in the middle of distinct program, and crosses over the distribution of multiple memory devices.Similarly, operating data can be identified in module, and can realize according to any suitable form and be organized in the data structure of any suitable type.Described operating data can be collected as individual data collection, or can be distributed on diverse location and (be included in different storage device), and can only be present on system or network as electronic signal at least in part.
When module can utilize software simulating, consider the level of existing hardware technique, so can with the module of software simulating, when not considering cost, those skilled in the art can build corresponding hardware circuit and realize corresponding function, and described hardware circuit comprises existing semiconductor or other discrete element of conventional ultra-large integrated (VLSI) circuit or gate array and such as logic chip, transistor and so on.Module can also use programmable hardware device, the realizations such as such as field programmable gate array, programmable logic array, programmable logic device.
In each embodiment of the method for the present invention; the sequence number of described each step can not be used for the sequencing limiting each step; for those of ordinary skill in the art, under the prerequisite not paying creative work, the priority of each step is changed also within protection scope of the present invention.
The above is the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the prerequisite not departing from principle of the present invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (15)

1. one kind is carried out the method for singular value decomposition to complex matrix H=A+Bi, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, it is characterized in that, described method comprises:
Real matrix construction step, utilizes multiple thread parallel to calculate real matrix corresponding to described complex matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Diagonalization step, carries out diagonalization to described real matrix X by Jacobi iteration, and the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
Matrix computations step, the diagonal matrix D of complex matrix and right singular matrix V according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix.
2. method according to claim 1, is characterized in that, described real matrix construction step specifically comprises:
Calculate sub-step, utilize multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B;
Build sub-step, utilize matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B
3. method according to claim 2, is characterized in that, in described diagonalization step, carries out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein:
C = J k , N T J k , N - 1 T . . . J k , 2 T J k , 1 T X k ;
X kfor the intermediary matrix that last iterative process exports;
J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses;
N is the exponent number of matrix X.
4. method according to claim 3, is characterized in that, described diagonalization step specifically comprises:
First diagonalization sub-step, utilizes N 2individual thread parallel calculates the N of Matrix C 2individual element;
Second diagonalization sub-step, utilizes N 2individual thread parallel calculates matrix X k+1n 2individual element.
5. method according to claim 1, it is characterized in that, in described matrix computations step, after building the element that multiple thread parallel calculates in the diagonal matrix D ' of described real matrix and right singular matrix V ', build diagonal matrix D and the right singular matrix V of complex matrix according to diagonal matrix D ' and right singular matrix V '.
6. one kind is carried out the device of singular value decomposition to complex matrix H=A+Bi, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, it is characterized in that, described device comprises:
Real matrix builds module, calculates real matrix corresponding to described complex matrix for utilizing multiple thread parallel X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Diagonalization module, for carrying out diagonalization to described real matrix X by Jacobi iteration, the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
Second matrix computations module, for diagonal matrix D and the right singular matrix V of complex matrix according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix.
7. device according to claim 6, is characterized in that, described real matrix builds module and specifically comprises:
Computing unit, for utilizing multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B;
Construction unit, for utilizing matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
8. device according to claim 7, is characterized in that, described diagonalization module is specifically for carrying out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein:
C = J k , N T J k , N - 1 T . . . J k , 2 T J k , 1 T X k ;
X kfor the intermediary matrix that last iterative process exports;
J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses;
N is the exponent number of matrix X.
9. device according to claim 8, is characterized in that, described diagonalization module specifically comprises:
First diagonalization unit, for utilizing N 2individual thread parallel calculates the N of Matrix C 2individual element;
Second diagonalization unit, for utilizing N 2individual thread parallel calculates matrix X k+1n 2individual element.
10. device according to claim 6, it is characterized in that, described second matrix computations module, specifically for after the element that builds multiple thread parallel and calculate in the diagonal matrix D ' of described real matrix and right singular matrix V ', builds diagonal matrix D and the right singular matrix V of complex matrix according to diagonal matrix D ' and right singular matrix V '.
11. 1 kinds of computing equipments, comprise main control unit and the many-core processor with memory, right singular matrix V is obtained for carrying out decomposition to the channel complex matrix H of multiple-input and multiple-output mimo system, so that the transmitter of described mimo system uses described matrix V to carry out pre-encode operation to signal to be transmitted, it is characterized in that, described main control unit is used for the memory complex matrix H=A+Bi of pending singular value decomposition being sent to described many-core processor, and described many-core processor is used for:
Multiple thread parallel is utilized to calculate real matrix corresponding to described complex matrix X = A T A + B T B B T A - A T B A T B - B T A A T A + B T B Constitution element, and build described real matrix;
Carry out diagonalization to described real matrix X by Jacobi iteration, the off-diagonal element obtaining described real matrix is all less than or equal to the diagonalizable matrix of predetermined threshold; In iterative process each time, the Jacobi rotational transformation matrix that different sequence pair is corresponding is undertaken by multi-threaded parallel the computing of matrix;
The diagonal matrix D of complex matrix and right singular matrix V according to the diagonal matrix obtaining Jacobi rotational transformation matrix that described diagonalizable matrix uses and calculate described complex matrix;
Described main control unit exports described diagonal matrix D and right singular matrix V after copying described diagonal matrix D and right singular matrix V.
12. computing equipments according to claim 11, is characterized in that, described many-core processor is specifically for utilizing multiple thread parallel compute matrix A ta+B tthe constitution element of the upper triangular matrix of B and matrix B ta-A tthe constitution element of B, then utilizes matrix A ta+B tthe upper triangular matrix of B and matrix B ta-A tb builds described real matrix A T A + B T B B T A - A T B A T B - B T A A T A + B T B .
13. computing equipments according to claim 12, is characterized in that, described many-core processor is specifically for carrying out iterative computation by following formula:
X k+1=CJ k,1J k,2…J k,N-1J k,N
Wherein:
C = J k , N T J k , N - 1 T . . . J k , 2 T J k , 1 T X k ;
X kfor the intermediary matrix that last iterative process exports;
J k, 1, J k, 2..., J k, N-1, J k,Nfor the Jacobi rotational transformation matrix that current iteration process uses;
N is the exponent number of matrix X.
14. computing equipments according to claim 13, is characterized in that, described many-core processor is specifically for utilizing N 2individual thread parallel calculates the N of Matrix C 2after individual element, utilize N 2individual thread parallel calculates matrix X k+1n 2individual element.
15. computing equipments according to claim 11, it is characterized in that, described many-core processor, specifically for after the element that builds multiple thread parallel and calculate in the diagonal matrix D ' of described real matrix and right singular matrix V ', builds diagonal matrix D and the right singular matrix V of complex matrix according to diagonal matrix D ' and right singular matrix V '.
CN201410377159.5A 2014-08-01 2014-08-01 Method and device for performing singular value decomposition on complex matrix and computing equipment Pending CN105323036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410377159.5A CN105323036A (en) 2014-08-01 2014-08-01 Method and device for performing singular value decomposition on complex matrix and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410377159.5A CN105323036A (en) 2014-08-01 2014-08-01 Method and device for performing singular value decomposition on complex matrix and computing equipment

Publications (1)

Publication Number Publication Date
CN105323036A true CN105323036A (en) 2016-02-10

Family

ID=55249708

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410377159.5A Pending CN105323036A (en) 2014-08-01 2014-08-01 Method and device for performing singular value decomposition on complex matrix and computing equipment

Country Status (1)

Country Link
CN (1) CN105323036A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105763880A (en) * 2016-02-19 2016-07-13 北京大学 Method for eliminating block effect of video coding images
CN113536228A (en) * 2021-09-16 2021-10-22 之江实验室 FPGA acceleration implementation method for matrix singular value decomposition
CN113704691A (en) * 2021-08-26 2021-11-26 中国科学院软件研究所 Small-scale symmetric matrix parallel three-diagonalization method of Shenwei many-core processor
CN115600061A (en) * 2022-12-14 2023-01-13 嘉兴索罗威新能源有限公司(Cn) Inverter zero voltage drop data processing method based on machine learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101390351A (en) * 2004-11-15 2009-03-18 高通股份有限公司 Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation
CN101512929A (en) * 2006-08-17 2009-08-19 交互数字技术公司 Method and apparatus for providing efficient precoding feedback in a mimo wireless communication system
CN101626266A (en) * 2009-07-27 2010-01-13 北京天碁科技有限公司 Method and device for estimating rank indication and precoding matrix indication in precoding system
CN103294649A (en) * 2013-05-23 2013-09-11 东南大学 Bilateral CORDIC arithmetic unit, and parallel Jacobian Hermite matrix characteristic decomposition method and implementation circuit based on bilateral CORDIC arithmetic unit.
CN103378888A (en) * 2012-04-24 2013-10-30 华为技术有限公司 Downlink beam forming method and downlink beam forming equipment
CN103501212A (en) * 2013-09-30 2014-01-08 上海交通大学 SVD (Singular Value Decomposition) method and SVD device of MIMO (Multiple Input Multiple Output) pre-coding technology

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101390351A (en) * 2004-11-15 2009-03-18 高通股份有限公司 Eigenvalue decomposition and singular value decomposition of matrices using jacobi rotation
CN101512929A (en) * 2006-08-17 2009-08-19 交互数字技术公司 Method and apparatus for providing efficient precoding feedback in a mimo wireless communication system
CN101626266A (en) * 2009-07-27 2010-01-13 北京天碁科技有限公司 Method and device for estimating rank indication and precoding matrix indication in precoding system
CN103378888A (en) * 2012-04-24 2013-10-30 华为技术有限公司 Downlink beam forming method and downlink beam forming equipment
CN103294649A (en) * 2013-05-23 2013-09-11 东南大学 Bilateral CORDIC arithmetic unit, and parallel Jacobian Hermite matrix characteristic decomposition method and implementation circuit based on bilateral CORDIC arithmetic unit.
CN103501212A (en) * 2013-09-30 2014-01-08 上海交通大学 SVD (Singular Value Decomposition) method and SVD device of MIMO (Multiple Input Multiple Output) pre-coding technology

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHARLOTTE KOTAS AND JACOB BARHEN: "Singular value decomposition utilizing parallel algorithms on graphical processors", 《OCEAN 2011》 *
袁生光: "对称矩阵特征值分解的硬件实现研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
韩庆文;张鹏;王韬: "OFDM信道估计的复矩阵分解及FPGA实现", 《重庆大学学报》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105763880A (en) * 2016-02-19 2016-07-13 北京大学 Method for eliminating block effect of video coding images
CN105763880B (en) * 2016-02-19 2019-02-22 北京大学 The method for eliminating video coded pictures blocking artifact
CN113704691A (en) * 2021-08-26 2021-11-26 中国科学院软件研究所 Small-scale symmetric matrix parallel three-diagonalization method of Shenwei many-core processor
CN113704691B (en) * 2021-08-26 2023-04-25 中国科学院软件研究所 Small-scale symmetric matrix parallel tri-diagonalization method of Shenwei many-core processor
CN113536228A (en) * 2021-09-16 2021-10-22 之江实验室 FPGA acceleration implementation method for matrix singular value decomposition
CN115600061A (en) * 2022-12-14 2023-01-13 嘉兴索罗威新能源有限公司(Cn) Inverter zero voltage drop data processing method based on machine learning

Similar Documents

Publication Publication Date Title
Singh et al. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling
CN104915322B (en) A kind of hardware-accelerated method of convolutional neural networks
CN104391820B (en) General floating-point matrix processor hardware structure based on FPGA
Shimokawabe et al. An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code
EP3343460A1 (en) Hardware accelerator template and design framework for implementing recurrent neural networks
CN103617150A (en) GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system
CN110383300A (en) A kind of computing device and method
CN110163360A (en) A kind of computing device and method
US11934308B2 (en) Processor cluster address generation
CN105323036A (en) Method and device for performing singular value decomposition on complex matrix and computing equipment
CN105373517A (en) Spark-based distributed matrix inversion parallel operation method
CN110163350A (en) A kind of computing device and method
Gao et al. A multi-GPU parallel optimization model for the preconditioned conjugate gradient algorithm
CN107291666A (en) A kind of data processing method and processing unit
CN104850529A (en) Acceleration calculation on-chip system of LS-SVM model established based on Zynq platform
Bodily et al. A comparison study on implementing optical flow and digital communications on FPGAs and GPUs
CN114356836A (en) RISC-V based three-dimensional interconnected many-core processor architecture and working method thereof
Dass et al. Distributed training of support vector machine on a multiple-FPGA system
He et al. A multiple-GPU based parallel independent coefficient reanalysis method and applications for vehicle design
CN114511094A (en) Quantum algorithm optimization method and device, storage medium and electronic device
CN105323037A (en) Pre-coding method and device according to complex matrix
CN104615516A (en) Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
CN101561797A (en) Method and device for singular value and feature value composition of matrix on processing system
Zhang et al. Joint compressing and partitioning of CNNs for fast edge-cloud collaborative intelligence for IoT
Gan et al. Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160210