CN104572588A - Matrix inversion processing method and device - Google Patents

Matrix inversion processing method and device Download PDF

Info

Publication number
CN104572588A
CN104572588A CN201410816765.2A CN201410816765A CN104572588A CN 104572588 A CN104572588 A CN 104572588A CN 201410816765 A CN201410816765 A CN 201410816765A CN 104572588 A CN104572588 A CN 104572588A
Authority
CN
China
Prior art keywords
matrix
vector
thread
data
current line
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410816765.2A
Other languages
Chinese (zh)
Other versions
CN104572588B (en
Inventor
魏一雄
陈兴玉
程五四
陈帝江
胡祥涛
张红旗
苏建军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 38 Research Institute
Original Assignee
CETC 38 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 38 Research Institute filed Critical CETC 38 Research Institute
Priority to CN201410816765.2A priority Critical patent/CN104572588B/en
Publication of CN104572588A publication Critical patent/CN104572588A/en
Application granted granted Critical
Publication of CN104572588B publication Critical patent/CN104572588B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a matrix inversion processing method and device. The matrix inversion processing method comprises the following steps of acquiring an expanding matrix expanded through a central processor; utilizing a compute unified device architecture platform to build a global grid structure according to the expanding matrix, wherein the global grid structure comprises a plurality of thread blocks and threads; utilizing the global grid structure to carry out parallel processing on column vectors of the expanding matrix, wherein a serial computing way is adopted to compute data on each column of column vectors of the expanding matrix to obtain a computing result, and the computing result comprises an inverse matrix of a target matrix and a unit matrix with the same size as the inverse matrix; outputting the computing result to the central processor, wherein the central processor extracts the inverse matrix from the computing result. Through the matrix inversion processing method and the device provided by the invention, the problem that the computing efficiency is low in the matrix inversion process in the prior art is solved, and the effect in improving the computing efficiency in the matrix inversion process is achieved.

Description

Matrix inversion process method and apparatus
Technical field
The present invention relates to data processing field, in particular to a kind of matrix inversion process method and apparatus.
Background technology
In present digitizing industrial field, due to the develop rapidly of computer technology, the technology of more and more other field starts to be included into, also digitizing technique is more and more relied on to bring obvious Industrial driving ability, particularly manufacture field, the proportion occupied in product development process due to design, simulation analysis increases, and computing machine fast calculation analysis ability easily, and the product development process of modern manufacturing industry is almost based on digitizing technique completely.Also this dependence just, impels computer software and hardware to constantly update and regenerates to meet the performance requirement day by day promoted.
Calculating for matrix generally includes limit elements method and method of finite difference and boundary element method.Relative to Finite Element and method of finite difference, boundary element method, because of its high precision and dimensionality reduction advantage, is more suitable for the process of Quick Pretreatment, adaptive structure analysis.But the matrix of coefficients adopting boundary element method to obtain has dense and asymmetrical inferior position, solution efficiency and the Solve problems scale of traditional boundary unit method are all restricted.A lot of scholar, by introducing Fast numerical computational algorithm, as quick multipole etc., accelerates Solution of Boundary Element Method computation process, improves Solve problems scale.But for the engineering problem of time correlation, when time domain or frequency domain utilize Boundary Element Method, due to the complicacy of elementary solution and time domain discrete or conversion requirements, above-mentioned numerical algorithm can not obtain and well calculate effect.Especially O (N is related to 3) matrix inversion operation of computation complexity consumes plenty of time in numerical operation process.
Hackbusch proposed the numerical solution algorithm accelerating algorithm based on hereditary tree construction in 1999.Utilize the partitioning of matrix, and compatible sub-block is utilized outer multiplication approximate representation, thus condensation matrix data store, reduce the data volume participating in matrix operation, utilize recurrence thought simultaneously, by hereditary tree construction, propose approximate inversion technique.But this method promotes limited for counting yield, simultaneously owing to being approximate fits, computational accuracy cannot be guaranteed.In addition, due to data correlation stronger in matrix inversion operation, be difficult to utilize central processing unit (CPU) multi-core parallel concurrent to calculate lifting reduction and expend computing time, this makes counting yield in the process of matrix inversion lower.
Calculate inefficient problem in matrix inversion process in prior art, not yet propose effective solution at present.
Summary of the invention
Fundamental purpose of the present invention is to provide a kind of matrix inversion process method and apparatus, calculates inefficient problem to solve in prior art in matrix inversion process.
To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of matrix inversion process method is provided.Matrix inversion process method according to the present invention comprises: obtain the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix; Utilize unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread; Described global grid structure is utilized to adopt the mode of parallel computation to calculate the column vector of described extended matrix successively, the in the ranks data of mode to the row vector of described extended matrix of serial computing are adopted to calculate, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And export described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.
Further, unified calculation equipment framework platform construction global grid structure is utilized to comprise according to described extended matrix: to determine to split radix according to the size of described extended matrix; According to described segmentation radix, the row vector of described extended matrix and column vector are divided, obtain multiple data segment; And the quantity of multiple data segment builds thread block structure according to described segmentation cardinal sum, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.
Further, described global grid structure is utilized to adopt the mode of parallel computation to calculate the column vector of described extended matrix successively, the in the ranks data of mode to the row vector of described extended matrix of serial computing are adopted to calculate, obtain result of calculation to comprise: the coefficient vector calculating current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector; Utilize described global grid structure thread to calculate the transformation results of this thread in the mapping position of described extended matrix, the data of replacing in the mapping position of described extended matrix by described transformation results, obtain the extended matrix after replacing; Judge that whether described current line vector is last column row vector of described extended matrix; If judge that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, return the step performing the coefficient vector calculating current line vector; And if judge that described current line vector is last column row vector of described extended matrix, then unitization process is carried out to the extended matrix after described replacement, obtains described result of calculation.
Further, the coefficient vector calculating current line vector in described extended matrix comprises: the diagonal position data obtaining described current line vector; Obtain the data in the column vector at described diagonal position data place; Successively the data in the column vector at diagonal position data and described diagonal position data place are divided by, obtain described coefficient vector.
Further, utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, described matrix inversion process method also comprises: set up and share video memory space, described shared video memory space is for storing the data of described current line vector; Wherein, described global grid structure thread calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix from described shared video memory space.
To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of matrix inversion process device.Matrix inversion process device according to the present invention comprises: acquiring unit, and for obtaining the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix; First sets up unit, and for utilizing unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread; Computing unit, the mode of parallel computation is adopted to calculate the column vector of described extended matrix successively for utilizing described global grid structure, the in the ranks data of mode to the row vector of described extended matrix of serial computing are adopted to calculate, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And output unit, for exporting described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.
Further, described first sets up unit comprises: determination module, splits radix for determining according to the size of described extended matrix; Dividing module, for dividing the row vector of described extended matrix and column vector according to described segmentation radix, obtaining multiple data segment; And structure module, quantity for multiple data segment according to described segmentation cardinal sum builds thread block structure, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.
Further, described computing unit comprises: the first computing module, for calculating the coefficient vector of current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector; Second computing module, the transformation results of this thread in the mapping position of described extended matrix is calculated for utilizing described global grid structure thread, by the data that described transformation results is replaced in the mapping position of described extended matrix, obtain the extended matrix after replacing; Judge module, for judging that whether described current line vector is last column row vector of described extended matrix; If described first computing module is also for judging that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, calculate the coefficient vector of current line vector; And processing module, if for judging that described current line vector is last column row vector of described extended matrix, then carry out unitization process to the extended matrix after described replacement, obtain described result of calculation.
Further, described first computing module comprises: first obtains submodule, for obtaining the diagonal position data of described current line vector; Second obtains submodule, the data in the column vector obtaining described diagonal position data place; Division submodule, for the data in the column vector at diagonal position data and described diagonal position data place being divided by successively, obtains described coefficient vector.
Further, described matrix inversion process device also comprises: second sets up unit, for utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, set up and share video memory space, described shared video memory space is for storing the data of described current line vector; Described second computing module also calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix for described global grid structure thread from described shared video memory space.
According to the embodiment of the present invention, obtain the extended matrix after central processing unit expansion, unified calculation equipment framework platform construction global grid structure is utilized according to extended matrix, global grid structure is utilized to adopt the mode of parallel computation to calculate the column vector of extended matrix successively, the in the ranks data of mode to the row vector of extended matrix of serial computing are adopted to calculate, obtain result of calculation, export result of calculation to central processing unit, wherein, central processing unit extracts inverse matrix from result of calculation, solve in prior art and calculate inefficient problem in matrix inversion process, reach the effect improving counting yield in matrix inversion process.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the matrix inversion process method according to the embodiment of the present invention;
Fig. 2 is the schematic diagram of the global grid structure according to the embodiment of the present invention;
Fig. 3 is the process flow diagram according to the preferred matrix inversion process method of the embodiment of the present invention; And
Fig. 4 is the schematic diagram of the matrix inversion process device according to the embodiment of the present invention.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged, in the appropriate case so that embodiments of the invention described herein.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiments provide a kind of matrix inversion process method.The method can be performed by graphic process unit (GPU).
Fig. 1 is the process flow diagram of the matrix inversion process method according to the embodiment of the present invention.As shown in Figure 1, to comprise step as follows for this matrix inversion process method:
Step S102, obtain the extended matrix after central processing unit expansion, extended matrix comprises objective matrix and has the unit matrix of formed objects with objective matrix.
Objective matrix is matrix to be inverted.Before calculating objective matrix, first by host side central processing unit (CPU), objective matrix is expanded to extended matrix, the objective matrix [M] of such as m*n size, expanded to extended matrix [M|E], then export this extended matrix to equipment end graphic process unit (GPU), carry out realize target matrix inversion by GPU.
Step S104, utilizes unified calculation equipment framework platform construction global grid structure according to extended matrix.Wherein, global grid structure comprises multiple thread.
Unified calculation equipment framework (CUDA) platform is the one expansion of C language, and it allows to use standard C to carry out GPU code programming.This code had both been applicable to central processing unit (CPU), was also applicable to graphic process unit (GPU).Host side is responsible for deriving the multithreading task (kernel function) operated in GPU equipment end; Equipment end is provided with internal schedule device by kernel programme distribution in corresponding GPU hardware.Utilize office's network that CUDA platform construction is made up of entirely multiple thread, utilize this global grid structure to process extended matrix.Particularly, global grid structure can comprise at least three sub-global grid structures, be respectively used to coefficient vector calculate, row matrix computing and matrix unit process three processes, wherein, coefficient vector calculates corresponding sub-global grid structure at horizontal stroke, longitudinal direction divides thread block, wherein each thread is for the treatment of each data element on objective matrix in extended matrix, sub-global grid structure corresponding to row matrix computing is only laterally dividing thread block, wherein each thread is for the treatment of a column vector of extended matrix, sub-global grid structure corresponding to matrix unit process is only longitudinally dividing thread block, wherein each thread is for the treatment of a row vector of extended matrix.
Step S106, utilizes described global grid structure to carry out parallel processing to the column vector of described extended matrix, and wherein, the mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, and obtains result of calculation.Result of calculation comprises the inverse matrix of objective matrix and has the unit matrix of formed objects with inverse matrix.
Owing to carrying out in the process that objective matrix inverts, namely in the ranks the relevance of data is higher for data between the row vector of extended matrix and row vector, therefore, the mode of the data acquisition serial computing between the different rows vector of extended matrix is calculated, namely in each thread according to the order serial computing data in the ranks of row vector.And for each column vector of extended matrix, parallel computation can be adopted, i.e. parallel computation between each thread, thus improve the counting yield of data in the ranks.
Step S108, exports result of calculation to central processing unit, and wherein, central processing unit extracts inverse matrix from result of calculation.
After equipment end GPU calculates result of calculation, this result of calculation is exported to central processor CPU, extracted the inverse matrix obtaining objective matrix by CPU.
For the objective matrix of m*n size [M], before the inverse matrix calculating objective matrix, need to distribute data space according to objective matrix [M] size.For host side, need to distribute m*n size space and store objective matrix; In equipment end, then need to distribute 2*m*n*sizeof (float) size space (g_iMatrix) and store intermediate calculation data, open up sizeof (float) * m*1 size space (g_tVector) simultaneously and store compute vector.Then import the target matrix data of host side into video memory space, utilize GPU to invert to objective matrix.
The matrix of trying to achieve in CUDA video memory is actual is [E|InM], and wherein InM is the inverse matrix of original matrix, and [E] is the unit matrix of m*n size.Import data into memory headroom by video memory space, and extract InM replacement original matrix M, simultaneously by video memory and memory headroom release.
According to the embodiment of the present invention, obtain the extended matrix after central processing unit expansion, unified calculation equipment framework platform construction global grid structure is utilized according to extended matrix, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, export result of calculation to central processing unit, wherein, central processing unit extracts inverse matrix from result of calculation.Utilize logic processing capability that CPU host side is powerful and the powerful arithmetic capability of GPU equipment end, by the row, column data handling procedure of matrix inversion process is broken, the computation capability of maximum using CUDA reduces computing time, lifting matrixes inversion calculation efficiency.
Preferably, unified calculation equipment framework platform construction global grid structure is utilized to comprise according to extended matrix: to determine to split radix according to the size of extended matrix; According to segmentation radix, the row vector of extended matrix and column vector are divided, obtain multiple data segment; And build thread block structure according to the quantity of the multiple data segment of segmentation cardinal sum, form global grid structure, wherein, global grid structure comprises and multiple data segment thread block one to one, and each thread block has the thread with segmentation radix equal number.
Particularly, global grid structure can comprise at least three sub-global grid structures, is respectively used to coefficient vector calculating, row matrix computing and matrix unit process three processes.
For the objective matrix of m*n size [M], expand to the matrix [M|E] of 2*m*n size.
Row m and row n is divided into the data segment of Segment_m=m/s+1, Segment_n=n/s+1 size according to segmentation radix s (can get s=8,16,32 or 64, select according to the concrete data scale size that calculates) correspondence.Utilize row and column to split the Segments obtained and build overall Grid (grid); According to segmentation radix s, build Block structure, form global grid structure.As shown in Figure 2, T represents thread (Thread), and B represents thread block (Block), and x, y represent thread block dimension in the two directions, and X, Y represent the dimension in grid both direction.
Particularly, in computation process, be: Grid structure during coefficient calculations is (Segment_m, Segment_n, 1) that Block structure is (s, s, 1) in the structure principle of X, Y, Z tri-dimensions; The Grid structure of row matrix computing is (1,2*Segment_n, 1), and Block structure is (1, s, 1); The Grid structure of matrix unit is Grid (Segment_m, 1,1), and Block structure is (s, 1,1).
Preferably, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation to comprise: the coefficient vector calculating current line vector in extended matrix, coefficient vector comprises the coefficient of other row vectors in extended matrix outside current line vector for current line vector; Utilize global grid structure thread to calculate the transformation results of this thread in the mapping position of extended matrix, the data of replacing in the mapping position of extended matrix by transformation results, obtain the extended matrix after replacing; Judge that whether current line vector is last column row vector of extended matrix; If judge that current line vector is not last column row vector of extended matrix, then the next line row vector of current line vector is vectorial as current line, return the step performing the coefficient vector calculating current line vector; And if judge that current line vector is last column row vector of extended matrix, then to replace after extended matrix carry out unitization process, obtain result of calculation.
Particularly, extended matrix row vector k (the current line vector of the extended matrix traversed in k behavior extended matrix) can be traveled through by series process, by CUDA thread, concurrent operation is carried out to the row of extended matrix, calculate other row for the capable coefficient of k, obtain coefficient vector, stored in vector space g_tVector.
K for current traversal is capable, and usage factor vector, namely the transformation results in each thread in compute matrix mapping position is row matrix computing, and replaces former data.
Preferably, utilizing before global grid structure thread calculates the transformation results of this thread in the mapping position of extended matrix, matrix inversion process method also comprises: set up and share video memory space, share video memory space for storing the data of current line vector, wherein, global grid structure thread calls the data of current line vector for calculating the transformation results of this thread on the mapping position of extended matrix from shared video memory space.
Store the data of current line vector by setting up shared video memory space, so that in computation process, different threads can call the data of this row vector simultaneously, avoids repeating to call taking computational resource.
Calculating original matrix data and current matrix diagonal position data (base) carry out division arithmetic in each thread, and replace former data, namely carry out the unitization operation of matrix mapping location data.Finally, judge whether the last column traversing extended matrix, otherwise continue to perform above-mentioned steps to next line row vector.
According to the embodiment of the present invention, by Ergodic Matrices row vector successively, the data acquisition parallel computation of each row vector, thus improve the counting yield of matrix inversion.
Further, the coefficient vector calculating current line vector in extended matrix comprises: the diagonal position data obtaining current line vector; Obtain the data in the column vector at diagonal position data place; Successively the data in the column vector at diagonal position data and diagonal position data place are divided by, obtain coefficient vector.
Particularly, the Grid structure when coefficient calculations is (Segment_m, Segment_n, 1), and Block structure is (s, s, 1).Under Grid and Block structure, each thread process overall situation matrix (g_iMatrix) relevant position data.In thread kernel function, thread is mapped in X, Y in the overall situation or Z-direction position (tid_in_grid_x/y/z) and is determined by Block position and width and Block thread position, as:
tid_in_grid_y=blockDim.y*blockIdx.y+threadIdx.y。
Search the current thread matrix that position maps in the Y direction be expert at, and the matrix diagonals position data (base) at the diagonal position data of this journey and the capable place of tid_in_grid_y is divided by, result of calculation puts into the position of coefficient vector (g_tVector) corresponding tid_in_grid_y.Execution false code is as follows:
When row matrix computing, Grid structure is (1, Segment_m, 1), and Block structure is (1, s, 1).
Utilize thread global position determination thread matrix data position (tid_in_grid) to be dealt with, determine the capable respective column position (T_in_obj) of k needing to carry out subtraction simultaneously.
tid_in_grid=tid_in_grid_y*n+tid_in_grid_x;
T_in_obj=k*n+tid_in_grid_x;
Because the capable data of k are the data all needing in each thread computes to call, taking computational resource for avoiding repeating to call, in thread block, setting up shared video memory space (sdata) store these data, accessing for different threads simultaneously.
The k of matrix is capable, the multiplication of correspondence position in the data of T_in_obj row and g_iVector, and the matrix data corresponding with tid_in_grid position carries out subtraction, and carries out data replacement.
In matrix unit computation process, Grid structure is Grid (Segment_m, 1,1), and Block structure is (s, 1,1).In arbitrary thread, the matrix position tid_in_matrix that index row k corresponding tid_in_grid_x arrange, by this position data divided by the corresponding data at k line position of current matrix base, by result stored in matrix tid_in_matrix position.
Take size as 100*100 objective matrix [M] be example, composition graphs 3 is described the embodiment of the present invention.
Step S302, distributes data space according to objective matrix [M] size.For host side, need to distribute 100*100 size space and store objective matrix; In equipment end, then need to distribute 156.25KB size space (g_iMatrix) and store intermediate calculation data, open up 1KB size space (g_tVector) simultaneously and store compute vector.Import matrix data to be inverted for host side into video memory space.
Step S304, expands to matrix [M|E] by objective matrix [M].Wherein matrix [M|E] size is 2*100*100.
Step S306, is divided into some data segments of formed objects by row m and row n.Particularly, row m and row n is divided into the data segment of Segment_m=7, Segment_n=7 size according to segmentation radix s (s=8,16,32 or 64 calculates the little selection of data scale according to concrete) correspondence.
Step S308, the data segment utilizing row and column to split to obtain builds global grid structure.Particularly, according to segmentation radix s, Block structure is built.Be specifically: Grid structure during coefficient calculations is (7,7,1) that Block structure is (16,16,1) in the structure principle of X, Y, Z tri-dimensions; The Grid structure of row matrix computing is (1,14,1), and Block structure is (1,16,1); The Grid structure of matrix unit is Grid (7,1,1), and Block structure is (16,1,1).
After structure global grid structure, executed in parallel following steps S310 is to step S316 in each thread, namely to the column vector parallel computation of extended matrix.
Step S310, the current line vector k of Ergodic Matrices.Particularly, walk to the 100th row from the 1st, searching loop row matrix vector, k=1,2 ... 100.
Step S312, coefficient calculations.Under Grid and Block structure, each thread process overall situation matrix (g_iMatrix) relevant position data.In thread kernel function, thread is mapped in X, Y in the overall situation or Z-direction position (tid_in_grid_x/y/z) and is determined by Block position and width and Block thread position, as:
tid_in_grid_y=blockDim.y*blockIdx.y+threadIdx.y。
Search the current thread matrix that position maps in the Y direction be expert at, and the matrix diagonals position data (base) at the diagonal position data of this journey and the capable place of tid_in_grid_y is divided by, result of calculation puts into the position of coefficient vector (g_tVector) corresponding tid_in_grid_y.Execution false code is as follows:
Step S314, row matrix computing, in CUDA, the capable data of matrix k of each thread-data and current traversal carry out computing.First utilize thread global position determination thread matrix data position (tid_in_grid) to be dealt with, determine the capable respective column position (T_in_obj) of k needing to carry out subtraction simultaneously.
tid_in_grid=tid_in_grid_y*n+tid_in_grid_x;
T_in_obj=k*n+tid_in_grid_x;
Because the capable data of k are the data all needing in each thread computes to call, taking computational resource for avoiding repeating to call, in thread block, setting up shared video memory space (sdata) store these data, accessing for different threads simultaneously.
The k of matrix is capable, the multiplication of correspondence position in the data of T_in_obj row and g_iVector, and the matrix data corresponding with tid_in_grid position carries out subtraction, and carries out data replacement.
Step S316, matrix unit.In arbitrary thread, the matrix position tid_in_matrix that index row k corresponding tid_in_grid_x arrange, by this position data divided by the corresponding data at k line position of current matrix base, by result of calculation stored in matrix tid_in_matrix position.
Step S318, judges whether k equals m.If so, then k adds 1, and returns step S310, otherwise, then perform step S320.
The matrix of trying to achieve in step S320, CUDA video memory is actual is [E|InM], and wherein InM is the inverse matrix of original matrix, and [E] is the unit matrix of 100*100 size.Import data into memory headroom by video memory space, and extract InM replacement original matrix M, simultaneously by video memory and memory headroom release.
The embodiment of the present invention additionally provides a kind of matrix inversion process device.It should be noted that, the matrix inversion process device of the embodiment of the present invention may be used for performing the matrix inversion process method that the embodiment of the present invention provides, and the matrix inversion process device that the matrix inversion process method of the embodiment of the present invention also can be provided by the embodiment of the present invention performs.
Fig. 4 is the schematic diagram of the matrix inversion process device according to the embodiment of the present invention.As shown in Figure 4, this matrix inversion process device comprises: acquiring unit 10, first sets up unit 20, computing unit 30 and output unit 40.
Acquiring unit 10 is for obtaining the extended matrix after central processing unit expansion, and extended matrix comprises objective matrix and has the unit matrix of formed objects with objective matrix.
Objective matrix is matrix to be inverted.Before calculating objective matrix, first by host side central processing unit (CPU), objective matrix is expanded to extended matrix, the objective matrix [M] of such as m*n size, expanded to extended matrix [M|E], then export this extended matrix to equipment end graphic process unit (GPU), carry out realize target matrix inversion by GPU.
First sets up unit 20 for utilizing unified calculation equipment framework platform construction global grid structure according to extended matrix.Wherein, global grid structure comprises multiple thread.
Unified calculation equipment framework (CUDA) platform, CUDA is the one expansion of C language, and it allows to use standard C to carry out GPU code programming.This code had both been applicable to central processing unit (CPU), was also applicable to graphic process unit (GPU).Host side is responsible for deriving the multithreading task (kernel function) operated in GPU equipment end; Equipment end is provided with internal schedule device by kernel programme distribution in corresponding GPU hardware.Utilize office's network that CUDA platform construction is made up of entirely multiple thread, utilize this global grid structure to process extended matrix.Particularly, global grid structure can comprise at least three sub-global grid structures, be respectively used to coefficient vector calculate, row matrix computing and matrix unit process three processes, wherein, coefficient vector calculates corresponding sub-global grid structure at horizontal stroke, longitudinal direction divides thread block, wherein each thread is for the treatment of each data element on objective matrix in extended matrix, sub-global grid structure corresponding to row matrix computing is only laterally dividing thread block, wherein each thread is for the treatment of a column vector of extended matrix, sub-global grid structure corresponding to matrix unit process is only longitudinally dividing thread block, wherein each thread is for the treatment of a row vector of extended matrix.
Computing unit 30 carries out parallel processing for utilizing described global grid structure to the column vector of described extended matrix, and wherein, the mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, and obtains result of calculation.Result of calculation comprises the inverse matrix of objective matrix and has the unit matrix of formed objects with inverse matrix.
Owing to carrying out in the process that objective matrix inverts, namely in the ranks the relevance of data is higher for data between the row vector of extended matrix and row vector, therefore, the mode of the data acquisition serial computing between the different rows vector of extended matrix is calculated, namely in each thread according to the order serial computing data in the ranks of row vector.And for each column vector of extended matrix, parallel computation can be adopted, i.e. parallel computation between each thread, thus improve the counting yield of data in the ranks.
Output unit 40 is for exporting result of calculation to central processing unit, and wherein, central processing unit extracts inverse matrix from result of calculation.
After equipment end GPU calculates result of calculation, this result of calculation is exported to central processor CPU, extracted the inverse matrix obtaining objective matrix by CPU.
For the objective matrix of m*n size [M], before the inverse matrix calculating objective matrix, need to distribute data space according to objective matrix [M] size.For host side, need to distribute m*n size space and store objective matrix; In equipment end, then need to distribute 2*m*n*sizeof (float) size space (g_iMatrix) and store intermediate calculation data, open up sizeof (float) * m*1 size space (g_tVector) simultaneously and store compute vector.Then import the target matrix data of host side into video memory space, utilize GPU to invert to objective matrix.
The matrix of trying to achieve in CUDA video memory is actual is [E|InM], and wherein InM is the inverse matrix of original matrix, and [E] is the unit matrix of m*n size.Import data into memory headroom by video memory space, and extract InM replacement original matrix M, simultaneously by video memory and memory headroom release.
According to the embodiment of the present invention, obtain the extended matrix after central processing unit expansion, unified calculation equipment framework platform construction global grid structure is utilized according to extended matrix, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, export result of calculation to central processing unit, wherein, central processing unit extracts inverse matrix from result of calculation.Utilize logic processing capability that CPU host side is powerful and the powerful arithmetic capability of GPU equipment end, by the row, column data handling procedure of matrix inversion process is broken, the computation capability of maximum using CUDA reduces computing time, lifting matrixes inversion calculation efficiency.
Preferably, first sets up unit comprises: determination module, splits radix for determining according to the size of extended matrix; Dividing module, for dividing the row vector of extended matrix and column vector according to segmentation radix, obtaining multiple data segment; And structure module, for building thread block structure according to the quantity of the multiple data segment of segmentation cardinal sum, form global grid structure, wherein, global grid structure comprises and multiple data segment thread block one to one, and each thread block has the thread with segmentation radix equal number.
Particularly, global grid structure can comprise at least three sub-global grid structures, is respectively used to coefficient vector calculating, row matrix computing and matrix unit process three processes.
For the objective matrix of m*n size [M], expand to the matrix [M|E] of 2*m*n size.
Row m and row n is divided into the data segment of Segment_m=m/s+1, Segment_n=n/s+1 size according to segmentation radix s (can get s=8,16,32 or 64, select according to the concrete data scale size that calculates) correspondence.Utilize row and column to split the Segments obtained and build overall Grid (grid); According to segmentation radix s, build Block structure, form global grid structure.As shown in Figure 2, T represents thread (Thread), and B represents thread block (Block), and x, y represent thread block dimension in the two directions, and X, Y represent the dimension in grid both direction.
Particularly, in computation process, be: Grid structure during coefficient calculations is (Segment_m, Segment_n, 1) that Block structure is (s, s, 1) in the structure principle of X, Y, Z tri-dimensions; The Grid structure of row matrix computing is (1,2*Segment_n, 1), and Block structure is (1, s, 1); The Grid structure of matrix unit is Grid (Segment_m, 1,1), and Block structure is (s, 1,1).
Preferably, computing unit comprises: the first computing module, and for calculating the coefficient vector of current line vector in extended matrix, coefficient vector comprises the coefficient of other row vectors in extended matrix outside current line vector for current line vector; Second computing module, for utilizing global grid structure thread to calculate the transformation results of this thread in the mapping position of extended matrix, the data of replacing in the mapping position of extended matrix by transformation results, obtain the extended matrix after replacing; Judge module, for judging that whether current line vector is last column row vector of extended matrix; If the first computing module is also for judging that current line vector is not last column row vector of extended matrix, then the next line row vector of current line vector is vectorial as current line, calculate the coefficient vector of current line vector; And processing module, if for judging that current line vector is last column row vector of extended matrix, then unitization process is carried out to the extended matrix after replacement, obtain result of calculation.
Particularly, extended matrix row vector k (the current line vector of the extended matrix traversed in k behavior extended matrix) can be traveled through by series process, by CUDA thread, concurrent operation is carried out to the row of extended matrix, calculate other row for the capable coefficient of k, obtain coefficient vector, stored in vector space g_tVector.
K for current traversal is capable, and usage factor vector, namely the transformation results in each thread in compute matrix mapping position is row matrix computing, and replaces former data.
Preferably, matrix inversion process device also comprises: second sets up unit, for utilizing before global grid structure thread calculates the transformation results of this thread in the mapping position of extended matrix, setting up and sharing video memory space, sharing video memory space for storing the data of current line vector; Second computing module also calls the data of current line vector for calculating the transformation results of this thread on the mapping position of extended matrix for global grid structure thread from shared video memory space.
Store the data of current line vector by setting up shared video memory space, so that in computation process, different threads can call the data of this row vector simultaneously, avoids repeating to call taking computational resource.
Calculating original matrix data and current matrix diagonal position data (base) carry out division arithmetic in each thread, and replace former data, namely carry out the unitization operation of matrix mapping location data.Finally, judge whether the last column traversing extended matrix, otherwise continue to perform above-mentioned steps to next line row vector.
According to the embodiment of the present invention, by Ergodic Matrices row vector successively, the data acquisition parallel computation of each row vector, thus improve the counting yield of matrix inversion.
Further, the first computing module comprises: first obtains submodule, for obtaining the diagonal position data of current line vector; Second obtains submodule, the data in the column vector obtaining diagonal position data place; Division submodule, for the data in the column vector at diagonal position data and diagonal position data place being divided by successively, obtains coefficient vector.
Particularly, the Grid structure when coefficient calculations is (Segment_m, Segment_n, 1), and Block structure is (s, s, 1).Under Grid and Block structure, each thread process overall situation matrix (g_iMatrix) relevant position data.In thread kernel function, thread is mapped in X, Y in the overall situation or Z-direction position (tid_in_grid_x/y/z) and is determined by Block position and width and Block thread position, as:
tid_in_grid_y=blockDim.y*blockIdx.y+threadIdx.y。
Search the current thread matrix that position maps in the Y direction be expert at, and the matrix diagonals position data (base) at the diagonal position data of this journey and the capable place of tid_in_grid_y is divided by, result of calculation puts into the position of coefficient vector (g_tVector) corresponding tid_in_grid_y.Execution false code is as follows:
When row matrix computing, Grid structure is (1, Segment_m, 1), and Block structure is (1, s, 1).
Utilize thread global position determination thread matrix data position (tid_in_grid) to be dealt with, determine the capable respective column position (T_in_obj) of k needing to carry out subtraction simultaneously.
tid_in_grid=tid_in_grid_y*n+tid_in_grid_x;
T_in_obj=k*n+tid_in_grid_x;
Because the capable data of k are the data all needing in each thread computes to call, taking computational resource for avoiding repeating to call, in thread block, setting up shared video memory space (sdata) store these data, accessing for different threads simultaneously.
The k of matrix is capable, the multiplication of correspondence position in the data of T_in_obj row and g_iVector, and the matrix data corresponding with tid_in_grid position carries out subtraction, and carries out data replacement.
In matrix unit computation process, Grid structure is Grid (Segment_m, 1,1), and Block structure is (s, 1,1).In arbitrary thread, the matrix position tid_in_matrix that index row k corresponding tid_in_grid_x arrange, by this position data divided by the corresponding data at k line position of current matrix base, by result stored in matrix tid_in_matrix position.
Effect of the present invention can be further illustrated by following emulation and measured data experiment.
Simulated conditions
Algorithm operation platform:
CPU:Intel(R)Xeon(R)CPU E5-1620v2(3.70GHz);
GPU:NVIDIA Quadro K4000;
Internal memory: 16GB;
Compiler: Visual Studio 2010;
Emulation content
The parallel inversion operation of serial and CUDA is carried out respectively, statistical computation spended time (unit is second), and verification computation result precision for the dense unsymmetrical matrix in boundary element method numerical evaluation.
Measured data is tested
Solved two and densely non-ly piled matrix, scale is 3084*3084 and 7605*7605 respectively.Solve scale for the first, parallel C UDA spended time of inverting is: 5.383s; Serial CPU spended time of inverting is: 32.325s; Speed-up ratio is 6.01.
Solve scale for the first, parallel C UDA spended time of inverting is: 14.447; Serial CPU spended time of inverting is: 198.144; Speed-up ratio is 13.72.
Can see, calculate relative to serial CPU, the walk abreast counting yield of inversion algorithms of CUDA promotes obviously, and promotes along with calculating data scale and improve.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed device, the mode by other realizes.Such as, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, mobile terminal, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a matrix inversion process method, is characterized in that, comprising:
Obtain the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix;
Utilize unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread;
Described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And
Export described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.
2. matrix inversion process method according to claim 1, is characterized in that, utilizes unified calculation equipment framework platform construction global grid structure to comprise according to described extended matrix:
Determine to split radix according to the size of described extended matrix;
According to described segmentation radix, the row vector of described extended matrix and column vector are divided, obtain multiple data segment; And
According to described segmentation cardinal sum, the quantity of multiple data segment builds thread block structure, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.
3. matrix inversion process method according to claim 1, it is characterized in that, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, and obtains result of calculation and comprises:
Calculate the coefficient vector of current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector;
Utilize described global grid structure thread to calculate the transformation results of this thread in the mapping position of described extended matrix, the data of replacing in the mapping position of described extended matrix by described transformation results, obtain the extended matrix after replacing;
Judge that whether described current line vector is last column row vector of described extended matrix;
If judge that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, return the step performing the coefficient vector calculating current line vector; And
If judge that described current line vector is last column row vector of described extended matrix, then unitization process is carried out to the extended matrix after described replacement, obtain described result of calculation.
4. matrix inversion process method according to claim 3, is characterized in that, the coefficient vector calculating current line vector in described extended matrix comprises:
Obtain the diagonal position data of described current line vector;
Obtain the data in the column vector at described diagonal position data place;
Successively the data in the column vector at diagonal position data and described diagonal position data place are divided by, obtain described coefficient vector.
5. matrix inversion process method according to claim 3, is characterized in that, utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, described matrix inversion process method also comprises:
Set up and share video memory space, described shared video memory space is for storing the data of described current line vector;
Wherein, described global grid structure thread calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix from described shared video memory space.
6. a matrix inversion process device, is characterized in that, comprising:
Acquiring unit, for obtaining the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix;
First sets up unit, and for utilizing unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread;
Computing unit, for utilizing described global grid structure, parallel processing is carried out to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And
Output unit, for exporting described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.
7. matrix inversion process device according to claim 6, is characterized in that, described first sets up unit comprises:
Determination module, splits radix for determining according to the size of described extended matrix;
Dividing module, for dividing the row vector of described extended matrix and column vector according to described segmentation radix, obtaining multiple data segment; And
Build module, quantity for multiple data segment according to described segmentation cardinal sum builds thread block structure, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.
8. matrix inversion process device according to claim 6, is characterized in that, described computing unit comprises:
First computing module, for calculating the coefficient vector of current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector;
Second computing module, the transformation results of this thread in the mapping position of described extended matrix is calculated for utilizing described global grid structure thread, by the data that described transformation results is replaced in the mapping position of described extended matrix, obtain the extended matrix after replacing;
Judge module, for judging that whether described current line vector is last column row vector of described extended matrix;
If described first computing module is also for judging that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, calculate the coefficient vector of current line vector; And
Processing module, if for judging that described current line vector is last column row vector of described extended matrix, then carry out unitization process to the extended matrix after described replacement, obtain described result of calculation.
9. matrix inversion process device according to claim 8, is characterized in that, described first computing module comprises:
First obtains submodule, for obtaining the diagonal position data of described current line vector;
Second obtains submodule, the data in the column vector obtaining described diagonal position data place;
Division submodule, for the data in the column vector at diagonal position data and described diagonal position data place being divided by successively, obtains described coefficient vector.
10. matrix inversion process device according to claim 8, is characterized in that, described matrix inversion process device also comprises:
Second sets up unit, and for utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, set up and share video memory space, described shared video memory space is for storing the data of described current line vector;
Described second computing module also calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix for described global grid structure thread from described shared video memory space.
CN201410816765.2A 2014-12-23 2014-12-23 Matrix inversion process method and apparatus Expired - Fee Related CN104572588B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410816765.2A CN104572588B (en) 2014-12-23 2014-12-23 Matrix inversion process method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410816765.2A CN104572588B (en) 2014-12-23 2014-12-23 Matrix inversion process method and apparatus

Publications (2)

Publication Number Publication Date
CN104572588A true CN104572588A (en) 2015-04-29
CN104572588B CN104572588B (en) 2018-10-23

Family

ID=53088693

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410816765.2A Expired - Fee Related CN104572588B (en) 2014-12-23 2014-12-23 Matrix inversion process method and apparatus

Country Status (1)

Country Link
CN (1) CN104572588B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021188A (en) * 2016-05-11 2016-10-12 广州广电运通金融电子股份有限公司 Parallel hardware architecture and parallel computing method for floating point matrix inversion
CN112837205A (en) * 2021-03-05 2021-05-25 中国科学院计算机网络信息中心 Delay correction-based batch matrix inversion method on graphics processor
WO2022022362A1 (en) * 2020-07-31 2022-02-03 中兴通讯股份有限公司 Data processing method and device, and storage medium
CN114417249A (en) * 2022-01-24 2022-04-29 合肥工业大学 Multi-order matrix fast inversion hardware structure implementation method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011038940A1 (en) * 2009-10-01 2011-04-07 Intracom S.A. Telecom Solutions Matrix inversion using qr decomposition on a parallel pipelined systolic array
CN103631761A (en) * 2012-08-29 2014-03-12 睿励科学仪器(上海)有限公司 Method for matrix operation and rigorous wave coupling analysis through parallel processing architecture

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011038940A1 (en) * 2009-10-01 2011-04-07 Intracom S.A. Telecom Solutions Matrix inversion using qr decomposition on a parallel pipelined systolic array
CN103631761A (en) * 2012-08-29 2014-03-12 睿励科学仪器(上海)有限公司 Method for matrix operation and rigorous wave coupling analysis through parallel processing architecture

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
PETER BENNER ET AL.: "High Performance Matrix Inversion of SPD Matrices on Graphics Processors", 《HIGH PERFORMANCE COMPUTING AND SIMULATION,2011 INT. CONF.》 *
SHANE RYOO ET AL: "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA", 《ACM PPOPP 2008》 *
刘丽: "GPU并行技术在矩阵运算及正则模式分析中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
高跃清等: "基于CUDA的SAR成像CS算法研究", 《计算机与网络》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021188A (en) * 2016-05-11 2016-10-12 广州广电运通金融电子股份有限公司 Parallel hardware architecture and parallel computing method for floating point matrix inversion
WO2017193922A1 (en) * 2016-05-11 2017-11-16 广州广电运通金融电子股份有限公司 Parallel hardware architecture and parallel computing method for floating point matrix inversion
WO2022022362A1 (en) * 2020-07-31 2022-02-03 中兴通讯股份有限公司 Data processing method and device, and storage medium
CN112837205A (en) * 2021-03-05 2021-05-25 中国科学院计算机网络信息中心 Delay correction-based batch matrix inversion method on graphics processor
CN112837205B (en) * 2021-03-05 2022-07-26 中国科学院计算机网络信息中心 Delay correction-based batch matrix inversion method on graphics processor
CN114417249A (en) * 2022-01-24 2022-04-29 合肥工业大学 Multi-order matrix fast inversion hardware structure implementation method
CN114417249B (en) * 2022-01-24 2024-03-26 合肥工业大学 Method for realizing multi-order matrix rapid inversion hardware structure

Also Published As

Publication number Publication date
CN104572588B (en) 2018-10-23

Similar Documents

Publication Publication Date Title
Guo et al. A survey of FPGA-based neural network accelerator
Zachariadis et al. Accelerating sparse matrix–matrix multiplication with GPU Tensor Cores
US9886418B2 (en) Matrix operands for linear algebra operations
CN108170639B (en) Tensor CP decomposition implementation method based on distributed environment
CN109726441B (en) Body and surface mixed GPU parallel computing electromagnetism DGTD method
CN109086244A (en) Matrix convolution vectorization implementation method based on vector processor
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
CN112668708B (en) Convolution operation device for improving data utilization rate
CN103177414A (en) Structure-based dependency graph node similarity concurrent computation method
CN104182571B (en) Kriging interpolation methods based on Delaunay and GPU
CN104572588A (en) Matrix inversion processing method and device
Rybacki et al. Experiments with single core, multi-core, and GPU based computation of cellular automata
CN110135569A (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
CN106202224B (en) Search processing method and device
CN110782009B (en) Computing kernel optimization method based on ARMv8 system
Shi et al. Efficient sparse-dense matrix-matrix multiplication on GPUs using the customized sparse storage format
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
Wu et al. Optimizing dynamic programming on graphics processing units via adaptive thread-level parallelism
CN106933777B (en) The high-performance implementation method of the one-dimensional FFT of base 2 based on domestic 26010 processor of Shen prestige
CN110245706B (en) Lightweight target detection method for embedded application
Shi et al. Geocomputation over the emerging heterogeneous computing infrastructure
Lai et al. Accelerating geospatial applications on hybrid architectures
Husselmann et al. Spatial data structures, sorting and gpu parallelism for situated-agent simulation and visualisation
CN103559312B (en) GPU (graphics processing unit) based melody matching parallelization method
Shantharam et al. Exploiting dense substructures for fast sparse matrix vector multiplication

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181023

Termination date: 20191223