CN104572588A

CN104572588A - Matrix inversion processing method and device

Info

Publication number: CN104572588A
Application number: CN201410816765.2A
Authority: CN
Inventors: 魏一雄; 陈兴玉; 程五四; 陈帝江; 胡祥涛; 张红旗; 苏建军
Original assignee: CETC 38 Research Institute
Current assignee: CETC 38 Research Institute
Priority date: 2014-12-23
Filing date: 2014-12-23
Publication date: 2015-04-29
Anticipated expiration: 2034-12-23
Also published as: CN104572588B

Abstract

The invention discloses a matrix inversion processing method and device. The matrix inversion processing method comprises the following steps of acquiring an expanding matrix expanded through a central processor; utilizing a compute unified device architecture platform to build a global grid structure according to the expanding matrix, wherein the global grid structure comprises a plurality of thread blocks and threads; utilizing the global grid structure to carry out parallel processing on column vectors of the expanding matrix, wherein a serial computing way is adopted to compute data on each column of column vectors of the expanding matrix to obtain a computing result, and the computing result comprises an inverse matrix of a target matrix and a unit matrix with the same size as the inverse matrix; outputting the computing result to the central processor, wherein the central processor extracts the inverse matrix from the computing result. Through the matrix inversion processing method and the device provided by the invention, the problem that the computing efficiency is low in the matrix inversion process in the prior art is solved, and the effect in improving the computing efficiency in the matrix inversion process is achieved.

Description

Matrix inversion process method and apparatus

Technical field

The present invention relates to data processing field, in particular to a kind of matrix inversion process method and apparatus.

Background technology

In present digitizing industrial field, due to the develop rapidly of computer technology, the technology of more and more other field starts to be included into, also digitizing technique is more and more relied on to bring obvious Industrial driving ability, particularly manufacture field, the proportion occupied in product development process due to design, simulation analysis increases, and computing machine fast calculation analysis ability easily, and the product development process of modern manufacturing industry is almost based on digitizing technique completely.Also this dependence just, impels computer software and hardware to constantly update and regenerates to meet the performance requirement day by day promoted.

Calculating for matrix generally includes limit elements method and method of finite difference and boundary element method.Relative to Finite Element and method of finite difference, boundary element method, because of its high precision and dimensionality reduction advantage, is more suitable for the process of Quick Pretreatment, adaptive structure analysis.But the matrix of coefficients adopting boundary element method to obtain has dense and asymmetrical inferior position, solution efficiency and the Solve problems scale of traditional boundary unit method are all restricted.A lot of scholar, by introducing Fast numerical computational algorithm, as quick multipole etc., accelerates Solution of Boundary Element Method computation process, improves Solve problems scale.But for the engineering problem of time correlation, when time domain or frequency domain utilize Boundary Element Method, due to the complicacy of elementary solution and time domain discrete or conversion requirements, above-mentioned numerical algorithm can not obtain and well calculate effect.Especially O (N is related to ³) matrix inversion operation of computation complexity consumes plenty of time in numerical operation process.

Hackbusch proposed the numerical solution algorithm accelerating algorithm based on hereditary tree construction in 1999.Utilize the partitioning of matrix, and compatible sub-block is utilized outer multiplication approximate representation, thus condensation matrix data store, reduce the data volume participating in matrix operation, utilize recurrence thought simultaneously, by hereditary tree construction, propose approximate inversion technique.But this method promotes limited for counting yield, simultaneously owing to being approximate fits, computational accuracy cannot be guaranteed.In addition, due to data correlation stronger in matrix inversion operation, be difficult to utilize central processing unit (CPU) multi-core parallel concurrent to calculate lifting reduction and expend computing time, this makes counting yield in the process of matrix inversion lower.

Calculate inefficient problem in matrix inversion process in prior art, not yet propose effective solution at present.

Summary of the invention

Fundamental purpose of the present invention is to provide a kind of matrix inversion process method and apparatus, calculates inefficient problem to solve in prior art in matrix inversion process.

To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of matrix inversion process method is provided.Matrix inversion process method according to the present invention comprises: obtain the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix; Utilize unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread; Described global grid structure is utilized to adopt the mode of parallel computation to calculate the column vector of described extended matrix successively, the in the ranks data of mode to the row vector of described extended matrix of serial computing are adopted to calculate, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And export described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.

Further, unified calculation equipment framework platform construction global grid structure is utilized to comprise according to described extended matrix: to determine to split radix according to the size of described extended matrix; According to described segmentation radix, the row vector of described extended matrix and column vector are divided, obtain multiple data segment; And the quantity of multiple data segment builds thread block structure according to described segmentation cardinal sum, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.

Further, described global grid structure is utilized to adopt the mode of parallel computation to calculate the column vector of described extended matrix successively, the in the ranks data of mode to the row vector of described extended matrix of serial computing are adopted to calculate, obtain result of calculation to comprise: the coefficient vector calculating current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector; Utilize described global grid structure thread to calculate the transformation results of this thread in the mapping position of described extended matrix, the data of replacing in the mapping position of described extended matrix by described transformation results, obtain the extended matrix after replacing; Judge that whether described current line vector is last column row vector of described extended matrix; If judge that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, return the step performing the coefficient vector calculating current line vector; And if judge that described current line vector is last column row vector of described extended matrix, then unitization process is carried out to the extended matrix after described replacement, obtains described result of calculation.

Further, the coefficient vector calculating current line vector in described extended matrix comprises: the diagonal position data obtaining described current line vector; Obtain the data in the column vector at described diagonal position data place; Successively the data in the column vector at diagonal position data and described diagonal position data place are divided by, obtain described coefficient vector.

Further, utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, described matrix inversion process method also comprises: set up and share video memory space, described shared video memory space is for storing the data of described current line vector; Wherein, described global grid structure thread calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix from described shared video memory space.

To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of matrix inversion process device.Matrix inversion process device according to the present invention comprises: acquiring unit, and for obtaining the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix; First sets up unit, and for utilizing unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread; Computing unit, the mode of parallel computation is adopted to calculate the column vector of described extended matrix successively for utilizing described global grid structure, the in the ranks data of mode to the row vector of described extended matrix of serial computing are adopted to calculate, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And output unit, for exporting described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.

Further, described first sets up unit comprises: determination module, splits radix for determining according to the size of described extended matrix; Dividing module, for dividing the row vector of described extended matrix and column vector according to described segmentation radix, obtaining multiple data segment; And structure module, quantity for multiple data segment according to described segmentation cardinal sum builds thread block structure, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.

Further, described computing unit comprises: the first computing module, for calculating the coefficient vector of current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector; Second computing module, the transformation results of this thread in the mapping position of described extended matrix is calculated for utilizing described global grid structure thread, by the data that described transformation results is replaced in the mapping position of described extended matrix, obtain the extended matrix after replacing; Judge module, for judging that whether described current line vector is last column row vector of described extended matrix; If described first computing module is also for judging that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, calculate the coefficient vector of current line vector; And processing module, if for judging that described current line vector is last column row vector of described extended matrix, then carry out unitization process to the extended matrix after described replacement, obtain described result of calculation.

Further, described first computing module comprises: first obtains submodule, for obtaining the diagonal position data of described current line vector; Second obtains submodule, the data in the column vector obtaining described diagonal position data place; Division submodule, for the data in the column vector at diagonal position data and described diagonal position data place being divided by successively, obtains described coefficient vector.

Further, described matrix inversion process device also comprises: second sets up unit, for utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, set up and share video memory space, described shared video memory space is for storing the data of described current line vector; Described second computing module also calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix for described global grid structure thread from described shared video memory space.

According to the embodiment of the present invention, obtain the extended matrix after central processing unit expansion, unified calculation equipment framework platform construction global grid structure is utilized according to extended matrix, global grid structure is utilized to adopt the mode of parallel computation to calculate the column vector of extended matrix successively, the in the ranks data of mode to the row vector of extended matrix of serial computing are adopted to calculate, obtain result of calculation, export result of calculation to central processing unit, wherein, central processing unit extracts inverse matrix from result of calculation, solve in prior art and calculate inefficient problem in matrix inversion process, reach the effect improving counting yield in matrix inversion process.

Accompanying drawing explanation

The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the process flow diagram of the matrix inversion process method according to the embodiment of the present invention;

Fig. 2 is the schematic diagram of the global grid structure according to the embodiment of the present invention;

Fig. 3 is the process flow diagram according to the preferred matrix inversion process method of the embodiment of the present invention; And

Fig. 4 is the schematic diagram of the matrix inversion process device according to the embodiment of the present invention.

Embodiment

It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.

The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.

It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged, in the appropriate case so that embodiments of the invention described herein.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.

Embodiments provide a kind of matrix inversion process method.The method can be performed by graphic process unit (GPU).

Fig. 1 is the process flow diagram of the matrix inversion process method according to the embodiment of the present invention.As shown in Figure 1, to comprise step as follows for this matrix inversion process method:

Step S102, obtain the extended matrix after central processing unit expansion, extended matrix comprises objective matrix and has the unit matrix of formed objects with objective matrix.

Objective matrix is matrix to be inverted.Before calculating objective matrix, first by host side central processing unit (CPU), objective matrix is expanded to extended matrix, the objective matrix [M] of such as m*n size, expanded to extended matrix [M|E], then export this extended matrix to equipment end graphic process unit (GPU), carry out realize target matrix inversion by GPU.

Step S104, utilizes unified calculation equipment framework platform construction global grid structure according to extended matrix.Wherein, global grid structure comprises multiple thread.

Unified calculation equipment framework (CUDA) platform is the one expansion of C language, and it allows to use standard C to carry out GPU code programming.This code had both been applicable to central processing unit (CPU), was also applicable to graphic process unit (GPU).Host side is responsible for deriving the multithreading task (kernel function) operated in GPU equipment end; Equipment end is provided with internal schedule device by kernel programme distribution in corresponding GPU hardware.Utilize office's network that CUDA platform construction is made up of entirely multiple thread, utilize this global grid structure to process extended matrix.Particularly, global grid structure can comprise at least three sub-global grid structures, be respectively used to coefficient vector calculate, row matrix computing and matrix unit process three processes, wherein, coefficient vector calculates corresponding sub-global grid structure at horizontal stroke, longitudinal direction divides thread block, wherein each thread is for the treatment of each data element on objective matrix in extended matrix, sub-global grid structure corresponding to row matrix computing is only laterally dividing thread block, wherein each thread is for the treatment of a column vector of extended matrix, sub-global grid structure corresponding to matrix unit process is only longitudinally dividing thread block, wherein each thread is for the treatment of a row vector of extended matrix.

Step S106, utilizes described global grid structure to carry out parallel processing to the column vector of described extended matrix, and wherein, the mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, and obtains result of calculation.Result of calculation comprises the inverse matrix of objective matrix and has the unit matrix of formed objects with inverse matrix.

Owing to carrying out in the process that objective matrix inverts, namely in the ranks the relevance of data is higher for data between the row vector of extended matrix and row vector, therefore, the mode of the data acquisition serial computing between the different rows vector of extended matrix is calculated, namely in each thread according to the order serial computing data in the ranks of row vector.And for each column vector of extended matrix, parallel computation can be adopted, i.e. parallel computation between each thread, thus improve the counting yield of data in the ranks.

Step S108, exports result of calculation to central processing unit, and wherein, central processing unit extracts inverse matrix from result of calculation.

After equipment end GPU calculates result of calculation, this result of calculation is exported to central processor CPU, extracted the inverse matrix obtaining objective matrix by CPU.

For the objective matrix of m*n size [M], before the inverse matrix calculating objective matrix, need to distribute data space according to objective matrix [M] size.For host side, need to distribute m*n size space and store objective matrix; In equipment end, then need to distribute 2*m*n*sizeof (float) size space (g_iMatrix) and store intermediate calculation data, open up sizeof (float) * m*1 size space (g_tVector) simultaneously and store compute vector.Then import the target matrix data of host side into video memory space, utilize GPU to invert to objective matrix.

The matrix of trying to achieve in CUDA video memory is actual is [E|InM], and wherein InM is the inverse matrix of original matrix, and [E] is the unit matrix of m*n size.Import data into memory headroom by video memory space, and extract InM replacement original matrix M, simultaneously by video memory and memory headroom release.

According to the embodiment of the present invention, obtain the extended matrix after central processing unit expansion, unified calculation equipment framework platform construction global grid structure is utilized according to extended matrix, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, export result of calculation to central processing unit, wherein, central processing unit extracts inverse matrix from result of calculation.Utilize logic processing capability that CPU host side is powerful and the powerful arithmetic capability of GPU equipment end, by the row, column data handling procedure of matrix inversion process is broken, the computation capability of maximum using CUDA reduces computing time, lifting matrixes inversion calculation efficiency.

Preferably, unified calculation equipment framework platform construction global grid structure is utilized to comprise according to extended matrix: to determine to split radix according to the size of extended matrix; According to segmentation radix, the row vector of extended matrix and column vector are divided, obtain multiple data segment; And build thread block structure according to the quantity of the multiple data segment of segmentation cardinal sum, form global grid structure, wherein, global grid structure comprises and multiple data segment thread block one to one, and each thread block has the thread with segmentation radix equal number.

Particularly, global grid structure can comprise at least three sub-global grid structures, is respectively used to coefficient vector calculating, row matrix computing and matrix unit process three processes.

For the objective matrix of m*n size [M], expand to the matrix [M|E] of 2*m*n size.

Row m and row n is divided into the data segment of Segment_m=m/s+1, Segment_n=n/s+1 size according to segmentation radix s (can get s=8,16,32 or 64, select according to the concrete data scale size that calculates) correspondence.Utilize row and column to split the Segments obtained and build overall Grid (grid); According to segmentation radix s, build Block structure, form global grid structure.As shown in Figure 2, T represents thread (Thread), and B represents thread block (Block), and x, y represent thread block dimension in the two directions, and X, Y represent the dimension in grid both direction.

Particularly, in computation process, be: Grid structure during coefficient calculations is (Segment_m, Segment_n, 1) that Block structure is (s, s, 1) in the structure principle of X, Y, Z tri-dimensions; The Grid structure of row matrix computing is (1,2*Segment_n, 1), and Block structure is (1, s, 1); The Grid structure of matrix unit is Grid (Segment_m, 1,1), and Block structure is (s, 1,1).

Preferably, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation to comprise: the coefficient vector calculating current line vector in extended matrix, coefficient vector comprises the coefficient of other row vectors in extended matrix outside current line vector for current line vector; Utilize global grid structure thread to calculate the transformation results of this thread in the mapping position of extended matrix, the data of replacing in the mapping position of extended matrix by transformation results, obtain the extended matrix after replacing; Judge that whether current line vector is last column row vector of extended matrix; If judge that current line vector is not last column row vector of extended matrix, then the next line row vector of current line vector is vectorial as current line, return the step performing the coefficient vector calculating current line vector; And if judge that current line vector is last column row vector of extended matrix, then to replace after extended matrix carry out unitization process, obtain result of calculation.

Particularly, extended matrix row vector k (the current line vector of the extended matrix traversed in k behavior extended matrix) can be traveled through by series process, by CUDA thread, concurrent operation is carried out to the row of extended matrix, calculate other row for the capable coefficient of k, obtain coefficient vector, stored in vector space g_tVector.

K for current traversal is capable, and usage factor vector, namely the transformation results in each thread in compute matrix mapping position is row matrix computing, and replaces former data.

Preferably, utilizing before global grid structure thread calculates the transformation results of this thread in the mapping position of extended matrix, matrix inversion process method also comprises: set up and share video memory space, share video memory space for storing the data of current line vector, wherein, global grid structure thread calls the data of current line vector for calculating the transformation results of this thread on the mapping position of extended matrix from shared video memory space.

Store the data of current line vector by setting up shared video memory space, so that in computation process, different threads can call the data of this row vector simultaneously, avoids repeating to call taking computational resource.

Calculating original matrix data and current matrix diagonal position data (base) carry out division arithmetic in each thread, and replace former data, namely carry out the unitization operation of matrix mapping location data.Finally, judge whether the last column traversing extended matrix, otherwise continue to perform above-mentioned steps to next line row vector.

According to the embodiment of the present invention, by Ergodic Matrices row vector successively, the data acquisition parallel computation of each row vector, thus improve the counting yield of matrix inversion.

Further, the coefficient vector calculating current line vector in extended matrix comprises: the diagonal position data obtaining current line vector; Obtain the data in the column vector at diagonal position data place; Successively the data in the column vector at diagonal position data and diagonal position data place are divided by, obtain coefficient vector.

Particularly, the Grid structure when coefficient calculations is (Segment_m, Segment_n, 1), and Block structure is (s, s, 1).Under Grid and Block structure, each thread process overall situation matrix (g_iMatrix) relevant position data.In thread kernel function, thread is mapped in X, Y in the overall situation or Z-direction position (tid_in_grid_x/y/z) and is determined by Block position and width and Block thread position, as:

tid_in_grid_y＝blockDim.y*blockIdx.y+threadIdx.y。

Search the current thread matrix that position maps in the Y direction be expert at, and the matrix diagonals position data (base) at the diagonal position data of this journey and the capable place of tid_in_grid_y is divided by, result of calculation puts into the position of coefficient vector (g_tVector) corresponding tid_in_grid_y.Execution false code is as follows:

When row matrix computing, Grid structure is (1, Segment_m, 1), and Block structure is (1, s, 1).

Utilize thread global position determination thread matrix data position (tid_in_grid) to be dealt with, determine the capable respective column position (T_in_obj) of k needing to carry out subtraction simultaneously.

tid_in_grid＝tid_in_grid_y*n+tid_in_grid_x；

T_in_obj＝k*n+tid_in_grid_x；

Because the capable data of k are the data all needing in each thread computes to call, taking computational resource for avoiding repeating to call, in thread block, setting up shared video memory space (sdata) store these data, accessing for different threads simultaneously.

The k of matrix is capable, the multiplication of correspondence position in the data of T_in_obj row and g_iVector, and the matrix data corresponding with tid_in_grid position carries out subtraction, and carries out data replacement.

In matrix unit computation process, Grid structure is Grid (Segment_m, 1,1), and Block structure is (s, 1,1).In arbitrary thread, the matrix position tid_in_matrix that index row k corresponding tid_in_grid_x arrange, by this position data divided by the corresponding data at k line position of current matrix base, by result stored in matrix tid_in_matrix position.

Take size as 100*100 objective matrix [M] be example, composition graphs 3 is described the embodiment of the present invention.

Step S302, distributes data space according to objective matrix [M] size.For host side, need to distribute 100*100 size space and store objective matrix; In equipment end, then need to distribute 156.25KB size space (g_iMatrix) and store intermediate calculation data, open up 1KB size space (g_tVector) simultaneously and store compute vector.Import matrix data to be inverted for host side into video memory space.

Step S304, expands to matrix [M|E] by objective matrix [M].Wherein matrix [M|E] size is 2*100*100.

Step S306, is divided into some data segments of formed objects by row m and row n.Particularly, row m and row n is divided into the data segment of Segment_m=7, Segment_n=7 size according to segmentation radix s (s=8,16,32 or 64 calculates the little selection of data scale according to concrete) correspondence.

Step S308, the data segment utilizing row and column to split to obtain builds global grid structure.Particularly, according to segmentation radix s, Block structure is built.Be specifically: Grid structure during coefficient calculations is (7,7,1) that Block structure is (16,16,1) in the structure principle of X, Y, Z tri-dimensions; The Grid structure of row matrix computing is (1,14,1), and Block structure is (1,16,1); The Grid structure of matrix unit is Grid (7,1,1), and Block structure is (16,1,1).

After structure global grid structure, executed in parallel following steps S310 is to step S316 in each thread, namely to the column vector parallel computation of extended matrix.

Step S310, the current line vector k of Ergodic Matrices.Particularly, walk to the 100th row from the 1st, searching loop row matrix vector, k=1,2 ... 100.

Step S312, coefficient calculations.Under Grid and Block structure, each thread process overall situation matrix (g_iMatrix) relevant position data.In thread kernel function, thread is mapped in X, Y in the overall situation or Z-direction position (tid_in_grid_x/y/z) and is determined by Block position and width and Block thread position, as:

tid_in_grid_y＝blockDim.y*blockIdx.y+threadIdx.y。

Step S314, row matrix computing, in CUDA, the capable data of matrix k of each thread-data and current traversal carry out computing.First utilize thread global position determination thread matrix data position (tid_in_grid) to be dealt with, determine the capable respective column position (T_in_obj) of k needing to carry out subtraction simultaneously.

tid_in_grid＝tid_in_grid_y*n+tid_in_grid_x；

T_in_obj＝k*n+tid_in_grid_x；

Step S316, matrix unit.In arbitrary thread, the matrix position tid_in_matrix that index row k corresponding tid_in_grid_x arrange, by this position data divided by the corresponding data at k line position of current matrix base, by result of calculation stored in matrix tid_in_matrix position.

Step S318, judges whether k equals m.If so, then k adds 1, and returns step S310, otherwise, then perform step S320.

The matrix of trying to achieve in step S320, CUDA video memory is actual is [E|InM], and wherein InM is the inverse matrix of original matrix, and [E] is the unit matrix of 100*100 size.Import data into memory headroom by video memory space, and extract InM replacement original matrix M, simultaneously by video memory and memory headroom release.

The embodiment of the present invention additionally provides a kind of matrix inversion process device.It should be noted that, the matrix inversion process device of the embodiment of the present invention may be used for performing the matrix inversion process method that the embodiment of the present invention provides, and the matrix inversion process device that the matrix inversion process method of the embodiment of the present invention also can be provided by the embodiment of the present invention performs.

Fig. 4 is the schematic diagram of the matrix inversion process device according to the embodiment of the present invention.As shown in Figure 4, this matrix inversion process device comprises: acquiring unit 10, first sets up unit 20, computing unit 30 and output unit 40.

Acquiring unit 10 is for obtaining the extended matrix after central processing unit expansion, and extended matrix comprises objective matrix and has the unit matrix of formed objects with objective matrix.

First sets up unit 20 for utilizing unified calculation equipment framework platform construction global grid structure according to extended matrix.Wherein, global grid structure comprises multiple thread.

Unified calculation equipment framework (CUDA) platform, CUDA is the one expansion of C language, and it allows to use standard C to carry out GPU code programming.This code had both been applicable to central processing unit (CPU), was also applicable to graphic process unit (GPU).Host side is responsible for deriving the multithreading task (kernel function) operated in GPU equipment end; Equipment end is provided with internal schedule device by kernel programme distribution in corresponding GPU hardware.Utilize office's network that CUDA platform construction is made up of entirely multiple thread, utilize this global grid structure to process extended matrix.Particularly, global grid structure can comprise at least three sub-global grid structures, be respectively used to coefficient vector calculate, row matrix computing and matrix unit process three processes, wherein, coefficient vector calculates corresponding sub-global grid structure at horizontal stroke, longitudinal direction divides thread block, wherein each thread is for the treatment of each data element on objective matrix in extended matrix, sub-global grid structure corresponding to row matrix computing is only laterally dividing thread block, wherein each thread is for the treatment of a column vector of extended matrix, sub-global grid structure corresponding to matrix unit process is only longitudinally dividing thread block, wherein each thread is for the treatment of a row vector of extended matrix.

Computing unit 30 carries out parallel processing for utilizing described global grid structure to the column vector of described extended matrix, and wherein, the mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, and obtains result of calculation.Result of calculation comprises the inverse matrix of objective matrix and has the unit matrix of formed objects with inverse matrix.

Output unit 40 is for exporting result of calculation to central processing unit, and wherein, central processing unit extracts inverse matrix from result of calculation.

Preferably, first sets up unit comprises: determination module, splits radix for determining according to the size of extended matrix; Dividing module, for dividing the row vector of extended matrix and column vector according to segmentation radix, obtaining multiple data segment; And structure module, for building thread block structure according to the quantity of the multiple data segment of segmentation cardinal sum, form global grid structure, wherein, global grid structure comprises and multiple data segment thread block one to one, and each thread block has the thread with segmentation radix equal number.

Preferably, computing unit comprises: the first computing module, and for calculating the coefficient vector of current line vector in extended matrix, coefficient vector comprises the coefficient of other row vectors in extended matrix outside current line vector for current line vector; Second computing module, for utilizing global grid structure thread to calculate the transformation results of this thread in the mapping position of extended matrix, the data of replacing in the mapping position of extended matrix by transformation results, obtain the extended matrix after replacing; Judge module, for judging that whether current line vector is last column row vector of extended matrix; If the first computing module is also for judging that current line vector is not last column row vector of extended matrix, then the next line row vector of current line vector is vectorial as current line, calculate the coefficient vector of current line vector; And processing module, if for judging that current line vector is last column row vector of extended matrix, then unitization process is carried out to the extended matrix after replacement, obtain result of calculation.

Preferably, matrix inversion process device also comprises: second sets up unit, for utilizing before global grid structure thread calculates the transformation results of this thread in the mapping position of extended matrix, setting up and sharing video memory space, sharing video memory space for storing the data of current line vector; Second computing module also calls the data of current line vector for calculating the transformation results of this thread on the mapping position of extended matrix for global grid structure thread from shared video memory space.

Further, the first computing module comprises: first obtains submodule, for obtaining the diagonal position data of current line vector; Second obtains submodule, the data in the column vector obtaining diagonal position data place; Division submodule, for the data in the column vector at diagonal position data and diagonal position data place being divided by successively, obtains coefficient vector.

tid_in_grid_y＝blockDim.y*blockIdx.y+threadIdx.y。

tid_in_grid＝tid_in_grid_y*n+tid_in_grid_x；

T_in_obj＝k*n+tid_in_grid_x；

Effect of the present invention can be further illustrated by following emulation and measured data experiment.

Simulated conditions

Algorithm operation platform:

CPU:Intel(R)Xeon(R)CPU E5-1620v2(3.70GHz)；

GPU:NVIDIA Quadro K4000；

Internal memory: 16GB;

Compiler: Visual Studio 2010;

Emulation content

The parallel inversion operation of serial and CUDA is carried out respectively, statistical computation spended time (unit is second), and verification computation result precision for the dense unsymmetrical matrix in boundary element method numerical evaluation.

Measured data is tested

Solved two and densely non-ly piled matrix, scale is 3084*3084 and 7605*7605 respectively.Solve scale for the first, parallel C UDA spended time of inverting is: 5.383s; Serial CPU spended time of inverting is: 32.325s; Speed-up ratio is 6.01.

Solve scale for the first, parallel C UDA spended time of inverting is: 14.447; Serial CPU spended time of inverting is: 198.144; Speed-up ratio is 13.72.

Can see, calculate relative to serial CPU, the walk abreast counting yield of inversion algorithms of CUDA promotes obviously, and promotes along with calculating data scale and improve.

It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.

In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.

In several embodiments that the application provides, should be understood that, disclosed device, the mode by other realizes.Such as, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical or other form.

The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.

If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, mobile terminal, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a matrix inversion process method, is characterized in that, comprising:

Obtain the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix;

Utilize unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread;

Described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And

Export described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.

2. matrix inversion process method according to claim 1, is characterized in that, utilizes unified calculation equipment framework platform construction global grid structure to comprise according to described extended matrix:

Determine to split radix according to the size of described extended matrix;

According to described segmentation radix, the row vector of described extended matrix and column vector are divided, obtain multiple data segment; And

According to described segmentation cardinal sum, the quantity of multiple data segment builds thread block structure, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.

3. matrix inversion process method according to claim 1, it is characterized in that, described global grid structure is utilized to carry out parallel processing to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, and obtains result of calculation and comprises:

Calculate the coefficient vector of current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector;

Utilize described global grid structure thread to calculate the transformation results of this thread in the mapping position of described extended matrix, the data of replacing in the mapping position of described extended matrix by described transformation results, obtain the extended matrix after replacing;

Judge that whether described current line vector is last column row vector of described extended matrix;

If judge that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, return the step performing the coefficient vector calculating current line vector; And

If judge that described current line vector is last column row vector of described extended matrix, then unitization process is carried out to the extended matrix after described replacement, obtain described result of calculation.

4. matrix inversion process method according to claim 3, is characterized in that, the coefficient vector calculating current line vector in described extended matrix comprises:

Obtain the diagonal position data of described current line vector;

Obtain the data in the column vector at described diagonal position data place;

Successively the data in the column vector at diagonal position data and described diagonal position data place are divided by, obtain described coefficient vector.

5. matrix inversion process method according to claim 3, is characterized in that, utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, described matrix inversion process method also comprises:

Set up and share video memory space, described shared video memory space is for storing the data of described current line vector;

Wherein, described global grid structure thread calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix from described shared video memory space.

6. a matrix inversion process device, is characterized in that, comprising:

Acquiring unit, for obtaining the extended matrix after central processing unit expansion, described extended matrix comprises objective matrix and has the unit matrix of formed objects with described objective matrix;

First sets up unit, and for utilizing unified calculation equipment framework platform construction global grid structure according to described extended matrix, wherein, described global grid structure comprises multiple thread;

Computing unit, for utilizing described global grid structure, parallel processing is carried out to the column vector of described extended matrix, wherein, mode for the data acquisition serial computing in each row column vector of described extended matrix calculates, obtain result of calculation, described result of calculation comprises the inverse matrix of described objective matrix and has the unit matrix of formed objects with described inverse matrix; And

Output unit, for exporting described result of calculation to described central processing unit, wherein, described central processing unit extracts described inverse matrix from described result of calculation.

7. matrix inversion process device according to claim 6, is characterized in that, described first sets up unit comprises:

Determination module, splits radix for determining according to the size of described extended matrix;

Dividing module, for dividing the row vector of described extended matrix and column vector according to described segmentation radix, obtaining multiple data segment; And

Build module, quantity for multiple data segment according to described segmentation cardinal sum builds thread block structure, form described global grid structure, wherein, described global grid structure comprises and described multiple data segment thread block one to one, and each thread block has the thread with described segmentation radix equal number.

8. matrix inversion process device according to claim 6, is characterized in that, described computing unit comprises:

First computing module, for calculating the coefficient vector of current line vector in described extended matrix, described coefficient vector comprises the coefficient of other row vectors outside current line vector described in described extended matrix for described current line vector;

Second computing module, the transformation results of this thread in the mapping position of described extended matrix is calculated for utilizing described global grid structure thread, by the data that described transformation results is replaced in the mapping position of described extended matrix, obtain the extended matrix after replacing;

Judge module, for judging that whether described current line vector is last column row vector of described extended matrix;

If described first computing module is also for judging that described current line vector is not last column row vector of described extended matrix, then the next line row vector of described current line vector is vectorial as described current line, calculate the coefficient vector of current line vector; And

Processing module, if for judging that described current line vector is last column row vector of described extended matrix, then carry out unitization process to the extended matrix after described replacement, obtain described result of calculation.

9. matrix inversion process device according to claim 8, is characterized in that, described first computing module comprises:

First obtains submodule, for obtaining the diagonal position data of described current line vector;

Second obtains submodule, the data in the column vector obtaining described diagonal position data place;

Division submodule, for the data in the column vector at diagonal position data and described diagonal position data place being divided by successively, obtains described coefficient vector.

10. matrix inversion process device according to claim 8, is characterized in that, described matrix inversion process device also comprises:

Second sets up unit, and for utilizing before described global grid structure thread calculates the transformation results of this thread in the mapping position of described extended matrix, set up and share video memory space, described shared video memory space is for storing the data of described current line vector;

Described second computing module also calls the data of described current line vector for calculating the transformation results of this thread on the mapping position of described extended matrix for described global grid structure thread from described shared video memory space.