CN107256203A

CN107256203A - The implementation method and device of a kind of matrix-vector multiplication

Info

Publication number: CN107256203A
Application number: CN201710506697.3A
Authority: CN
Inventors: 谢启凯; 吴韶华
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-06-28
Filing date: 2017-06-28
Publication date: 2017-10-17

Abstract

The embodiment of the invention discloses a kind of implementation method of matrix-vector multiplication, this method includes：Under open computing language OpenCL frameworks, vectorization processing is carried out respectively to the first matrix and the second matrix of multiplication；The multiple submatrixs obtained after vectorization is handled carry out concurrent operation.Device is realized the embodiment of the invention also discloses a kind of matrix-vector multiplication.It by the scheme of the embodiment of the present invention, can implement on high-performance calculation platform, take full advantage of computer hardware resource, and substantially reduce the calculating time, improve operation efficiency.

Description

The implementation method and device of a kind of matrix-vector multiplication

Technical field

The present embodiments relate to the implementation method and dress of high-performance computing sector, more particularly to a kind of matrix-vector multiplication Put.

Background technology

The data huge explosion of the current social mankind, information data is more and more, and people are to the disposal ability of information data It is required that also more and more higher, such as in artificial intelligence, weather forecast, space flight national defence, financial economy, oil exploration, scientific research Field, the demand to high performance computation is growing day by day, and high performance matrix-vector multiplication is calculated, and is even more its important foundation stone. But the scheme used in current matrix-vector multiplication is CPU (Central Processing Unit central processing units) serial Calculating matrix vector multiplication, i.e., carry out the multiplication of next data again after being multiplied to a data in matrix, calculates the time Long, efficiency is low, is required far from current growing data processing speed is met.

The content of the invention

In order to solve the above problems, the embodiment of the present invention proposes the implementation method and device of a kind of matrix-vector multiplication, It can implement on high-performance calculation platform, make full use of computer hardware resource, and substantially reduce the calculating time, improve Operation efficiency.

In order to achieve the above object, the embodiment of the present invention proposes a kind of implementation method of matrix-vector multiplication, this method Including：

Under open computing language OpenCL frameworks, the first matrix and the second matrix of multiplication are carried out at vectorization respectively Reason；

The multiple submatrixs obtained after vectorization is handled carry out concurrent operation.

Alternatively, under open computing language OpenCL frameworks, the first matrix and the second matrix of multiplication are carried out respectively Vectorization processing includes：

Using each row vector in the first matrix as a row data block, and by each in the second matrix arrange to Amount is used as a column data block；

Respectively by any one row data block and the incoming kernel function of any one column data block, and obtain many Individual kernel functions；Wherein, each kernel functions and wherein incoming row data block and column data block are corresponded；

Vectorization processing is carried out respectively to the row data block and column data block in each kernel functions, to obtain multiple rows To quantum matrix and multiple column vector submatrixs.

Alternatively, carrying out vectorization processing respectively to the row data block and column data block in each kernel functions includes：

Using OpenCL vectorial Vector data types, every n real-coded GA in row data block is carried out respectively Vectorization processing to obtain multiple row vector submatrixs, and every n real-coded GA in column data block is carried out respectively to Quantification treatment to obtain multiple column vector submatrixs, wherein, n is positive integer.

Alternatively, the multiple submatrixs obtained after vectorization is handled, which carry out concurrent operation, to be included：

In each kernel functions, mutual corresponding row vector submatrix is carried out parallel with column vector submatrix respectively It is multiplied.

Alternatively, n=4.

In order to achieve the above object, what the embodiment of the present invention also proposed a kind of matrix-vector multiplication realizes device, the dress Put including：Processing module and computing module；

Processing module, under open computing language OpenCL frameworks, to the first matrix and the second matrix point of multiplication Carry out not vectorization processing；

Computing module, multiple submatrixs for being obtained after vectorization is handled carry out concurrent operation.

Alternatively, processing module is under open computing language OpenCL frameworks, to the first matrix and the second matrix of multiplication Carrying out vectorization processing respectively includes：

Alternatively, processing module is carried out at vectorization respectively to the row data block and column data block in each kernel functions Reason includes：

Using OpenCL Vector data types, row vector is entered to every n real-coded GA in row data block respectively Change processing to obtain multiple row vector submatrixs, and vectorization is carried out to every n real-coded GA in column data block respectively Handle to obtain multiple column vector submatrixs, wherein, n is positive integer.

Alternatively, the multiple submatrixs obtained after computing module handles vectorization, which carry out concurrent operation, to be included：

Alternatively, n=4.

Scheme of the embodiment of the present invention includes：Under open computing language OpenCL frameworks, the first matrix to multiplication and the Two matrixes carry out vectorization processing respectively；The multiple submatrixs obtained after vectorization is handled carry out concurrent operation.By this hair The scheme of bright embodiment, can implement on high-performance calculation platform, take full advantage of computer hardware resource, and contract significantly The short calculating time, improve operation efficiency.

Brief description of the drawings

The accompanying drawing in the embodiment of the present invention is illustrated below, the accompanying drawing in embodiment is used for the embodiment of the present invention Further understand, together with specification be used for explain the embodiment of the present invention, do not constitute to protection domain of the embodiment of the present invention Limitation.

Fig. 1 is the implementation method flow chart of the matrix-vector multiplication of the embodiment of the present invention；

Fig. 2 carries out vectorization processing method stream respectively for the first matrix and the second matrix to multiplication of the embodiment of the present invention Cheng Tu；

Fig. 3 carries out data block to the first matrix and the second matrix for the embodiment of the present invention and divides schematic diagram；

Fig. 4 is the OpenCL block flow diagrams of the matrix-vector multiplication of the embodiment of the present invention；

Fig. 5 is carries out vectorization schematic diagram in the kernel functions of the embodiment of the present invention to row data block and column data block；

Fig. 6 realizes device composition frame chart for the matrix-vector multiplication of the embodiment of the present invention.

Embodiment

For the ease of the understanding of those skilled in the art, the embodiment of the present invention is made further to retouch below in conjunction with the accompanying drawings State, can not be used for limiting the protection domain of the embodiment of the present invention.

The embodiment of the present invention proposes a kind of implementation method of matrix-vector multiplication, as shown in figure 1, this method can include S101-S102：

S101, under open computing language OpenCL frameworks, the first matrix and the second matrix of multiplication are carried out respectively to Quantification treatment.

In embodiments of the present invention, asked in order to which the speed for solving the presence of current matrix multiplying method is slow, efficiency is low A kind of topic, it is proposed that matrix-vector multiplication implementation method based on OpenCL.(OpenComputingLanguage is opened OpenCL Put computing language) language, it is open, the cross-platform multiple programming framework towards the general purpose of heterogeneous system.Current The achievable computer hardware accelerated parallel under the conditions of, can make full use of computer hardware resource, improve matrix-vector multiplication The operation efficiency of method；Described computer hardware is all support OpenCL computer hardware platforms, for example, this computer is hard Part platform can by CPU, GPU (Graphic Processing Unit, graphics processor) or other kinds of processor group into.

In embodiments of the present invention, the method that the matrix-vector multiplication based on OpenCL is realized is by the way that operation matrix is divided Block processing, piecemeal number is equal with row matrix columns, so as to realize the parallelization processing of data.Following scheme can specifically be passed through Realize.

Alternatively, as shown in Fig. 2 under open computing language OpenCL frameworks, to the first matrix and the second square of multiplication Battle array carries out vectorization processing respectively can include S201-S203：

S201, using each row vector in the first matrix as a row data block, and will be each in the second matrix Individual column vector is used as a column data block.

In embodiments of the present invention, initial work is carried out first under OpenCL frameworks, to equipment Device, context Component is defined and assignment necessary to Context, program Program etc., then to the first matrix and the second matrix of multiplication Carry out piecemeal processing.Specifically, can be using each row vector in the first matrix as a row data block, and by the second square Each column vector in battle array is used as a column data block.Row data block Hblock1 as shown in Figure 3, Hblock2, Hblock3 ... Hblockn, and column data block Lblock1, Lblock2, Lblock3 ... Lblockn.

S202, respectively by any one row data block and the incoming kernel function of any one column data block, and Obtain multiple kernel functions；Wherein, each kernel functions and wherein incoming row data block and column data block are corresponded.

In embodiments of the present invention, as shown in figure 4, when being calculated, can by an arbitrary row data block and In the incoming kernel function of one column data block, for example, by Hblock1 and the incoming kernel of Lblock1, inciting somebody to action In the Hblock2 and incoming kerne2 of Lblock1, by Hblock3 and the incoming kerne3 of Lblock1 ... ..., the rest may be inferred, will Different row data blocks is put into different kernel functions with combining for column data block, to divide in different kernel functions It is other that parallel multiplication computing is done to corresponding row data block and column data block.

S203, vectorization processing is carried out respectively to the row data block and column data block in each kernel functions, to obtain Multiple row vector submatrixs and multiple column vector submatrixs.

In embodiments of the present invention, after by the incoming different kernel functions of row data block and column data block, In each kernel functions, vectorization processing further can also be carried out to row data block and column data block, by the row data Block is further divided into smaller subvector or submatrix with column data block.It can specifically be realized by following scheme.

In embodiments of the present invention, as shown in figure 5, among the computing of kernel functions, using OpenCL Vector The data included in each row data block and column data block further can be carried out vectorization processing, each by data type Row data block and column data block can be divided into multiple submatrixs, each submatrix can be comprising n real-coded GA.

In embodiments of the present invention, n numerical value can be determined according to the computing capability of current calculating platform, if worked as The computing capability of preceding calculating platform is stronger, and can set n numerical value is smaller, if the calculating energy of current calculating platform Power is poor, and can set n numerical value is larger.Alternatively, n=4.For example, by every adjacent 4 in row data block Single-precision floating-point data is as a row vector submatrix, accordingly, by every 4 adjacent single-precision floating points in column data block Data are used as a column vector submatrix, row vector submatrix HVector1, HVector2 ... as shown in Figure 5 HVectorn, and column vector submatrix LVector1, LVector2 ... LVectorn.

S102, vectorization is handled after multiple submatrixs for obtaining carry out concurrent operation.

In embodiments of the present invention, just can be right after to the further vectorization of data block in each kernel functions Multiple submatrixs after vectorization carry out concurrent operation using asynchronous thread.

In embodiments of the present invention, for example HVector1 is multiplied with LVector1 respectively, by HVector2 with LVector2, which is multiplied, ... ... is multiplied HVectorn with LVectorn, and above-mentioned computing is carried out parallel, as n=4 so that 4 single precision floating datum operations of each thread single treatment, so as to further improve operation efficiency using vector operation.Each Computing unit is separate in calculating process, without communication, therefore also has good expansibility, and the embodiment scheme Parallel granularity and operation efficiency are taken into account.

In embodiments of the present invention, by scheme of the embodiment of the present invention will realize high-performance calculation unit between independently transport Calculate, without communication or mutually wait, realize that high-performance is quickly calculated；And there is scheme of the embodiment of the present invention good platform to move Plant property, due to OpenCL cross-platform characteristic, the isomery that this embodiment scheme can easily be transplanted to all support OpenCL is high In performance calculating platform；Compared with traditional CPU serial computing matrix-vector multiplications, data parallel and vectorization side make use of Method, substantially increases computational efficiency.

In order to achieve the above object, the embodiment of the present invention also proposed a kind of matrix-vector multiplication realize device 1, it is necessary to Illustrate, any one embodiment in above-mentioned embodiment of the method is suitable for the device embodiment of the present invention, herein Repeat no more, as shown in fig. 6, the device can include：Processing module 11 and computing module 12；

Processing module 11, under open computing language OpenCL frameworks, to the first matrix and the second matrix of multiplication Vectorization processing is carried out respectively；

Computing module 12, multiple submatrixs for being obtained after vectorization is handled carry out concurrent operation.

Alternatively, processing module 11 is under open computing language OpenCL frameworks, to the first matrix and the second square of multiplication Battle array carries out vectorization processing respectively to be included：

Alternatively, the row data block and column data block in each kernel functions of 11 pairs of processing module carry out vectorization respectively Processing includes：

Alternatively, multiple submatrixs that computing module 12 is obtained after vectorization is handled, which carry out concurrent operation, to be included：

Alternatively, n=4.

It should be noted that embodiment described above be for only for ease of it will be understood by those skilled in the art that, and It is not used in the protection domain of the limitation embodiment of the present invention, on the premise of the inventive concept of the embodiment of the present invention is not departed from, ability Any obvious replacement and improvement that field technique personnel are made to the embodiment of the present invention etc. is in the embodiment of the present invention Within protection domain.

Claims

1. a kind of implementation method of matrix-vector multiplication, it is characterised in that methods described includes：

Under open computing language OpenCL frameworks, vectorization processing is carried out respectively to the first matrix and the second matrix of multiplication；

2. the implementation method of matrix-vector multiplication as claimed in claim 1, it is characterised in that described in open computing language Under OpenCL frameworks, carrying out vectorization processing respectively to the first matrix and the second matrix of multiplication includes：

Using each row vector in first matrix as a row data block, and by each in second matrix Column vector is used as a column data block；

Respectively by any one row data block and the incoming kernel function of any one column data block, and obtain multiple Kernel functions；Wherein, each kernel functions are corresponded with the wherein incoming row data block and the column data block；

Vectorization processing is carried out respectively to the row data block and column data block in each kernel functions, to obtain multiple row vectors Submatrix and multiple column vector submatrixs.

3. the implementation method of matrix-vector multiplication as claimed in claim 2, it is characterised in that described to each kernel functions In row data block and column data block carry out respectively vectorization processing include：

Using the vectorial Vector data types of the OpenCL, respectively to every n real-coded GA in the row data block Vectorization processing is carried out to obtain multiple row vector submatrixs, and respectively to every n floating type number in the column data block Handled according to vectorization is carried out to obtain multiple column vector submatrixs, wherein, n is positive integer.

4. the implementation method of matrix-vector multiplication as claimed in claim 3, it is characterised in that it is described vectorization is handled after obtain The multiple submatrixs obtained, which carry out concurrent operation, to be included：

In each kernel functions, mutual corresponding row vector submatrix is subjected to parallel phase with column vector submatrix respectively Multiply.

5. the implementation method of matrix-vector multiplication as claimed in claim 3, it is characterised in that the n=4.

6. a kind of matrix-vector multiplication realizes device, it is characterised in that described device includes：Processing module and computing module；

The processing module, under open computing language OpenCL frameworks, to the first matrix and the second matrix point of multiplication Carry out not vectorization processing；

The computing module, multiple submatrixs for being obtained after vectorization is handled carry out concurrent operation.

7. matrix-vector multiplication as claimed in claim 6 realizes device, it is characterised in that the processing module is transported open Calculate under language OpenCL frameworks, carrying out vectorization processing respectively to the first matrix and the second matrix of multiplication includes：

8. matrix-vector multiplication as claimed in claim 7 realizes device, it is characterised in that the processing module is to each Row data block and column data block in kernel functions carry out vectorization processing respectively to be included：

9. matrix-vector multiplication as claimed in claim 8 realizes device, it is characterised in that the computing module is by vectorization The multiple submatrixs obtained after processing, which carry out concurrent operation, to be included：

10. matrix-vector multiplication as claimed in claim 8 realizes device, it is characterised in that the n=4.