CN109445850A

CN109445850A - A kind of matrix transposition method and system based on 26010 processor of Shen prestige

Info

Publication number: CN109445850A
Application number: CN201811094916.2A
Authority: CN
Inventors: 胡波; 李明; 李一明; 秦旭; 彭星洪; 李晋
Original assignee: Chengdu Shen Wei Technology Co Ltd
Current assignee: Chengdu Shen Wei Technology Co Ltd
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2019-03-08

Abstract

The present invention relates to a kind of matrix transposition methods based on 26010 processor of Shen prestige, include the following steps: S1, the matrix A stored in main core are divided into 64 submatrixs, and 64 submatrixs are numbered；S2 carries out number corresponding with the number of 64 submatrixs from core to 64, and respectively reads 64 submatrixs in the slave core with submatrix reference numeral；S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition；S4 by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, and carries out transposition by internuclear register communication mode to matrix B, obtains Matrix C；Matrix C storage to main core is completed transposition by S5.By by biggish matrix decomposition be the lesser multiple piecemeals of scale, then by matrix in block form transposition and Transfer-matrix it is parallel, so that transposition efficiency is got a promotion.

Description

A kind of matrix transposition method and system based on 26010 processor of Shen prestige

Technical field

The present invention relates to the matrix transposition method fields of processor more particularly to a kind of based on 26010 processor of Shen prestige Matrix transposition method and system.

Background technique

Prestige 26010 processor in Shen is the high-performance calculation processor of China's independent development.The processor uses extension ALPHA architecture instruction set, processor use 4 core groups, each core group by an operation control core (main core, 64 Risc architecture general processor unit) an and arithmetic core array, i.e. 8 rows 8 arrange mesh structure 64 arithmetic cores (from Core) composition.Main core and all supported from core 256 bit vector floating point instructions extend；It include each 32 registers and 64KB from core The user controllable LDM and 16KB program space, and directly access local LDM delay is minimum, and supports from core hardware pipeline Access instruction and floating-point operation emit while instruction；Register stage communication can be used, inside core array with a vector length Degree is unit, data broadcasting or data receiver respectively can be carried out in its row or column from core, ranks register communication buffering is one The buffering of FIFO (First Input First Output), and there is the attribute removed after reading；The internuclear DMA simultaneous asynchronous data transmissions machine of principal and subordinate is provided System is realized that data are read from main memory to from core LDM or from from core LDM and is written back in main memory, and DMA is passed comprising a variety of data Defeated mode.The processor performance is excellent, therefore the platform is transplanted in more and more libraries that calculate, and is used for related application.

However, when carrying out matrix operation using 26010 processor of Shen prestige, it usually needs reading data by row from core sheet Ground memory LDM, is calculated.After the completion of calculating, if you need to matrix transposition, then need data passing through direct memory access Data back to main core is carried out data transposition by DMA, after the completion of transposition, if it also needs by carrying out parallel computation from core, then Secondary sent data to by DMA carries out from core LDM, and the calculating process is very cumbersome, greatly limits its operational performance.

Summary of the invention

In order to solve the above-mentioned technical problem the present invention provides a kind of matrix transposition method based on 26010 processor of Shen prestige.

The technical scheme to solve the above technical problems is that a kind of matrix based on 26010 processor of Shen prestige turns Method is set, 1 core group of 26010 processor of Shen prestige, from core, includes the following steps: including 1 main core and 64

The matrix A stored in the main core is divided into 64 submatrixs, and compiled to 64 submatrixs by S1 Number.

S2 carries out number corresponding with the number of 64 submatrixs from core to described in 64, and respectively by 64 institutes State submatrix read with described in the submatrix reference numeral from core.

S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition.

S4, by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, And transposition is carried out by internuclear register communication mode to the matrix B, obtain Matrix C.

S5 completes transposition by Matrix C storage into the main core.

The invention has the advantages that for larger matrix, by being that scale is smaller by biggish matrix decomposition Multiple piecemeals, then it is parallel by matrix in block form transposition and Transfer-matrix, while the low access for depositing LDM from core office being made full use of to prolong When, so that transposition efficiency is also got a promotion.To make full use of the concurrent operation of the slave core array of one of core group of processor Ability can largely promote its processor operational performance.

Based on the above technical solution, the present invention can also be improved as follows.

Further, the S1 is specifically included:

The matrix A is determined as the matrix of m × n form by S11, wherein m representing matrix line number, n representing matrix columns.

The matrix A is divided into 64 submatrixs by S12, wherein m is divided into 64 parts, n is kept when m aliquot 64 Constant, the form of 64 obtained submatrixs is respectively m₀× n, m₁× n ..., m₆₃×n。

Beneficial effect using above-mentioned further scheme is, by being lesser multiple points of scale by biggish matrix decomposition Block submatrix, so as to subsequent promotion transposition efficiency.

Further, the S1 is specific further include:

The matrix A is divided into as m aliquant 64, and when m obtains quotient d and remainder s divided by 64 and is divided into 64 by S13 A submatrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d + 1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.

Further, the S3 is implemented as, by way of striding copy from core LDM to the sub- square from core Battle array carry out transposition.

Beneficial effect using above-mentioned further scheme is, by way of striding copy from core LDM to described from core Submatrix carry out transposition, the high access performance from core LDM is utilized as far as possible, so that the performance of submatrix transposition is centainly mentioned It rises.

Further, the S4 is implemented as, by transmitting the internuclear register communication mode of token to the matrix B Carry out transposition.

Beneficial effect using above-mentioned further scheme is, by devising from the row, column register communication between core array A kind of data exchange ways based on row, column token realize from the data transposition in core array, reduce data in principal and subordinate Internuclear is transmitted several times, while also reducing the complexity of main core program, and the performance of matrix transposition also has a certain upgrade.

In order to solve the above-mentioned technical problem the present invention also provides a kind of matrix transposition system based on 26010 processor of Shen prestige System.

The technical scheme to solve the above technical problems is that a kind of matrix based on 26010 processor of Shen prestige turns Set system, 1 core group of 26010 processor of Shen prestige includes 1 main core and 64 from core, comprising:

Matrix division module, for the matrix A stored in the main core to be divided into 64 submatrixs, and to described in 64 Submatrix is numbered.

Submatrix read module reads 64 submatrixs for being numbered to described in 64 from core, and respectively To with described in the submatrix reference numeral from core.

It is obtained from nucleon matrix transposition module for carrying out transposition to each submatrix from core respectively Slave core after 64 transposition.

Switching Module is set from consideration convey, for arranging the slave core after 64 transposition from the number order of core according to described Transposition is carried out by internuclear register communication mode at the matrix B of 8 × 8 forms, and to the matrix B, obtains Matrix C.

Data memory module completes transposition by Matrix C storage into the main core.

Further, the matrix division module is specifically used for:

The matrix A is determined as to the matrix of m × n form, wherein m representing matrix line number, n representing matrix columns.

The matrix A is divided into 64 submatrixs, wherein m is divided into 64 parts, n is kept not when m aliquot 64 Become, the form of 64 obtained submatrixs is respectively m₀× n, m₁× n ..., m₆₃×n。

Further, the matrix division module is also used to:

As m aliquant 64, and when m obtains quotient d and remainder s divided by 64, the matrix A is divided into and is divided into 64 sons Matrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.

Further, described to be specifically used for from nucleon matrix transposition module, to described by way of striding copy from core LDM Transposition is carried out from the submatrix in core.

Further, described to set Switching Module specifically for the internuclear register communication mode pair by transmitting token from consideration convey The matrix B carries out transposition.

Detailed description of the invention

Fig. 1 is the flow chart of the matrix transposition method based on 26010 processor of Shen prestige of the embodiment of the present invention；

Fig. 2 is the detailed process figure of the step 1 of the embodiment of the present invention；

Fig. 3 is the detailed process figure of the step 2 of the embodiment of the present invention；

Fig. 4 is the detailed process figure of the step 4 of the embodiment of the present invention；

Fig. 5 is the detailed process figure of the step 6 of the embodiment of the present invention；

Fig. 6 is to carry out each point that piecemeal obtains by original matrix in the embodiment of the present invention when matrix occupied space is greater than 2MB The transposition front-rear position schematic diagram of block matrix；

Fig. 7 is the 26010 isomery many-core processor architecture diagram of Shen prestige of the embodiment of the present invention；

Fig. 8 is 26010 processor of Shen prestige of the embodiment of the present invention from core array schematic diagram；

Specific embodiment

The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.

As shown in Figure 1, a kind of matrix transposition method based on 26010 processor of Shen prestige provided in an embodiment of the present invention, institute Stating 26010 processor of Shen Wei includes 4 core groups, and one of core group, from core, includes the following steps: including 1 main core and 64

S5 completes transposition by Matrix C storage into the main core.

It should be noted that each core group can carry out implementing matrix transposition with independent parallel in 4 core groups.

Optionally, the S3 is implemented as, by way of striding copy from core LDM to the sub- square from core Battle array carry out transposition.Submatrix is successively subjected to transposition from core, by dimension m × n of matrix A to be processed, is numbered from core, Yi Jicun Chu Ben parameters such as each submatrix dimension from core LDM calculate the data communicated required for current transposed matrix conversion, later It is received according to the row that the form of transmitting token carries out data, row is sent or column receive, column send, and safeguard relevant state.Together When respectively safeguarded from internuclear state, without being communicated.Wherein, the copy that strides means that every duplication one is adjacent Data copy another array to, need a mobile regular length, no longer adjacent.

For example, the number for being one 0~63 with single token.Token value is since 0.It is 0 from core ID when beginning, and 0 column of core totally 8 from core, initiate the transmitting of token, i.e., 0 be written into register.It next is 1 and core 1 from core ID 8 of column obtain token from core, token value 0, the slave core of 1 column of core, by submatrix 0, after required transposition Data are transferred to corresponding from core with 0 column of core of itself going together.When the slave core of 0 column of core 0 and core, obtain again When token 0, i.e., by carrying out column communication from the initiation of core 0, mode is communicated similar to row, to complete the transposition of submatrix 0.

Optionally, the S4 is implemented as, by transmitting the internuclear register communication mode of token to the matrix B Carry out transposition.Specifically, 64 array, that is, matrix Bs from core composition 8 × 8, when matrix B carries out transposition, 8 rows execute parallel.Often A line successively initiates capable communication in such a way that board is played drinking games in transmitting by the slave core of current row.After the completion of every a line communication, by working as forward The mode that slave core where setting submatrix passes sequentially through transmitting column token initiates column communication, and a second son square is completed after the completion of communication The transposition of battle array, so circulation carry out, until all subarrays all transposition are completed.Namely complete the transposition of matrix A.

In practical application scene, due to being limited to storage size of the 26010 many-core platform of Shen Wei from core LDM, if The method for splitting for having counted matrix A, using from core calculating and internuclear data exchange capability, matrix A is divided into 64 sons Matrix.If original matrix occupied space is less than 2MB (MB i.e. 1024 × 1024 bytes), if the dimension of A is m × n, it is divided into 64 A submatrix, respectively submatrix A₀, A₁, A₂, A₃..., A₆₃, dimension is respectively m₀× n, m₁× n ..., m₆₃× n.Wherein, submatrix A₀, A₁, A₂, A₃..., A₆₃Slave core 0 after being successively read coding, from core 1, from core 2, from core 3 ... ..., from core 63.

Optionally, in order to which matrix is bisected into 64 groups as far as possible, with m divided by 64, when m aliquot 64, m is divided into 64 Part, n is remained unchanged, and the form of 64 obtained the submatrix is respectively m₀× n, m₁× n ..., m₆₃×n。

If aliquant, the slave core of extra line number is less than from core ID, successively will divide a line more.For example, m is obtained divided by 64 The remainder arrived is s, then it represents that s row matrix is successively added to submatrix A according to the number order of m by remaining s row matrix₀, A₁, A₂, A₃..., A_sIn, wherein each submatrix only adds a row matrix.Such as 9 row matrix be assigned to 3 cores, Mei Gecong 3 row of core.3 cores are assigned to such as 10 rows, obtain 4,3,3 rows respectively from core, 10 divided by 3 remainders be 1, therefore is less than remainder from core ID Data line is obtained from the core more, i.e., obtains 4 rows from core 0.

Secondly, passing through dimension m, n of matrix from core, matrix is present in the address of main core, the ginseng such as matrix element data type Number calculates the address offset that DMA reads main memory, reads data length, initiates DMA transfer and submatrix is read to the LDM from core In.

It should be noted that from the LDM of core, i.e. Local Direct Memory is from the direct memory in the local of core. DMA, i.e. Direct Memory Access are direct memory access.

Then, due to from register communication between core array, it is necessary to be colleague from the communication of internuclear row or the slave core of same column Between carry out column communication, cross-communication is unable to, with a line from core 0 to from core 7, can going communication between any two.From core 0, from core 8 ..., from core 56, column communication can be carried out between the slave core of same row, but from core 1, be unable to direct communication from core 8.

Therefore it needs to carry out transposition to the submatrix from core by way of transmitting token.Specifically, including it is as follows Step carries out the data exchange of submatrix, that is to say a kind of communication mode based on ranks token, it is all since core, column connect Token, column reception transposition data are received, row receives token, and row receives transposition data, and the slave core of token is held in the states switching such as end Corresponding from core to token value, i.e., the slave core of identical as token value or same column, colleague sends data, to realize matrix transposition. Implementation carries out in accordance with the following steps.

Step 1: the slave core of the array for being 8 × 8 from core array, every a line first row generates board of playing drinking games, and row token value is 0, every a line other from core be in row receive token state.The detailed process of step 1 is as shown in Figure 2.

Step 2: submatrix such as A₀, place holds token from core, i.e., opened up in LDM matrix A after transposition '₀Space after A' will be belonged to₀Data be put into the space.After completing the operation, passed to the board that will play drinking games by way of the communication of internuclear row It goes together next from core, sets id herein as from core ID value, i.e., from the number of core.Row should be entered from core and receive transposition data mode, Wherein, the detailed process of step 2 is as shown in Figure 3.

Step 3: at the same time, submatrix A₀Place from core same column other from core, such as from core 8, from core 16 ..., From core 56, also hold board of playing drinking games, also opens up a temporary space in LDM, space size is for current row 8 needed for A0 transposition from core Size of data, other operation with step 2.

Step 4: such 8 row is parallel to carry out submatrix A from core₀The exchange of data needed for transposition.Any a line other from Core, acquisition are played drinking games bridge queen, check the value of token, by the slave core of local data transfer to current row identical or same column with token, Assuming that token value be v, then the current row being sent to data from core index, be calculated as follows (wherein " & " indicate logical AND, " 0x ", " h " are immediate hexadecimal representation form), then i.e. to it is adjacent it is next transmit token from core, when playing drinking games, board is passed Be delivered to current row finally, board of playing drinking games by circulating transfer to current row the 1st from core.Wherein, the detailed process of step 4 is as shown in Figure 4.

Wherein, the formula of row communication index is calculated by token value:

Index=v&0x07h；

Step 5: when the row for taking identical value for the second time from the slave core of core same column identical with row token value or corresponding with the value When token, illustrate that the data of current row have been transmitted.It is identical with row token value from core, generate column token, the value of column token Equal to the value for board of playing drinking games.Board of playing drinking games should be held, into column token reception state from the other from core of core same column simultaneously.Due to It should have been opened up memory space from core LDM, such as A₀The data of current row needed for transposition have been stored, thus only need by column token according to It is secondary to pass to adjacent same column from core.

Step 6: receiving the slave core of column token, it is only necessary to which the data in LDM temporary space are sent to column token value pair The slave core for the same column answered, it is assumed that token value is that v is then calculate by the following formula column communication index, that is to say value being sent to current son The slave core of matrix transposition.LDM is cleared up immediately, and column token is successively passed into the next from core of same column, while the board that will play drinking games Value plus 1, successively passes to the slave core of colleague, itself enters row and receives token status.Wherein, the detailed process of step 6 such as Fig. 5 institute Show.

Wherein, the formula of column communication index is calculated by token value:

Index=(v&0x038h) > > 3；

Step 7: column token is successively transmitted from the slave core of core same column where the submatrix for carrying out transposition.It is same to play drinking games Board is the same, and column token can circulate in a column and transmit from core, but when there is the column token for obtaining same value for the second time from core, the column Token is released immediately, that is to say submatrix such as A₀Corresponding transposition is completed.1 should be added from core by row token value simultaneously, passed to The slave core of adjacent colleague.At this point, can be by the submatrix A' after transposition from core₀(dimension n₀× m), by calculating transposition submatrix In the positional shift of main core, the submatrix after transposition is passed back to main core memory by DMA and corresponds to position by the data length of submatrix It sets, here by the overlapping of DMA transfer and intercore communication, performance has certain promotion.Itself enter row simultaneously and receives token status.

Step 8: according to described in step 1 to step 8, being sequentially completed A₁, A₂, A₃..., A₆₃The transposition of submatrix.

Step 9: transposition terminates, and when receiving board of playing drinking games from core, and row token value is maximum token value, wherein if every It is a that be all assigned data from core be then 63, when being otherwise assigned to the slave core ID of submatrix for the last one, it is not corresponding with the value from Core same column, after being transmitted data, the board that will play drinking games passes to adjacent after core, i.e. cleaning submatrix retrogressing of going together Out.When taken for the second time from core identical value play drinking games board when, and the token value be maximum token value when, complete column communication after, row Token is released, and also exits end in succession from core.

It should be noted that if original matrix occupied space is greater than 2MB, then the partitioning of matrix is carried out.Equal part is each divided as far as possible Block.Assuming that matrix data type is double-precision floating points, standard piecemeal dimension is 64 × 4096.Original matrix dimension is m × n, In, consideration m >=64, n >=4096, it is assumed that s=n/4096 remainder is r, and s'=m/64 remainder is r', if r, r' are not equal to 0, A piecemeal of (s+1) × (s'+1) is then divided the matrix into, each piecemeal still carries out data transposition according to above-mentioned steps.It can incite somebody to action simultaneously Piecemeal forms a matrix as element, and former piecemeal is A_ij, transposition is A after the completion_ji。

After the completion of transposition, the initial position of piecemeal need to be calculated from core, then the corresponding position into main core is restored by DMA. Respectively the matrix dimensionality after storing transposition in core LDM is respectively n₀×m,......,n₆₃× m, from core initiate DMA by submatrix according to The secondary main core memory of deposit, and matrix transposition is completed, transposition result is as shown in Figure 6.

It should be noted that in order to improve performance as far as possible, from internuclear respective maintenance correlated condition, including every time from internuclear The length of data exchange.It is as far as possible simultaneously valid data from the message of intercore communication, row, column token is only shape data.

In a kind of matrix transposition system based on 26010 processor of Shen prestige provided in this embodiment.At the Shen prestige 26010 Managing device includes 4 core groups, and the core group includes 1 main core and 64 from core, the system comprises:

Submatrix read module, for carrying out volume corresponding with the number of 64 submatrixs from core to described in 64 Number, and respectively by 64 submatrixs read with described in the submatrix reference numeral from core.

Switching Module is set from consideration convey, for arranging the slave core after 64 transposition from the number order of core according to described Transposition is carried out by internuclear register communication mode at 8 × 8 matrix B, and to the matrix B, obtains Matrix C.

Optionally, the matrix division module is specifically used for:

Optionally, the matrix division module is also used to:

Optionally, described to be specifically used for from nucleon matrix transposition module, to described by way of striding copy from core LDM Transposition is carried out from the submatrix in core.

Optionally, described to set Switching Module specifically for the internuclear register communication mode pair by transmitting token from consideration convey The matrix B carries out transposition.

In practical application scene, as shown in fig. 7, the 26010 isomery many-core processor of Shen prestige in the present embodiment uses expansion The ALPHA architecture instruction set of exhibition, processor use 4 core groups, each core group by an operation control core (main core, 64 Position risc architecture general processor unit) and an arithmetic core array.

If any shown in Fig. 8, arithmetic core in Shen Wei 26010 processor, one core group, is one 8 × 8 processor array It constitutes.Arrow is illustrated when carrying out matrix transposition, from the transfer mode of row, column token between core array in figure.It wherein plays drinking games board Transmitting is that 8 rows execute parallel from core, and the transmitting of column token only exists 1 column every time and transmitted from core.

Processor in the present embodiment includes main core part and from core part.Main core part, which is realized mainly to complete, to be turned Set the acquisition of matrix, the distribution of matrix memory space, the initiation of matrix transposition.It include interface and core from core part.Its Middle interface section completes master and slave core DMA communication, the functions such as inspection of main core Transfer Parameters.Wherein, in the present embodiment using from The main flow that core array carries out matrix transposition please refers to Fig. 2-Fig. 5.

During being realized from core part, from core by row, column token passing, realize described above from nuclear state Migration, to realize the transposition of matrix.It is designed and is instructed according to hardware pipeline in detailed implementation, avoid counting as far as possible It calculates, memory access variable relies on.Optimize the use of local variable as far as possible in function design, while using register variable, is promoted Operational performance.

It should be noted that having used single-instruction multiple-data stream (SIMD) instruction, i.e., when being read out operation to continuous data Single Instruction Multiple Data, SIMD instruction further improve memory access performance.It is restored using DMA When transposed matrix, being superimposed for intercore communication and DMA transfer is used, performance is improved.For there are in the case where multiple piecemeals, The transposition of piecemeal is also superimposed with the transmission of result, performance is also obviously improved.

Wherein, it table 1: describes and carries out matrix transposition and use using main core on the light supercomputer in martial prowess Taihu Lake This method carries out the comparison (not including the assignment of matrix and the time-consuming of data transmission) of matrix transposition time-consuming, wherein transposition number It is double-precision floating point type according to type.

1 matrix transposition contrast test of table

It is learnt by table 1, with the increase of matrix transposition scale, buffers hit rate decline, memory access increases, and time-consuming substantially mentions It is high.This method time-consuming increases with matrix size, more linearly.

In summary, the matrix transposition method based on 26010 processor of Shen prestige in the present embodiment and system are by from core Row, column register communication between array devises a kind of data exchange ways based on row, column token, realizes from core battle array Data transposition in column reduces data and is transmitted several times principal and subordinate is internuclear, while also reducing the complexity of main core program, square The performance of battle array transposition also has a certain upgrade.In addition it for larger matrix, devises and decomposes twice, by biggish matrix The lesser multiple piecemeals of scale are decomposed into, it is parallel by matrix in block form transposition and Transfer-matrix, while making full use of from core office and depositing Low access delay, so that transposition efficiency is also got a promotion.

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of matrix transposition method based on 26010 processor of Shen prestige, 1 core group of 26010 processor of Shen prestige includes 1 A main core and 64 are from core, which comprises the steps of:

The matrix A stored in the main core is divided into 64 submatrixs, and 64 submatrixs is numbered by S1；

S2 carries out number corresponding with the number of 64 submatrixs from core to described in 64, and respectively by 64 sons Matrix read with described in the submatrix reference numeral from core；

S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition；

S4, by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, and it is right The matrix B carries out transposition by internuclear register communication mode, obtains Matrix C；

S5 completes transposition by Matrix C storage into the main core.

2. the matrix transposition method according to claim 1 based on 26010 processor of Shen prestige, which is characterized in that the S1 It specifically includes:

The matrix A is determined as the matrix of m × n form by S11, wherein m representing matrix line number, n representing matrix columns；

The matrix A is divided into 64 submatrixs by S12, wherein m is divided into 64 parts, n is kept not when m aliquot 64 Become, the form of 64 obtained submatrixs is respectively m₀× n, m₁× n ..., m₆₃×n。

3. the matrix transposition method according to claim 2 based on 26010 processor of Shen prestige, which is characterized in that the S1 Specifically further include:

The matrix A is divided into as m aliquant 64, and when m obtains quotient d and remainder s divided by 64 and is divided into 64 sons by S13 Matrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.

4. the matrix transposition method according to claim 1-3 based on 26010 processor of Shen prestige, feature exist In the specific implementation of the S3 are as follows: carry out transposition to the submatrix from core by way of striding copy from core LDM.

5. the matrix transposition method according to claim 1-3 based on 26010 processor of Shen prestige, feature exist In the specific implementation of the S4 are as follows: the internuclear register communication mode by transmitting token carries out transposition to the matrix B.

6. a kind of matrix transposition system based on 26010 processor of Shen prestige, 1 core group of 26010 processor of Shen prestige includes 1 A main core and 64 are from core characterized by comprising

Matrix division module, for the matrix A stored in the main core to be divided into 64 submatrixs, and to 64 sub- squares Battle array is numbered；

Submatrix read module, for carrying out number corresponding with the number of 64 submatrixs from core to described in 64, and Respectively by 64 submatrixs read with described in the submatrix reference numeral from core；

64 are obtained from nucleon matrix transposition module for carrying out transposition to each submatrix from core respectively Slave core after transposition；

Switching Module is set from consideration convey, for the slave core after 64 transposition to be arranged in 8 from the number order of core according to described The matrix B of × 8 forms, and transposition is carried out by internuclear register communication mode to the matrix B, obtain Matrix C；

7. the matrix transposition system according to claim 6 based on 26010 processor of Shen prestige, which is characterized in that the square Battle array division module is specifically used for:

The matrix A is determined as to the matrix of m × n form, wherein m representing matrix line number, n representing matrix columns；

The matrix A is divided into 64 submatrixs, wherein m is divided into 64 parts, n is remained unchanged, and is obtained when m aliquot 64 To the forms of 64 submatrixs be respectively m₀× n, m₁× n ..., m₆₃×n。

8. the matrix transposition system according to claim 7 based on 26010 processor of Shen prestige, which is characterized in that the square Battle array division module is also used to:

As m aliquant 64, and when m obtains quotient d and remainder s divided by 64, the matrix A is divided into and is divided into 64 sub- squares Battle array, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n Rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.

9. according to the described in any item matrix transposition systems based on 26010 processor of Shen prestige of claim 6-8, feature exists In described to be specifically used for from nucleon matrix transposition module: to the sub- square from core by way of striding copy from core LDM Battle array carry out transposition.

10. according to the described in any item matrix transposition systems based on 26010 processor of Shen prestige of claim 6-8, feature exists In described to set Switching Module from consideration convey and be specifically used for: by transmit the internuclear register communication mode of token to the matrix B into Row transposition.