CN109445850A - A kind of matrix transposition method and system based on 26010 processor of Shen prestige - Google Patents

A kind of matrix transposition method and system based on 26010 processor of Shen prestige Download PDF

Info

Publication number
CN109445850A
CN109445850A CN201811094916.2A CN201811094916A CN109445850A CN 109445850 A CN109445850 A CN 109445850A CN 201811094916 A CN201811094916 A CN 201811094916A CN 109445850 A CN109445850 A CN 109445850A
Authority
CN
China
Prior art keywords
matrix
core
transposition
submatrix
submatrixs
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811094916.2A
Other languages
Chinese (zh)
Inventor
胡波
李明
李一明
秦旭
彭星洪
李晋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Shen Wei Technology Co Ltd
Original Assignee
Chengdu Shen Wei Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Shen Wei Technology Co Ltd filed Critical Chengdu Shen Wei Technology Co Ltd
Priority to CN201811094916.2A priority Critical patent/CN109445850A/en
Publication of CN109445850A publication Critical patent/CN109445850A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Multi Processors (AREA)

Abstract

The present invention relates to a kind of matrix transposition methods based on 26010 processor of Shen prestige, include the following steps: S1, the matrix A stored in main core are divided into 64 submatrixs, and 64 submatrixs are numbered;S2 carries out number corresponding with the number of 64 submatrixs from core to 64, and respectively reads 64 submatrixs in the slave core with submatrix reference numeral;S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition;S4 by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, and carries out transposition by internuclear register communication mode to matrix B, obtains Matrix C;Matrix C storage to main core is completed transposition by S5.By by biggish matrix decomposition be the lesser multiple piecemeals of scale, then by matrix in block form transposition and Transfer-matrix it is parallel, so that transposition efficiency is got a promotion.

Description

A kind of matrix transposition method and system based on 26010 processor of Shen prestige
Technical field
The present invention relates to the matrix transposition method fields of processor more particularly to a kind of based on 26010 processor of Shen prestige Matrix transposition method and system.
Background technique
Prestige 26010 processor in Shen is the high-performance calculation processor of China's independent development.The processor uses extension ALPHA architecture instruction set, processor use 4 core groups, each core group by an operation control core (main core, 64 Risc architecture general processor unit) an and arithmetic core array, i.e. 8 rows 8 arrange mesh structure 64 arithmetic cores (from Core) composition.Main core and all supported from core 256 bit vector floating point instructions extend;It include each 32 registers and 64KB from core The user controllable LDM and 16KB program space, and directly access local LDM delay is minimum, and supports from core hardware pipeline Access instruction and floating-point operation emit while instruction;Register stage communication can be used, inside core array with a vector length Degree is unit, data broadcasting or data receiver respectively can be carried out in its row or column from core, ranks register communication buffering is one The buffering of FIFO (First Input First Output), and there is the attribute removed after reading;The internuclear DMA simultaneous asynchronous data transmissions machine of principal and subordinate is provided System is realized that data are read from main memory to from core LDM or from from core LDM and is written back in main memory, and DMA is passed comprising a variety of data Defeated mode.The processor performance is excellent, therefore the platform is transplanted in more and more libraries that calculate, and is used for related application.
However, when carrying out matrix operation using 26010 processor of Shen prestige, it usually needs reading data by row from core sheet Ground memory LDM, is calculated.After the completion of calculating, if you need to matrix transposition, then need data passing through direct memory access Data back to main core is carried out data transposition by DMA, after the completion of transposition, if it also needs by carrying out parallel computation from core, then Secondary sent data to by DMA carries out from core LDM, and the calculating process is very cumbersome, greatly limits its operational performance.
Summary of the invention
In order to solve the above-mentioned technical problem the present invention provides a kind of matrix transposition method based on 26010 processor of Shen prestige.
The technical scheme to solve the above technical problems is that a kind of matrix based on 26010 processor of Shen prestige turns Method is set, 1 core group of 26010 processor of Shen prestige, from core, includes the following steps: including 1 main core and 64
The matrix A stored in the main core is divided into 64 submatrixs, and compiled to 64 submatrixs by S1 Number.
S2 carries out number corresponding with the number of 64 submatrixs from core to described in 64, and respectively by 64 institutes State submatrix read with described in the submatrix reference numeral from core.
S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition.
S4, by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, And transposition is carried out by internuclear register communication mode to the matrix B, obtain Matrix C.
S5 completes transposition by Matrix C storage into the main core.
The invention has the advantages that for larger matrix, by being that scale is smaller by biggish matrix decomposition Multiple piecemeals, then it is parallel by matrix in block form transposition and Transfer-matrix, while the low access for depositing LDM from core office being made full use of to prolong When, so that transposition efficiency is also got a promotion.To make full use of the concurrent operation of the slave core array of one of core group of processor Ability can largely promote its processor operational performance.
Based on the above technical solution, the present invention can also be improved as follows.
Further, the S1 is specifically included:
The matrix A is determined as the matrix of m × n form by S11, wherein m representing matrix line number, n representing matrix columns.
The matrix A is divided into 64 submatrixs by S12, wherein m is divided into 64 parts, n is kept when m aliquot 64 Constant, the form of 64 obtained submatrixs is respectively m0× n, m1× n ..., m63×n。
Beneficial effect using above-mentioned further scheme is, by being lesser multiple points of scale by biggish matrix decomposition Block submatrix, so as to subsequent promotion transposition efficiency.
Further, the S1 is specific further include:
The matrix A is divided into as m aliquant 64, and when m obtains quotient d and remainder s divided by 64 and is divided into 64 by S13 A submatrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d + 1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.
Further, the S3 is implemented as, by way of striding copy from core LDM to the sub- square from core Battle array carry out transposition.
Beneficial effect using above-mentioned further scheme is, by way of striding copy from core LDM to described from core Submatrix carry out transposition, the high access performance from core LDM is utilized as far as possible, so that the performance of submatrix transposition is centainly mentioned It rises.
Further, the S4 is implemented as, by transmitting the internuclear register communication mode of token to the matrix B Carry out transposition.
Beneficial effect using above-mentioned further scheme is, by devising from the row, column register communication between core array A kind of data exchange ways based on row, column token realize from the data transposition in core array, reduce data in principal and subordinate Internuclear is transmitted several times, while also reducing the complexity of main core program, and the performance of matrix transposition also has a certain upgrade.
In order to solve the above-mentioned technical problem the present invention also provides a kind of matrix transposition system based on 26010 processor of Shen prestige System.
The technical scheme to solve the above technical problems is that a kind of matrix based on 26010 processor of Shen prestige turns Set system, 1 core group of 26010 processor of Shen prestige includes 1 main core and 64 from core, comprising:
Matrix division module, for the matrix A stored in the main core to be divided into 64 submatrixs, and to described in 64 Submatrix is numbered.
Submatrix read module reads 64 submatrixs for being numbered to described in 64 from core, and respectively To with described in the submatrix reference numeral from core.
It is obtained from nucleon matrix transposition module for carrying out transposition to each submatrix from core respectively Slave core after 64 transposition.
Switching Module is set from consideration convey, for arranging the slave core after 64 transposition from the number order of core according to described Transposition is carried out by internuclear register communication mode at the matrix B of 8 × 8 forms, and to the matrix B, obtains Matrix C.
Data memory module completes transposition by Matrix C storage into the main core.
Further, the matrix division module is specifically used for:
The matrix A is determined as to the matrix of m × n form, wherein m representing matrix line number, n representing matrix columns.
The matrix A is divided into 64 submatrixs, wherein m is divided into 64 parts, n is kept not when m aliquot 64 Become, the form of 64 obtained submatrixs is respectively m0× n, m1× n ..., m63×n。
Further, the matrix division module is also used to:
As m aliquant 64, and when m obtains quotient d and remainder s divided by 64, the matrix A is divided into and is divided into 64 sons Matrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.
Further, described to be specifically used for from nucleon matrix transposition module, to described by way of striding copy from core LDM Transposition is carried out from the submatrix in core.
Further, described to set Switching Module specifically for the internuclear register communication mode pair by transmitting token from consideration convey The matrix B carries out transposition.
Detailed description of the invention
Fig. 1 is the flow chart of the matrix transposition method based on 26010 processor of Shen prestige of the embodiment of the present invention;
Fig. 2 is the detailed process figure of the step 1 of the embodiment of the present invention;
Fig. 3 is the detailed process figure of the step 2 of the embodiment of the present invention;
Fig. 4 is the detailed process figure of the step 4 of the embodiment of the present invention;
Fig. 5 is the detailed process figure of the step 6 of the embodiment of the present invention;
Fig. 6 is to carry out each point that piecemeal obtains by original matrix in the embodiment of the present invention when matrix occupied space is greater than 2MB The transposition front-rear position schematic diagram of block matrix;
Fig. 7 is the 26010 isomery many-core processor architecture diagram of Shen prestige of the embodiment of the present invention;
Fig. 8 is 26010 processor of Shen prestige of the embodiment of the present invention from core array schematic diagram;
Specific embodiment
The principle and features of the present invention will be described below with reference to the accompanying drawings, and the given examples are served only to explain the present invention, and It is non-to be used to limit the scope of the invention.
As shown in Figure 1, a kind of matrix transposition method based on 26010 processor of Shen prestige provided in an embodiment of the present invention, institute Stating 26010 processor of Shen Wei includes 4 core groups, and one of core group, from core, includes the following steps: including 1 main core and 64
The matrix A stored in the main core is divided into 64 submatrixs, and compiled to 64 submatrixs by S1 Number.
S2 carries out number corresponding with the number of 64 submatrixs from core to described in 64, and respectively by 64 institutes State submatrix read with described in the submatrix reference numeral from core.
S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition.
S4, by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, And transposition is carried out by internuclear register communication mode to the matrix B, obtain Matrix C.
S5 completes transposition by Matrix C storage into the main core.
It should be noted that each core group can carry out implementing matrix transposition with independent parallel in 4 core groups.
Optionally, the S3 is implemented as, by way of striding copy from core LDM to the sub- square from core Battle array carry out transposition.Submatrix is successively subjected to transposition from core, by dimension m × n of matrix A to be processed, is numbered from core, Yi Jicun Chu Ben parameters such as each submatrix dimension from core LDM calculate the data communicated required for current transposed matrix conversion, later It is received according to the row that the form of transmitting token carries out data, row is sent or column receive, column send, and safeguard relevant state.Together When respectively safeguarded from internuclear state, without being communicated.Wherein, the copy that strides means that every duplication one is adjacent Data copy another array to, need a mobile regular length, no longer adjacent.
For example, the number for being one 0~63 with single token.Token value is since 0.It is 0 from core ID when beginning, and 0 column of core totally 8 from core, initiate the transmitting of token, i.e., 0 be written into register.It next is 1 and core 1 from core ID 8 of column obtain token from core, token value 0, the slave core of 1 column of core, by submatrix 0, after required transposition Data are transferred to corresponding from core with 0 column of core of itself going together.When the slave core of 0 column of core 0 and core, obtain again When token 0, i.e., by carrying out column communication from the initiation of core 0, mode is communicated similar to row, to complete the transposition of submatrix 0.
Optionally, the S4 is implemented as, by transmitting the internuclear register communication mode of token to the matrix B Carry out transposition.Specifically, 64 array, that is, matrix Bs from core composition 8 × 8, when matrix B carries out transposition, 8 rows execute parallel.Often A line successively initiates capable communication in such a way that board is played drinking games in transmitting by the slave core of current row.After the completion of every a line communication, by working as forward The mode that slave core where setting submatrix passes sequentially through transmitting column token initiates column communication, and a second son square is completed after the completion of communication The transposition of battle array, so circulation carry out, until all subarrays all transposition are completed.Namely complete the transposition of matrix A.
In practical application scene, due to being limited to storage size of the 26010 many-core platform of Shen Wei from core LDM, if The method for splitting for having counted matrix A, using from core calculating and internuclear data exchange capability, matrix A is divided into 64 sons Matrix.If original matrix occupied space is less than 2MB (MB i.e. 1024 × 1024 bytes), if the dimension of A is m × n, it is divided into 64 A submatrix, respectively submatrix A0, A1, A2, A3..., A63, dimension is respectively m0× n, m1× n ..., m63× n.Wherein, submatrix A0, A1, A2, A3..., A63Slave core 0 after being successively read coding, from core 1, from core 2, from core 3 ... ..., from core 63.
Optionally, in order to which matrix is bisected into 64 groups as far as possible, with m divided by 64, when m aliquot 64, m is divided into 64 Part, n is remained unchanged, and the form of 64 obtained the submatrix is respectively m0× n, m1× n ..., m63×n。
If aliquant, the slave core of extra line number is less than from core ID, successively will divide a line more.For example, m is obtained divided by 64 The remainder arrived is s, then it represents that s row matrix is successively added to submatrix A according to the number order of m by remaining s row matrix0, A1, A2, A3..., AsIn, wherein each submatrix only adds a row matrix.Such as 9 row matrix be assigned to 3 cores, Mei Gecong 3 row of core.3 cores are assigned to such as 10 rows, obtain 4,3,3 rows respectively from core, 10 divided by 3 remainders be 1, therefore is less than remainder from core ID Data line is obtained from the core more, i.e., obtains 4 rows from core 0.
Secondly, passing through dimension m, n of matrix from core, matrix is present in the address of main core, the ginseng such as matrix element data type Number calculates the address offset that DMA reads main memory, reads data length, initiates DMA transfer and submatrix is read to the LDM from core In.
It should be noted that from the LDM of core, i.e. Local Direct Memory is from the direct memory in the local of core. DMA, i.e. Direct Memory Access are direct memory access.
Then, due to from register communication between core array, it is necessary to be colleague from the communication of internuclear row or the slave core of same column Between carry out column communication, cross-communication is unable to, with a line from core 0 to from core 7, can going communication between any two.From core 0, from core 8 ..., from core 56, column communication can be carried out between the slave core of same row, but from core 1, be unable to direct communication from core 8.
Therefore it needs to carry out transposition to the submatrix from core by way of transmitting token.Specifically, including it is as follows Step carries out the data exchange of submatrix, that is to say a kind of communication mode based on ranks token, it is all since core, column connect Token, column reception transposition data are received, row receives token, and row receives transposition data, and the slave core of token is held in the states switching such as end Corresponding from core to token value, i.e., the slave core of identical as token value or same column, colleague sends data, to realize matrix transposition. Implementation carries out in accordance with the following steps.
Step 1: the slave core of the array for being 8 × 8 from core array, every a line first row generates board of playing drinking games, and row token value is 0, every a line other from core be in row receive token state.The detailed process of step 1 is as shown in Figure 2.
Step 2: submatrix such as A0, place holds token from core, i.e., opened up in LDM matrix A after transposition '0Space after A' will be belonged to0Data be put into the space.After completing the operation, passed to the board that will play drinking games by way of the communication of internuclear row It goes together next from core, sets id herein as from core ID value, i.e., from the number of core.Row should be entered from core and receive transposition data mode, Wherein, the detailed process of step 2 is as shown in Figure 3.
Step 3: at the same time, submatrix A0Place from core same column other from core, such as from core 8, from core 16 ..., From core 56, also hold board of playing drinking games, also opens up a temporary space in LDM, space size is for current row 8 needed for A0 transposition from core Size of data, other operation with step 2.
Step 4: such 8 row is parallel to carry out submatrix A from core0The exchange of data needed for transposition.Any a line other from Core, acquisition are played drinking games bridge queen, check the value of token, by the slave core of local data transfer to current row identical or same column with token, Assuming that token value be v, then the current row being sent to data from core index, be calculated as follows (wherein " & " indicate logical AND, " 0x ", " h " are immediate hexadecimal representation form), then i.e. to it is adjacent it is next transmit token from core, when playing drinking games, board is passed Be delivered to current row finally, board of playing drinking games by circulating transfer to current row the 1st from core.Wherein, the detailed process of step 4 is as shown in Figure 4.
Wherein, the formula of row communication index is calculated by token value:
Index=v&0x07h;
Step 5: when the row for taking identical value for the second time from the slave core of core same column identical with row token value or corresponding with the value When token, illustrate that the data of current row have been transmitted.It is identical with row token value from core, generate column token, the value of column token Equal to the value for board of playing drinking games.Board of playing drinking games should be held, into column token reception state from the other from core of core same column simultaneously.Due to It should have been opened up memory space from core LDM, such as A0The data of current row needed for transposition have been stored, thus only need by column token according to It is secondary to pass to adjacent same column from core.
Step 6: receiving the slave core of column token, it is only necessary to which the data in LDM temporary space are sent to column token value pair The slave core for the same column answered, it is assumed that token value is that v is then calculate by the following formula column communication index, that is to say value being sent to current son The slave core of matrix transposition.LDM is cleared up immediately, and column token is successively passed into the next from core of same column, while the board that will play drinking games Value plus 1, successively passes to the slave core of colleague, itself enters row and receives token status.Wherein, the detailed process of step 6 such as Fig. 5 institute Show.
Wherein, the formula of column communication index is calculated by token value:
Index=(v&0x038h) > > 3;
Step 7: column token is successively transmitted from the slave core of core same column where the submatrix for carrying out transposition.It is same to play drinking games Board is the same, and column token can circulate in a column and transmit from core, but when there is the column token for obtaining same value for the second time from core, the column Token is released immediately, that is to say submatrix such as A0Corresponding transposition is completed.1 should be added from core by row token value simultaneously, passed to The slave core of adjacent colleague.At this point, can be by the submatrix A' after transposition from core0(dimension n0× m), by calculating transposition submatrix In the positional shift of main core, the submatrix after transposition is passed back to main core memory by DMA and corresponds to position by the data length of submatrix It sets, here by the overlapping of DMA transfer and intercore communication, performance has certain promotion.Itself enter row simultaneously and receives token status.
Step 8: according to described in step 1 to step 8, being sequentially completed A1, A2, A3..., A63The transposition of submatrix.
Step 9: transposition terminates, and when receiving board of playing drinking games from core, and row token value is maximum token value, wherein if every It is a that be all assigned data from core be then 63, when being otherwise assigned to the slave core ID of submatrix for the last one, it is not corresponding with the value from Core same column, after being transmitted data, the board that will play drinking games passes to adjacent after core, i.e. cleaning submatrix retrogressing of going together Out.When taken for the second time from core identical value play drinking games board when, and the token value be maximum token value when, complete column communication after, row Token is released, and also exits end in succession from core.
It should be noted that if original matrix occupied space is greater than 2MB, then the partitioning of matrix is carried out.Equal part is each divided as far as possible Block.Assuming that matrix data type is double-precision floating points, standard piecemeal dimension is 64 × 4096.Original matrix dimension is m × n, In, consideration m >=64, n >=4096, it is assumed that s=n/4096 remainder is r, and s'=m/64 remainder is r', if r, r' are not equal to 0, A piecemeal of (s+1) × (s'+1) is then divided the matrix into, each piecemeal still carries out data transposition according to above-mentioned steps.It can incite somebody to action simultaneously Piecemeal forms a matrix as element, and former piecemeal is Aij, transposition is A after the completionji
After the completion of transposition, the initial position of piecemeal need to be calculated from core, then the corresponding position into main core is restored by DMA. Respectively the matrix dimensionality after storing transposition in core LDM is respectively n0×m,......,n63× m, from core initiate DMA by submatrix according to The secondary main core memory of deposit, and matrix transposition is completed, transposition result is as shown in Figure 6.
It should be noted that in order to improve performance as far as possible, from internuclear respective maintenance correlated condition, including every time from internuclear The length of data exchange.It is as far as possible simultaneously valid data from the message of intercore communication, row, column token is only shape data.
In a kind of matrix transposition system based on 26010 processor of Shen prestige provided in this embodiment.At the Shen prestige 26010 Managing device includes 4 core groups, and the core group includes 1 main core and 64 from core, the system comprises:
Matrix division module, for the matrix A stored in the main core to be divided into 64 submatrixs, and to described in 64 Submatrix is numbered.
Submatrix read module, for carrying out volume corresponding with the number of 64 submatrixs from core to described in 64 Number, and respectively by 64 submatrixs read with described in the submatrix reference numeral from core.
It is obtained from nucleon matrix transposition module for carrying out transposition to each submatrix from core respectively Slave core after 64 transposition.
Switching Module is set from consideration convey, for arranging the slave core after 64 transposition from the number order of core according to described Transposition is carried out by internuclear register communication mode at 8 × 8 matrix B, and to the matrix B, obtains Matrix C.
Data memory module completes transposition by Matrix C storage into the main core.
Optionally, the matrix division module is specifically used for:
The matrix A is determined as to the matrix of m × n form, wherein m representing matrix line number, n representing matrix columns.
The matrix A is divided into 64 submatrixs, wherein m is divided into 64 parts, n is kept not when m aliquot 64 Become, the form of 64 obtained submatrixs is respectively m0× n, m1× n ..., m63×n。
Optionally, the matrix division module is also used to:
As m aliquant 64, and when m obtains quotient d and remainder s divided by 64, the matrix A is divided into and is divided into 64 sons Matrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.
Optionally, described to be specifically used for from nucleon matrix transposition module, to described by way of striding copy from core LDM Transposition is carried out from the submatrix in core.
Optionally, described to set Switching Module specifically for the internuclear register communication mode pair by transmitting token from consideration convey The matrix B carries out transposition.
In practical application scene, as shown in fig. 7, the 26010 isomery many-core processor of Shen prestige in the present embodiment uses expansion The ALPHA architecture instruction set of exhibition, processor use 4 core groups, each core group by an operation control core (main core, 64 Position risc architecture general processor unit) and an arithmetic core array.
If any shown in Fig. 8, arithmetic core in Shen Wei 26010 processor, one core group, is one 8 × 8 processor array It constitutes.Arrow is illustrated when carrying out matrix transposition, from the transfer mode of row, column token between core array in figure.It wherein plays drinking games board Transmitting is that 8 rows execute parallel from core, and the transmitting of column token only exists 1 column every time and transmitted from core.
Processor in the present embodiment includes main core part and from core part.Main core part, which is realized mainly to complete, to be turned Set the acquisition of matrix, the distribution of matrix memory space, the initiation of matrix transposition.It include interface and core from core part.Its Middle interface section completes master and slave core DMA communication, the functions such as inspection of main core Transfer Parameters.Wherein, in the present embodiment using from The main flow that core array carries out matrix transposition please refers to Fig. 2-Fig. 5.
During being realized from core part, from core by row, column token passing, realize described above from nuclear state Migration, to realize the transposition of matrix.It is designed and is instructed according to hardware pipeline in detailed implementation, avoid counting as far as possible It calculates, memory access variable relies on.Optimize the use of local variable as far as possible in function design, while using register variable, is promoted Operational performance.
It should be noted that having used single-instruction multiple-data stream (SIMD) instruction, i.e., when being read out operation to continuous data Single Instruction Multiple Data, SIMD instruction further improve memory access performance.It is restored using DMA When transposed matrix, being superimposed for intercore communication and DMA transfer is used, performance is improved.For there are in the case where multiple piecemeals, The transposition of piecemeal is also superimposed with the transmission of result, performance is also obviously improved.
Wherein, it table 1: describes and carries out matrix transposition and use using main core on the light supercomputer in martial prowess Taihu Lake This method carries out the comparison (not including the assignment of matrix and the time-consuming of data transmission) of matrix transposition time-consuming, wherein transposition number It is double-precision floating point type according to type.
1 matrix transposition contrast test of table
It is learnt by table 1, with the increase of matrix transposition scale, buffers hit rate decline, memory access increases, and time-consuming substantially mentions It is high.This method time-consuming increases with matrix size, more linearly.
In summary, the matrix transposition method based on 26010 processor of Shen prestige in the present embodiment and system are by from core Row, column register communication between array devises a kind of data exchange ways based on row, column token, realizes from core battle array Data transposition in column reduces data and is transmitted several times principal and subordinate is internuclear, while also reducing the complexity of main core program, square The performance of battle array transposition also has a certain upgrade.In addition it for larger matrix, devises and decomposes twice, by biggish matrix The lesser multiple piecemeals of scale are decomposed into, it is parallel by matrix in block form transposition and Transfer-matrix, while making full use of from core office and depositing Low access delay, so that transposition efficiency is also got a promotion.
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of matrix transposition method based on 26010 processor of Shen prestige, 1 core group of 26010 processor of Shen prestige includes 1 A main core and 64 are from core, which comprises the steps of:
The matrix A stored in the main core is divided into 64 submatrixs, and 64 submatrixs is numbered by S1;
S2 carries out number corresponding with the number of 64 submatrixs from core to described in 64, and respectively by 64 sons Matrix read with described in the submatrix reference numeral from core;
S3 carries out transposition to each submatrix from core respectively, the slave core after obtaining 64 transposition;
S4, by the slave core after 64 transposition according to the matrix B for being arranged in 8 × 8 forms from the number order of core, and it is right The matrix B carries out transposition by internuclear register communication mode, obtains Matrix C;
S5 completes transposition by Matrix C storage into the main core.
2. the matrix transposition method according to claim 1 based on 26010 processor of Shen prestige, which is characterized in that the S1 It specifically includes:
The matrix A is determined as the matrix of m × n form by S11, wherein m representing matrix line number, n representing matrix columns;
The matrix A is divided into 64 submatrixs by S12, wherein m is divided into 64 parts, n is kept not when m aliquot 64 Become, the form of 64 obtained submatrixs is respectively m0× n, m1× n ..., m63×n。
3. the matrix transposition method according to claim 2 based on 26010 processor of Shen prestige, which is characterized in that the S1 Specifically further include:
The matrix A is divided into as m aliquant 64, and when m obtains quotient d and remainder s divided by 64 and is divided into 64 sons by S13 Matrix, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.
4. the matrix transposition method according to claim 1-3 based on 26010 processor of Shen prestige, feature exist In the specific implementation of the S3 are as follows: carry out transposition to the submatrix from core by way of striding copy from core LDM.
5. the matrix transposition method according to claim 1-3 based on 26010 processor of Shen prestige, feature exist In the specific implementation of the S4 are as follows: the internuclear register communication mode by transmitting token carries out transposition to the matrix B.
6. a kind of matrix transposition system based on 26010 processor of Shen prestige, 1 core group of 26010 processor of Shen prestige includes 1 A main core and 64 are from core characterized by comprising
Matrix division module, for the matrix A stored in the main core to be divided into 64 submatrixs, and to 64 sub- squares Battle array is numbered;
Submatrix read module, for carrying out number corresponding with the number of 64 submatrixs from core to described in 64, and Respectively by 64 submatrixs read with described in the submatrix reference numeral from core;
64 are obtained from nucleon matrix transposition module for carrying out transposition to each submatrix from core respectively Slave core after transposition;
Switching Module is set from consideration convey, for the slave core after 64 transposition to be arranged in 8 from the number order of core according to described The matrix B of × 8 forms, and transposition is carried out by internuclear register communication mode to the matrix B, obtain Matrix C;
Data memory module completes transposition by Matrix C storage into the main core.
7. the matrix transposition system according to claim 6 based on 26010 processor of Shen prestige, which is characterized in that the square Battle array division module is specifically used for:
The matrix A is determined as to the matrix of m × n form, wherein m representing matrix line number, n representing matrix columns;
The matrix A is divided into 64 submatrixs, wherein m is divided into 64 parts, n is remained unchanged, and is obtained when m aliquot 64 To the forms of 64 submatrixs be respectively m0× n, m1× n ..., m63×n。
8. the matrix transposition system according to claim 7 based on 26010 processor of Shen prestige, which is characterized in that the square Battle array division module is also used to:
As m aliquant 64, and when m obtains quotient d and remainder s divided by 64, the matrix A is divided into and is divided into 64 sub- squares Battle array, and serial number is carried out since 0 to 64 submatrixs, wherein the submatrix of the number less than s is (d+1) × n Rank matrix, the submatrix of the number more than or equal to s are d × n rank matrix.
9. according to the described in any item matrix transposition systems based on 26010 processor of Shen prestige of claim 6-8, feature exists In described to be specifically used for from nucleon matrix transposition module: to the sub- square from core by way of striding copy from core LDM Battle array carry out transposition.
10. according to the described in any item matrix transposition systems based on 26010 processor of Shen prestige of claim 6-8, feature exists In described to set Switching Module from consideration convey and be specifically used for: by transmit the internuclear register communication mode of token to the matrix B into Row transposition.
CN201811094916.2A 2018-09-19 2018-09-19 A kind of matrix transposition method and system based on 26010 processor of Shen prestige Pending CN109445850A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811094916.2A CN109445850A (en) 2018-09-19 2018-09-19 A kind of matrix transposition method and system based on 26010 processor of Shen prestige

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811094916.2A CN109445850A (en) 2018-09-19 2018-09-19 A kind of matrix transposition method and system based on 26010 processor of Shen prestige

Publications (1)

Publication Number Publication Date
CN109445850A true CN109445850A (en) 2019-03-08

Family

ID=65530524

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811094916.2A Pending CN109445850A (en) 2018-09-19 2018-09-19 A kind of matrix transposition method and system based on 26010 processor of Shen prestige

Country Status (1)

Country Link
CN (1) CN109445850A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181894A (en) * 2019-07-04 2021-01-05 山东省计算中心(国家超级计算济南中心) In-core group self-adaptive adjustment operation method based on Shenwei many-core processor
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN113704691A (en) * 2021-08-26 2021-11-26 中国科学院软件研究所 Small-scale symmetric matrix parallel three-diagonalization method of Shenwei many-core processor
CN117934532A (en) * 2024-03-22 2024-04-26 西南石油大学 Parallel optimization method and system for image edge detection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020198911A1 (en) * 2001-06-06 2002-12-26 Blomgren James S. Rearranging data between vector and matrix forms in a SIMD matrix processor
WO2011064898A1 (en) * 2009-11-26 2011-06-03 Nec Corporation Apparatus to enable time and area efficient access to square matrices and its transposes distributed stored in internal memory of processing elements working in simd mode and method therefore
CN106775594A (en) * 2017-01-13 2017-05-31 中国科学院软件研究所 A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN108509270A (en) * 2018-03-08 2018-09-07 中国科学院软件研究所 The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020198911A1 (en) * 2001-06-06 2002-12-26 Blomgren James S. Rearranging data between vector and matrix forms in a SIMD matrix processor
WO2011064898A1 (en) * 2009-11-26 2011-06-03 Nec Corporation Apparatus to enable time and area efficient access to square matrices and its transposes distributed stored in internal memory of processing elements working in simd mode and method therefore
CN106775594A (en) * 2017-01-13 2017-05-31 中国科学院软件研究所 A kind of Sparse Matrix-Vector based on the domestic processor of Shen prestige 26010 multiplies isomery many-core implementation method
CN107168683A (en) * 2017-05-05 2017-09-15 中国科学院软件研究所 GEMM dense matrix multiply high-performance implementation method on the domestic many-core CPU of Shen prestige 26010
CN108509270A (en) * 2018-03-08 2018-09-07 中国科学院软件研究所 The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112181894A (en) * 2019-07-04 2021-01-05 山东省计算中心(国家超级计算济南中心) In-core group self-adaptive adjustment operation method based on Shenwei many-core processor
CN112181894B (en) * 2019-07-04 2022-05-31 山东省计算中心(国家超级计算济南中心) In-core group adaptive adjustment operation method based on Shenwei many-core processor
CN112446007A (en) * 2019-08-29 2021-03-05 上海华为技术有限公司 Matrix operation method, operation device and processor
CN113704691A (en) * 2021-08-26 2021-11-26 中国科学院软件研究所 Small-scale symmetric matrix parallel three-diagonalization method of Shenwei many-core processor
CN113704691B (en) * 2021-08-26 2023-04-25 中国科学院软件研究所 Small-scale symmetric matrix parallel tri-diagonalization method of Shenwei many-core processor
CN117934532A (en) * 2024-03-22 2024-04-26 西南石油大学 Parallel optimization method and system for image edge detection
CN117934532B (en) * 2024-03-22 2024-06-04 西南石油大学 Parallel optimization method and system for image edge detection

Similar Documents

Publication Publication Date Title
CN109445850A (en) A kind of matrix transposition method and system based on 26010 processor of Shen prestige
US11829300B2 (en) Method and apparatus for vector sorting using vector permutation logic
AU674832B2 (en) Input/output arrangement for massively parallel computer system
US11941399B2 (en) Exposing valid byte lanes as vector predicates to CPU
CN103221936B (en) A kind of sharing functionality memory circuitry for processing cluster
US20200349106A1 (en) Mixed-precision neural-processing unit tile
CN102047241B (en) Local and global data share
US11550575B2 (en) Method and apparatus for vector sorting
US20140003742A1 (en) Transposition operation device, integrated circuit for the same, and transposition method
CN111461311B (en) Convolutional neural network operation acceleration method and device based on many-core processor
CN109952559B (en) Streaming engine with individually selectable elements and group replication
CN1051995A (en) Data-driven array processor
US20200026746A1 (en) Matrix and Vector Multiplication Operation Method and Apparatus
CN109271138A (en) A kind of chain type multiplication structure multiplied suitable for big dimensional matrix
US8489825B2 (en) Method of storing data, method of loading data and signal processor
US9002919B2 (en) Data rearranging circuit, variable delay circuit, fast fourier transform circuit, and data rearranging method
US20080126466A1 (en) Method and Apparatus for Accumulating Floating Point Values
CN112074810B (en) Parallel processing apparatus
JP2010244096A (en) Data processing apparatus, printing system, and program
US20180232207A1 (en) Arithmetic processing apparatus and control method for arithmetic processing apparatus
CN113159302A (en) Routing structure for reconfigurable neural network processor
JP2000020501A (en) Parallel computer system and communication method between arithmetic processing units
CN112463218A (en) Instruction emission control method and circuit, data processing method and circuit
JP2003099249A (en) Data processor
US12032490B2 (en) Method and apparatus for vector sorting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190308

RJ01 Rejection of invention patent application after publication