CN102360344A

CN102360344A - Matrix processor as well as instruction set and embedded system thereof

Info

Publication number: CN102360344A
Application number: CN2011103039194A
Authority: CN
Inventors: 张斌; 梅魁志; 郑南宁; 董培祥; 张书锋; 李宇海; 赵晨; 殷浩
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2011-10-10
Filing date: 2011-10-10
Publication date: 2012-02-22
Anticipated expiration: 2031-10-10
Also published as: CN102360344B

Abstract

The invention provides a matrix processor as well as an instruction set and an embedded system thereof. The matrix processor comprises an external data interface, an IRAM (intelligent random access memory), a DRAM (dynamic random access memory) and a matrix processor core, wherein the external data interface is connected with the IRAM and the DRAM of the matrix processor as well as an external memory so as to write an instruction of the matrix processor and perform data exchange with the outside; the IRAM and the DRAM are equivalent to buffers of the matrix processor; the IRAM receives an instruction sequence written by an external module; the DRAM receives a matrix or other data written by the external module as well as computing results written by the matrix processor core so that the matrix or data and the computing results are used by the matrix processor or are read by the external module so as to complete data exchange between the matrix processor and the external module; and the matrix processor core is used for fetching the instruction, decoding, computing, and writing back and controlling the computing results. The matrix processor provided by the invention can independently complete various matrix operations and other mathematical operations.

Description

Matrix processor and instruction set thereof and embedded system

[technical field]

The present invention relates to the processor technical field, particularly a kind of matrix processor and instruction set and embedded system.

[background technology]

Matrix operation is the basic problem during scientific and engineering calculates.It is not only the mathematics subject, also is the important mathematical tool of many science and engineering subjects.Matrix operation is irreplaceable mathematical tool in numerous subjects such as physics, mechanics, computer science, Aero-Space.Special, in computer science and technology, a lot of fields all will use matrix operation, calculate and general algorithm design and analysis etc. such as Digital Image Processing, computer graphics, pattern-recognition, machine vision, artificial intelligence, science.

How the classical matrix computing realizes with the processor serial computing, seriously restricts the raising of computing velocity.Adopt the computing of hardware realization matrix can improve computing velocity, but the hardware of existing matrix operation realizes mostly being to be a certain concrete matrix computing, the specialized hardware structure of design such as matrix inversion, matrix multiplication for example, its hardware resource requirements greatly, very flexible.Special; The current various complete algorithm that relate to matrix operation all comprise multiple matrix operation and other non-matrix operations; At this moment, when the hardware circuit of particular matrix computing is added computing system, need very long and mutual and stand-by period other part of system; Like this, the efficient that hardware module improved of adding special purpose matrix computing is just very limited.

So there is very big demand a kind of matrix operation unit that can independently accomplish complete algorithm.

[summary of the invention]

The object of the present invention is to provide a kind of matrix processor and embedded system thereof, it can independently accomplish various matrix operations and other mathematical operations.

In order to address the above problem, the present invention adopts following technical scheme:

A kind of matrix processor comprises external data interface, IRAM, DRAM, matrix processor core;

Said external data interface, the IRAM of connection matrix processor, DRAM and external memory storage are accomplished writing and carrying out exchanges data with the outside of matrix processor instruction;

Said IRAM and DRAM are equivalent to the buffer memory of matrix processor; IRAM receives the instruction sequence that external module writes; The result of calculation that the matrix that DRAM reception external module writes or other data, receiving matrix processor core write supplies matrix processor to use or external module is read, and accomplishes the exchanges data of matrix processor and external module;

Said matrix processor core is used to get finger, decoding, computing, the result writes back and control.

The present invention further improves and is: said external data interface, IRAM, DRAM, registers group of the common connection of matrix processor core, said registers group is deposited the system information and the interactive information of external data interface, IRAM, DRAM, matrix processor core.

The present invention further improves and is: said external data interface, IRAM, DRAM, interruption generating device of the common connection of matrix processor core, the interrupt request of external data interface, IRAM, DRAM, matrix processor core is exported to external control module (outer CPU) through registers group and interruption generating device.

The present invention further improves and is: said matrix processor core comprises gets finger unit, first decoding unit, second decoding unit, the unit that reads and writes data, general purpose register set, FPU Float Point Unit and control module; IRAM, get and refer to that unit, first decoding unit, second decoding unit, FPU Float Point Unit connect successively; FPU Float Point Unit, general purpose register set, the unit that reads and writes data, DRAM connect successively; First decoding unit connects the said unit that reads and writes data.

The present invention further improves and is: gets the instruction fetch enable signal that refers to that unit reception control module sends, from IRAM, begins the reading command that circulates, send instruction and give first decoding unit, and the completion jump instruction; First decoding unit receives to ask for and refers to the instruction that the unit sends; Classification according to instruction is deciphered; Convert matrix operation and mathematical function operational order to SIMD or the floating-point operation instruction writes second decoding unit, L/S and move are sent into the unit that reads and writes data; The unit that reads and writes data receives data address and the enable signal that first decoding unit sends, and accomplishes that reading of data writes general purpose register set from DRAM, and the data in the general purpose register set are write DRAM; Data between the register of general purpose register set shift; Second decoding unit receives SIMD and the floating-point operation instruction that first decoding unit sends, and instruction is decoded as the floating-point operation instruction, gives FPU Float Point Unit; FPU Float Point Unit comprises the first floating-point operation module and one second floating-point operation module of four parallel connections, and the first floating-point operation module of these four parallel connections is connected in series to the second floating-point operation module; The purpose of each first floating-point operation module and source-register address are controlled by second decoding unit; The first floating-point operation module is accomplished the computing of expansion single-precision floating point, and the second floating-point operation module is accomplished 4 input additions of expansion single precision; The operation of control module gating matrix processor is given outer CPU when calculate to accomplish or take place to send look-at-me when unusual.

The present invention further improves and is: said matrix processor also comprises a specified register, when said specified register is carried out the operational order that the real number participation is arranged at matrix processor, preserves this real number.

The present invention further improves and is: the instruction set that this matrix processor uses comprises: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data instruction, matrix operation instruction;

Said L/S and move, the data of accomplishing between matrix processor buffer memory and register, register read and write;

The change of instruction execution sequence is accomplished in said jump instruction;

Basic floating point arithmetic is accomplished in the instruction of said floating-point operation, comprise ask absolute value, comparison, add, subtract, multiplication and division, evolution, multiply-add operation;

The computing of elementary mathematics function is accomplished in said mathematical function instruction, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function;

The concurrent operation of different floating numbers is accomplished in said SIMD instruction, and the computing of completion is identical with the computing that comprises during floating-point operation is instructed;

Said matrix operation instruction; Accomplish some of matrix basic and simple calculations, comprise that the ranks of matrix generation, matrix transpose, matrix extract, by the matrix ranks sue for peace, the plus-minus of the addition subtraction multiplication and division of matrix and real number, matrix is taken advantage of, matrix elementary transformation.

A kind of embedded system comprises CPU, bus, SDRAM, matrix processor, registers group and interruption generating device; The external data interface of CPU, SDRAM, matrix processor is connected to bus; CPU connects registers group and interruption generating device respectively through two data lines, and said registers group connects interruption generating device and matrix processor.

A kind of method of work of embedded system may further comprise the steps:

1) after system powered on, cpu reset read operation boot instruction from Flash, accomplish the initialization of processor; Configuration parameter is write registers group, matrix processor is configured; At last, the beginning executive utility is prepared by load operation system from SDRAM;

2) CPU writes registers group with the instruction sequence and the initial sum termination address of calculating desired data in SDRAM of matrix processor, and sends the signal of starting working and give registers group, discharges bus then; After matrix processor received the signal of starting working, application took bus;

3) after matrix processor takies bus, write instruction sequence and initial sum termination address reading of data from SDRAM of data in the registers group according to CPU, write IRAM and DRAM respectively, discharge bus then;

4) instruction sequence and data write completion after, the matrix processor core is started working; Get finger unit reading command from IRAM successively, send to first decoding unit; If receive the instruction jump instruction of first decoding unit input, then accomplish the corresponding instruction redirect; Translating the instruction that first decoding unit will receive handles as follows: L/S and move send to the unit that reads and writes data; Jump instruction sends to the value unit; Matrix and mathematical function instruction are decoded as SIMD or second decoding unit is exported in the floating-point operation instruction; Straight-through second decoding unit of exporting to of SIMD and floating-point operation instruction; The unit that reads and writes data instructs according to Load, reads the data among the DRAM, writes general purpose register set; The instruction that second decoding unit sends first decoding unit is decoded and is sent to FPU Float Point Unit; Corresponding floating data is read in the Floating-point Computation instruction that FPU Float Point Unit is sent into according to second decoding unit from general-purpose register, accomplish corresponding floating-point operation, and operation result writes back general purpose register set; The unit that reads and writes data instructs according to Store, and the data in the general purpose register set of matrix processor core are write back DRAM;

5) instruction among the IRAM is all accomplished, and matrix processor writes registers group through control module with relevant parameters, produces corresponding the interruption through the interruption generating device; CPU confirms according to interruption that receives and the overall task of being accomplished whether matrix processor proceeds other computings, accomplishes if calculate all, and the gating matrix processor writes the result of calculation among the DRAM in the assigned address zone of SDRAM.

Instruction set is to be used for the set of a cover instruction of computing and control system in the processor.Each processor has just been stipulated a series of order set that match with its hardware circuit when design.And the advanced person of instruction set whether, also is related to performance of processors, and it is not only the effective tool that improves processor efficient, also is the important symbol that processor performance embodies.

Matrix processor is because of the reason of completion function, and its instruction set is different from universal cpu, but in the process of instruction set design, still defers to the design concept of Reduced Instruction Set Computer (RISC).(1) instruction type is few, format specification, and instruction length is unified.(2) addressing mode is simplified.(3) utilize between register in a large number and operate, only with specific operational access RAM.(4) simplified processor structure.(5) parallel ability of reinforcement processor.

For realizing the matrix processor instruction set that can accomplish various matrix operations and other mathematical operations, the instruction set of matrix processor should have six part: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data (SIMD) instruction, matrix operation are instructed.

(1) L (Load)/S (Store) and move, the data of accomplishing between matrix processor buffer memory and register, register read and write.

(2) change of instruction execution sequence is accomplished in jump instruction.

(3) basic floating point arithmetic is accomplished in floating-point operation instruction, comprises asking absolute value, comparison, add, subtracting, multiplication and division, evolution, take advantage of and computing such as add.

(4) computing of some elementary mathematics function is accomplished in mathematical function instruction, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function etc.

(5) concurrent operation of different floating numbers is accomplished in SIMD (single instruction multiple data) instruction, and the computing of completion is identical with the computing that comprises during floating-point operation is instructed.

(6) matrix operation instruction; Accomplish some of matrix basic and simple calculations; The ranks that comprise matrix generation, matrix transpose, matrix extract, by the matrix ranks sue for peace, the plus-minus of the addition subtraction multiplication and division of matrix and real number, matrix is taken advantage of, matrix elementary transformation; Through making up these different computings, make matrix processor can accomplish complicated algorithm.

The present invention proposes a kind of method and flow process: confirm that at first according to actual needs matrix processor need accomplish the size of maximum matrix, IRAM and the DRAM of the computational accuracy of which instruction, matrix processor, mode of operation, support, the composition of arithmetic element, the size of general-purpose register, accomplish the instruction set form according to above-mentioned condition design according to matrix processor instruction set design matrix processor; Structure and the workflow and the abnormality processing mode etc. of matrix processor are accomplished in design then.

The present invention according to concrete environment for use, has realized a kind of matrix processor and working mechanism thereof that accomplishes the expansion single precision according to above-mentioned instruction set and matrix processor method for designing:

The concrete instruction that needs to realize has: (1) L/S and move comprise: LM, SM, LMR, SMR, MOV; (2) jump instruction comprises: JMP, JL, B, BL; (3) floating-point operation instruction comprises: FABV, FCMP, FCMPZ, FCPY, FNEG, FADD, FSUB, FMUL, FDIV, FSQRT, FNMUL, FMAC, FMSB, NOP; (4) mathematical function instruction comprises: FSINF, FCOSF, FTANF, FARCSINF, FARCCOSF, FARCTANF; (5) SIMD instruction comprises: SABV, SCMP, SCMPEZ, SCPY, SNEG, SADD, SSUB, SMUL, SDIV, SSQRT, SNMUL, SMAC, SMSB, NOP; (6) matrix operation instruction comprises: MGNM, MGOM, MGIM, MTRN, MRE, MCE, MCPY, MMRA, MNCA, MRA, MCA, MARN, MSRN, MMRN, MDRN, MAM, MSM, MDMM, MDD, MMM, METRS, METRM, METRA, METCS, METCM, METCA.

Computational accuracy: the floating number precision is the single precision floating datum that meets the IEEE754 standard when carrying out exchanges data with the outside; Use the single precision extended floating-point numbers that meets ieee standard during internal calculation, wherein sign bit is 1,8 of exponent bits, 39 of truth of a matter positions.

Two kinds of mode of operations: mode of operation and debugging mode.Work and debugging mode mainly change through EBI.EBI is a mode of operation during for host device interface; It when EBI is the slave unit interface debugging mode.

Support is 32 * 32 floating number matrix operation to the maximum.

The IRAM size is 32Kbits; DRAM is 192Kbits (can deposit 4 32 * 32 matrix at most).

Arithmetic element is connected in series to other 1 floating-point operation module (FPU2) realization by the floating-point operation module (FPU1) of the completion identical function of 4 parallel connections.Wherein, FPU1 accomplish take absolute value, the floating number comparison, duplicate, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation.FPU2 accomplishes 4 input floating adds.

Each register is 48bits in the general purpose register set, 64 altogether, and can be used as an integral body and use, also can divide into groups to use, be divided into four groups, 16 every group, the register file mode of third reading is write in every group of register employing one.

With respect to prior art, the present invention has the following advantages: matrix processor of the present invention can independently be accomplished various matrix operations and other mathematical operations, and through making up these different computings, matrix processor can be accomplished multiple complicated algorithm.It is fast that this processor is accomplished floating-point operation speed, has very strong dirigibility, can independently accomplish the computing of total algorithm.

[description of drawings]

Fig. 1 is the basic design drawing of matrix processor;

Fig. 2 is the floating point data format of matrix processor;

Fig. 3 is the connected mode of matrix processor and bus;

Fig. 4 is the floating-point operation structural drawing;

Fig. 5 is the structure of general-purpose register;

Fig. 6 is a matrix processor instruction set form;

Fig. 7 is the structural drawing of matrix processor core;

Fig. 8 is the structural drawing of matrix processor system.

[embodiment]

Below in conjunction with accompanying drawing and embodiment the present invention is elaborated.

Can carry out matrix computations in order to make the matrix processor instruction set; And realize complete algorithm computing and control, the instruction set of matrix processor should have six part: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data (SIMD) instruction, matrix operation are instructed.

According to actual needs, the concrete instruction of realization has: (1) L/S and move comprise: LM, SM, LMR, SMR, MOV; (2) jump instruction comprises: JMP, JL, B, BL; (3) floating-point operation instruction comprises: FABV, FCMP, FCMPZ, FCPY, FNEG, FADD, FSUB, FMUL, FDIV, FSQRT, FNMUL, FMAC, FMSB, NOP; (4) mathematical function instruction comprises: FSINF, FCOSF, FTANF, FARCSINF, FARCCOSF, FARCTANF; (5) SIMD instruction comprises: SABV, SCMP, SCMPEZ, SCPY, SNEG, SADD, SSUB, SMUL, SDIV, SSQRT, SNMUL, SMAC, SMSB, NOP; (6) matrix operation instruction comprises: MGNM, MGOM, MGIM, MTRN, MRE, MCE, MCPY, MMRA, MNCA, MRA, MCA, MARN, MSRN, MMRN, MDRN, MAM, MSM, MDMM, MDD, MMM, METRS, METRM, METRA, METCS, METCM, METCA.

With reference to Fig. 1, be the basic project organization figure of matrix processor.The IRAM and the DRAM of external data interface connection matrix processor accomplish writing and carrying out the function of exchanges data with the outside of matrix processor instruction, connected mode can for: directly be connected or be connected to system bus with CPU.IRAM and DRAM are equivalent to the buffer memory of matrix processor, and IRAM receives the instruction sequence that external module writes; The result of calculation that the matrix that DRAM reception external module writes or other data, receiving matrix processor core write supplies matrix processor to use or external module is read, and accomplishes the exchanges data of matrix processor and external module.The width of IRAM and DRAM and the degree of depth are comprehensively confirmed according to actual conditions.The matrix processor core is accomplished and is got finger, decoding, computing, writes back and control function.

The precision of matrix processor computational data is: the floating number precision is the single precision floating datum that meets the IEEE754 standard when carrying out exchanges data with the outside; Use the single precision extended floating-point numbers that meets the IEEE754 standard during internal calculation, wherein sign bit is 1,8 of exponent bits, and the truth of a matter is 39, and is as shown in Figure 2, wherein, sign bit (sign) is represented the positive and negative of floating number; Index (exponent) is the index of a binary floating point number; Mantissa (fraction) is a binary floating point number part.The value representation of single precision extended floating-point numbers is: value=(1) ^Sign* 1.f * 2 ^Exp-bias(f is a mantissa, and exp is an index, and bias is the side-play amount of index, and bias is 127 under single precision).

Mode of operation: mode of operation and debugging mode.Confirm that external module is an embedded type CPU, it is connected through bus with matrix processor, and is as shown in Figure 3.Work and debugging mode are realized through changing EBI.When RAM through bus master interface connection matrix processor and bus is mode of operation; At this moment CPU is through writing data to control register; When the gating matrix processor starts working; Matrix processor reads desired data voluntarily and carries out computing from storer then, sends interrupt notification CPU after computing is accomplished and calculates and accomplish, and CPU can not read and write the IRAM and the DRAM of matrix processor.When RAM through bus slave interface connection matrix processor and bus is debugging mode, and at this moment CPU can read and write the IRAM and the DRAM of matrix processor.

Support is 32 * 32 floating number matrix operation to the maximum.

The IRAM size is 32Kbits (1024 * 32bits).DRAM is 192Kbits (4 * 32 * 32 * 48bits can deposit 4 32 * 32 matrix at most), and matrix data begins to write data RAM successively continuously by row with the matrix start address.

FPU Float Point Unit is as shown in Figure 4, is connected in series to other 1 floating-point operation module (FPU2) by the floating-point operation module (FPU1) of the completion identical function of 4 parallel connections and realizes.Wherein, FPU1 accomplish take absolute value, two floating number comparisons, floating number and 0 are compared, duplicate, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation.The input of FPU1 has two parts: the calculating useful signal of decoding unit input; The operand of general purpose register set and specified register input.Wherein, the calculating useful signal of decoding unit input comprise take absolute value, two floating number comparisons, floating number and 0 compare, duplicate, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation and the direct useful signal of exporting.Operand comprises general purpose register set and specified register input; Generally, read 3 operand input ports (operand input port 1, operand input port 2, operand input port 3) that port (read port one, read port 2, read port 3) corresponds respectively to FPU1 for 3 of general purpose register set; When specified register input useful signal is effective, operand input port 2 be input as the specified register output data.FPU2 accomplishes 4 input floating adds.Output control module receives the output of 4 FPU1 and the output of FPU2, selects suitable operation result output.

Shown in Figure 5 is the structural drawing of general purpose register set.Each register is 48bits in the general purpose register set, 64 altogether, and can be used as an integral body and use, also can divide into groups to use, be divided into four groups, 16 every group, every group of register adopts 1 to write the 3 register file modes read, the corresponding FPU1 of each group.The address of general purpose register set is 6bits, is made up of two parts: high 4 are address in the general purpose register set packet group, and low 2 is packet numbering.When register used as a whole, the address was from 0-63; When general purpose register set was used respectively as 4 groups, the address was from 0-15 in the packet group, and packet numbering is from 0-3.Fig. 5 (a) is depicted as the structural drawing of a grouping of general purpose register set, and it is made up of 3 registers group, and 3 registers group structure functions are identical.The input of a grouping of general purpose register set has: write data information comprises write address, with imitating and write data; Read data information comprises and reads the address; What output comprised 3 registers group reads port (read port one, read port 2, read port 3).Fig. 5 (b) is depicted as the structural drawing of general purpose register set, and it is made up of grouping, 4 MUXs (MUX1, MUX2, MUX3, MUX4) and 1 2-4 code translator of 4 general purpose register set.MUX1 from different write address inputs, selects correct input to input to general purpose register set as write address according to control signal.Wherein the address inputs to general purpose register set grouping 1-4 simultaneously in the group; Packet numbering inputs to the 2-4 code translator, be output as with imitate signal (packet numbering is 00,01,10,11 o'clock, output be respectively general purpose register set grouping 1-4 with imitating signal).MUX2 from different write datas is imported, selects correct input to divide into groups to general purpose register set as the data that write the general purpose register set grouping according to identical control signal.MUX3 is according to control signal, from different reading the input of address, selects correct input to give general purpose register set as reading the address that general purpose register set divides into groups.MUX4 from the output data of general purpose register set different grouping, selects the output data of correct output as general purpose register set according to control signal.

According to above condition, confirm that the form of matrix processor instruction is as shown in Figure 6.The matrix processor instruction set is divided into six types: (1) L/S and move, accomplish the exchanges data between matrix processor data RAM and general purpose register set, the exchanges data that exchanges data, general purpose register set and the specified register between each register in the MPU general purpose register set accomplished in move; (2) redirect of instruction address is accomplished in jump instruction, comprises condition and non-condition jump instruction; (3) mathematical operation of floating number is accomplished in floating-point operation instruction, comprising: absolute value, comparison, negate, add, subtract, multiplication and division, evolution, multiply accumulating etc.; (4) the mathematical function computing is accomplished in data function instruction, comprising: trigonometric function, inverse trigonometric function computing etc.; (5) the single instruction multiple data computing is accomplished in SIMD instruction, accomplishes many group floating-point operations simultaneously; (6) matrix operation instruction; Accomplish matrix operation, comprising: generate complete 0 matrix, complete 1 matrix, unit matrix, transposition, extraction row, extract row, reproduction matrix, ask certain row or certain row and, by the row summation, operate by row summation, matrix addition subtraction multiplication and division real number, matrix plus-minus, matrix dot product, matrix multiplication, matrix elementary transformation.

Instruction type is represented through Type in instruction; Type is the highest 4bits of each instruction, and the Type of L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, SIMD instruction, matrix operation instruction is respectively: 0001,0010,0011,0100,0101,0110.

(1) L/S and move.

The 0-3 position of instruction is OP_code, and is as shown in table 1.

Table 1L/S and move

LM and SM order format are (a) in L/S and the move.

LM (Load MPU): the floating number of one section continuation address among the DRAM is moved in the continuous registers group.Cond (the 24-27 position of instruction) is 2 for moving several numbers ^Cond-1Individual, DRAM_start (the 12-23 position of instruction) is a start address among the DRAM, and Reg_start (the 4-11 position of instruction) is the register start address.

SM (Store MPU): the one piece of data in the general purpose register set is write DRAM continuously.The same LM of each several part implication in the instruction.

LMR and SMR order format are (b) in L/S and the move.

LMR (Load MPU by Register):, a floating number among the DRAM is write among the register Reg_d (the 4-9 position of instruction) according to the side-play amount among the register Reg_s (the 10-15 position of instruction).DRAM_start (the 16-27 position of instruction) is the base address of DRAM, and Reg_s is offset address storage addresses in registers group of DRAM, and Reg_d is the register address that writes, that is: * Reg_d=* (DRAM_start+*Reg_s).

SRM (Store MPU by Register): according to the side-play amount among the register Reg_s, the floating number among the register Reg_d is write DRAM, the same LRM of each several part implication in the instruction.

MOV order format is (c) in L/S and the move.

MOV: move.Value among the source-register Reg_s (the 12-19 position of instruction) is moved among the destination register Reg_d (the 4-11 position of instruction).Give the address of one section continuous address of general purpose register set as its register, remaining address all is a specified register.So, when the address of Reg_s and Reg_d all is the address in the general purpose register set, carry out the exchanges data between two registers in the general purpose register set; When the address of Reg_s or Reg_d is the specified register address, carry out the exchanges data of general-purpose register and specified register or specified register and specified register.

(2) jump instruction

The 0-3 position of instruction is OP_code, and is as shown in table 2; The 4-15 position is IRAM_addr; The 16-19 position is Cond, and is as shown in table 3; 20-23 position (wherein have only 20-21 position effective) is FPU_num.

Table 2 redirect and move

JMP (Jump): the IRAM_addr place instruction fetch that jumps to IRAM.

CJ (Conditional Jump): when satisfying the Cond condition, jump to the IRAM_addr place instruction fetch of IRAM.Result with the represented FPU of FPU_num carries out condition judgment.

B (Branch): this moment, IRAM_addr deposited side-play amount, promptly added the side-play amount place instruction fetch of depositing among the IRAM_addr at current instruction address.Jump to current instruction address and add offset address.

CB (Conditional Branch): when satisfying the Cond condition, jump to the current address and add the instruction fetch of side-play amount place.FPU_num and the same CJ of Cond implication.

The Condition of table 3 jump instruction

Cond	Meaning
		0001	0
0010	Just
		0011	Negative

0100	Equal
		0101	Greater than
0110	Less than

(3) floating-point operation instruction

The 0-3 position of instruction is OP_code, and is as shown in table 4; The 4-7 position is Fm; The 8-11 position is Fn; 12-15 position Fd; 16-19 position (wherein have only 16-17 position effective) is FPU_num.

The instruction of table 4 floating-point operation

In the floating-point operation instruction, FPU_num is the numbering of FPU, and expression is accomplished floating-point operation with which FPU; Fd is a destination register, and when three behaviour's numbers were made floating-point operation, Fd was the origin operation register of one of them; Fm and Fn are source-register, during the single operand floating-point operation, use the floating number in the Fm register.

FABS (Floating-point Absolute Value): floating number absolute value.

FCMP (Floating-point Compare): floating number relatively.

FCMPZ (Floating-point Compare with Zero): floating number and 0 relatively.

FCPY (Floating-point Copy): floating number is duplicated.

FNEG (Floating-point Negate): floating number negate.

FADD (Floating-point Addition): floating add.

FSUB (Floating-point Subtract): FSU.

FMUL (Floating-point Multiply): FML.

FDIV (Floating-point Divide): floating divide.

FSQRT (Floating-point Square Root): floating number square root.

FNMUL (Floating-point Negated Multiply): product negate.

FMAC (Floating-point Multiply and Accumulate): take advantage of and add (Fd=Fd+Fm * Fn).

FMSB (Floating-point Multiply and Subtract): take advantage of and subtract (Fd=Fd-Fm * Fn).

NOP: blank operation.

(4) mathematical function instruction

The 0-3 position of instruction is OP_code, and is as shown in table 5; The 4-11 position is Reg_addr_S; The 12-19 position is Reg_addr_D; 20-23 position (wherein have only 20-21 position effective) is FPU_num.

The instruction of table 5 mathematical function

In the mathematical function instruction, FPU_num is the numbering of FPU, and expression is accomplished floating-point operation with which FPU; Reg_addr_S is a register address of depositing independent variable; Reg_addr_D is a register address of depositing the functional operation result.

FSINF (Floating-point Sine Function): sine function.

FCOSF (Floating-point Cosine Function): cosine function.

FTANF (Floating-point Tangent Function): tan.

FARCSINF (Floating-point Arc-Sin Function): arcsin function.

FARCCOSF (Floating-point Arc-Cosine Function): inverse cosine function.

FARCTANF (Floating-point Arc-Tangent Function): arctan function.

(5) SIMD instruction

The 0-3 position of instruction is OP_code, and is as shown in table 6; The 4-7 position is Fm; The 8-11 position is Fn; 12-15 position Fd.

Table 6SIMD instruction

SABV (SIMD Absolute Value): absolute value.

SCMP (SIMD Compare): relatively.

SCMPEZ (SIMD Compare with Zero): with 0 comparison.

SCPY (SIMD Copy): duplicate.

SNEG (SIMD Negate): negate.

SADD (SIMD Addition): floating add.

SSUB (SIMD Subtract): FSU.

SMUL (SIMD Multiply): FML.

SDIV (SIMD Divide): floating divide.

SSQRT (SIMD Square Root): floating number square root.

SNMUL (SIMD Negated Multiply): product negate.

SMAC (SIMD Multiply and Accumulate): take advantage of and add (Fd=Fd+Fm * Fn).

SMSB (SIMD Multiply and Subtract): take advantage of and subtract (Fd=Fd-Fm * Fn).

NOP: blank operation.

(6) matrix operation instruction

Each complete matrix operation instruction all is made up of 2 or 3 words, and for the matrix operation that has real number to participate in, when I=1, instruction has 3 words, and wherein the 3rd word is for participating in the real number of computing; When I=0, instruction is two words, and the real number of participating in computing leaves among the specified register real_num, and other instructions all are 2 words.The 1st word that matrix operation among Fig. 1 instruction (a) is called the matrix operation instruction; With the matrix operation among Fig. 1 instruction (b) or the 2nd word that (c) is called the matrix operation instruction (when the matrix instruction is instructed for the matrix elementary transformation; Be that OP_codel is 7 o'clock; Form is (c), and other order formats are (b)), (b) middle A, B, D are 5bits; (c) the low 5bit s of A, B, C, D is valid data in, and 6bit is 0; The 3rd word that the real number of participating in computing is called the matrix operation instruction.

The 0-3 position of instructing in the 1st word of matrix operation instruction is OP_codel; The 4-15 position is DRAM_start; The 16-27 position is DRAM_result.In the 2nd word of matrix operation instruction, the 0-3 position of instruction (b) is OP_code2; The 4-7 position is low 4 of A; The 8-11 position is low 4 of B; The 12-15 position is low 4 of D; 16 is the 5th of A; 17 is the 5th of B; 18 is the 5th of D; 19 is I; The 20-31 position is MB_start; The 0-3 position of instruction (c) is OP_code2; The 7th is I; 8-13 position (wherein have only 8-12 position effective) is D; 14-19 position (wherein have only 14-18 position effective) is C; 20-25 position (wherein have only 20-24 position effective) is B; 26-31 position (wherein have only 26-30 position effective) is A.

Type is 0110; DRAM_result deposits the start address of DRAM in for the matrix operation result; DRAM_start is the start address of matrix A in DRAM; MB_start is the start address of the 2nd matrix in the matrix operation; OP_codel is the type of matrix operation; Concrete matrix computing with the combination of the OP_code2 in the 2nd word of matrix operation instruction expression completion; As shown in table 7, wherein OP_codel and OP_code2 are hexadecimal representation, and unlisted number is undefined order number.

Table 6SIMD instruction

MGNM (Matrix Generate Null Matrix): generate complete 0 matrix, matrix size is A * B.

MGOM (Matrix generate One ' s Matrix): generate complete 1 matrix, matrix size is A * B.

MGIM (Matrix Generate Identity Matrix): generate unit matrix, matrix size is A * A.

MTRN (Matrix Transposition): matrix transpose, the original matrix size is A * B.

MRE (Matrix Row Extract): extract row matrix, matrix size is A * B, and it is capable to extract matrix D.

MCE (Matrix Column Extract): extract rectangular array, matrix size is A * B, extracts matrix D row.

MCPY (Matrix Copy): reproduction matrix, matrix size are A * B.

MMRA (Matrix M Row Addition): ask M capable with, matrix size is A * B, ask the D row matrix with.

MNCA (Matrix N Column Addition): ask N row with, matrix size is A * B, ask the D column matrix with.

MRA (Matrix Row Addition): by row summation, matrix size is A * B, ask the every row of matrix with.

MCA (Matrix Column Addition): by row summations, matrix size is A * B, ask the every row of matrix with.

MARN (Matrix Add Real Number): matrix adds real number, and matrix size is A * B, and each element of matrix adds real number, and real number is the 3rd word of matrix operation instruction or the real number in the specified register.

MSRN (Matrix Subtract Real Number): reduction matrix real number, matrix size are A * B, and each element of matrix deducts real number, and real number is the 3rd word of matrix operation instruction or the real number in the specified register.

MMRN (Matrix Multiply Real Number): matrix is taken advantage of real number, and matrix size is A * B, and each element of matrix multiply by real number, and real number is the 3rd word of matrix operation instruction or the real number in the specified register.

MDRN (Matrix Divide Real Number): matrix removes real number, and matrix size is A * B, and each element of matrix is divided by real number, and real number is the 3rd word of matrix operation instruction or the real number in the specified register.

MAM (Matrix Add Matrix): matrix adds, and 2 matrix size all are A * B, 2 each corresponding element additions of matrix.

MDM (Matrix Substract Matrix): reduction matrix, 2 matrix size all are A * B, 2 each corresponding elements of matrix subtract each other.

MDMM (Matrix Dot Multiply Matrix): the matrix dot product, 2 matrix size all are A * B, 2 each corresponding elements of matrix multiply each other.

MDD (Matrix Dot Divide): matrix dot removes, and 2 matrix size all are A * B, and 2 each corresponding elements of matrix are divided by.

MMM (Matrix Multiply Matrix): matrix is taken advantage of, and the size of matrix 1 all is A * B, and the size of matrix 2 is B * D, and result sizes is A * D.

METRS (Matrix Flementary Transformation, Row Switching): exchange two row, matrix size is A * B, and it is capable that the capable data of matrix C are write D, and it is capable that the capable number of matrix D is write C.

METRM (Matrix Flementary Transformation, Row Multiplication): row is taken advantage of, and matrix size is A * B, and each number that matrix D is capable multiply by the number of the capable same column of C on the real add, and it is capable to write C.

METRA (Matrix Flementary Transformation, Row Addition): row adds, and matrix size is A * B, and each number that matrix D is capable adds the number of the capable same column of C, and it is capable to write C.

METCS (Matrix Flementary Transformation, Column Switching): exchange two row, matrix size is A * B, and the data that matrix C is listed as write the D row, and the number that matrix D is listed as writes the C row.

METCM (Matrix Flementary Transformation, Column Multiplication): row are taken advantage of, and matrix size is A * B, and each number of matrix D row multiply by C row colleague's on the real add number, write the C row.

METCA (Matrix Flementary Transformation, Column Addition): row add, and matrix size is A * B, and each number that matrix D is listed as adds the number that the C row are gone together, and writes the C row.

Shown in Figure 7, be the structural drawing of matrix processor core.By get finger, decoding 1, decoding 2, read and write data, general purpose register set, FPU Float Point Unit and control module form.MPU core and outside interface have the interface of getting the interface, control module and the CPU that refer to unit and the interface of IRAM, read and write data unit and DRAM.

Get the instruction fetch enable signal that refers to that unit reception control module sends, from IRAM, begin the reading command that circulates, send to instruction decode Unit 1, and accomplish jump instruction.Decipher Unit 1 and receive the instruction of sending from fetch unit; Classification according to instruction is deciphered; Convert matrix operation and mathematical function operational order to SIMD or the floating-point operation instruction writes decoding Unit 2, L/S and move are sent into the module that reads and writes data.Data address and the enable signal that Unit 1 sends deciphered in the unit reception that reads and writes data, and completion reading of data from DRAM writes general purpose register set, and the data in the general purpose register set are write DRAM; Data between register shift; General-purpose register numerical value writes specified register.Decipher Unit 2 and receive the SIMD and the floating-point operation instruction of sending decoding Unit 1, instruction is decoded as the floating-point operation instruction, give FPU Float Point Unit.FPU Float Point Unit comprises 4 FPU1 and 1 FPU2, and the purpose of each FPU1 and source-register address are by decoding 2 unit controls.FPU1 accomplishes the computing of expansion single-precision floating point, and FPU2 accomplishes 4 input additions of expansion single precision.The operation of control module gating matrix processor is given outer CPU when calculate to accomplish or take place to send look-at-me when unusual.

Shown in Figure 8 be by matrix processor realize Embedded System Structure figure, comprising embedded type CPU, BUS (bus), SDRAM (storer), matrix processor, registers group and interruption generating device.

CPU realizes the control to matrix processor through write controlled variable to registers group; The interruption generating device is arranged between CPU and the registers group, and the interrupt request of matrix processor is through registers group and interruption generating device input CPU, and realization is communicated by letter with mutual with CPU's.

In this instance; Adopt Advanced Microcontroller Bus Architecture (Advanced Microcontroller Bus Architecture; AMBA) (Advanced High performance Bus is AHB) as the bus standard of BUS for the senior high performance bus of 2.0 protocol definitions.CPU mainly comprises processor core (mainly being an integer unit), the instruction and data buffer memory, interruptable controller, debugging unit (DSU), timer, universal asynchronous serial (UART), the memory controller that separate, and cutting obtains on the LEON2 basis.

The normal mode of operation process of matrix processor embedded system is following:

1) after system powered on, cpu reset read operation boot instruction from Flash, accomplish the initialization of processor.Configuration parameter is write registers group, matrix processor is configured.At last, the beginning executive utility is prepared by load operation system from SDRAM.

2) CPU writes registers group with the instruction sequence and the initial sum termination address of calculating desired data in SDRAM of matrix processor, and sends the signal of starting working and give registers group, discharges bus then.After matrix processor received the signal of starting working, application took bus.

3) after matrix processor takies bus, write instruction sequence and initial sum termination address reading of data from SDRAM of data in the registers group according to CPU, write IRAM and DRAM respectively, discharge bus then.CPU occupies bus once more, can accomplish other and the incoherent task of matrix processor.

4) instruction sequence and data write completion after, the matrix processor core is started working.Get finger unit reading command from IRAM successively, send to decoding Unit 1; If receive the instruction jump instruction of decoding Unit 1 input, then accomplish the corresponding instruction redirect.Deciphering the instruction that Unit 1 will receive handles as follows: L/S and move send to the unit that reads and writes data; Jump instruction sends to the value unit; Matrix and mathematical function instruction are decoded as SIMD or decoding Unit 2 are exported in the floating-point operation instruction; The straight-through decoding Unit 2 of exporting to of SIMD and floating-point operation instruction.The unit that reads and writes data instructs according to Load, reads the data among the DRAM, writes general purpose register set.Deciphering Unit 2 will decipher the instruction of sending Unit 1 and decode and send to FPU Float Point Unit.Corresponding floating data is read in the Floating-point Computation instruction that FPU Float Point Unit is sent into according to decoding Unit 2 from general-purpose register, accomplish corresponding floating-point operation, and operation result writes back general purpose register set.The unit that reads and writes data instructs according to Store, and the data in the MPU general purpose register set are write back DRAM.

5) instruction among the IRAM is all accomplished, and matrix processor writes registers group through control module with relevant parameters, produces corresponding the interruption through the interruption generating device.CPU confirms according to interruption that receives and the overall task of being accomplished whether matrix processor proceeds other computings, accomplishes if calculate all, and the gating matrix processor writes the result of calculation among the DRAM in the assigned address zone of SDRAM.

The debugging mode process of matrix processor embedded system is following:

2) CPU control is with the instruction of matrix processor with calculate required data and write IRAM and DRAM.

3) CPU can send two kinds of working signals: single step run signal and continuous run signal.When matrix processor is received the single step run signal, accomplish a complete instruction at every turn; When receiving continuous run signal, accomplish in the normal mode of operation the 4th) work in step.CPU can read and write IRAM and DRAM at any time in the course of work of matrix processor, whether the result of calculation in the middle of confirming is correct.

Claims

1. a matrix processor is characterized in that, comprises external data interface, IRAM, DRAM, matrix processor core;

IRAM, DRAM and the external memory storage of said external data interface connection matrix processor are accomplished writing and carrying out exchanges data with the outside of matrix processor instruction;

2. a kind of matrix processor according to claim 1; It is characterized in that; Said external data interface, IRAM, DRAM, registers group of the common connection of matrix processor core, said registers group is deposited the system information and the interactive information of external data interface, IRAM, DRAM, matrix processor core.

3. a kind of matrix processor according to claim 2; It is characterized in that; Said external data interface, IRAM, DRAM, interruption generating device of the common connection of matrix processor core, the interrupt request of external data interface, IRAM, DRAM, matrix processor core is exported to outer CPU through registers group and interruption generating device.

4. a kind of matrix processor according to claim 1 is characterized in that, said matrix processor core comprises gets finger unit, first decoding unit, second decoding unit, the unit that reads and writes data, general purpose register set, FPU Float Point Unit and control module; IRAM, get and refer to that unit, first decoding unit, second decoding unit, FPU Float Point Unit connect successively; FPU Float Point Unit, general purpose register set, the unit that reads and writes data, DRAM connect successively; First decoding unit connects the said unit that reads and writes data.

5. a kind of matrix processor according to claim 4; It is characterized in that the instruction set that this matrix processor uses comprises: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data instruction, matrix operation instruction;

6. according to claim 4 or 5 described a kind of matrix processors, it is characterized in that, get the instruction fetch enable signal that refers to that unit reception control module sends, from IRAM, begin the reading command that circulates, send instruction and give first decoding unit, and the completion jump instruction; First decoding unit receives to ask for and refers to the instruction that the unit sends; Classification according to instruction is deciphered; Convert matrix operation and mathematical function operational order to SIMD or the floating-point operation instruction writes second decoding unit, L/S and move are sent into the unit that reads and writes data; The unit that reads and writes data receives data address and the enable signal that first decoding unit sends, and accomplishes that reading of data writes general purpose register set from DRAM, and the data in the general purpose register set are write DRAM; Data between the register of general purpose register set shift; Second decoding unit receives SIMD and the floating-point operation instruction that first decoding unit sends, and instruction is decoded as the floating-point operation instruction, gives FPU Float Point Unit; FPU Float Point Unit comprises the first floating-point operation module and one second floating-point operation module of four parallel connections, and the first floating-point operation module of these four parallel connections is connected in series to the second floating-point operation module; The purpose of each first floating-point operation module and source-register address are controlled by second decoding unit; The first floating-point operation module is accomplished the computing of expansion single-precision floating point, and the second floating-point operation module is accomplished 4 input additions of expansion single precision; The operation of control module gating matrix processor is given outer CPU when calculate to accomplish or take place to send look-at-me when unusual.

7. by the described a kind of matrix processor of claim 6, it is characterized in that said matrix processor also comprises a specified register, when said specified register is carried out the operational order that the real number participation is arranged at matrix processor, preserve this real number.

8. the embedded system that is made up by the described a kind of matrix processor of claim 4 is characterized in that, comprises CPU, bus, SDRAM, matrix processor, registers group and interruption generating device; The external data interface of CPU, SDRAM, matrix processor is connected to bus; CPU connects registers group and interruption generating device respectively through two data lines, and said registers group connects interruption generating device and matrix processor.

9. the method for work of embedded system according to claim 8 is characterized in that, may further comprise the steps: