CN102360344B

CN102360344B - Matrix processor as well as instruction set and embedded system thereof

Info

Publication number: CN102360344B
Application number: CN201110303919.4A
Authority: CN
Inventors: 张斌; 梅魁志; 郑南宁; 董培祥; 张书锋; 李宇海; 赵晨; 殷浩
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2011-10-10
Filing date: 2011-10-10
Publication date: 2014-03-12
Anticipated expiration: 2031-10-10
Also published as: CN102360344A

Abstract

The invention provides a matrix processor as well as an instruction set and an embedded system thereof. The matrix processor comprises an external data interface, an IRAM (intelligent random access memory), a DRAM (dynamic random access memory) and a matrix processor core, wherein the external data interface is connected with the IRAM and the DRAM of the matrix processor as well as an external memory so as to write an instruction of the matrix processor and perform data exchange with the outside; the IRAM and the DRAM are equivalent to buffers of the matrix processor; the IRAM receives an instruction sequence written by an external module; the DRAM receives a matrix or other data written by the external module as well as computing results written by the matrix processor core so that the matrix or data and the computing results are used by the matrix processor or are read by the external module so as to complete data exchange between the matrix processor and the external module; and the matrix processor core is used for fetching the instruction, decoding, computing, and writing back and controlling the computing results. The matrix processor provided by the invention can independently complete various matrix operations and other mathematical operations.

Description

Matrix processor and instruction set thereof and embedded system

[technical field]

The present invention relates to processor technical field, particularly a kind of matrix processor and instruction set thereof and embedded system.

[background technology]

Matrix operation is the basic problem in scientific and engineering computing.It is not only Mathematics Discipline, is also the important mathematical tool of many science and engineering subjects.Matrix operation is irreplaceable mathematical tool in numerous subjects such as physics, mechanics, computer science, Aero-Space.Especially, in computer science and technology, a lot of fields all will use matrix operation, such as Digital Image Processing, computer graphics, pattern-recognition, machine vision, artificial intelligence, science are calculated and general algorithm design and analysis etc.

The multiplex processor serial computing of classical matrix computing realizes, and seriously restricts the raising of computing velocity.Adopt the computing of hardware realization matrix can improve computing velocity, the hardware of still existing matrix operation is realized and being mostly for a certain concrete matrix computing, the specialized hardware structure of designs such as matrix inversion, matrix multiplication, and its hardware resource requirements is large, very flexible.Especially, the current various complete algorithm that relate to matrix operation all comprise multiple matrix operation and other non-matrix operations, at this moment, when the hardware circuit of particular matrix computing is added to computing system, need to be very long and mutual and stand-by period system other parts, like this, the efficiency that adds the hardware module of special purpose matrix computing to improve is just very limited.

So there is very large demand a kind of matrix operation unit that can complete independently complete algorithm.

[summary of the invention]

The object of the present invention is to provide a kind of matrix processor and embedded system thereof, they can the various matrix operations of complete independently and other mathematical operations.

In order to address the above problem, the present invention adopts following technical scheme:

A matrix processor, comprises external data interface, IRAM, DRAM, matrix processor core;

Described external data interface, the IRAM of connection matrix processor, DRAM and external memory storage, complete writing and carrying out exchanges data with outside of matrix processor instruction;

Described IRAM and DRAM, be equivalent to the buffer memory of matrix processor; IRAM receives the instruction sequence that external module writes; The result of calculation that the matrix that DRAM reception external module writes or other data, receiving matrix processor core write, reads for matrix processor or external module, completes the exchanges data of matrix processor and external module;

Described matrix processor core, writes back and controls for fetching, decoding, computing, result.

The present invention further improves and is: described external data interface, IRAM, DRAM, a register group of the common connection of matrix processor core, described register group is deposited system information and the interactive information of external data interface, IRAM, DRAM, matrix processor core.

The present invention further improves and is: described external data interface, IRAM, DRAM, an interruption generator of the common connection of matrix processor core, the interrupt request of external data interface, IRAM, DRAM, matrix processor core is exported to external control module (outer CPU) by register group and interruption generator.

The present invention further improves and is: described matrix processor core comprises fetching unit, the first decoding unit, the second decoding unit, the unit that reads and writes data, general purpose register set, Float Point Unit and control module; IRAM, fetching unit, the first decoding unit, the second decoding unit, Float Point Unit connect successively; Float Point Unit, general purpose register set, the unit that reads and writes data, DRAM connect successively; Unit reads and writes data described in the first decoding unit connects.

The present invention further improves and is: the instruction fetch enable signal that fetching unit reception control unit sends, from IRAM, start the reading command that circulates, and send instruction to the first decoding unit, and complete jump instruction; The first decoding unit receives the instruction sending from fetching unit, according to the classification of instruction, carry out decoding, by matrix operation with mathematical function operational order converts SIMD to or floating-point operation instruction writes the second decoding unit, L/S and move are sent into the unit that reads and writes data; The unit that reads and writes data receives data address and the enable signal that the first decoding unit sends, and completes reading out data from DRAM and writes general purpose register set, and the data in general purpose register set are write to DRAM; Data between the register of general purpose register set shift; The second decoding unit receives SIMD and the floating-point operation instruction that the first decoding unit sends, and by instruction decoding, is floating-point operation instruction, gives Float Point Unit; Float Point Unit comprises four the first floating-point operation modules in parallel and a second floating-point operation module, and these four the first floating-point operation modules in parallel are connected in series to the second floating-point operation module; The object of each the first floating-point operation module and source-register address are controlled by the second decoding unit; The first floating-point operation module completes the computing of expansion single-precision floating point, and the second floating-point operation module completes 4 input additions of expansion single precision; The operation of control module gating matrix processor, sends look-at-me to outer CPU when calculating completes or occurs when abnormal.

The present invention further improves and is: described matrix processor also comprises a specified register, when described specified register carries out at matrix processor the operational order that has real number participation, preserves this real number.

The present invention further improves and is: the instruction set that this matrix processor is used comprises: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data instruction, matrix operation instruction;

Described L/S and move, the data that complete between matrix processor buffer memory and register, register read and write;

Described jump instruction, completes the change of instruction execution sequence;

Described floating-point operation instruction, completes basic floating point arithmetic, comprise ask absolute value, comparison, add, subtract, multiplication and division, evolution, multiply-add operation;

Described mathematical function instruction, completes the computing of elementary mathematics function, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function;

Described SIMD instruction, completes the concurrent operation of different floating numbers, and the computing the completing computing that instruction comprises with floating-point operation is identical;

Described matrix operation instruction, complete some of matrix basic and simple calculations, comprise that the ranks of matrix generation, matrix transpose, matrix extract, by matrix ranks sue for peace, matrix and the addition subtraction multiplication and division of real number be, the plus-minus of matrix is taken advantage of, matrix elementary transformation.

An embedded system, comprises CPU, bus, SDRAM, matrix processor, register group and interrupts generator; The external data interface of CPU, SDRAM, matrix processor is connected to bus; CPU is connected respectively register group and is interrupted generator, described register group disconnecting generator and matrix processor by two data lines.

A method of work for embedded system, comprises the following steps:

1), after system powers on, cpu reset reads operation boot instruction from Flash, completes the initialization of processor; Configuration parameter is write to register group, matrix processor is configured; Finally, load operation system from SDRAM, prepares to start executive utility;

2) CPU, by the instruction sequence of matrix processor and the initial sum termination address of calculating desired data in SDRAM, writes register group, and sends the signal of starting working to register group, then discharges bus; Matrix processor receives starts working after signal, and application takies bus;

3) when matrix processor takies after bus, according to CPU, write instruction sequence in register group and initial sum termination address reading out data from SDRAM of data, write respectively IRAM and DRAM, then discharge bus;

4) after having write of instruction sequence and data, matrix processor core is started working; Fetching unit is reading command from IRAM successively, sends to the first decoding unit; If receive the instruction jump instruction of the first decoding unit input, complete corresponding instruction redirect; Translating the first decoding unit is handled as follows the instruction receiving: L/S and move send to the unit that reads and writes data; Jump instruction sends to value unit; Matrix and mathematical function instruction decoding are that the second decoding unit is exported in SIMD or floating-point operation instruction; Straight-through second decoding unit of exporting to of SIMD and floating-point operation instruction; Read and write data unit according to Load instruction, read the data in DRAM, write general purpose register set; The instruction that the second decoding unit sends the first decoding unit is decoded and is sent to Float Point Unit; Corresponding floating data is read in the Floating-point Computation instruction that Float Point Unit is sent into according to the second decoding unit from general-purpose register, completes corresponding floating-point operation, and operation result writes back general purpose register set; Read and write data unit according to Store instruction, the data in the general purpose register set of matrix processor core are write back to DRAM;

5) instruction in IRAM all completes, and matrix processor by corresponding parameter read-in register group, produces corresponding interrupt through interrupting generator by control module; CPU determines according to the interruption receiving and the overall task completing whether matrix processor proceeds other computings, if calculate, all completes, and gating matrix processor writes the result of calculation in DRAM in the assigned address region of SDRAM.

Instruction set is in processor, to be used for the set of a set of instruction of computing and control system.Each processor has just been stipulated a series of order set that match with its hardware circuit when design.And the advanced person of instruction set is whether, be also related to the performance of processor, it is not only the effective tool that improves processor efficiency, is also the important symbol that processor performance embodies.

Matrix processor is because of the reason of completed function, and its instruction set is different from universal cpu, but in the process of instruction set design, still defers to the design concept of Reduced Instruction Set Computer (RISC).(1) instruction type is few, format specification, and instruction length is unified.(2) addressing mode is simplified.(3) utilize between register and operate in a large number, only with specific operational access RAM.(4) simplified processor structure.(5) strengthen the parallel ability of processor.

For realizing a matrix processor instruction set that can complete various matrix operations and other mathematical operations, the instruction set of matrix processor should have six part: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data (SIMD) instruction, matrix operation instruction.

(1) L (Load)/S (Store) and move, the data that complete between matrix processor buffer memory and register, register read and write.

(2) jump instruction, completes the change of instruction execution sequence.

(3) floating-point operation instruction, completes basic floating point arithmetic, comprises and asks absolute value, comparison, adds, subtracts, multiplication and division, evolution, takes advantage of and the computing such as add.

(4) mathematical function instruction, completes the computing of some elementary mathematics function, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function etc.

(5) SIMD (single instruction multiple data) instruction, completes the concurrent operation of different floating numbers, and the computing the completing computing that instruction comprises with floating-point operation is identical.

(6) matrix operation instruction, complete some of matrix basic and simple calculations, the ranks that comprise matrix generation, matrix transpose, matrix extract, by matrix ranks sue for peace, matrix and the addition subtraction multiplication and division of real number be, the plus-minus of matrix is taken advantage of, matrix elementary transformation, by combining these different computings, make matrix processor can complete complicated algorithm.

The present invention proposes a kind of according to method and the flow process of matrix processor instruction set design matrix processor: determine that first according to actual needs the size of maximum matrix, IRAM and DRAM that matrix processor need to complete the computational accuracy of which instruction, matrix processor, mode of operation, support is, the size of the composition of arithmetic element, general-purpose register has designed instruction set form according to above-mentioned condition; Then structure and workflow and the abnormality processing mode etc. of matrix processor have been designed.

The present invention, according to above-mentioned instruction set and matrix processor method for designing, according to concrete environment for use, has realized a kind of matrix processor and working mechanism thereof that completes expansion single precision:

Need the concrete instruction realizing to have: (1) L/S and move, comprising: LM, SM, LMR, SMR, MOV; (2) jump instruction, comprising: JMP, JL, B, BL; (3) floating-point operation instruction, comprising: FABV, FCMP, FCMPZ, FCPY, FNEG, FADD, FSUB, FMUL, FDIV, FSQRT, FNMUL, FMAC, FMSB, NOP; (4) mathematical function instruction, comprising: FSINF, FCOSF, FTANF, FARCSINF, FARCCOSF, FARCTANF; (5) SIMD instruction, comprising: SABV, SCMP, SCMPEZ, SCPY, SNEG, SADD, SSUB, SMUL, SDIV, SSQRT, SNMUL, SMAC, SMSB, NOP; (6) matrix operation instruction, comprising: MGNM, MGOM, MGIM, MTRN, MRE, MCE, MCPY, MMRA, MNCA, MRA, MCA, MARN, MSRN, MMRN, MDRN, MAM, MSM, MDMM, MDD, MMM, METRS, METRM, METRA, METCS, METCM, METCA.

Computational accuracy: while carrying out exchanges data with outside, floating number precision is the single precision floating datum that meets IEEE754 standard; During internal calculation, use the single precision extended floating-point numbers that meets ieee standard, wherein sign bit is 1,8 of exponent bits, 39 of truth of a matter positions.

Two kinds of mode of operations: mode of operation and debugging mode.Work and debugging mode mainly change by bus interface.It when bus interface is host device interface, is mode of operation; Bus interface is debugging mode when from equipment interface.

Support is 32 * 32 floating number matrix operation to the maximum.

IRAM size is 32Kbits; DRAM is 192Kbits (can deposit the matrix of 4 32 * 32 at most).

Arithmetic element is connected in series to other 1 floating-point operation module (FPU2) by 4 in parallel floating-point operation modules (FPU1) that complete identical function and realizes.Wherein, FPU1 complete take absolute value, floating number comparison, copy, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation.FPU2 completes 4 input floating adds.

In general purpose register set, each register is 48bits, 64 altogether, and can be used as an integral body and use, also can divide into groups to use, be divided into four groups, 16 every group, the register file mode of third reading is write in every group of register employing one.

With respect to prior art, the present invention has the following advantages: matrix processor of the present invention can the various matrix operations of complete independently and other mathematical operations, and by combining these different computings, matrix processor can complete Various Complex algorithm.It is fast that this processor completes flop operating speed, has very strong dirigibility, computing that can complete independently total algorithm.

[accompanying drawing explanation]

Fig. 1 is the basic design drawing of matrix processor;

Fig. 2 is the floating point data format of matrix processor;

Fig. 3 is the connected mode of matrix processor and bus;

Fig. 4 is floating-point operation structural drawing;

Fig. 5 is the structure of general-purpose register;

Fig. 6 is matrix processor instruction set form;

Fig. 7 is the structural drawing of matrix processor core;

Fig. 8 is the structural drawing of matrix processor system.

[embodiment]

Below in conjunction with the drawings and specific embodiments, the present invention is elaborated.

In order to make matrix processor instruction set, can carry out matrix computations, and realize complete algorithm computing and control, the instruction set of matrix processor should have six part: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data (SIMD) instruction, matrix operation instruction.

According to actual needs, the concrete instruction of realization has: (1) L/S and move, comprising: LM, SM, LMR, SMR, MOV; (2) jump instruction, comprising: JMP, JL, B, BL; (3) floating-point operation instruction, comprising: FABV, FCMP, FCMPZ, FCPY, FNEG, FADD, FSUB, FMUL, FDIV, FSQRT, FNMUL, FMAC, FMSB, NOP; (4) mathematical function instruction, comprising: FSINF, FCOSF, FTANF, FARCSINF, FARCCOSF, FARCTANF; (5) SIMD instruction, comprising: SABV, SCMP, SCMPEZ, SCPY, SNEG, SADD, SSUB, SMUL, SDIV, SSQRT, SNMUL, SMAC, SMSB, NOP; (6) matrix operation instruction, comprising: MGNM, MGOM, MGIM, MTRN, MRE, MCE, MCPY, MMRA, MNCA, MRA, MCA, MARN, MSRN, MMRN, MDRN, MAM, MSM, MDMM, MDD, MMM, METRS, METRM, METRA, METCS, METCM, METCA.

With reference to Fig. 1, it is the Basic Design structural drawing of matrix processor.IRAM and the DRAM of external data interface connection matrix processor, complete writing and carrying out the function of exchanges data with outside of matrix processor instruction, and connected mode can be: be directly connected with CPU or be connected to system bus.IRAM and DRAM are equivalent to the buffer memory of matrix processor, and IRAM receives the instruction sequence that external module writes; The result of calculation that the matrix that DRAM reception external module writes or other data, receiving matrix processor core write, reads for matrix processor or external module, completes the exchanges data of matrix processor and external module.The width of IRAM and DRAM and the degree of depth are comprehensively determined according to actual conditions.Matrix processor core completes fetching, decoding, computing, writes back and control function.

The precision of matrix processor computational data is: while carrying out exchanges data with outside, floating number precision is the single precision floating datum that meets IEEE754 standard; During internal calculation, use the single precision extended floating-point numbers that meets IEEE754 standard, wherein sign bit is 1,8 of exponent bits, and the truth of a matter is 39, as shown in Figure 2, wherein, sign bit (sign) represents the positive and negative of floating number; Index (exponent) is the index of a binary floating point number; Mantissa (fraction) is a binary floating point number part.The value representation of single precision extended floating-point numbers is: value=(1) ^sign* 1.f * 2 ^exp-bias(f is mantissa, and exp is index, the side-play amount that bias is index, under single precision, bias is 127).

Mode of operation: mode of operation and debugging mode.Determine that external module is embedded type CPU, it is connected by bus with matrix processor, as shown in Figure 3.Work and debugging mode are realized by changing bus interface.When the RAM by bus master interface connection matrix processor and bus, it is mode of operation, at this moment CPU passes through to control register data writing, when gating matrix processor starts working, then matrix processor reads voluntarily desired data and carries out computing from storer, after computing completes, send interrupt notification CPU and calculated, CPU can not read and write the IRAM of matrix processor and DRAM.When the RAM by bus slave interface connection matrix processor and bus, be debugging mode, at this moment CPU can read and write the IRAM of matrix processor and DRAM.

Support is 32 * 32 floating number matrix operation to the maximum.

IRAM size is 32Kbits (1024 * 32bits).DRAM is 192Kbits (4 * 32 * 32 * 48bits can deposit the matrix of 4 32 * 32 at most), and matrix data starts by row successively continuous data writing RAM with matrix start address.

Float Point Unit as shown in Figure 4, is connected in series to other 1 floating-point operation module (FPU2) by 4 in parallel floating-point operation modules (FPU1) that complete identical function and realizes.Wherein, FPU1 complete take absolute value, two floating number comparisons, floating number and 0 are compared, copy, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation.The input of FPU1 has two parts: the calculating useful signal of decoding unit input; The operand of general purpose register set and specified register input.Wherein, the calculating useful signal of decoding unit input comprise take absolute value, two floating number comparisons, floating number and 0 compare, copy, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation and the direct useful signal of exporting.Operand comprises general purpose register set and specified register input, generally, read 3 operand input ports (operand input port 1, operand input port 2, operand input port 3) that port (read port one, read port 2, read port 3) corresponds respectively to FPU1 for 3 of general purpose register set; When specified register input useful signal is effective, the specified register that is input as of operand input port 2 is exported data.FPU2 completes 4 input floating adds.Output control module receives the output of 4 FPU1 and the output of FPU2, selects suitable operation result output.

Figure 5 shows that the structural drawing of general purpose register set.In general purpose register set, each register is 48bits, 64 altogether, and can be used as an integral body and use, also can divide into groups to use, be divided into four groups, 16 every group, every group of register adopts 1 to write the 3 register file modes of reading, the corresponding FPU1 of each group.The address of general purpose register set is 6bits, two parts, consists of: high 4 is address in general purpose register set grouping group, and low 2 is packet numbering.When register uses as a whole, address is from 0-63; When general purpose register set is used respectively as 4 groups, in grouping group, address is from 0-15, and packet numbering is from 0-3.Fig. 5 (a) is depicted as the structural drawing of a grouping of general purpose register set, and it is comprised of 3 register groups, and 3 register group structure functions are identical.The input of a grouping of general purpose register set has: write data message, comprise write address, with effect with write data; Read data information, comprises and reads address; What output comprised 3 register groups reads port (read port one, read port 2, read port 3).Fig. 5 (b) is depicted as the structural drawing of general purpose register set, and it is comprised of grouping, 4 MUX (MUX1, MUX2, MUX3, MUX4) and 1 2-4 code translator of 4 general purpose register set.MUX1, according to control signal, from different write address inputs, selects correct input to input to general purpose register set as write address.Wherein in group, address inputs to general purpose register set grouping 1-4 simultaneously; Packet numbering inputs to 2-4 code translator, is output as and writes useful signal (packet numbering is, output is respectively the useful signal of writing of general purpose register set grouping 1-4) at 00,01,10,11 o'clock.MUX2, according to identical control signal, from different writing data input, selects correct input to divide into groups to general purpose register set as the data that write general purpose register set grouping.MUX3, according to control signal, from different reading the input of address, selects correct input as the address of reading general purpose register set grouping to general purpose register set.MUX4, according to control signal, from the output data of general purpose register set different grouping, selects correct output as the output data of general purpose register set.

According to above condition, determine the form of matrix processor instruction as shown in Figure 6.Matrix processor instruction set is divided into six classes: (1) L/S and move, complete the exchanges data that exchanges data, move between matrix processor data RAM and general purpose register set completes the exchanges data between each register, general purpose register set and specified register in MPU general purpose register set; (2) jump instruction, completes the redirect of instruction address, comprises condition and non-conditional jump instruction; (3) floating-point operation instruction, completes the mathematical operation of floating number, comprising: absolute value, comparison, negate, add, subtract, multiplication and division, evolution, multiply accumulating etc.; (4) data function instruction, completes mathematical function computing, comprising: trigonometric function, inverse trigonometric function computing etc.; (5) SIMD instruction, completes single instruction multiple data computing, completes many group floating-point operations simultaneously; (6) matrix operation instruction, complete matrix operation, comprising: generate full 0 matrix, all 1's matrix, unit matrix, transposition, extraction row, extract row, reproduction matrix, ask certain row or certain row and, by row summation, operate by row summation, matrix addition subtraction multiplication and division real number, matrix plus-minus, matrix dot product, matrix multiplication, Matrix Elementary Transformation.

Instruction type represents by Type in instruction, Type is the highest 4bits of each instruction, and the Type of L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, SIMD instruction, matrix operation instruction is respectively: 0001,0010,0011,0100,0101,0110.

(1) L/S and move.

The 0-3 position of instruction is OP_code, as shown in table 1.

Table 1L/S and move

LM and SM order format are (a) in L/S and move.

LM (Load MPU): the floating number of one section of continuation address in DRAM is moved in continuous register group.Cond (the 24-27 position of instruction) is 2 for moving several numbers ^cond-1individual, DRAM_start (the 12-23 position of instruction) is start address in DRAM, and Reg_start (the 4-11 position of instruction) is register start address.

SM (Store MPU): the one piece of data in general purpose register set is write to DRAM continuously.The same LM of each several part implication in instruction.

LMR and SMR order format are (b) in L/S and move.

LMR (Load MPU by Register): according to the side-play amount in register Reg_s (the 10-15 position of instruction), a floating number in DRAM is write in register Reg_d (the 4-9 position of instruction).DRAM_start (the 16-27 position of instruction) is the base address of DRAM, the address that the offset address that Reg_s is DRAM is deposited in register group, and Reg_d is the register address writing, that is: * Reg_d=* (DRAM_start+*Reg_s).

SRM (Store MPU by Register): according to the side-play amount in register Reg_s, the floating number in register Reg_d is write to DRAM, the same LRM of each several part implication in instruction.

MOV order format is (c) in L/S and move.

MOV: move.Value in source-register Reg_s (the 12-19 position of instruction) is moved in destination register Reg_d (the 4-11 position of instruction).Give the address of one section of continuous address of general purpose register set as its register, remaining address is all specified register.So, when the address of Reg_s and Reg_d is all the address in general purpose register set, carry out the exchanges data between two registers in general purpose register set; When the address of Reg_s or Reg_d is specified register address, carry out the exchanges data of general-purpose register and specified register or specified register and specified register.

(2) jump instruction

The 0-3 position of instruction is OP_code, as shown in table 2; 4-15 position is IRAM_addr; 16-19 position is Cond, as shown in table 3; 20-23 position (wherein only have 20-21 position effective) is FPU_num.

Table 2 redirect and move

JMP (Jump): jump to the IRAM_addr place instruction fetch of IRAM.

CJ (Conditional Jump): while meeting Cond condition, jump to the IRAM_addr place instruction fetch of IRAM.Result with the represented FPU of FPU_num is carried out condition judgment.

B (Branch): now IRAM_addr deposits side-play amount, adds the side-play amount place instruction fetch of depositing in IRAM_addr at current instruction address.Jump to current instruction address and add offset address.

CB (Conditional Branch): while meeting Cond condition, jump to current address and add the instruction fetch of side-play amount place.FPU_num and the same CJ of Cond implication.

The Condition of table 3 jump instruction

Cond	Meaning
		0001	0
0010	Just
		0011	Negative

0100	Equal
		0101	Be greater than
0110	Be less than

(3) floating-point operation instruction

The 0-3 position of instruction is OP_code, as shown in table 4; 4-7 position is Fm; 8-11 position is Fn; 12-15 position Fd; 16-19 position (wherein only have 16-17 position effective) is FPU_num.

The instruction of table 4 floating-point operation

In floating-point operation instruction, the numbering that FPU_num is FPU, represents which FPU to complete floating-point operation with; Fd is destination register, when three behaviour's numbers are made floating-point operation, and the origin operation register that Fd is one of them; Fm and Fn are source-register, during single operand floating-point operation, use the floating number in Fm register.

FABS (Floating-point Absolute Value): floating number absolute value.

FCMP (Floating-point Compare): floating number comparison.

FCMPZ (Floating-point Compare with Zero): floating number and 0 relatively.

FCPY (Floating-point Copy): floating number copies.

FNEG (Floating-point Negate): floating number negate.

FADD (Floating-point Addition): floating add.

FSUB (Floating-point Subtract): Floating Subtract.

FMUL (Floating-point Multiply): floating multiplication.

FDIV (Floating-point Divide): floating divide.

FSQRT (Floating-point Square Root): floating number square root.

FNMUL (Floating-point Negated Multiply): product negate.

FMAC (Floating-point Multiply and Accumulate): take advantage of and add (Fd=Fd+Fm * Fn).

FMSB (Floating-point Multiply and Subtract): take advantage of and subtract (Fd=Fd-Fm * Fn).

NOP: blank operation.

(4) mathematical function instruction

The 0-3 position of instruction is OP_code, as shown in table 5; 4-11 position is Reg_addr_S; 12-19 position is Reg_addr_D; 20-23 position (wherein only have 20-21 position effective) is FPU_num.

The instruction of table 5 mathematical function

In mathematical function instruction, the numbering that FPU_num is FPU, represents which FPU to complete floating-point operation with; Reg_addr_S is the register address of depositing independent variable; Reg_addr_D is the register address of depositing functional operation result.

FSINF (Floating-point Sine Function): sine function.

FCOSF (Floating-point Cosine Function): cosine function.

FTANF (Floating-point Tangent Function): tan.

FARCSINF (Floating-point Arc-Sin Function): arcsin function.

FARCCOSF (Floating-point Arc-Cosine Function): inverse cosine function.

FARCTANF (Floating-point Arc-Tangent Function): arctan function.

(5) SIMD instruction

The 0-3 position of instruction is OP_code, as shown in table 6; 4-7 position is Fm; 8-11 position is Fn; 12-15 position Fd.

Table 6SIMD instruction

SABV (SIMD Absolute Value): absolute value.

SCMP (SIMD Compare): relatively.

SCMPEZ (SIMD Compare with Zero): with 0 comparison.

SCPY (SIMD Copy): copy.

SNEG (SIMD Negate): negate.

SADD (SIMD Addition): floating add.

SSUB (SIMD Subtract): Floating Subtract.

SMUL (SIMD Multiply): floating multiplication.

SDIV (SIMD Divide): floating divide.

SSQRT (SIMD Square Root): floating number square root.

SNMUL (SIMD Negated Multiply): product negate.

SMAC (SIMD Multiply and Accumulate): take advantage of and add (Fd=Fd+Fm * Fn).

SMSB (SIMD Multiply and Subtract): take advantage of and subtract (Fd=Fd-Fm * Fn).

NOP: blank operation.

(6) matrix operation instruction

Each complete matrix operation instruction is all comprised of 2 or 3 words, and for the matrix operation that has real number to participate in, when I=1, instruction has 3 words, and wherein the 3rd word is for participating in the real number of computing; When I=0, instruction is two words, and the real number that participates in computing leaves in specified register real_num, and other instructions are all 2 words.The 1st word that matrix operation instruction (a) in Fig. 1 is called to matrix operation instruction; By the matrix operation instruction (b) in Fig. 1 or the 2nd word that (c) is called matrix operation instruction (when matrix instruction is Matrix Elementary Transformation instruction, be that OP_codel is 7 o'clock, form is (c), other order formats are (b)), (b) in, A, B, D are 5bits, (c) in, the low 5bit s of A, B, C, D is valid data, and 6bit is 0; The 3rd word that the real number that participates in computing is called to matrix operation instruction.

In the 1st word of matrix operation instruction, the 0-3 position of instruction is OP_codel; 4-15 position is DRAM_start; 16-27 position is DRAM_result.In the 2nd word of matrix operation instruction, the 0-3 position of instruction (b) is OP_code2; 4-7 position is the low 4 of A; 8-11 position is the low 4 of B; 12-15 position is the low 4 of D; 16 is the 5th of A; 17 is the 5th of B; 18 is the 5th of D; 19 is I; 20-31 position is MB_start; The 0-3 position of instruction (c) is OP_code2; The 7th is I; 8-13 position (wherein only have 8-12 position effective) is D; 14-19 position (wherein only have 14-18 position effective) is C; 20-25 position (wherein only have 20-24 position effective) is B; 26-31 position (wherein only have 26-30 position effective) is A.

Type is 0110; DRAM_result is the start address that matrix operation result deposits DRAM in; DRAM_start is the start address of matrix A in DRAM; MB_start is the start address of the 2nd matrix in matrix operation; OP_codel is the type of matrix operation, the concrete matrix computing having represented with the OP_code2 combination in the 2nd word of matrix operation instruction, as shown in table 7, wherein OP_codel and OP_code2 are hexadecimal representation, and unlisted number is undefined instruction encoding.

Table 6SIMD instruction

MGNM (Matrix Generate Null Matrix): generate full 0 matrix, matrix size is A * B.

MGOM (Matrix generate One ' s Matrix): generate all 1's matrix, matrix size is A * B.

MGIM (Matrix Generate Identity Matrix): generate unit matrix, matrix size is A * A.

MTRN (Matrix Transposition): matrix transpose, original matrix size is A * B.

MRE (Matrix Row Extract): extract row matrix, matrix size is A * B, extracts matrix D capable.

MCE (Matrix Column Extract): extract rectangular array, matrix size is A * B, extracts matrix D row.

MCPY (Matrix Copy): reproduction matrix, matrix size is A * B.

MMRA (Matrix M Row Addition): ask M capable and, matrix size is A * B, ask D row matrix and.

MNCA (Matrix N Column Addition): ask N row and, matrix size is A * B, ask D column matrix and.

MRA (Matrix Row Addition): by row summation, matrix size is A * B, ask the every row of matrix and.

MCA (Matrix Column Addition): by row summations, matrix size is A * B, ask the every row of matrix and.

MARN (Matrix Add Real Number): matrix adds real number, and matrix size is A * B, and each element of matrix adds real number, real number is the 3rd word of matrix operation instruction or the real number in specified register.

MSRN (Matrix Subtract Real Number): matrix subtracts real number, and matrix size is A * B, and each element of matrix deducts real number, real number is the 3rd word of matrix operation instruction or the real number in specified register.

MMRN (Matrix Multiply Real Number): Matrix Multiplication real number, matrix size is A * B, and each element of matrix is multiplied by real number, and real number is the 3rd word of matrix operation instruction or the real number in specified register.

MDRN (Matrix Divide Real Number): matrix is except real number, and matrix size is A * B, each element of matrix is divided by real number, and real number is the 3rd word of matrix operation instruction or the real number in specified register.

MAM (Matrix Add Matrix): matrix adds, 2 matrix size are all A * B, 2 each corresponding elements of matrix are added.

MDM (Matrix Substract Matrix): matrix subtracts, 2 matrix size are all A * B, 2 each corresponding elements of matrix subtract each other.

MDMM (Matrix Dot Multiply Matrix): matrix dot product, 2 matrix size are all A * B, 2 each corresponding elements of matrix multiply each other.

MDD (Matrix Dot Divide): matrix dot removes, 2 matrix size are all A * B, 2 each corresponding elements of matrix are divided by.

MMM (Matrix Multiply Matrix): Matrix Multiplication, the size of matrix 1 is all A * B, and the size of matrix 2 is B * D, and result sizes is A * D.

METRS (Matrix Flementary Transformation, Row Switching): exchange two row, matrix size is A * B, writes D by the capable data of matrix C capable, the capable number of matrix D is write to C capable.

METRM (Matrix Flementary Transformation, Row Multiplication): row is taken advantage of, and matrix size is A * B, each capable number of matrix D is multiplied by the number of the capable same column of C on real add, writes C capable.

METRA (Matrix Flementary Transformation, Row Addition): row adds, and matrix size is A * B, each capable number of matrix D is added to the number of the capable same column of C, writes C capable.

METCS (Matrix Flementary Transformation, Column Switching): exchange two row, matrix size is A * B, writes D row by the data of matrix C row, and the number of matrix D row is write to C row.

METCM (Matrix Flementary Transformation, Column Multiplication): row are taken advantage of, and matrix size is A * B, is multiplied by each number of matrix D row the number of C row colleague on real add, writes C row.

METCA (Matrix Flementary Transformation, Column Addition): row add, and matrix size is A * B, adds by each number of matrix D row the number that C row are gone together, and writes C row.

Shown in Fig. 7, it is the structural drawing of matrix processor core.By fetching, decoding 1, decoding 2, read and write data, general purpose register set, Float Point Unit and control module form.MPU core and outside interface have the interface of fetching unit and IRAM, the interface of interface, control module and the CPU of read and write data unit and DRAM.

The instruction fetch enable signal that fetching unit reception control unit sends starts the reading command that circulates from IRAM, sends to Instruction decoding Unit 1, and completes jump instruction.Decoding Unit 1 receives the instruction sending from fetch unit, according to the classification of instruction, carry out decoding, by matrix operation with mathematical function operational order converts SIMD to or floating-point operation instruction writes decoding Unit 2, L/S and move are sent into the module that reads and writes data.The unit that reads and writes data receives data address and the enable signal that decoding Unit 1 sends, and completes reading out data from DRAM and writes general purpose register set, and the data in general purpose register set are write to DRAM; Data between register shift; General-purpose register numerical value writes specified register.Decoding Unit 2 receive SIMD and the floating-point operation instruction that decoding Unit 1 sends, and by instruction decoding, are floating-point operation instruction, give Float Point Unit.Float Point Unit comprises 4 FPU1 and 1 FPU2, and the object of each FPU1 and source-register address are by decoding 2 unit controls.FPU1 completes the computing of expansion single-precision floating point, and FPU2 completes 4 input additions of expansion single precision.The operation of control module gating matrix processor, sends look-at-me to outer CPU when calculating completes or occurs when abnormal.

Figure 8 shows that by matrix processor, realized Embedded System Structure figure, comprising embedded type CPU, BUS (bus), SDRAM (storer), matrix processor, register group with interrupt generator.

CPU controls the control of parameter realization to matrix processor by writing to register group; Interruption generator is arranged between CPU and register group, and the interrupt request of matrix processor is by register group and interrupt generator input CPU, and realization is communicated by letter with mutual with CPU's.

In this example, adopt Advanced Microcontroller Bus Architecture (Advanced Microcontroller Bus Architecture, AMBA) the senior high performance bus of 2.0 protocol definitions (Advanced High performance Bus, AHB) is as the bus standard of BUS.CPU mainly comprises instruction and data buffer memory, interruptable controller, debugging unit (DSU), timer, universal asynchronous serial (UART), the memory controller of processor core (being mainly an integer unit), separation, and on LEON2 basis, cutting obtains.

The normal mode of operation process of matrix processor embedded system is as follows:

1), after system powers on, cpu reset reads operation boot instruction from Flash, completes the initialization of processor.Configuration parameter is write to register group, matrix processor is configured.Finally, load operation system from SDRAM, prepares to start executive utility.

2) CPU, by the instruction sequence of matrix processor and the initial sum termination address of calculating desired data in SDRAM, writes register group, and sends the signal of starting working to register group, then discharges bus.Matrix processor receives starts working after signal, and application takies bus.

3) when matrix processor takies after bus, according to CPU, write instruction sequence in register group and initial sum termination address reading out data from SDRAM of data, write respectively IRAM and DRAM, then discharge bus.CPU occupies bus again, can complete other and the incoherent task of matrix processor.

4) after having write of instruction sequence and data, matrix processor core is started working.Fetching unit is reading command from IRAM successively, sends to decoding Unit 1; If receive the instruction jump instruction of decoding 1 unit input, complete corresponding instruction redirect.Decoding Unit 1 is handled as follows the instruction receiving: L/S and move send to the unit that reads and writes data; Jump instruction sends to value unit; Matrix and mathematical function instruction decoding are that decoding Unit 2 are exported in SIMD or floating-point operation instruction; The straight-through decoding Unit 2 of exporting to of SIMD and floating-point operation instruction.Read and write data unit according to Load instruction, read the data in DRAM, write general purpose register set.The instruction that decoding Unit 2 send decoding Unit 1 is decoded and is sent to Float Point Unit.Corresponding floating data is read in the Floating-point Computation instruction that Float Point Unit is sent into according to decoding Unit 2 from general-purpose register, completes corresponding floating-point operation, and operation result writes back general purpose register set.Read and write data unit according to Store instruction, the data in MPU general purpose register set are write back to DRAM.

5) instruction in IRAM all completes, and matrix processor by corresponding parameter read-in register group, produces corresponding interrupt through interrupting generator by control module.CPU determines according to the interruption receiving and the overall task completing whether matrix processor proceeds other computings, if calculate, all completes, and gating matrix processor writes the result of calculation in DRAM in the assigned address region of SDRAM.

The debugging mode process of matrix processor embedded system is as follows:

2) CPU controls the instruction of matrix processor and calculates required data and writes IRAM and DRAM.

3) CPU can send two kinds of working signals: single step run signal and continuously run signal.When matrix processor is received single step run signal, complete a complete instruction at every turn; While receiving continuous run signal, complete in normal mode of operation the 4th) work of step.CPU, in the course of work of matrix processor, can read and write IRAM and DRAM at any time, and whether the result of calculation in the middle of determining is correct.

Claims

1. the method for work of embedded system, is characterized in that, described embedded system comprises CPU, bus, SDRAM, matrix processor, register group and interrupts generator; The external data interface of CPU, SDRAM, matrix processor is connected to bus; CPU is connected respectively register group and is interrupted generator, described register group disconnecting generator and matrix processor by two data lines;

Described matrix processor comprises external data interface, IRAM, DRAM, matrix processor core; IRAM, DRAM and the external memory storage of described external data interface connection matrix processor, complete writing and carrying out exchanges data with outside of matrix processor instruction; Described IRAM and DRAM, be equivalent to the buffer memory of matrix processor; IRAM receives the instruction sequence that CPU writes; The result of calculation that the matrix that DRAM reception CPU writes or other data, receiving matrix processor core write, reads for matrix processor or CPU, completes the exchanges data of matrix processor and CPU; Described matrix processor core, writes back and controls for fetching, decoding, computing, result;

Described matrix processor core comprises fetching unit, the first decoding unit, the second decoding unit, the unit that reads and writes data, general purpose register set, Float Point Unit and control module; IRAM, fetching unit, the first decoding unit, the second decoding unit, Float Point Unit connect successively; Float Point Unit, general purpose register set, the unit that reads and writes data, DRAM connect successively; Unit reads and writes data described in the first decoding unit connects;

The method of work of described embedded system comprises the following steps:

4) after having write of instruction sequence and data, matrix processor core is started working; Fetching unit is reading command from IRAM successively, sends to the first decoding unit; If receive the jump instruction of the first decoding unit input, complete corresponding instruction redirect; Translating the first decoding unit is handled as follows the instruction receiving: L/S and move send to the unit that reads and writes data; Jump instruction sends to fetching unit; Matrix and mathematical function instruction decoding are that the second decoding unit is exported in SIMD or floating-point operation instruction; Straight-through second decoding unit of exporting to of SIMD and floating-point operation instruction; Read and write data unit according to Load instruction, read the data in DRAM, write general purpose register set; The instruction that the second decoding unit sends the first decoding unit is decoded and is sent to Float Point Unit; Corresponding floating data is read in the Floating-point Computation instruction that Float Point Unit is sent into according to the second decoding unit from general-purpose register, completes corresponding floating-point operation, and operation result writes back general purpose register set; Read and write data unit according to Store instruction, the data in the general purpose register set of matrix processor core are write back to DRAM;

2. the method for work of embedded system according to claim 1, it is characterized in that, described external data interface, IRAM, DRAM, a register group of the common connection of matrix processor core, described register group is deposited system information and the interactive information of external data interface, IRAM, DRAM, matrix processor core.

3. the method for work of embedded system according to claim 2, it is characterized in that, described external data interface, IRAM, DRAM, an interruption generator of the common connection of matrix processor core, the interrupt request of external data interface, IRAM, DRAM, matrix processor core is exported to CPU by register group and interruption generator.

4. the method for work of embedded system according to claim 1, it is characterized in that, the instruction set that described matrix processor is used comprises: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data instruction, matrix operation instruction;

Described single instruction multiple data instruction, completes the concurrent operation of different floating numbers, and the computing the completing computing that instruction comprises with floating-point operation is identical;

5. the method for work of embedded system according to claim 4, is characterized in that, the instruction fetch enable signal that fetching unit reception control unit sends starts the reading command that circulates from IRAM, sends instruction to the first decoding unit, and completes jump instruction; The first decoding unit receives the instruction sending from fetching unit, according to the classification of instruction, carry out decoding, by matrix operation with mathematical function operational order converts SIMD to or floating-point operation instruction writes the second decoding unit, L/S and move are sent into the unit that reads and writes data; The unit that reads and writes data receives data address and the enable signal that the first decoding unit sends, and completes reading out data from DRAM and writes general purpose register set, and the data in general purpose register set are write to DRAM, and the data that complete between the register of general purpose register set shift; The second decoding unit receives SIMD and the floating-point operation instruction that the first decoding unit sends, and by instruction decoding, is floating-point operation instruction, gives Float Point Unit; Float Point Unit comprises four the first floating-point operation modules in parallel and a second floating-point operation module, and these four the first floating-point operation modules in parallel are connected in series to the second floating-point operation module; The object of each the first floating-point operation module and source-register address are controlled by the second decoding unit; The first floating-point operation module completes the computing of expansion single-precision floating point, and the second floating-point operation module completes 4 input additions of expansion single precision; The operation of control module gating matrix processor, sends look-at-me to CPU when calculating completes or occurs when abnormal.

6. by the method for work of embedded system claimed in claim 5, it is characterized in that, described matrix processor also comprises a specified register, when described specified register carries out at matrix processor the operational order that has real number participation, preserves this real number.