CN102360344B - Matrix processor as well as instruction set and embedded system thereof - Google Patents

Matrix processor as well as instruction set and embedded system thereof Download PDF

Info

Publication number
CN102360344B
CN102360344B CN201110303919.4A CN201110303919A CN102360344B CN 102360344 B CN102360344 B CN 102360344B CN 201110303919 A CN201110303919 A CN 201110303919A CN 102360344 B CN102360344 B CN 102360344B
Authority
CN
China
Prior art keywords
instruction
matrix
floating
data
matrix processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201110303919.4A
Other languages
Chinese (zh)
Other versions
CN102360344A (en
Inventor
张斌
梅魁志
郑南宁
董培祥
张书锋
李宇海
赵晨
殷浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201110303919.4A priority Critical patent/CN102360344B/en
Publication of CN102360344A publication Critical patent/CN102360344A/en
Application granted granted Critical
Publication of CN102360344B publication Critical patent/CN102360344B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

The invention provides a matrix processor as well as an instruction set and an embedded system thereof. The matrix processor comprises an external data interface, an IRAM (intelligent random access memory), a DRAM (dynamic random access memory) and a matrix processor core, wherein the external data interface is connected with the IRAM and the DRAM of the matrix processor as well as an external memory so as to write an instruction of the matrix processor and perform data exchange with the outside; the IRAM and the DRAM are equivalent to buffers of the matrix processor; the IRAM receives an instruction sequence written by an external module; the DRAM receives a matrix or other data written by the external module as well as computing results written by the matrix processor core so that the matrix or data and the computing results are used by the matrix processor or are read by the external module so as to complete data exchange between the matrix processor and the external module; and the matrix processor core is used for fetching the instruction, decoding, computing, and writing back and controlling the computing results. The matrix processor provided by the invention can independently complete various matrix operations and other mathematical operations.

Description

Matrix processor and instruction set thereof and embedded system
[technical field]
The present invention relates to processor technical field, particularly a kind of matrix processor and instruction set thereof and embedded system.
[background technology]
Matrix operation is the basic problem in scientific and engineering computing.It is not only Mathematics Discipline, is also the important mathematical tool of many science and engineering subjects.Matrix operation is irreplaceable mathematical tool in numerous subjects such as physics, mechanics, computer science, Aero-Space.Especially, in computer science and technology, a lot of fields all will use matrix operation, such as Digital Image Processing, computer graphics, pattern-recognition, machine vision, artificial intelligence, science are calculated and general algorithm design and analysis etc.
The multiplex processor serial computing of classical matrix computing realizes, and seriously restricts the raising of computing velocity.Adopt the computing of hardware realization matrix can improve computing velocity, the hardware of still existing matrix operation is realized and being mostly for a certain concrete matrix computing, the specialized hardware structure of designs such as matrix inversion, matrix multiplication, and its hardware resource requirements is large, very flexible.Especially, the current various complete algorithm that relate to matrix operation all comprise multiple matrix operation and other non-matrix operations, at this moment, when the hardware circuit of particular matrix computing is added to computing system, need to be very long and mutual and stand-by period system other parts, like this, the efficiency that adds the hardware module of special purpose matrix computing to improve is just very limited.
So there is very large demand a kind of matrix operation unit that can complete independently complete algorithm.
[summary of the invention]
The object of the present invention is to provide a kind of matrix processor and embedded system thereof, they can the various matrix operations of complete independently and other mathematical operations.
In order to address the above problem, the present invention adopts following technical scheme:
A matrix processor, comprises external data interface, IRAM, DRAM, matrix processor core;
Described external data interface, the IRAM of connection matrix processor, DRAM and external memory storage, complete writing and carrying out exchanges data with outside of matrix processor instruction;
Described IRAM and DRAM, be equivalent to the buffer memory of matrix processor; IRAM receives the instruction sequence that external module writes; The result of calculation that the matrix that DRAM reception external module writes or other data, receiving matrix processor core write, reads for matrix processor or external module, completes the exchanges data of matrix processor and external module;
Described matrix processor core, writes back and controls for fetching, decoding, computing, result.
The present invention further improves and is: described external data interface, IRAM, DRAM, a register group of the common connection of matrix processor core, described register group is deposited system information and the interactive information of external data interface, IRAM, DRAM, matrix processor core.
The present invention further improves and is: described external data interface, IRAM, DRAM, an interruption generator of the common connection of matrix processor core, the interrupt request of external data interface, IRAM, DRAM, matrix processor core is exported to external control module (outer CPU) by register group and interruption generator.
The present invention further improves and is: described matrix processor core comprises fetching unit, the first decoding unit, the second decoding unit, the unit that reads and writes data, general purpose register set, Float Point Unit and control module; IRAM, fetching unit, the first decoding unit, the second decoding unit, Float Point Unit connect successively; Float Point Unit, general purpose register set, the unit that reads and writes data, DRAM connect successively; Unit reads and writes data described in the first decoding unit connects.
The present invention further improves and is: the instruction fetch enable signal that fetching unit reception control unit sends, from IRAM, start the reading command that circulates, and send instruction to the first decoding unit, and complete jump instruction; The first decoding unit receives the instruction sending from fetching unit, according to the classification of instruction, carry out decoding, by matrix operation with mathematical function operational order converts SIMD to or floating-point operation instruction writes the second decoding unit, L/S and move are sent into the unit that reads and writes data; The unit that reads and writes data receives data address and the enable signal that the first decoding unit sends, and completes reading out data from DRAM and writes general purpose register set, and the data in general purpose register set are write to DRAM; Data between the register of general purpose register set shift; The second decoding unit receives SIMD and the floating-point operation instruction that the first decoding unit sends, and by instruction decoding, is floating-point operation instruction, gives Float Point Unit; Float Point Unit comprises four the first floating-point operation modules in parallel and a second floating-point operation module, and these four the first floating-point operation modules in parallel are connected in series to the second floating-point operation module; The object of each the first floating-point operation module and source-register address are controlled by the second decoding unit; The first floating-point operation module completes the computing of expansion single-precision floating point, and the second floating-point operation module completes 4 input additions of expansion single precision; The operation of control module gating matrix processor, sends look-at-me to outer CPU when calculating completes or occurs when abnormal.
The present invention further improves and is: described matrix processor also comprises a specified register, when described specified register carries out at matrix processor the operational order that has real number participation, preserves this real number.
The present invention further improves and is: the instruction set that this matrix processor is used comprises: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data instruction, matrix operation instruction;
Described L/S and move, the data that complete between matrix processor buffer memory and register, register read and write;
Described jump instruction, completes the change of instruction execution sequence;
Described floating-point operation instruction, completes basic floating point arithmetic, comprise ask absolute value, comparison, add, subtract, multiplication and division, evolution, multiply-add operation;
Described mathematical function instruction, completes the computing of elementary mathematics function, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function;
Described SIMD instruction, completes the concurrent operation of different floating numbers, and the computing the completing computing that instruction comprises with floating-point operation is identical;
Described matrix operation instruction, complete some of matrix basic and simple calculations, comprise that the ranks of matrix generation, matrix transpose, matrix extract, by matrix ranks sue for peace, matrix and the addition subtraction multiplication and division of real number be, the plus-minus of matrix is taken advantage of, matrix elementary transformation.
An embedded system, comprises CPU, bus, SDRAM, matrix processor, register group and interrupts generator; The external data interface of CPU, SDRAM, matrix processor is connected to bus; CPU is connected respectively register group and is interrupted generator, described register group disconnecting generator and matrix processor by two data lines.
A method of work for embedded system, comprises the following steps:
1), after system powers on, cpu reset reads operation boot instruction from Flash, completes the initialization of processor; Configuration parameter is write to register group, matrix processor is configured; Finally, load operation system from SDRAM, prepares to start executive utility;
2) CPU, by the instruction sequence of matrix processor and the initial sum termination address of calculating desired data in SDRAM, writes register group, and sends the signal of starting working to register group, then discharges bus; Matrix processor receives starts working after signal, and application takies bus;
3) when matrix processor takies after bus, according to CPU, write instruction sequence in register group and initial sum termination address reading out data from SDRAM of data, write respectively IRAM and DRAM, then discharge bus;
4) after having write of instruction sequence and data, matrix processor core is started working; Fetching unit is reading command from IRAM successively, sends to the first decoding unit; If receive the instruction jump instruction of the first decoding unit input, complete corresponding instruction redirect; Translating the first decoding unit is handled as follows the instruction receiving: L/S and move send to the unit that reads and writes data; Jump instruction sends to value unit; Matrix and mathematical function instruction decoding are that the second decoding unit is exported in SIMD or floating-point operation instruction; Straight-through second decoding unit of exporting to of SIMD and floating-point operation instruction; Read and write data unit according to Load instruction, read the data in DRAM, write general purpose register set; The instruction that the second decoding unit sends the first decoding unit is decoded and is sent to Float Point Unit; Corresponding floating data is read in the Floating-point Computation instruction that Float Point Unit is sent into according to the second decoding unit from general-purpose register, completes corresponding floating-point operation, and operation result writes back general purpose register set; Read and write data unit according to Store instruction, the data in the general purpose register set of matrix processor core are write back to DRAM;
5) instruction in IRAM all completes, and matrix processor by corresponding parameter read-in register group, produces corresponding interrupt through interrupting generator by control module; CPU determines according to the interruption receiving and the overall task completing whether matrix processor proceeds other computings, if calculate, all completes, and gating matrix processor writes the result of calculation in DRAM in the assigned address region of SDRAM.
Instruction set is in processor, to be used for the set of a set of instruction of computing and control system.Each processor has just been stipulated a series of order set that match with its hardware circuit when design.And the advanced person of instruction set is whether, be also related to the performance of processor, it is not only the effective tool that improves processor efficiency, is also the important symbol that processor performance embodies.
Matrix processor is because of the reason of completed function, and its instruction set is different from universal cpu, but in the process of instruction set design, still defers to the design concept of Reduced Instruction Set Computer (RISC).(1) instruction type is few, format specification, and instruction length is unified.(2) addressing mode is simplified.(3) utilize between register and operate in a large number, only with specific operational access RAM.(4) simplified processor structure.(5) strengthen the parallel ability of processor.
For realizing a matrix processor instruction set that can complete various matrix operations and other mathematical operations, the instruction set of matrix processor should have six part: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data (SIMD) instruction, matrix operation instruction.
(1) L (Load)/S (Store) and move, the data that complete between matrix processor buffer memory and register, register read and write.
(2) jump instruction, completes the change of instruction execution sequence.
(3) floating-point operation instruction, completes basic floating point arithmetic, comprises and asks absolute value, comparison, adds, subtracts, multiplication and division, evolution, takes advantage of and the computing such as add.
(4) mathematical function instruction, completes the computing of some elementary mathematics function, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function etc.
(5) SIMD (single instruction multiple data) instruction, completes the concurrent operation of different floating numbers, and the computing the completing computing that instruction comprises with floating-point operation is identical.
(6) matrix operation instruction, complete some of matrix basic and simple calculations, the ranks that comprise matrix generation, matrix transpose, matrix extract, by matrix ranks sue for peace, matrix and the addition subtraction multiplication and division of real number be, the plus-minus of matrix is taken advantage of, matrix elementary transformation, by combining these different computings, make matrix processor can complete complicated algorithm.
The present invention proposes a kind of according to method and the flow process of matrix processor instruction set design matrix processor: determine that first according to actual needs the size of maximum matrix, IRAM and DRAM that matrix processor need to complete the computational accuracy of which instruction, matrix processor, mode of operation, support is, the size of the composition of arithmetic element, general-purpose register has designed instruction set form according to above-mentioned condition; Then structure and workflow and the abnormality processing mode etc. of matrix processor have been designed.
The present invention, according to above-mentioned instruction set and matrix processor method for designing, according to concrete environment for use, has realized a kind of matrix processor and working mechanism thereof that completes expansion single precision:
Need the concrete instruction realizing to have: (1) L/S and move, comprising: LM, SM, LMR, SMR, MOV; (2) jump instruction, comprising: JMP, JL, B, BL; (3) floating-point operation instruction, comprising: FABV, FCMP, FCMPZ, FCPY, FNEG, FADD, FSUB, FMUL, FDIV, FSQRT, FNMUL, FMAC, FMSB, NOP; (4) mathematical function instruction, comprising: FSINF, FCOSF, FTANF, FARCSINF, FARCCOSF, FARCTANF; (5) SIMD instruction, comprising: SABV, SCMP, SCMPEZ, SCPY, SNEG, SADD, SSUB, SMUL, SDIV, SSQRT, SNMUL, SMAC, SMSB, NOP; (6) matrix operation instruction, comprising: MGNM, MGOM, MGIM, MTRN, MRE, MCE, MCPY, MMRA, MNCA, MRA, MCA, MARN, MSRN, MMRN, MDRN, MAM, MSM, MDMM, MDD, MMM, METRS, METRM, METRA, METCS, METCM, METCA.
Computational accuracy: while carrying out exchanges data with outside, floating number precision is the single precision floating datum that meets IEEE754 standard; During internal calculation, use the single precision extended floating-point numbers that meets ieee standard, wherein sign bit is 1,8 of exponent bits, 39 of truth of a matter positions.
Two kinds of mode of operations: mode of operation and debugging mode.Work and debugging mode mainly change by bus interface.It when bus interface is host device interface, is mode of operation; Bus interface is debugging mode when from equipment interface.
Support is 32 * 32 floating number matrix operation to the maximum.
IRAM size is 32Kbits; DRAM is 192Kbits (can deposit the matrix of 4 32 * 32 at most).
Arithmetic element is connected in series to other 1 floating-point operation module (FPU2) by 4 in parallel floating-point operation modules (FPU1) that complete identical function and realizes.Wherein, FPU1 complete take absolute value, floating number comparison, copy, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation.FPU2 completes 4 input floating adds.
In general purpose register set, each register is 48bits, 64 altogether, and can be used as an integral body and use, also can divide into groups to use, be divided into four groups, 16 every group, the register file mode of third reading is write in every group of register employing one.
With respect to prior art, the present invention has the following advantages: matrix processor of the present invention can the various matrix operations of complete independently and other mathematical operations, and by combining these different computings, matrix processor can complete Various Complex algorithm.It is fast that this processor completes flop operating speed, has very strong dirigibility, computing that can complete independently total algorithm.
[accompanying drawing explanation]
Fig. 1 is the basic design drawing of matrix processor;
Fig. 2 is the floating point data format of matrix processor;
Fig. 3 is the connected mode of matrix processor and bus;
Fig. 4 is floating-point operation structural drawing;
Fig. 5 is the structure of general-purpose register;
Fig. 6 is matrix processor instruction set form;
Fig. 7 is the structural drawing of matrix processor core;
Fig. 8 is the structural drawing of matrix processor system.
[embodiment]
Below in conjunction with the drawings and specific embodiments, the present invention is elaborated.
In order to make matrix processor instruction set, can carry out matrix computations, and realize complete algorithm computing and control, the instruction set of matrix processor should have six part: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data (SIMD) instruction, matrix operation instruction.
According to actual needs, the concrete instruction of realization has: (1) L/S and move, comprising: LM, SM, LMR, SMR, MOV; (2) jump instruction, comprising: JMP, JL, B, BL; (3) floating-point operation instruction, comprising: FABV, FCMP, FCMPZ, FCPY, FNEG, FADD, FSUB, FMUL, FDIV, FSQRT, FNMUL, FMAC, FMSB, NOP; (4) mathematical function instruction, comprising: FSINF, FCOSF, FTANF, FARCSINF, FARCCOSF, FARCTANF; (5) SIMD instruction, comprising: SABV, SCMP, SCMPEZ, SCPY, SNEG, SADD, SSUB, SMUL, SDIV, SSQRT, SNMUL, SMAC, SMSB, NOP; (6) matrix operation instruction, comprising: MGNM, MGOM, MGIM, MTRN, MRE, MCE, MCPY, MMRA, MNCA, MRA, MCA, MARN, MSRN, MMRN, MDRN, MAM, MSM, MDMM, MDD, MMM, METRS, METRM, METRA, METCS, METCM, METCA.
With reference to Fig. 1, it is the Basic Design structural drawing of matrix processor.IRAM and the DRAM of external data interface connection matrix processor, complete writing and carrying out the function of exchanges data with outside of matrix processor instruction, and connected mode can be: be directly connected with CPU or be connected to system bus.IRAM and DRAM are equivalent to the buffer memory of matrix processor, and IRAM receives the instruction sequence that external module writes; The result of calculation that the matrix that DRAM reception external module writes or other data, receiving matrix processor core write, reads for matrix processor or external module, completes the exchanges data of matrix processor and external module.The width of IRAM and DRAM and the degree of depth are comprehensively determined according to actual conditions.Matrix processor core completes fetching, decoding, computing, writes back and control function.
The precision of matrix processor computational data is: while carrying out exchanges data with outside, floating number precision is the single precision floating datum that meets IEEE754 standard; During internal calculation, use the single precision extended floating-point numbers that meets IEEE754 standard, wherein sign bit is 1,8 of exponent bits, and the truth of a matter is 39, as shown in Figure 2, wherein, sign bit (sign) represents the positive and negative of floating number; Index (exponent) is the index of a binary floating point number; Mantissa (fraction) is a binary floating point number part.The value representation of single precision extended floating-point numbers is: value=(1) sign* 1.f * 2 exp-bias(f is mantissa, and exp is index, the side-play amount that bias is index, under single precision, bias is 127).
Mode of operation: mode of operation and debugging mode.Determine that external module is embedded type CPU, it is connected by bus with matrix processor, as shown in Figure 3.Work and debugging mode are realized by changing bus interface.When the RAM by bus master interface connection matrix processor and bus, it is mode of operation, at this moment CPU passes through to control register data writing, when gating matrix processor starts working, then matrix processor reads voluntarily desired data and carries out computing from storer, after computing completes, send interrupt notification CPU and calculated, CPU can not read and write the IRAM of matrix processor and DRAM.When the RAM by bus slave interface connection matrix processor and bus, be debugging mode, at this moment CPU can read and write the IRAM of matrix processor and DRAM.
Support is 32 * 32 floating number matrix operation to the maximum.
IRAM size is 32Kbits (1024 * 32bits).DRAM is 192Kbits (4 * 32 * 32 * 48bits can deposit the matrix of 4 32 * 32 at most), and matrix data starts by row successively continuous data writing RAM with matrix start address.
Float Point Unit as shown in Figure 4, is connected in series to other 1 floating-point operation module (FPU2) by 4 in parallel floating-point operation modules (FPU1) that complete identical function and realizes.Wherein, FPU1 complete take absolute value, two floating number comparisons, floating number and 0 are compared, copy, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation.The input of FPU1 has two parts: the calculating useful signal of decoding unit input; The operand of general purpose register set and specified register input.Wherein, the calculating useful signal of decoding unit input comprise take absolute value, two floating number comparisons, floating number and 0 compare, copy, get opposite number, add, subtract, multiplication and division, evolution, product get negative, take advantage of add, take advantage of subtract, blank operation and the direct useful signal of exporting.Operand comprises general purpose register set and specified register input, generally, read 3 operand input ports (operand input port 1, operand input port 2, operand input port 3) that port (read port one, read port 2, read port 3) corresponds respectively to FPU1 for 3 of general purpose register set; When specified register input useful signal is effective, the specified register that is input as of operand input port 2 is exported data.FPU2 completes 4 input floating adds.Output control module receives the output of 4 FPU1 and the output of FPU2, selects suitable operation result output.
Figure 5 shows that the structural drawing of general purpose register set.In general purpose register set, each register is 48bits, 64 altogether, and can be used as an integral body and use, also can divide into groups to use, be divided into four groups, 16 every group, every group of register adopts 1 to write the 3 register file modes of reading, the corresponding FPU1 of each group.The address of general purpose register set is 6bits, two parts, consists of: high 4 is address in general purpose register set grouping group, and low 2 is packet numbering.When register uses as a whole, address is from 0-63; When general purpose register set is used respectively as 4 groups, in grouping group, address is from 0-15, and packet numbering is from 0-3.Fig. 5 (a) is depicted as the structural drawing of a grouping of general purpose register set, and it is comprised of 3 register groups, and 3 register group structure functions are identical.The input of a grouping of general purpose register set has: write data message, comprise write address, with effect with write data; Read data information, comprises and reads address; What output comprised 3 register groups reads port (read port one, read port 2, read port 3).Fig. 5 (b) is depicted as the structural drawing of general purpose register set, and it is comprised of grouping, 4 MUX (MUX1, MUX2, MUX3, MUX4) and 1 2-4 code translator of 4 general purpose register set.MUX1, according to control signal, from different write address inputs, selects correct input to input to general purpose register set as write address.Wherein in group, address inputs to general purpose register set grouping 1-4 simultaneously; Packet numbering inputs to 2-4 code translator, is output as and writes useful signal (packet numbering is, output is respectively the useful signal of writing of general purpose register set grouping 1-4) at 00,01,10,11 o'clock.MUX2, according to identical control signal, from different writing data input, selects correct input to divide into groups to general purpose register set as the data that write general purpose register set grouping.MUX3, according to control signal, from different reading the input of address, selects correct input as the address of reading general purpose register set grouping to general purpose register set.MUX4, according to control signal, from the output data of general purpose register set different grouping, selects correct output as the output data of general purpose register set.
According to above condition, determine the form of matrix processor instruction as shown in Figure 6.Matrix processor instruction set is divided into six classes: (1) L/S and move, complete the exchanges data that exchanges data, move between matrix processor data RAM and general purpose register set completes the exchanges data between each register, general purpose register set and specified register in MPU general purpose register set; (2) jump instruction, completes the redirect of instruction address, comprises condition and non-conditional jump instruction; (3) floating-point operation instruction, completes the mathematical operation of floating number, comprising: absolute value, comparison, negate, add, subtract, multiplication and division, evolution, multiply accumulating etc.; (4) data function instruction, completes mathematical function computing, comprising: trigonometric function, inverse trigonometric function computing etc.; (5) SIMD instruction, completes single instruction multiple data computing, completes many group floating-point operations simultaneously; (6) matrix operation instruction, complete matrix operation, comprising: generate full 0 matrix, all 1's matrix, unit matrix, transposition, extraction row, extract row, reproduction matrix, ask certain row or certain row and, by row summation, operate by row summation, matrix addition subtraction multiplication and division real number, matrix plus-minus, matrix dot product, matrix multiplication, Matrix Elementary Transformation.
Instruction type represents by Type in instruction, Type is the highest 4bits of each instruction, and the Type of L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, SIMD instruction, matrix operation instruction is respectively: 0001,0010,0011,0100,0101,0110.
(1) L/S and move.
The 0-3 position of instruction is OP_code, as shown in table 1.
Table 1L/S and move
Figure BDA0000097519270000101
Figure BDA0000097519270000111
LM and SM order format are (a) in L/S and move.
LM (Load MPU): the floating number of one section of continuation address in DRAM is moved in continuous register group.Cond (the 24-27 position of instruction) is 2 for moving several numbers cond-1individual, DRAM_start (the 12-23 position of instruction) is start address in DRAM, and Reg_start (the 4-11 position of instruction) is register start address.
SM (Store MPU): the one piece of data in general purpose register set is write to DRAM continuously.The same LM of each several part implication in instruction.
LMR and SMR order format are (b) in L/S and move.
LMR (Load MPU by Register): according to the side-play amount in register Reg_s (the 10-15 position of instruction), a floating number in DRAM is write in register Reg_d (the 4-9 position of instruction).DRAM_start (the 16-27 position of instruction) is the base address of DRAM, the address that the offset address that Reg_s is DRAM is deposited in register group, and Reg_d is the register address writing, that is: * Reg_d=* (DRAM_start+*Reg_s).
SRM (Store MPU by Register): according to the side-play amount in register Reg_s, the floating number in register Reg_d is write to DRAM, the same LRM of each several part implication in instruction.
MOV order format is (c) in L/S and move.
MOV: move.Value in source-register Reg_s (the 12-19 position of instruction) is moved in destination register Reg_d (the 4-11 position of instruction).Give the address of one section of continuous address of general purpose register set as its register, remaining address is all specified register.So, when the address of Reg_s and Reg_d is all the address in general purpose register set, carry out the exchanges data between two registers in general purpose register set; When the address of Reg_s or Reg_d is specified register address, carry out the exchanges data of general-purpose register and specified register or specified register and specified register.
(2) jump instruction
The 0-3 position of instruction is OP_code, as shown in table 2; 4-15 position is IRAM_addr; 16-19 position is Cond, as shown in table 3; 20-23 position (wherein only have 20-21 position effective) is FPU_num.
Table 2 redirect and move
JMP (Jump): jump to the IRAM_addr place instruction fetch of IRAM.
CJ (Conditional Jump): while meeting Cond condition, jump to the IRAM_addr place instruction fetch of IRAM.Result with the represented FPU of FPU_num is carried out condition judgment.
B (Branch): now IRAM_addr deposits side-play amount, adds the side-play amount place instruction fetch of depositing in IRAM_addr at current instruction address.Jump to current instruction address and add offset address.
CB (Conditional Branch): while meeting Cond condition, jump to current address and add the instruction fetch of side-play amount place.FPU_num and the same CJ of Cond implication.
The Condition of table 3 jump instruction
Cond Meaning
0001 0
0010 Just
0011 Negative
0100 Equal
0101 Be greater than
0110 Be less than
(3) floating-point operation instruction
The 0-3 position of instruction is OP_code, as shown in table 4; 4-7 position is Fm; 8-11 position is Fn; 12-15 position Fd; 16-19 position (wherein only have 16-17 position effective) is FPU_num.
The instruction of table 4 floating-point operation
Figure BDA0000097519270000131
In floating-point operation instruction, the numbering that FPU_num is FPU, represents which FPU to complete floating-point operation with; Fd is destination register, when three behaviour's numbers are made floating-point operation, and the origin operation register that Fd is one of them; Fm and Fn are source-register, during single operand floating-point operation, use the floating number in Fm register.
FABS (Floating-point Absolute Value): floating number absolute value.
FCMP (Floating-point Compare): floating number comparison.
FCMPZ (Floating-point Compare with Zero): floating number and 0 relatively.
FCPY (Floating-point Copy): floating number copies.
FNEG (Floating-point Negate): floating number negate.
FADD (Floating-point Addition): floating add.
FSUB (Floating-point Subtract): Floating Subtract.
FMUL (Floating-point Multiply): floating multiplication.
FDIV (Floating-point Divide): floating divide.
FSQRT (Floating-point Square Root): floating number square root.
FNMUL (Floating-point Negated Multiply): product negate.
FMAC (Floating-point Multiply and Accumulate): take advantage of and add (Fd=Fd+Fm * Fn).
FMSB (Floating-point Multiply and Subtract): take advantage of and subtract (Fd=Fd-Fm * Fn).
NOP: blank operation.
(4) mathematical function instruction
The 0-3 position of instruction is OP_code, as shown in table 5; 4-11 position is Reg_addr_S; 12-19 position is Reg_addr_D; 20-23 position (wherein only have 20-21 position effective) is FPU_num.
The instruction of table 5 mathematical function
Figure BDA0000097519270000151
In mathematical function instruction, the numbering that FPU_num is FPU, represents which FPU to complete floating-point operation with; Reg_addr_S is the register address of depositing independent variable; Reg_addr_D is the register address of depositing functional operation result.
FSINF (Floating-point Sine Function): sine function.
FCOSF (Floating-point Cosine Function): cosine function.
FTANF (Floating-point Tangent Function): tan.
FARCSINF (Floating-point Arc-Sin Function): arcsin function.
FARCCOSF (Floating-point Arc-Cosine Function): inverse cosine function.
FARCTANF (Floating-point Arc-Tangent Function): arctan function.
(5) SIMD instruction
The 0-3 position of instruction is OP_code, as shown in table 6; 4-7 position is Fm; 8-11 position is Fn; 12-15 position Fd.
Table 6SIMD instruction
Figure BDA0000097519270000152
Figure BDA0000097519270000161
SABV (SIMD Absolute Value): absolute value.
SCMP (SIMD Compare): relatively.
SCMPEZ (SIMD Compare with Zero): with 0 comparison.
SCPY (SIMD Copy): copy.
SNEG (SIMD Negate): negate.
SADD (SIMD Addition): floating add.
SSUB (SIMD Subtract): Floating Subtract.
SMUL (SIMD Multiply): floating multiplication.
SDIV (SIMD Divide): floating divide.
SSQRT (SIMD Square Root): floating number square root.
SNMUL (SIMD Negated Multiply): product negate.
SMAC (SIMD Multiply and Accumulate): take advantage of and add (Fd=Fd+Fm * Fn).
SMSB (SIMD Multiply and Subtract): take advantage of and subtract (Fd=Fd-Fm * Fn).
NOP: blank operation.
(6) matrix operation instruction
Each complete matrix operation instruction is all comprised of 2 or 3 words, and for the matrix operation that has real number to participate in, when I=1, instruction has 3 words, and wherein the 3rd word is for participating in the real number of computing; When I=0, instruction is two words, and the real number that participates in computing leaves in specified register real_num, and other instructions are all 2 words.The 1st word that matrix operation instruction (a) in Fig. 1 is called to matrix operation instruction; By the matrix operation instruction (b) in Fig. 1 or the 2nd word that (c) is called matrix operation instruction (when matrix instruction is Matrix Elementary Transformation instruction, be that OP_codel is 7 o'clock, form is (c), other order formats are (b)), (b) in, A, B, D are 5bits, (c) in, the low 5bit s of A, B, C, D is valid data, and 6bit is 0; The 3rd word that the real number that participates in computing is called to matrix operation instruction.
In the 1st word of matrix operation instruction, the 0-3 position of instruction is OP_codel; 4-15 position is DRAM_start; 16-27 position is DRAM_result.In the 2nd word of matrix operation instruction, the 0-3 position of instruction (b) is OP_code2; 4-7 position is the low 4 of A; 8-11 position is the low 4 of B; 12-15 position is the low 4 of D; 16 is the 5th of A; 17 is the 5th of B; 18 is the 5th of D; 19 is I; 20-31 position is MB_start; The 0-3 position of instruction (c) is OP_code2; The 7th is I; 8-13 position (wherein only have 8-12 position effective) is D; 14-19 position (wherein only have 14-18 position effective) is C; 20-25 position (wherein only have 20-24 position effective) is B; 26-31 position (wherein only have 26-30 position effective) is A.
Type is 0110; DRAM_result is the start address that matrix operation result deposits DRAM in; DRAM_start is the start address of matrix A in DRAM; MB_start is the start address of the 2nd matrix in matrix operation; OP_codel is the type of matrix operation, the concrete matrix computing having represented with the OP_code2 combination in the 2nd word of matrix operation instruction, as shown in table 7, wherein OP_codel and OP_code2 are hexadecimal representation, and unlisted number is undefined instruction encoding.
Table 6SIMD instruction
Figure BDA0000097519270000171
Figure BDA0000097519270000181
MGNM (Matrix Generate Null Matrix): generate full 0 matrix, matrix size is A * B.
MGOM (Matrix generate One ' s Matrix): generate all 1's matrix, matrix size is A * B.
MGIM (Matrix Generate Identity Matrix): generate unit matrix, matrix size is A * A.
MTRN (Matrix Transposition): matrix transpose, original matrix size is A * B.
MRE (Matrix Row Extract): extract row matrix, matrix size is A * B, extracts matrix D capable.
MCE (Matrix Column Extract): extract rectangular array, matrix size is A * B, extracts matrix D row.
MCPY (Matrix Copy): reproduction matrix, matrix size is A * B.
MMRA (Matrix M Row Addition): ask M capable and, matrix size is A * B, ask D row matrix and.
MNCA (Matrix N Column Addition): ask N row and, matrix size is A * B, ask D column matrix and.
MRA (Matrix Row Addition): by row summation, matrix size is A * B, ask the every row of matrix and.
MCA (Matrix Column Addition): by row summations, matrix size is A * B, ask the every row of matrix and.
MARN (Matrix Add Real Number): matrix adds real number, and matrix size is A * B, and each element of matrix adds real number, real number is the 3rd word of matrix operation instruction or the real number in specified register.
MSRN (Matrix Subtract Real Number): matrix subtracts real number, and matrix size is A * B, and each element of matrix deducts real number, real number is the 3rd word of matrix operation instruction or the real number in specified register.
MMRN (Matrix Multiply Real Number): Matrix Multiplication real number, matrix size is A * B, and each element of matrix is multiplied by real number, and real number is the 3rd word of matrix operation instruction or the real number in specified register.
MDRN (Matrix Divide Real Number): matrix is except real number, and matrix size is A * B, each element of matrix is divided by real number, and real number is the 3rd word of matrix operation instruction or the real number in specified register.
MAM (Matrix Add Matrix): matrix adds, 2 matrix size are all A * B, 2 each corresponding elements of matrix are added.
MDM (Matrix Substract Matrix): matrix subtracts, 2 matrix size are all A * B, 2 each corresponding elements of matrix subtract each other.
MDMM (Matrix Dot Multiply Matrix): matrix dot product, 2 matrix size are all A * B, 2 each corresponding elements of matrix multiply each other.
MDD (Matrix Dot Divide): matrix dot removes, 2 matrix size are all A * B, 2 each corresponding elements of matrix are divided by.
MMM (Matrix Multiply Matrix): Matrix Multiplication, the size of matrix 1 is all A * B, and the size of matrix 2 is B * D, and result sizes is A * D.
METRS (Matrix Flementary Transformation, Row Switching): exchange two row, matrix size is A * B, writes D by the capable data of matrix C capable, the capable number of matrix D is write to C capable.
METRM (Matrix Flementary Transformation, Row Multiplication): row is taken advantage of, and matrix size is A * B, each capable number of matrix D is multiplied by the number of the capable same column of C on real add, writes C capable.
METRA (Matrix Flementary Transformation, Row Addition): row adds, and matrix size is A * B, each capable number of matrix D is added to the number of the capable same column of C, writes C capable.
METCS (Matrix Flementary Transformation, Column Switching): exchange two row, matrix size is A * B, writes D row by the data of matrix C row, and the number of matrix D row is write to C row.
METCM (Matrix Flementary Transformation, Column Multiplication): row are taken advantage of, and matrix size is A * B, is multiplied by each number of matrix D row the number of C row colleague on real add, writes C row.
METCA (Matrix Flementary Transformation, Column Addition): row add, and matrix size is A * B, adds by each number of matrix D row the number that C row are gone together, and writes C row.
Shown in Fig. 7, it is the structural drawing of matrix processor core.By fetching, decoding 1, decoding 2, read and write data, general purpose register set, Float Point Unit and control module form.MPU core and outside interface have the interface of fetching unit and IRAM, the interface of interface, control module and the CPU of read and write data unit and DRAM.
The instruction fetch enable signal that fetching unit reception control unit sends starts the reading command that circulates from IRAM, sends to Instruction decoding Unit 1, and completes jump instruction.Decoding Unit 1 receives the instruction sending from fetch unit, according to the classification of instruction, carry out decoding, by matrix operation with mathematical function operational order converts SIMD to or floating-point operation instruction writes decoding Unit 2, L/S and move are sent into the module that reads and writes data.The unit that reads and writes data receives data address and the enable signal that decoding Unit 1 sends, and completes reading out data from DRAM and writes general purpose register set, and the data in general purpose register set are write to DRAM; Data between register shift; General-purpose register numerical value writes specified register.Decoding Unit 2 receive SIMD and the floating-point operation instruction that decoding Unit 1 sends, and by instruction decoding, are floating-point operation instruction, give Float Point Unit.Float Point Unit comprises 4 FPU1 and 1 FPU2, and the object of each FPU1 and source-register address are by decoding 2 unit controls.FPU1 completes the computing of expansion single-precision floating point, and FPU2 completes 4 input additions of expansion single precision.The operation of control module gating matrix processor, sends look-at-me to outer CPU when calculating completes or occurs when abnormal.
Figure 8 shows that by matrix processor, realized Embedded System Structure figure, comprising embedded type CPU, BUS (bus), SDRAM (storer), matrix processor, register group with interrupt generator.
CPU controls the control of parameter realization to matrix processor by writing to register group; Interruption generator is arranged between CPU and register group, and the interrupt request of matrix processor is by register group and interrupt generator input CPU, and realization is communicated by letter with mutual with CPU's.
In this example, adopt Advanced Microcontroller Bus Architecture (Advanced Microcontroller Bus Architecture, AMBA) the senior high performance bus of 2.0 protocol definitions (Advanced High performance Bus, AHB) is as the bus standard of BUS.CPU mainly comprises instruction and data buffer memory, interruptable controller, debugging unit (DSU), timer, universal asynchronous serial (UART), the memory controller of processor core (being mainly an integer unit), separation, and on LEON2 basis, cutting obtains.
The normal mode of operation process of matrix processor embedded system is as follows:
1), after system powers on, cpu reset reads operation boot instruction from Flash, completes the initialization of processor.Configuration parameter is write to register group, matrix processor is configured.Finally, load operation system from SDRAM, prepares to start executive utility.
2) CPU, by the instruction sequence of matrix processor and the initial sum termination address of calculating desired data in SDRAM, writes register group, and sends the signal of starting working to register group, then discharges bus.Matrix processor receives starts working after signal, and application takies bus.
3) when matrix processor takies after bus, according to CPU, write instruction sequence in register group and initial sum termination address reading out data from SDRAM of data, write respectively IRAM and DRAM, then discharge bus.CPU occupies bus again, can complete other and the incoherent task of matrix processor.
4) after having write of instruction sequence and data, matrix processor core is started working.Fetching unit is reading command from IRAM successively, sends to decoding Unit 1; If receive the instruction jump instruction of decoding 1 unit input, complete corresponding instruction redirect.Decoding Unit 1 is handled as follows the instruction receiving: L/S and move send to the unit that reads and writes data; Jump instruction sends to value unit; Matrix and mathematical function instruction decoding are that decoding Unit 2 are exported in SIMD or floating-point operation instruction; The straight-through decoding Unit 2 of exporting to of SIMD and floating-point operation instruction.Read and write data unit according to Load instruction, read the data in DRAM, write general purpose register set.The instruction that decoding Unit 2 send decoding Unit 1 is decoded and is sent to Float Point Unit.Corresponding floating data is read in the Floating-point Computation instruction that Float Point Unit is sent into according to decoding Unit 2 from general-purpose register, completes corresponding floating-point operation, and operation result writes back general purpose register set.Read and write data unit according to Store instruction, the data in MPU general purpose register set are write back to DRAM.
5) instruction in IRAM all completes, and matrix processor by corresponding parameter read-in register group, produces corresponding interrupt through interrupting generator by control module.CPU determines according to the interruption receiving and the overall task completing whether matrix processor proceeds other computings, if calculate, all completes, and gating matrix processor writes the result of calculation in DRAM in the assigned address region of SDRAM.
The debugging mode process of matrix processor embedded system is as follows:
1), after system powers on, cpu reset reads operation boot instruction from Flash, completes the initialization of processor.Configuration parameter is write to register group, matrix processor is configured.Finally, load operation system from SDRAM, prepares to start executive utility.
2) CPU controls the instruction of matrix processor and calculates required data and writes IRAM and DRAM.
3) CPU can send two kinds of working signals: single step run signal and continuously run signal.When matrix processor is received single step run signal, complete a complete instruction at every turn; While receiving continuous run signal, complete in normal mode of operation the 4th) work of step.CPU, in the course of work of matrix processor, can read and write IRAM and DRAM at any time, and whether the result of calculation in the middle of determining is correct.

Claims (6)

1. the method for work of embedded system, is characterized in that, described embedded system comprises CPU, bus, SDRAM, matrix processor, register group and interrupts generator; The external data interface of CPU, SDRAM, matrix processor is connected to bus; CPU is connected respectively register group and is interrupted generator, described register group disconnecting generator and matrix processor by two data lines;
Described matrix processor comprises external data interface, IRAM, DRAM, matrix processor core; IRAM, DRAM and the external memory storage of described external data interface connection matrix processor, complete writing and carrying out exchanges data with outside of matrix processor instruction; Described IRAM and DRAM, be equivalent to the buffer memory of matrix processor; IRAM receives the instruction sequence that CPU writes; The result of calculation that the matrix that DRAM reception CPU writes or other data, receiving matrix processor core write, reads for matrix processor or CPU, completes the exchanges data of matrix processor and CPU; Described matrix processor core, writes back and controls for fetching, decoding, computing, result;
Described matrix processor core comprises fetching unit, the first decoding unit, the second decoding unit, the unit that reads and writes data, general purpose register set, Float Point Unit and control module; IRAM, fetching unit, the first decoding unit, the second decoding unit, Float Point Unit connect successively; Float Point Unit, general purpose register set, the unit that reads and writes data, DRAM connect successively; Unit reads and writes data described in the first decoding unit connects;
The method of work of described embedded system comprises the following steps:
1), after system powers on, cpu reset reads operation boot instruction from Flash, completes the initialization of processor; Configuration parameter is write to register group, matrix processor is configured; Finally, load operation system from SDRAM, prepares to start executive utility;
2) CPU, by the instruction sequence of matrix processor and the initial sum termination address of calculating desired data in SDRAM, writes register group, and sends the signal of starting working to register group, then discharges bus; Matrix processor receives starts working after signal, and application takies bus;
3) when matrix processor takies after bus, according to CPU, write instruction sequence in register group and initial sum termination address reading out data from SDRAM of data, write respectively IRAM and DRAM, then discharge bus;
4) after having write of instruction sequence and data, matrix processor core is started working; Fetching unit is reading command from IRAM successively, sends to the first decoding unit; If receive the jump instruction of the first decoding unit input, complete corresponding instruction redirect; Translating the first decoding unit is handled as follows the instruction receiving: L/S and move send to the unit that reads and writes data; Jump instruction sends to fetching unit; Matrix and mathematical function instruction decoding are that the second decoding unit is exported in SIMD or floating-point operation instruction; Straight-through second decoding unit of exporting to of SIMD and floating-point operation instruction; Read and write data unit according to Load instruction, read the data in DRAM, write general purpose register set; The instruction that the second decoding unit sends the first decoding unit is decoded and is sent to Float Point Unit; Corresponding floating data is read in the Floating-point Computation instruction that Float Point Unit is sent into according to the second decoding unit from general-purpose register, completes corresponding floating-point operation, and operation result writes back general purpose register set; Read and write data unit according to Store instruction, the data in the general purpose register set of matrix processor core are write back to DRAM;
5) instruction in IRAM all completes, and matrix processor by corresponding parameter read-in register group, produces corresponding interrupt through interrupting generator by control module; CPU determines according to the interruption receiving and the overall task completing whether matrix processor proceeds other computings, if calculate, all completes, and gating matrix processor writes the result of calculation in DRAM in the assigned address region of SDRAM.
2. the method for work of embedded system according to claim 1, it is characterized in that, described external data interface, IRAM, DRAM, a register group of the common connection of matrix processor core, described register group is deposited system information and the interactive information of external data interface, IRAM, DRAM, matrix processor core.
3. the method for work of embedded system according to claim 2, it is characterized in that, described external data interface, IRAM, DRAM, an interruption generator of the common connection of matrix processor core, the interrupt request of external data interface, IRAM, DRAM, matrix processor core is exported to CPU by register group and interruption generator.
4. the method for work of embedded system according to claim 1, it is characterized in that, the instruction set that described matrix processor is used comprises: L/S and move, jump instruction, floating-point operation instruction, mathematical function instruction, single instruction multiple data instruction, matrix operation instruction;
Described L/S and move, the data that complete between matrix processor buffer memory and register, register read and write;
Described jump instruction, completes the change of instruction execution sequence;
Described floating-point operation instruction, completes basic floating point arithmetic, comprise ask absolute value, comparison, add, subtract, multiplication and division, evolution, multiply-add operation;
Described mathematical function instruction, completes the computing of elementary mathematics function, comprises trigonometric function, inverse trigonometric function, logarithmic function, exponential function;
Described single instruction multiple data instruction, completes the concurrent operation of different floating numbers, and the computing the completing computing that instruction comprises with floating-point operation is identical;
Described matrix operation instruction, complete some of matrix basic and simple calculations, comprise that the ranks of matrix generation, matrix transpose, matrix extract, by matrix ranks sue for peace, matrix and the addition subtraction multiplication and division of real number be, the plus-minus of matrix is taken advantage of, matrix elementary transformation.
5. the method for work of embedded system according to claim 4, is characterized in that, the instruction fetch enable signal that fetching unit reception control unit sends starts the reading command that circulates from IRAM, sends instruction to the first decoding unit, and completes jump instruction; The first decoding unit receives the instruction sending from fetching unit, according to the classification of instruction, carry out decoding, by matrix operation with mathematical function operational order converts SIMD to or floating-point operation instruction writes the second decoding unit, L/S and move are sent into the unit that reads and writes data; The unit that reads and writes data receives data address and the enable signal that the first decoding unit sends, and completes reading out data from DRAM and writes general purpose register set, and the data in general purpose register set are write to DRAM, and the data that complete between the register of general purpose register set shift; The second decoding unit receives SIMD and the floating-point operation instruction that the first decoding unit sends, and by instruction decoding, is floating-point operation instruction, gives Float Point Unit; Float Point Unit comprises four the first floating-point operation modules in parallel and a second floating-point operation module, and these four the first floating-point operation modules in parallel are connected in series to the second floating-point operation module; The object of each the first floating-point operation module and source-register address are controlled by the second decoding unit; The first floating-point operation module completes the computing of expansion single-precision floating point, and the second floating-point operation module completes 4 input additions of expansion single precision; The operation of control module gating matrix processor, sends look-at-me to CPU when calculating completes or occurs when abnormal.
6. by the method for work of embedded system claimed in claim 5, it is characterized in that, described matrix processor also comprises a specified register, when described specified register carries out at matrix processor the operational order that has real number participation, preserves this real number.
CN201110303919.4A 2011-10-10 2011-10-10 Matrix processor as well as instruction set and embedded system thereof Expired - Fee Related CN102360344B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110303919.4A CN102360344B (en) 2011-10-10 2011-10-10 Matrix processor as well as instruction set and embedded system thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110303919.4A CN102360344B (en) 2011-10-10 2011-10-10 Matrix processor as well as instruction set and embedded system thereof

Publications (2)

Publication Number Publication Date
CN102360344A CN102360344A (en) 2012-02-22
CN102360344B true CN102360344B (en) 2014-03-12

Family

ID=45585673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110303919.4A Expired - Fee Related CN102360344B (en) 2011-10-10 2011-10-10 Matrix processor as well as instruction set and embedded system thereof

Country Status (1)

Country Link
CN (1) CN102360344B (en)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102013213414A1 (en) * 2013-07-09 2015-01-15 Robert Bosch Gmbh Method and apparatus for performing a model calculation of a data-based function model
CN104394302A (en) * 2014-11-28 2015-03-04 深圳职业技术学院 Real-time video defogging system based on FPGA
CN108595371B (en) * 2016-01-20 2019-11-19 北京中科寒武纪科技有限公司 For the reading data of vector operation, write-in and read-write scheduler and reservation station
US10762164B2 (en) 2016-01-20 2020-09-01 Cambricon Technologies Corporation Limited Vector and matrix computing device
CN106991077A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 A kind of matrix computations device
CN108491359B (en) * 2016-04-22 2019-12-24 北京中科寒武纪科技有限公司 Submatrix operation device and method
CN107315574B (en) * 2016-04-26 2021-01-01 安徽寒武纪信息科技有限公司 Apparatus and method for performing matrix multiplication operation
CN111857820B (en) * 2016-04-26 2024-05-07 中科寒武纪科技股份有限公司 Apparatus and method for performing matrix add/subtract operation
CN107678781B (en) * 2016-08-01 2021-02-26 北京百度网讯科技有限公司 Processor and method for executing instructions on processor
US20180217838A1 (en) * 2017-02-01 2018-08-02 Futurewei Technologies, Inc. Ultra lean vector processor
US11263008B2 (en) 2017-03-20 2022-03-01 Intel Corporation Systems, methods, and apparatuses for tile broadcast
CN107391447A (en) * 2017-07-26 2017-11-24 成都网烁信息科技有限公司 A kind of computer acceleration system and method
CN107895191B (en) 2017-10-30 2022-02-22 上海寒武纪信息科技有限公司 Information processing method and related product
CN108388446A (en) * 2018-02-05 2018-08-10 上海寒武纪信息科技有限公司 Computing module and method
CN108777155A (en) * 2018-08-02 2018-11-09 北京知存科技有限公司 Flash chip
US11996105B2 (en) 2018-09-13 2024-05-28 Shanghai Cambricon Information Technology Co., Ltd. Information processing method and terminal device
CN110046105B (en) * 2019-04-26 2021-10-22 中国科学院微电子研究所 3D NAND Flash
CN110990060B (en) * 2019-12-06 2022-03-22 北京瀚诺半导体科技有限公司 Embedded processor, instruction set and data processing method of storage and computation integrated chip
CN111242293B (en) * 2020-01-13 2023-07-18 腾讯科技(深圳)有限公司 Processing component, data processing method and electronic equipment
CN112527240B (en) * 2020-12-22 2023-11-14 中国电子科技集团公司第四十七研究所 Floating point operation device matched with 80C186CPU
CN115995249B (en) * 2023-03-24 2023-07-21 南京大学 Matrix transposition operation device based on DRAM
CN116679988B (en) * 2023-08-02 2023-10-27 武汉芯必达微电子有限公司 Hardware acceleration unit, hardware acceleration method, chip and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1342934A (en) * 2000-08-23 2002-04-03 任天堂株式会社 Method and device for pre-storage data in voiceband storage
CN1864131A (en) * 2003-10-01 2006-11-15 先进微装置公司 System and method for handling exceptional instructions in a trace cache based processor
CN101526895A (en) * 2009-01-22 2009-09-09 杭州中天微系统有限公司 High-performance low-power-consumption embedded processor based on command dual-transmission
CN201716564U (en) * 2010-06-25 2011-01-19 中国科学院沈阳自动化研究所 Processor architecture special for high-performance programmable logic controller (PLC)
CN102073543A (en) * 2011-01-14 2011-05-25 上海交通大学 General processor and graphics processor fusion system and method
CN102184092A (en) * 2011-05-04 2011-09-14 西安电子科技大学 Special instruction set processor based on pipeline structure

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1342934A (en) * 2000-08-23 2002-04-03 任天堂株式会社 Method and device for pre-storage data in voiceband storage
CN1864131A (en) * 2003-10-01 2006-11-15 先进微装置公司 System and method for handling exceptional instructions in a trace cache based processor
CN101526895A (en) * 2009-01-22 2009-09-09 杭州中天微系统有限公司 High-performance low-power-consumption embedded processor based on command dual-transmission
CN201716564U (en) * 2010-06-25 2011-01-19 中国科学院沈阳自动化研究所 Processor architecture special for high-performance programmable logic controller (PLC)
CN102073543A (en) * 2011-01-14 2011-05-25 上海交通大学 General processor and graphics processor fusion system and method
CN102184092A (en) * 2011-05-04 2011-09-14 西安电子科技大学 Special instruction set processor based on pipeline structure

Also Published As

Publication number Publication date
CN102360344A (en) 2012-02-22

Similar Documents

Publication Publication Date Title
CN102360344B (en) Matrix processor as well as instruction set and embedded system thereof
EP3602278B1 (en) Systems, methods, and apparatuses for tile matrix multiplication and accumulation
US7567996B2 (en) Vector SIMD processor
CN103793203B (en) Reducing power consumption in a fused multiply-add (FMA) unit responsive to input data values
CN100495326C (en) Array multiplication with reduced bandwidth requirement
CN113050990A (en) Apparatus, method and system for instructions for a matrix manipulation accelerator
CN110705703B (en) Sparse neural network processor based on systolic array
JPH10124484A (en) Data processor and data processing system
CN103019647A (en) Floating-point accumulation/gradual decrease operational method with floating-point precision maintaining function
WO2010051298A2 (en) Instruction and logic for performing range detection
WO2010111249A2 (en) System and method for achieving improved accuracy from efficient computer architectures
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
Stepchenkov et al. Energy efficient speed-independent 64-bit fused multiply-add unit
CN114691217A (en) Apparatus, method, and system for an 8-bit floating-point matrix dot-product instruction
CN111381808A (en) Multiplier, data processing method, chip and electronic equipment
Osinin A modular-logarithmic coprocessor concept
CN101930356B (en) Method for group addressing and read-write controlling of register file for floating-point coprocessor
CN102043609B (en) Floating-point coprocessor and corresponding configuration and control method
CN111752605A (en) fuzzy-J bit position using floating-point multiply-accumulate results
Wirawan et al. Parallel DNA sequence alignment on the cell broadband engine
EP3819788A1 (en) Data processing system and data processing method
CN112074810A (en) Parallel processing apparatus
CN112506468B (en) RISC-V general processor supporting high throughput multi-precision multiplication operation
US5206826A (en) Floating-point division cell
JPS6165336A (en) High-speed arithmetic system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140312

Termination date: 20161010