CN102750133B - 32-Bit triple-emission digital signal processor supporting SIMD - Google Patents

32-Bit triple-emission digital signal processor supporting SIMD Download PDF

Info

Publication number
CN102750133B
CN102750133B CN201210205812.0A CN201210205812A CN102750133B CN 102750133 B CN102750133 B CN 102750133B CN 201210205812 A CN201210205812 A CN 201210205812A CN 102750133 B CN102750133 B CN 102750133B
Authority
CN
China
Prior art keywords
unit
data
instruction
address
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210205812.0A
Other languages
Chinese (zh)
Other versions
CN102750133A (en
Inventor
屈凌翔
张庆文
黄嵩人
杨晓刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 58 Research Institute
Original Assignee
CETC 58 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 58 Research Institute filed Critical CETC 58 Research Institute
Priority to CN201210205812.0A priority Critical patent/CN102750133B/en
Publication of CN102750133A publication Critical patent/CN102750133A/en
Application granted granted Critical
Publication of CN102750133B publication Critical patent/CN102750133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Advance Control (AREA)

Abstract

The invention discloses a 32-bit triple-emission digital signal processor supporting SIMD (Single Instruction Multiple Data), comprising three flow lines in parallel emission: a data access flow line, an integer arithmetic flow line and a vector arithmetic flow line, wherein each flow line is provided with an independent decoding and execution unit and supports SIMD operation. The 32-bit triple-emission digital signal processor supporting SIMD is mainly composed of a program memory interface unit, a data memory interface unit, an instruction fetch unit, a flow line control unit, a system bus, a data access flow line unit, an integer arithmetic flow line unit, a vector arithmetic flow line unit, a data register, an address register, a vector register, a coprocessor interface unit and a floating point arithmetic unit, all of which are connected together through a circuit. The 32-bit triple-emission digital signal processor supporting SIMD supports parallel execution of three flow lines so that the parallel processing capability of a DSP (Digital Signal Processor) is improved; besides, the 32-bit triple-emission digital signal processor supports parallel execution four groups of 16-bit multiplying and adding operations in a single cycle, and supports simultaneous execution of the operation of five groups of data and the access operation of one group of data; therefore, the data processing capability of the DSP is enhanced.

Description

Support the digital signal processor of 32 three transmittings of SIMD
Technical field
The present invention relates to digital signal processor, specifically a kind of DSP of 32 three transmittings supporting SIMD.
Background technology
DSP is processing digital signal in real time, possesses the data-handling capacity of super general processor far away, plays an important role in fields such as data communication, multimedia processing.Along with the high speed development of the communication technology and multimedia technology, also more and more higher to the requirement of DSP data-handling capacity.The method that promotes DSP data-handling capacity mainly contains and promotes DSP dominant frequency, adopts multi-core architecture to promote processing power, the parallel processing capability that employing multiple-issue architecture promotes DSP core etc. of whole circuit.Along with DSP dominant frequency is more and more higher, the cost and the difficulty that continue raising are also more and more higher at present; Although and multicore architecture can reduce the requirement to DSP core, can greatly increase the design difficulty of whole SOC circuit.Therefore improving dominant frequency and taking outside multicore architecture, multiple-issue architecture is also more and more universal.The present invention adopts 3 emitting structurals, and 3 streamlines can executed in parallel; Especially to add processing power in order promoting to take advantage of, to have designed separately Vector Processing streamline, this streamline and another two streamlines are independent execution completely, mainly supports parallel data operation.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, improve concurrent operation ability and the multiply-add operation ability of processor, the digital signal processor of 32 three transmittings of a kind of SIMD of support is provided, and is the digital signal processor towards multimedia processing and data communication field.
According to technical scheme provided by the invention, the digital signal processor of 32 three transmittings of described support SIMD, comprise the streamline of 3 parallel transmittings: data access streamline, integer arithmetic streamline, vector operation streamline, every streamline possesses independently decoding and performance element, and supports SIMD operation;
Described data access streamline, comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; Access decoding unit is connected with address arithmetic unit with address-generation unit by control signal with storage control unit, control the computing of address-generation unit and address arithmetic unit, address-generation unit is connected with data register with address register by data bus with address arithmetic unit, from address register and data register, reading out data carries out computing, and result writes back address register or generates memory access address delivers to data-carrier store interface; Decoding is carried out in the access class instruction that access decoding unit launches instruction fetching unit; Storage control unit is according to the operation of access decode results control address arithmetic unit and address-generation unit; Address arithmetic unit is to calculating from the address date of address register, data register or immediate, and the result obtaining is kept in address register; Address-generation unit generates memory access address according to addressing mode etc.; Only have data access streamline to carry out read-write operation to external memory storage, all the other two streamlines are only to register manipulation;
Described integer arithmetic streamline, comprises ALU decoding unit, arithmetic control module, bit processing unit, ALU, coprocessor interface unit; ALU decoding unit is connected with bit processing unit with ALU by control signal with arithmetic control module, control the computing of ALU and bit processing unit, ALU is connected with data register by data bus with bit processing unit, from data register, reading out data is processed, and result is write back in data register; Bit processing unit is used for execute bit operation, comprises an insertion, position extraction, position replacement and shifting function etc.; Coprocessor interface unit is responsible for and coprocessor carries out data exchange;
Described vector operation streamline, comprises taking advantage of adding decoding unit, taking advantage of and add control module, vector operation unit; Take advantage of and add decoding unit and add control module and be connected with vector operation unit by control signal with taking advantage of, the computing of control vector arithmetic element, vector operation unit is connected with vector registor with data register by data bus, from data register or vector registor, reading out data is processed, and result is write back in data register or vector registor; In vector operation unit, comprise the multiplication unit of operand extraction unit, 4 16*32,2 64 bit accumulators, be mainly used in carrying out and parallel take advantage of, take advantage of and add, take advantage of and add reducing, support SIMD;
Also comprise:
Instruction fetching unit, is 3 fetching unit that streamline is shared, realizes the order of program and carries out and redirect, and data collision is made prediction for controlling the programmable counter of this unit; , in the instruction buffer of instruction fetching unit, and the order code in instruction buffer is judged by program storage interface read-in programme data, according to judged result, 32 or 16 bit instruction codes are sent to respectively in corresponding streamline;
Float Point Unit, as coprocessor, is connected to integer arithmetic streamline by coprocessor interface unit; Have independently floating intruction set, carry out floating-point operation by integer arithmetic streamline, and result is sent back to data register by integer arithmetic streamline;
Register file: comprise 16 32 bit address registers, 16 32 bit data register, 16 32 bit vector register and special function registers; Address register generates for address arithmetic and memory access address, and data register is for integer arithmetic and vector operation, and vector registor, for the SIMD operation of support vector computing, all supports different streamlines simultaneously from parallel the reading and writing data of different port;
Program storage interface is connected with program bus and instruction fetching unit by address bus, receives the fetching address that instruction fetching unit sends over, and director data is sent to instruction fetching unit by program bus; Instruction fetching unit by instruction bus with access decoding unit, ALU decoding unit, take advantage of and add decoding unit and be connected, and can walk abreast and send 3 instructions in these 3 decoding units; Described data register, vector registor, address register are connected with data-carrier store interface unit by data bus, carry out exchanges data by data-carrier store interface and external data memory; Pipeline control unit is connected with each execution unit by Pipeline control signal, controls the execution of each execution unit, and accepts the feedback of each execution unit.
The digital signal processor of 32 three transmittings of described support SIMD possesses multimedia and processes special instruction and Viterbi decoding special instruction, support 16 bit instructions and floating-point operation instruction, support fixed point and floating-point operation, wherein 16 bit instructions are the subset of 32 bit instructions; There is independently data cache and program high-speed cache.
Every streamline possesses its proprietary instruction, and in the operational code of instruction, two of specialized designs are for distinguishing this three classes instruction; In the instruction fetching stage, instruction fetching unit carries out pre-decode to these two and determines classes of instructions, and then classification sends to corresponding streamline.
The present invention adopts the memory organization mode of stratification, the first order the most close execution unit be internal register file, comprise the address register of 16 32, the data register of 16 32, the vector registor of 16 32; The second level is high-speed cache; The third level is data-carrier store; Two 32 bit address registers can form 64 bit address register pairs, two 32 bit vector registers can form 64 bit vector register pairs, two 32 bit data register can form 64 bit data register pair, 4 vector registors or data register can form 128 bit data register groups, for supporting SIMD operation.
The present invention adopts 5 stage pipeline structure, is respectively fetching, decoding, execution 1, carries out 2, writes back, and wherein decoding and operating part have respectively 3 groups of independently parts.
Article 3, streamline shared instruction fetching unit decoding, carry out 1, carry out 2, write back the stage independent parallel carry out; Wherein data access streamline is responsible for address arithmetic, memory access, unconditional jump, integer arithmetic streamline is responsible for adding reducing, logical operation, compare operation, shifting function, floating-point operation, bit manipulation, conditional jump, and vector operation streamline is mainly responsible for carrying out and single or multiplely parallel takes advantage of, take advantage of and add, take advantage of plus-minus, take advantage of and subtract add operation.
Described instruction fetching unit has instruction pre-decode function, determines that according to instruction pre-decode result instruction issue is in data access streamline, integer arithmetic streamline or vector operation streamline; The instruction buffer that comprises a 16*16 position in instruction fetching unit, the instruction in instruction fetching unit judges instruction buffer exit, determines figure place and the number of instruction issue, multipotency is launched 3 32 bit instructions simultaneously; In the time that the data in instruction buffer are less than or equal to 128bit, instruction fetching unit can read in 128 routine datas by program storage interface.
Described vector operation streamline has independently Instruction decoding and instruction control unit, independent vector operation unit, special instruction set; Vector operation unit comprises the multiplication unit of operand extraction unit, 4 16*32, and the ACC of two 64 comprises SIMD instruction in special instruction set, supports parallel multiplication, takes advantage of and add, take advantage of and subtract, take advantage of the operations such as plus-minus; Support at most 4 groups 16 of executed in parallel or 2 groups 32 s' multiplication or take advantage of to add, take advantage of and subtract, take advantage of and add reducing.
The present invention has the Vector Processing class instruction that is specifically designed to Vector Processing streamline, and the operation to 128 bit data register group Xn, XVn is supported in the instruction of Vector Processing class, takes advantage of add operation for supporting single instrction to carry out 4 groups of 16 of walking abreast; The data register bank that wherein Xn is made up of 4 32 bit data register, the data register bank that XVn is made up of 4 32 bit vector registers.
The present invention is the low-power consumption of an embedded system application, high-speed, high-performance digital signal processor, is mainly used in the built-in applied systems such as radio communication, image processing, control in real time.The present invention adopts superscale RSIC instruction framework, supports the parallel transmitting of 3 instructions of single clock cycle, supports 16/32 bit instruction decoding, supports monocycle 4MAC operation, supports abundant DSP addressing mode, supports SIMD operation, supports floating-point operation.
Advantage of the present invention is: the present invention is a kind of 32 fixed point/floating-point signal processor supporting single-instruction multiple-data stream (SIMD) (SIMD) and three transmittings, different parallel instructions is transmitted into corresponding performance element by it, support 3 pipeline parallel method operations, also support the instruction of SIMD class simultaneously.The present invention supports 3 pipeline parallel methods to carry out, and has improved the parallel processing capability of DSP; Increase independently vector operation unit, support the add operation of taking advantage of of 4 groups 16 of monocycle executed in parallel, add integer arithmetic unit and data access unit with vector pipeline executed in parallel, the present invention can support 5 groups of data operations and 1 group of data access operation to carry out simultaneously, has promoted the data-handling capacity of DSP.
Brief description of the drawings
Fig. 1 is the basic structure block diagram of DSP of the present invention.
Fig. 2 is the multi-level store hierarchical chart of DSP of the present invention.
Fig. 3 is the instruction fetching cellular construction figure of DSP of the present invention.
Fig. 4 is the vector operation unit fundamental block diagram of DSP of the present invention.
Fig. 5 is the multiport register file data flow figure of DSP of the present invention.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described.
Digital signal processor of the present invention adopts 3 transmittings, 5 stage pipeline structure, possesses the streamline of 3 executed in parallel, and every streamline possesses fetching, decoding, execution 1, carries out 2, writes back 5 grades of flowing water.Wherein fetching unit is that 3 streamlines share, and decoding and operating part possess 3 groups of independently parts.
In the instruction set of digital signal processor of the present invention, comprise 16 bit instructions and 32 bit instructions, wherein 16 bit instructions are subsets of 32 bit instructions.In instruction set of the present invention, except possessing DSP universal command, be also designed with multimedia and process special instruction, Viterbi decode special instruction, Vector Processing instruction and floating-point operation instruction.Multimedia is processed special instruction and is comprised instructions such as asking vectorial mean value, matrix operation, byte-extraction, for accelerating multimedia processing; Viterbi decoding special instruction comprises the instructions such as bit interleave, position separate, Viterbi tracking is returned, for accelerating Viterbi decoding; The instruction of Vector Processing class is mainly used in SIMD operation, supports the operation to 128 bit data, supports monocycle 4MAC operation; Floating-point operation instruction is used for supporting floating-point operation.
As shown in Figure 1, the streamline that the present invention comprises 3 parallel transmittings: data access streamline, integer arithmetic streamline, vector operation streamline, every streamline possesses independently decoding and performance element, and supports SIMD operation.
The present invention is mainly connected to form through circuit by program storage interface unit, data-carrier store interface unit, instruction fetching unit, pipeline control unit, system bus, data access pipelined units, integer arithmetic pipelined units, vector operation pipelined units, data register, address register, vector registor, coprocessor interface unit, Float Point Unit.Wherein, data access pipelined units comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; Integer arithmetic pipelined units comprises ALU decoding unit, arithmetic control module, ALU, bit processing unit; Vector operation pipelined units comprises to take advantage of and adds decoding unit, takes advantage of and add control module, vector operation unit.Float Point Unit is connected to integer arithmetic streamline as coprocessor through coprocessor interface unit.Program storage interface unit, data-carrier store interface unit are the interfaces exchanging with external data, wherein in program storage interface, comprise configurable program high-speed cache, comprise configurable data cache in data-carrier store interface.Pipeline control unit is for pipeline state management and abnormality processing.General-purpose register file is made up of 16 32 bit address registers, 16 32 bit data register and 16 32 bit vector registers, has formed the first order storer of close processor, and the second level is high-speed cache, and the third level is data-carrier store.
In the present invention, the annexation of each module as shown in Figure 1.Program storage interface is connected with program bus and instruction fetching unit by address bus, receives the fetching address that instruction fetching unit sends over, and director data is sent to instruction fetching unit by program bus.Instruction fetching unit by instruction bus with access decoding unit, ALU decoding unit, take advantage of and add decoding unit and dependent instruction control module is connected, and can walk abreast and send 3 instructions in these 3 decoding units.Access decoding unit is connected with address arithmetic unit with address-generation unit by control signal with storage control unit, controls the computing of address-generation unit and address arithmetic unit.Address-generation unit is connected with data register with address register by data bus with address arithmetic unit, from address register and data register, reading out data carries out computing, and result writes back address register or generates memory access address delivers to data-carrier store interface.ALU decoding unit is connected with bit processing unit with ALU by control signal with arithmetic control module, controls the computing of ALU and bit processing unit.ALU is connected with data register by data bus with bit processing unit, and from data register, reading out data is processed, and result is write back in data register.Take advantage of and add decoding unit and add control module and be connected with vector operation unit by control signal with taking advantage of, the computing of control vector arithmetic element.Vector operation unit is connected with vector registor with data register by data bus, and from data register or vector registor, reading out data is processed, and result is write back in data register or vector registor.Data register, vector registor, address register are connected with data-carrier store interface unit by data bus, carry out exchanges data by data-carrier store interface and external data memory.Pipeline control unit is connected with each execution unit by Pipeline control signal, controls the execution of each execution unit, and accepts the feedback of each execution unit.Float Point Unit is connected to integer arithmetic streamline by coprocessor interface, by integer arithmetic streamline reading command and data, carries out associative operation, and result is sent back to data register by integer arithmetic streamline.
This DSP core reads in instruction in the instruction buffer of instruction fetching unit by program storage interface from external memory storage, and by the pre-decode of instruction buffer exit instruction being determined to the type of instruction and being sent in corresponding streamline, can walk abreast at most and send 3 instructions.Decoding unit in streamline and instruction control unit, receiving after the instruction that fetching unit sends over, carry out decoding to instruction, produce relevant control signal, determine the types and sources of operand, and operand is sent into performance element.Under the control of the control signal that the performance element in integer arithmetic streamline and vector operation streamline produces at decoding unit, operand is calculated, the result of generation was sent in corresponding register in the stage of writing back.The performance element executive address computing of data access streamline or the memory access address of generated data accessing operation, and in the stage that writes back, the result of address arithmetic is saved in to corresponding address register, or according to the memory access address generating, external memory storage is carried out to data read-write operation.
Program storage interface is the interface that connects external program bus and built-in command fetching unit, and its inside comprises a configurable program high-speed cache, can enable or close this high-speed cache.Data-carrier store interface is the interface that connects inner general-purpose register and external data bus, and the interface of data exchange is provided, and its inside comprises a configurable data cache.
Instruction fetching unit is 3 parts that streamline is shared, and it reads in instruction by program storage interface, and instruction is carried out sending to respective streams waterline after pre-decode.
Data access streamline comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; Decoding is carried out in the access class instruction that access decoding unit launches instruction fetching unit; Storage control unit is according to the operation of access decode results control address arithmetic unit and address-generation unit; Address arithmetic unit is to calculating from the address date of address register, data register or immediate, and the result obtaining is kept in address register; Address-generation unit generates memory access address according to addressing mode etc.Data access streamline is responsible for address computation and data access work, carries out the data exchange between inner general-purpose register and external memory storage, is also responsible for the data exchange between data register and address register simultaneously.
Integer arithmetic streamline comprises ALU decoding unit, arithmetic control module, bit processing unit, ALU (ALU), coprocessor interface unit; Bit processing unit, for execute bit operational order, comprises an insertion, position extraction, position replacement and shifting function etc.; Coprocessor interface unit is responsible for and coprocessor carries out data exchange.Integer arithmetic streamline receives the instruction sending over from fetching unit and carries out, and execution result writes back in data register, main responsible plus-minus waits the bit manipulations such as logical operation, displacement such as arithmetical operation, AOI.
Vector operation streamline comprises taking advantage of and adds decoding unit, takes advantage of and add control module, vector operation unit.Take advantage of to add the instruction that decoding unit sends over fetching unit and carry out decoding, generate associated control signal; Take advantage of and add control module according to the computing of decode results control vector arithmetic element; In vector operation unit, comprise multiplication unit, 2 64 bit accumulators (ACC) of operand extraction unit, 4 16*32, be mainly used in carrying out and parallel take advantage of, take advantage of and add, take advantage of the operations such as plus-minus, support SIMD.Vector operation streamline is mainly responsible for the computing of single instrction execution multidiameter delay, can support at most the multiply-add operation executed in parallel of 4 groups 16.
Pipeline control unit is responsible for generating corresponding control signal, controls the operation of each streamline.Pipeline control unit receives the state of each streamline of present stage, and interruption, trap etc. are judged, and generates each pipeline state of next stage, controls the operation of each streamline next stage.
As shown in Figure 2, the present invention adopts the memory organization mode of stratification, and the most close execution unit is internal register file, comprises the address register of 16 32, the data register of 16 32, the vector registor of 16 32; The second level is high-speed cache; The third level is data-carrier store.Program storage interface of the present invention comprises program high-speed cache, and data-carrier store interface comprises data cache, and memory interface can determine whether using high-speed cache (CACHE) by corresponding configuration.General-purpose register of the present invention can be used in combination, two 32 bit address registers can form 64 bit address register pairs, two 32 bit vector registers can form 64 bit vector register pairs, two 32 bit data register can form 64 bit data register pair, 4 vector registors or data register can form 128 bit data register groups, for supporting SIMD operation.
As shown in Figure 3, instruction fetching of the present invention unit is responsible for from outside reading command to instruction BUF(impact damper), and instruction is carried out sending in respective streams waterline after pre-decode.Instruction fetching unit comprises instruction BUF, pre-decode unit, fetching request generation unit, fetching cancelling signal generation unit, fetching scalar/vector, recursion instruction processing unit, branch's jump instruction processing unit of one 256.Instruction BUF is for preserving the instruction of reading in, and when the data in instruction BUF are during lower than 128, fetching unit can generate fetching request signal, and reads 128 bit instruction data in instruction BUF.Fetching request generation unit produces fetching request signal according to the data bits in pipeline state and instruction BUF.Fetching cancelling signal generation unit judges according to pipeline state whether fetching activity is cancelled, and generates fetching cancelling signal if will cancel.Recursion instruction processing unit, branch's jump instruction processing unit are processed recursion instruction, branch's jump instruction respectively, calculate the PC value making new advances.Fetching scalar/vector, according to the result of former PC value and recursion instruction, branch's jump instruction, generates next cycles per instruction fetching address.Instruction pre-decode unit comprises integer arithmetic instruction formation logic, vector operation instruction formation logic, data access command formation logic etc., main being responsible for judges low several in instruction BUF, determine instruction strip number and the figure place of therefrom reading, can read at most 3 32 bit instructions simultaneously, and send in respective streams waterline.
As shown in Figure 4, vector operation of the present invention unit pushes away the multiplication unit of logic, operand extraction unit, 4 16*32,4 grades of CSA, 2 64 ACC before mainly comprising data.Take advantage of and add the control signal that decoding unit and taking advantage of adds control module instruction is carried out generating after decoding vector operation unit.Send in operand extraction unit through pushing away operand after treatment before data, operand is mainly from the operation result in data register, immediate or last cycle, and instruction operands mostly is 128bit most, from data register bank Xn or XVn.Operand extraction unit extracts operand according to the result of Instruction decoding, and the operand after extracting is delivered in each multiplication unit and ACC.As carry out one and support that 4 groups 16 while taking advantage of the SIMD instruction that adds executed in parallel, operand extraction unit is just delivered to 64 positional operands of input respectively taking 16 half-words as unit in 4 multiplication units, carries out 4 groups of 16 multiply operations that walk abreast; 128 positional operands of input are delivered to respectively in two ACC afterwards taking 32 words as unit, carried out 4 groups of 32 parallel add operations; The vector operation result of the execution result of two ACC, 128 of combination producings after result formation logic is processed, writes back in the data register bank Xn or XVn of 128.Therefore vector operation unit can be supported to take advantage of for 4 groups 16 simultaneously and add executed in parallel: 32+16*16; And 4 32 results that generate are combined into 128 bit data and write back to data register bank.
As shown in Figure 5, the memory bank that general-purpose register file of the present invention is multiport.General-purpose register comprises address register, data register, vector registor, and these three groups of memory banks that register file is all multiport, support parallel read-write operation.Three groups of registers are all supported the data exchange that carries out with external memory storage, also support different streamlines to carry out computing from reading out data wherein simultaneously.Address register supported data access stream waterline therefrom reading out data carries out address arithmetic; Data register support integer arithmetic, vector operation, three streamlines of data access therefrom reading out data carry out computing; Vector registor support vector arithmetic pipelining therefrom reading out data carries out computing.

Claims (10)

1. support the digital signal processor of 32 three transmittings of SIMD, it is characterized in that: the streamline that comprises 3 parallel transmittings: data access streamline, integer arithmetic streamline, vector operation streamline, every streamline possesses independently decoding and performance element, and supports SIMD operation;
Described data access streamline, comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; Access decoding unit is connected with address arithmetic unit with address-generation unit by control signal with storage control unit, control the computing of address-generation unit and address arithmetic unit, address-generation unit is connected with data register with address register by data bus with address arithmetic unit, from address register and data register, reading out data carries out computing, and result writes back address register or generates memory access address delivers to data-carrier store interface; Decoding is carried out in the access class instruction that access decoding unit launches instruction fetching unit; Storage control unit is according to the operation of access decode results control address arithmetic unit and address-generation unit; Address arithmetic unit is to calculating from the address date of address register, data register or immediate, and the result obtaining is kept in address register; Address-generation unit generates memory access address according to addressing mode; Only have data access streamline to carry out read-write operation to external memory storage, all the other two streamlines are only to register manipulation;
Described integer arithmetic streamline, comprises ALU decoding unit, arithmetic control module, bit processing unit, ALU, coprocessor interface unit; ALU decoding unit is connected with bit processing unit with ALU by control signal with arithmetic control module, control the computing of ALU and bit processing unit, ALU is connected with data register by data bus with bit processing unit, from data register, reading out data is processed, and result is write back in data register; Bit processing unit, for execute bit operation, comprises that insert position, extract position, replace and shifting function position; Coprocessor interface unit is responsible for and coprocessor carries out data exchange;
Described vector operation streamline, comprises taking advantage of adding decoding unit, taking advantage of and add control module, vector operation unit; Take advantage of and add decoding unit and add control module and be connected with vector operation unit by control signal with taking advantage of, the computing of control vector arithmetic element, vector operation unit is connected with vector registor with data register by data bus, from data register or vector registor, reading out data is processed, and result is write back in data register or vector registor; In vector operation unit, comprise the multiplication unit of operand extraction unit, 4 16*32,2 64 bit accumulators, be mainly used in carrying out and parallel take advantage of, take advantage of and add, take advantage of and add reducing, support SIMD;
Also comprise:
Instruction fetching unit, is 3 fetching unit that streamline is shared, realizes the order of program and carries out and redirect, and data collision is made prediction for controlling the programmable counter of this instruction fetching unit; , in the instruction buffer of instruction fetching unit, and the order code in instruction buffer is judged by program storage interface read-in programme data, according to judged result, 32 or 16 bit instruction codes are sent to respectively in corresponding streamline;
Float Point Unit, as coprocessor, is connected to integer arithmetic streamline by coprocessor interface unit; Have independently floating intruction set, carry out floating-point operation by integer arithmetic streamline, and result is sent back to data register by integer arithmetic streamline;
Register file: comprise 16 32 bit address registers, 16 32 bit data register, 16 32 bit vector register and special function registers; Address register generates for address arithmetic and memory access address, and data register is for integer arithmetic and vector operation, and vector registor, for the SIMD operation of support vector computing, all supports different streamlines simultaneously from parallel the reading and writing data of different port; Program storage interface is connected with program bus and instruction fetching unit by address bus, receives the fetching address that instruction fetching unit sends over, and director data is sent to instruction fetching unit by program bus; Instruction fetching unit by instruction bus with access decoding unit, ALU decoding unit, take advantage of and add decoding unit and be connected, and can walk abreast and send 3 instructions in these 3 decoding units; Described data register, vector registor, address register are connected with data-carrier store interface unit by data bus, carry out exchanges data by data-carrier store interface and external data memory; Pipeline control unit is connected with each execution unit by Pipeline control signal, controls the execution of each execution unit, and accepts the feedback of each execution unit.
2. support as claimed in claim 1 the digital signal processor of 32 three transmittings of SIMD, it is characterized in that, possess multimedia and process special instruction and Viterbi decoding special instruction, support 16 bit instructions and floating-point operation instruction, support fixed point and floating-point operation, wherein 16 bit instructions are the subset of 32 bit instructions.
3. the digital signal processor of supporting as claimed in claim 1 32 three transmittings of SIMD, is characterized in that having independently data cache and program high-speed cache.
4. the digital signal processors of supporting as claimed in claim 1 32 three of SIMD transmittings, is characterized in that, every streamline possesses its proprietary instruction, and in the operational code of instruction, two of specialized designs are for distinguishing this three classes instruction; In the instruction fetching stage, instruction fetching unit carries out pre-decode to these two and determines classes of instructions, and then classification sends to corresponding streamline.
5. support as claimed in claim 1 the digital signal processor of 32 three transmittings of SIMD, it is characterized in that, adopt the memory organization mode of stratification, the first order the most close execution unit be internal register file, comprise the address register of 16 32, the data register of 16 32, the vector registor of 16 32; The second level is high-speed cache; The third level is data-carrier store; Two 32 bit address registers can form 64 bit address register pairs, two 32 bit vector registers can form 64 bit vector register pairs, two 32 bit data register can form 64 bit data register pair, 4 vector registors or data register can form 128 bit data register groups, for supporting SIMD operation.
6. the digital signal processor of supporting as claimed in claim 1 32 three transmittings of SIMD, is characterized in that, adopts 5 stage pipeline structure, is respectively fetching, decoding, execution 1, carries out 2, writes back, and wherein decoding and operating part have respectively 3 groups of independently parts.
7. the digital signal processors of supporting as claimed in claim 6 32 three of SIMD transmittings, is characterized in that, 3 streamline shared instruction fetching unit in decoding, carry out 1, carry out 2, write back stage independent parallel and carry out; Wherein data access streamline is responsible for address arithmetic, memory access, unconditional jump, integer arithmetic streamline is responsible for adding reducing, logical operation, compare operation, shifting function, floating-point operation, bit manipulation, conditional jump, and vector operation streamline is mainly responsible for carrying out and single or multiplely parallel takes advantage of, take advantage of and add, take advantage of plus-minus, take advantage of and subtract add operation.
8. support as claimed in claim 1 the digital signal processor of 32 three transmittings of SIMD, it is characterized in that, described instruction fetching unit has instruction pre-decode function, determines that according to instruction pre-decode result instruction issue is in data access streamline, integer arithmetic streamline or vector operation streamline; The instruction buffer that comprises a 16*16 position in instruction fetching unit, the instruction in instruction fetching unit judges instruction buffer exit, determines figure place and the number of instruction issue, multipotency is launched 3 32 bit instructions simultaneously; In the time that the data in instruction buffer are less than or equal to 128bit, instruction fetching unit can read in 128 routine datas by program storage interface.
9. the digital signal processors of supporting as claimed in claim 1 32 three of SIMD transmittings, is characterized in that, described vector operation streamline has independently Instruction decoding and instruction control unit, independent vector operation unit, special instruction set; Vector operation unit comprises the multiplication unit of operand extraction unit, 4 16*32, and the totalizer of two 64 comprises SIMD instruction in special instruction set, supports parallel multiplication, takes advantage of and add, take advantage of and subtract, take advantage of and add reducing; Support at most 4 groups 16 of executed in parallel or 2 groups 32 s' multiplication or take advantage of to add, take advantage of and subtract, take advantage of and add reducing.
10. support as claimed in claim 1 the digital signal processor of 32 three transmittings of SIMD, it is characterized in that, have the Vector Processing class instruction that is specifically designed to Vector Processing streamline, the operation to 128 bit data register group Xn, XVn is supported in the instruction of Vector Processing class, takes advantage of add operation for supporting single instrction to carry out 4 groups of 16 of walking abreast; The data register bank that wherein Xn is made up of 4 32 bit data register, the data register bank that XVn is made up of 4 32 bit vector registers.
CN201210205812.0A 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD Active CN102750133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210205812.0A CN102750133B (en) 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210205812.0A CN102750133B (en) 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD

Publications (2)

Publication Number Publication Date
CN102750133A CN102750133A (en) 2012-10-24
CN102750133B true CN102750133B (en) 2014-07-30

Family

ID=47030356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210205812.0A Active CN102750133B (en) 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD

Country Status (1)

Country Link
CN (1) CN102750133B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218346A (en) * 2012-11-13 2013-07-24 长沙景嘉微电子股份有限公司 Digital signal processor applied to radio-frequency communication receiver
CN104699458A (en) * 2015-03-30 2015-06-10 哈尔滨工业大学 Fixed point vector processor and vector data access controlling method thereof
CN105426161B (en) * 2015-11-12 2017-11-07 天津大学 A kind of decoding circuit of the vectorial coprocessor of POWER instruction set
US9830150B2 (en) * 2015-12-04 2017-11-28 Google Llc Multi-functional execution lane for image processor
CN106991073B (en) * 2016-01-20 2020-06-05 中科寒武纪科技股份有限公司 Data read-write scheduler and reservation station for vector operation
CN111580863B (en) * 2016-01-20 2024-05-03 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN107315575B (en) * 2016-04-26 2020-07-31 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN111176608A (en) * 2016-04-26 2020-05-19 中科寒武纪科技股份有限公司 Apparatus and method for performing vector compare operations
CN112214244A (en) * 2016-08-05 2021-01-12 中科寒武纪科技股份有限公司 Arithmetic device and operation method thereof
CN107748674B (en) * 2017-09-07 2021-08-31 中国科学院微电子研究所 Information processing system oriented to bit granularity
CN108228236A (en) * 2017-12-06 2018-06-29 中国航空工业集团公司西安航空计算技术研究所 A kind of powerful instruction for supporting flowing water emits processing circuit
US10915317B2 (en) * 2017-12-22 2021-02-09 Alibaba Group Holding Limited Multiple-pipeline architecture with special number detection
CN111913746B (en) * 2020-08-31 2022-08-19 中国人民解放军国防科技大学 Design method of low-overhead embedded processor
CN112099762B (en) * 2020-09-10 2024-03-12 上海交通大学 Synergistic processing system and method for rapidly realizing SM2 cryptographic algorithm
CN112230995B (en) * 2020-10-13 2024-04-09 广东省新一代通信与网络创新研究院 Instruction generation method and device and electronic equipment
WO2022141321A1 (en) * 2020-12-30 2022-07-07 深圳市大疆创新科技有限公司 Dsp and parallel computing method therefor
CN115826910B (en) * 2023-02-07 2023-05-02 成都申威科技有限责任公司 Vector fixed point ALU processing system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1188275A (en) * 1996-08-19 1998-07-22 三星电子株式会社 Single-instruction-multiple-data processing with combined scalar/vector operations
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
US8108652B1 (en) * 2007-09-13 2012-01-31 Ronald Chi-Chun Hui Vector processing with high execution throughput

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1188275A (en) * 1996-08-19 1998-07-22 三星电子株式会社 Single-instruction-multiple-data processing with combined scalar/vector operations
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
US8108652B1 (en) * 2007-09-13 2012-01-31 Ronald Chi-Chun Hui Vector processing with high execution throughput

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Video DSP with a Macroblock-Level-Pipeline and a SIMD Type Vector-Pipeline Architecture for MPEG2 CODEC;Masaki Toyokura等;《IEEE JOURNAL OF SOLID-STATE CIRCUITS》;19941231;第29卷(第12期);第1474-1481页 *
Masaki Toyokura等.A Video DSP with a Macroblock-Level-Pipeline and a SIMD Type Vector-Pipeline Architecture for MPEG2 CODEC.《IEEE JOURNAL OF SOLID-STATE CIRCUITS》.1994,第29卷(第12期),第1474-1481页.
基于对指令数据区分访问的混合cache低功耗策略;王亮等;《计算机应用研究》;20080615;第25卷(第6期);第1894-1896页 *
王亮等.基于对指令数据区分访问的混合cache低功耗策略.《计算机应用研究》.2008,第25卷(第6期),第1894-1896页.

Also Published As

Publication number Publication date
CN102750133A (en) 2012-10-24

Similar Documents

Publication Publication Date Title
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
Chen et al. Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: Industrial product
CN109144573A (en) Two-level pipeline framework based on RISC-V instruction set
Gonzalez et al. Processor microarchitecture: An implementation perspective
KR101703743B1 (en) Accelerated interlane vector reduction instructions
CN106843810B (en) Equipment, method and the machine readable media of the control flow of trace command
CN109643233B (en) Data processing apparatus having a stream engine with read and read/forward operand encoding
CN101751244B (en) Microprocessor
US20200104126A1 (en) Apparatus and method for adaptable and efficient lane-wise tensor processing
CN104813279B (en) For reducing the instruction of the element in the vector registor with stride formula access module
US9122475B2 (en) Instruction for shifting bits left with pulling ones into less significant bits
KR20180021812A (en) Block-based architecture that executes contiguous blocks in parallel
US10831505B2 (en) Architecture and method for data parallel single program multiple data (SPMD) execution
CN101739235A (en) Processor device for seamless mixing 32-bit DSP and general RISC CPU
US10915328B2 (en) Apparatus and method for a high throughput parallel co-processor and interconnect with low offload latency
CN102508643A (en) Multicore-parallel digital signal processor and method for operating parallel instruction sets
Gautschi et al. Tailoring instruction-set extensions for an ultra-low power tightly-coupled cluster of OpenRISC cores
CN110427337B (en) Processor core based on field programmable gate array and operation method thereof
US11726912B2 (en) Coupling wide memory interface to wide write back paths
CN105373367A (en) Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
CN104008021A (en) Precision exception signaling for multiple data architecture
US20120066480A1 (en) Processor
CN112559037B (en) Instruction execution method, unit, device and system
Collange Simty: generalized SIMT execution on RISC-V
CN104823153A (en) Leading change anticipator logic

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant