CN102750133A - 32-Bit triple-emission digital signal processor supporting SIMD - Google Patents

32-Bit triple-emission digital signal processor supporting SIMD Download PDF

Info

Publication number
CN102750133A
CN102750133A CN2012102058120A CN201210205812A CN102750133A CN 102750133 A CN102750133 A CN 102750133A CN 2012102058120 A CN2012102058120 A CN 2012102058120A CN 201210205812 A CN201210205812 A CN 201210205812A CN 102750133 A CN102750133 A CN 102750133A
Authority
CN
China
Prior art keywords
unit
data
instruction
address
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012102058120A
Other languages
Chinese (zh)
Other versions
CN102750133B (en
Inventor
屈凌翔
张庆文
黄嵩人
杨晓刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 58 Research Institute
Original Assignee
CETC 58 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 58 Research Institute filed Critical CETC 58 Research Institute
Priority to CN201210205812.0A priority Critical patent/CN102750133B/en
Publication of CN102750133A publication Critical patent/CN102750133A/en
Application granted granted Critical
Publication of CN102750133B publication Critical patent/CN102750133B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a 32-bit triple-emission digital signal processor supporting SIMD (Single Instruction Multiple Data), comprising three flow lines in parallel emission: a data access flow line, an integer arithmetic flow line and a vector arithmetic flow line, wherein each flow line is provided with an independent decoding and execution unit and supports SIMD operation. The 32-bit triple-emission digital signal processor supporting SIMD is mainly composed of a program memory interface unit, a data memory interface unit, an instruction fetch unit, a flow line control unit, a system bus, a data access flow line unit, an integer arithmetic flow line unit, a vector arithmetic flow line unit, a data register, an address register, a vector register, a coprocessor interface unit and a floating point arithmetic unit, all of which are connected together through a circuit. The 32-bit triple-emission digital signal processor supporting SIMD supports parallel execution of three flow lines so that the parallel processing capability of a DSP (Digital Signal Processor) is improved; besides, the 32-bit triple-emission digital signal processor supports parallel execution four groups of 16-bit multiplying and adding operations in a single cycle, and supports simultaneous execution of the operation of five groups of data and the access operation of one group of data; therefore, the data processing capability of the DSP is enhanced.

Description

Support the digital signal processor of 32 three emissions of SIMD
Technical field
The present invention relates to digital signal processor, specifically is the DSP of 32 three emissions of a kind of SIMD of support.
Background technology
DSP is processing digital signal in real time, possesses the data-handling capacity of ultra general processor far away, plays an important role in fields such as data communication, multimedia processing.Along with the high speed development of the communication technology and multimedia technology, also increasingly high to the requirement of DSP data-handling capacity.The method that promotes the DSP data-handling capacity mainly contains and promotes the DSP dominant frequency, adopts the multinuclear framework to promote the processing power of entire circuit, the parallel processing capability that the employing multiple-issue architecture promotes DSP nuclear etc.Along with the DSP dominant frequency is increasingly high, the cost and the difficulty that continue raising are also increasingly high at present; Though and multicore architecture can reduce the requirement to DSP nuclear, can increase the design difficulty of whole SOC circuit greatly.Therefore improving dominant frequency and taking outside the multicore architecture, multiple-issue architecture is also more and more universal.The present invention adopts 3 emitting structurals, and 3 streamlines can executed in parallel; Especially to add processing power in order promoting to take advantage of, to have designed the Vector Processing streamline separately, two streamlines of this streamline and other are independent fully to be carried out, and mainly supports parallel data operation.
Summary of the invention
The objective of the invention is to overcome the deficiency that exists in the prior art; Improve the concurrent operation ability and the multiply-add operation ability of processor; The digital signal processor of 32 three emissions of a kind of SIMD of support is provided, and is the digital signal processor towards multimedia processing and data communication field.
According to technical scheme provided by the invention; The digital signal processor of 32 three emissions of described support SIMD; The streamline that comprises 3 parallel emissions: data access streamline, integer arithmetic streamline, vector operation streamline; Every streamline possesses independently decoding and performance element, and supports the SIMD operation;
Said data access streamline comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; The access decoding unit links to each other with the address arithmetic unit with address-generation unit through control signal with storage control unit; The computing of control address generation unit and address arithmetic unit; Address-generation unit links to each other with data register with address register through data bus with the address arithmetic unit; Reading of data is carried out computing from address register and data register, and the result writes back address register or generates the memory access address and delivers to the data-carrier store interface; The access decoding unit is got instruction and is referred to that the access class instruction that the unit launches deciphers; Storage control unit is according to the operation of access decode results control address arithmetic element and address-generation unit; The address arithmetic unit is to calculating from address register, data register or several immediately address dates, and the result who obtains is kept in the address register; Address-generation unit is according to generation memory access addresses such as addressing modes; Have only the data access streamline to carry out read-write operation to external memory storage, all the other two streamlines are only to register manipulation;
Said integer arithmetic streamline comprises ALU decoding unit, arithmetic control module, bit processing unit, ALU, coprocessor interface unit; The ALU decoding unit links to each other with bit processing unit with ALU through control signal with the arithmetic control module; The computing of control ALU and bit processing unit; ALU links to each other with data register through data bus with bit processing unit; Reading of data is handled from data register, and writes back the result in the data register; Bit processing unit is used for the execute bit operation, comprises position insertion, position extraction, position replacement and shifting function etc.; The coprocessor interface unit is responsible for carrying out data with coprocessor and is exchanged;
Said vector operation streamline comprises taking advantage of adding decoding unit, taking advantage of and add control module, vector operation unit; Take advantage of and add decoding unit and add control module and link to each other with the vector operation unit through control signal with taking advantage of; The computing of control vector arithmetic element; The vector operation unit links to each other with vector registor with data register through data bus; Reading of data is handled from data register or vector registor, and writes back the result in data register or the vector registor; Comprise the multiplication unit of operand extraction unit, 4 16*32,2 64 bit accumulators in the vector operation unit, be mainly used in to carry out and parallel take advantage of, take advantage of and add, take advantage of and add reducing, support SIMD;
Also comprise:
The finger unit is got in instruction, be 3 streamlines shared get the finger unit, the programmable counter that is used for controlling this unit realizes that the order of program carries out and redirect, and conflict is made prediction to data; Get in the instruction buffer that refers to the unit to instruction through program storage interface read-in programme data, and the order code in the instruction buffer is judged, send to 32 or 16 bit instruction sign indicating numbers respectively in the corresponding streamline according to judged result;
FPU Float Point Unit as coprocessor, is connected to the integer arithmetic streamline through the coprocessor interface unit; Have independently floating intruction set, carry out floating-point operation, and send result back to data register through the integer arithmetic streamline through the integer arithmetic streamline;
Register file: comprise 16 32 bit address registers, 16 32 bit data register, 16 32 bit vector register and special function registers; Address register is used for address arithmetic and the memory access address generates, and data register is used for integer arithmetic and vector operation, and vector registor is used for the SIMD operation of support vector computing, all supports different streamlines simultaneously from parallel the reading and writing data of different ports;
The program storage interface is got through address bus and program bus and instruction and is referred to that the unit links to each other, receives instruction and gets getting of referring to that the unit sends over and refer to the address, and send to director data to instruct through program bus and get the finger unit; Instruction get refer to the unit through instruction bus with access decoding unit, ALU decoding unit, take advantage of and add decoding unit and link to each other, and 3 of the transmissions that can walk abreast are instructed in these 3 decoding units; Said data register, vector registor, address register link to each other with the data-carrier store interface unit through data bus, carry out exchanges data through data-carrier store interface and external data memory; Pipeline control unit links to each other with each execution unit through the streamline control signal, controls the execution of each execution unit, and accepts the feedback of each execution unit.
The digital signal processor of 32 three emissions of said support SIMD possesses multimedia and handles special instruction and Viterbi decoding special instruction; Support the instruction of 16 bit instructions and floating-point operation; Support fixed point and floating-point operation, wherein 16 bit instructions are the subclass of 32 bit instructions; Have independently data cache and program high-speed cache.
Every streamline possesses its proprietary instruction, and two of specialized designs are used to distinguish this three types of instructions in the operational code of instruction; Get the finger stage in instruction, instruction is got the finger unit and is deciphered to confirm classes of instructions in advance to these two, and classification sends to corresponding streamline then.
The present invention adopts the memory organization mode of stratification, the first order near execution unit be the internal register file, comprise 16 32 address register, 16 32 data register, 16 32 vector registor; The second level is high-speed cache; The third level is a data-carrier store; Two 32 bit address registers can be formed 64 bit address register pairs; Two 32 bit vector registers can be formed 64 bit vector register pairs; It is right that two 32 bit data register can be formed 64 bit data register; 4 vector registors or data register can be formed 128 bit data register groups, are used to support the SIMD operation.
The present invention adopts 5 stage pipeline structure, is respectively to get finger, decoding, carry out 1, carry out 2, write back, and wherein decoding and operating part have 3 groups of independently parts respectively.
Article 3, the streamline shared instruction get refer to the unit in decoding, carry out 1, carry out 2, write back the stage independent parallel and carry out; Wherein the data access streamline is responsible for address arithmetic, storage access, unconditional jump; The integer arithmetic streamline is responsible for adding reducing, logical operation, compare operation, shifting function, floating-point operation, bit manipulation, condition redirect, and the vector operation streamline mainly be responsible for to carry out and single or a plurality ofly parallel take advantage of, take advantage of and add, take advantage of plus-minus, take advantage of and subtract add operation.
Said instruction is got and is referred to that the unit has the preparatory decoding function of instruction, determines transmitting instructions in data access streamline, integer arithmetic streamline or vector operation streamline according to the preparatory decode results of instruction; The instruction buffer that refers to comprise in the unit 16*16 position is got in instruction, and the instruction that refers to unit judges instruction buffer exit is got in instruction, the figure place and the bar number of decision transmitting instructions, and multipotency is launched 3 32 bit instructions simultaneously; When data in the instruction buffer during smaller or equal to 128bit, instruction is got and is referred to that the unit can read in 128 routine datas through the program storage interface.
Said vector operation streamline has independently instruction decode and instruction control unit, independent vector operation unit, special-purpose instruction set; The multiplication unit that comprises operand extraction unit, 4 16*32 in the vector operation unit, two 64 ACC comprises the SIMD instruction in the special-purpose instruction set, support parallel multiplication, take advantage of and add, take advantage of operations such as subtracting, take advantage of plus-minus; Support at most 4 groups 16 of executed in parallel or 2 groups 32 s' multiplication or take advantage of to add, take advantage of and subtract, take advantage of and add reducing.
The present invention has the Vector Processing class instruction that is specifically designed to the Vector Processing streamline, and the operation to 128 bit data register group Xn, XVn is supported in the instruction of Vector Processing class, is used to support single instrction to carry out 4 groups parallel 16 and takes advantage of add operation; The data register bank formed by 4 32 bit data register of Xn wherein, the data register bank that XVn is made up of 4 32 bit vector registers.
The present invention is low-power consumption, high-speed, the high-performance digital signal processor used towards embedded system, is mainly used in built-in applied systems such as radio communication, Flame Image Process, control in real time.The present invention adopts superscale RSIC instruction framework, supports the parallel emission of 3 instructions of single clock cycle, supports the decoding of 16/32 bit instruction, supports monocycle 4MAC operation, supports abundant DSP addressing mode, supports the SIMD operation, supports floating-point operation.
Advantage of the present invention is: the present invention is a kind of 32 fixed point/floating-point signal processor supporting single-instruction multiple-data stream (SIMD) (SIMD) and three emissions; It walks abreast different instructions and is transmitted into corresponding performance element; Support 3 pipeline parallel method operations, also support the instruction of SIMD class simultaneously.The present invention supports 3 pipeline parallel methods to carry out, and has improved the parallel processing capability of DSP; Increased independently vector operation unit; Support the add operation of taking advantage of of 4 groups 16 of monocycle executed in parallel; Add integer arithmetic unit and data access unit with the vector pipeline executed in parallel; The present invention can support 5 groups of data operations and 1 group of data access operation to carry out simultaneously, has promoted the data-handling capacity of DSP.
Description of drawings
Fig. 1 is the basic structure block diagram of DSP of the present invention.
Fig. 2 is the multi-level store hierarchical chart of DSP of the present invention.
Fig. 3 gets for the instruction of DSP of the present invention and refers to cellular construction figure.
Fig. 4 is the vector operation unit fundamental block diagram of DSP of the present invention.
Fig. 5 is the multiport register file data flow figure of DSP of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described further.
Digital signal processor of the present invention adopts 3 emissions, 5 stage pipeline structure, possesses the streamline of 3 executed in parallel, and every streamline possesses to be got fingers, decoding, execution 1, execution 2, write back 5 grades of flowing water.Wherein getting the finger unit is that 3 streamlines are shared, and decoding and operating part possess 3 groups of independently parts.
Comprise 16 bit instructions and 32 bit instructions in the instruction set of digital signal processor of the present invention, wherein 16 bit instructions are subclass of 32 bit instructions.Except possessing the DSP universal command, also be designed with multimedia and handle special instruction, Viterbi decoding special instruction, Vector Processing instruction and floating-point operation instruction in the instruction set of the present invention.Multimedia is handled special instruction and is comprised instructions such as asking vectorial mean value, matrix operation, byte-extraction, is used to quicken multimedia and handles; Viterbi decoding special instruction comprises instructions such as separate bit interleave, position, the Viterbi tracking is returned, and is used to quicken the Viterbi decoding; The instruction of Vector Processing class is mainly used in the SIMD operation, supports the operation to 128 bit data, supports monocycle 4MAC operation; The floating-point operation instruction is used to support floating-point operation.
As shown in Figure 1, the present invention comprises the streamline of 3 parallel emissions: data access streamline, integer arithmetic streamline, vector operation streamline, every streamline possesses independently decoding and performance element, and supports the SIMD operation.
The present invention is mainly got by program storage interface unit, data-carrier store interface unit, instruction and refers to that unit, pipeline control unit, system bus, data access pipelined units, integer arithmetic pipelined units, vector operation pipelined units, data register, address register, vector registor, coprocessor interface unit, FPU Float Point Unit connect to form through circuit.Wherein, the data access pipelined units comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; The integer arithmetic pipelined units comprises ALU decoding unit, arithmetic control module, ALU, bit processing unit; The vector operation pipelined units comprises to take advantage of and adds decoding unit, takes advantage of and add control module, vector operation unit.FPU Float Point Unit is connected to the integer arithmetic streamline as coprocessor through the coprocessor interface unit.Program storage interface unit, data-carrier store interface unit are the interfaces that exchanges with external data, wherein comprise configurable program high-speed cache in the program storage interface, comprise configurable data cache in the data-carrier store interface.Pipeline control unit is used for pipeline state management and abnormality processing.General-purpose register file is made up of 16 32 bit address registers, 16 32 bit data register and 16 32 bit vector registers, has constituted near the first order storer of processor, and the second level is high-speed cache, and the third level is a data-carrier store.
The annexation of each module is as shown in Figure 1 among the present invention.The program storage interface is got through address bus and program bus and instruction and is referred to that the unit links to each other, receives instruction and gets getting of referring to that the unit sends over and refer to the address, and send to director data to instruct through program bus and get the finger unit.Instruction get refer to the unit through instruction bus with access decoding unit, ALU decoding unit, take advantage of and add decoding unit and the dependent instruction control module links to each other, and 3 of the transmissions that can walk abreast are instructed in these 3 decoding units.The access decoding unit links to each other the computing of control address generation unit and address arithmetic unit with storage control unit with the address arithmetic unit with address-generation unit through control signal.Address-generation unit links to each other with data register with address register through data bus with the address arithmetic unit; Reading of data is carried out computing from address register and data register, and the result writes back address register or generates the memory access address and delivers to the data-carrier store interface.The ALU decoding unit links to each other with bit processing unit with ALU through control signal with the arithmetic control module, the computing of control ALU and bit processing unit.ALU links to each other with data register through data bus with bit processing unit, and reading of data is handled from data register, and writes back the result in the data register.Take advantage of and add decoding unit and add control module and link to each other the computing of control vector arithmetic element through control signal with the vector operation unit with taking advantage of.The vector operation unit links to each other with vector registor with data register through data bus, and reading of data is handled from data register or vector registor, and writes back the result in data register or the vector registor.Data register, vector registor, address register link to each other with the data-carrier store interface unit through data bus, carry out exchanges data through data-carrier store interface and external data memory.Pipeline control unit links to each other with each execution unit through the streamline control signal, controls the execution of each execution unit, and accepts the feedback of each execution unit.FPU Float Point Unit is connected to the integer arithmetic streamline through coprocessor interface, through integer arithmetic streamline reading command and data, carries out associative operation, and sends result back to data register through the integer arithmetic streamline.
This DSP nuclear reads in from external memory storage through the program storage interface and instructs instruction to get in the instruction buffer that refers to the unit; And, can walk abreast at most and send 3 instructions through the type of instruction being confirmed in the preparatory decoding of instruction buffer exit instruction and being sent in the corresponding streamline.Decoding unit in the streamline and instruction control unit receive get refer to instruction that the unit sends over after, instruction is deciphered, produce relevant control signal, confirm the types and sources of operand, and send into performance element to operand.Performance element in integer arithmetic streamline and the vector operation streamline calculates operand under the control of the control signal that decoding unit produces, and the result of generation sent in the relevant register in the stage of writing back.The memory access address of the performance element executive address computing of data access streamline or generation data access operation; And be saved in corresponding address register to the result of address arithmetic in the stage that writes back, perhaps external memory storage is carried out data read-write operation according to the memory access address that generates.
The program storage interface is that connection external program bus and built-in command are got the interface that refers to the unit, and its inside comprises a configurable program high-speed cache, can enable or close this high-speed cache.The data-carrier store interface is the interface that connects inner general-purpose register and external data bus, the interface that provides data to exchange, and its inside comprises a configurable data cache.
Instruction is got and is referred to that the unit is 3 parts that streamline is shared, and it reads in instruction through the program storage interface, and sends to the respective streams waterline after instruction deciphered in advance.
The data access streamline comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; The access decoding unit is got instruction and is referred to that the access class instruction that the unit launches deciphers; Storage control unit is according to the operation of access decode results control address arithmetic element and address-generation unit; The address arithmetic unit is to calculating from address register, data register or several immediately address dates, and the result who obtains is kept in the address register; Address-generation unit is according to generation memory access addresses such as addressing modes.The data access streamline is responsible for address computation and data access work, and the data of carrying out between inner general-purpose register and the external memory storage exchange, and the data of also being responsible between data register and the address register simultaneously exchange.
The integer arithmetic streamline comprises ALU decoding unit, arithmetic control module, bit processing unit, ALU (ALU), coprocessor interface unit; Bit processing unit is used for the execute bit operational order, comprises position insertion, position extraction, position replacement and shifting function etc.; The coprocessor interface unit is responsible for carrying out data with coprocessor and is exchanged.The integer arithmetic streamline receives instruction and the execution that refers to that from getting the unit sends over, and execution result writes back in the data register, and the main plus-minus of being responsible for waits bit manipulations such as logical operation, displacement such as arithmetical operation, AOI.
The vector operation streamline comprises taking advantage of and adds decoding unit, takes advantage of and add control module, vector operation unit.Take advantage of to add decoding unit and refer to that to getting the instruction that the unit sends over deciphers, generate associated control signal; Take advantage of and add the computing of control module according to decode results control vector arithmetic element; The multiplication unit, 2 64 bit accumulators (ACC) that comprise operand extraction unit, 4 16*32 in the vector operation unit are mainly used in and carry out the parallel operations such as adding, take advantage of plus-minus of taking advantage of, take advantage of, and support SIMD.The vector operation streamline mainly is responsible for the computing that single instrction is carried out multidiameter delay, can support 4 groups 16 multiply-add operation executed in parallel at most.
Pipeline control unit is responsible for generating control signal corresponding, controls the operation of each streamline.Pipeline control unit receives the state of each streamline of present stage, and interruptions, trap etc. are made judgement, and generation each pipeline state of next stage is controlled the operation of next stage of each streamline.
As shown in Figure 2, the present invention adopts the memory organization mode of stratification, near execution unit be the internal register file, comprise 16 32 address register, 16 32 data register, 16 32 vector registor; The second level is high-speed cache; The third level is a data-carrier store.Comprise the program high-speed cache in the program storage interface of the present invention, comprise data cache in the data-carrier store interface, memory interface can determine whether using high-speed cache (CACHE) through corresponding configuration.General-purpose register of the present invention can make up use; Two 32 bit address registers can be formed 64 bit address register pairs; Two 32 bit vector registers can be formed 64 bit vector register pairs; It is right that two 32 bit data register can be formed 64 bit data register, and 4 vector registors or data register can be formed 128 bit data register groups, are used to support the SIMD operation.
As shown in Figure 3, instruction of the present invention is got and is referred to that the unit is responsible for from the external read instruction fetch to instruction BUF (impact damper), and sends in the respective streams waterline after instruction deciphered in advance.Instruction get refer to the unit comprise one 256 instruction BUF, in advance decoding unit, get and refer to the request generation unit, get and refer to the cancelling signal generation unit, get and refer to scalar/vector, recursion instruction processing unit, branch's jump instruction processing unit.Instruction BUF is used to preserve the instruction of reading in, and when the data of instruction among the BUF are lower than 128, get and refers to that the unit can generate and get the finger request signal, and read 128 bit instruction data in instruction BUF.Get and refer to that the request generation unit produces according to pipeline state and data bits among the instruction BUF and get the finger request signal.Get to refer to the cancelling signal generation unit judges according to pipeline state whether get the finger activity cancels, get the finger cancelling signal if will cancel then generating.Recursion instruction processing unit, branch's jump instruction processing unit are handled recursion instruction, branch's jump instruction respectively, calculate the PC value that makes new advances.Get and refer to scalar/vector, generate next cycles per instruction and get the finger address according to the result of former PC value with recursion instruction, branch's jump instruction.Instruct preparatory decoding unit to comprise integer arithmetic instruction formation logic, vector operation instruction formation logic, data access command formation logic etc.; Main being responsible for judged low several among the instruction BUF; Definite instruction strip number and figure place of therefrom reading; Can read 3 32 bit instructions simultaneously at most, and send in the respective streams waterline.
As shown in Figure 4, vector operation of the present invention unit pushes away the multiplication unit of logic, operand extraction unit, 4 16*32,4 grades of CSA, 2 64 ACC before mainly comprising data.Take advantage of to add decoding unit and take advantage of and add control module is deciphered generation vector operation unit, back to instruction control signal.Operand through pushing away before the data after the processing is sent in the operand extraction unit, and mainly from data register, the number or the operation result in last cycle immediately, instruction operands is at most 128bit to operand, from data register bank Xn or XVn.The operand extraction unit extracts operand according to the result of instruction decode, and delivers to the operand after extracting among each multiplication unit and the ACC.As carry out one and support 4 groups 16 to take advantage of when adding the parallel SIMD instruction of carrying out, the operand extraction unit is 64 positional operands of input that unit delivers to respectively in 4 multiplication units with 16 half-words just, carries out 4 groups of 16 parallel multiply operations; Is 128 positional operands of input that unit delivers among two ACC after respectively with 32 words, carries out 4 groups of 32 parallel add operations; The execution result of two ACC is handled the back combination through formation logic as a result and is generated 128 vector operation result, writes back among 128 the data register bank Xn or XVn.Therefore the vector operation unit can be supported simultaneously to take advantage of for 4 groups 16 and add parallel execution the: 32+16*16; And 4 32 result combinations that generate become 128 bit data to write back to data register bank.
As shown in Figure 5, general-purpose register file of the present invention is the memory bank of multiport.General-purpose register comprises address register, data register, vector registor, and these three groups of memory banks that register file all is a multiport are supported parallel read-write operation.Three groups of registers are all supported to exchange with the data of carrying out of external memory storage, also support different streamlines to carry out computing from reading of data wherein simultaneously.Address register support data access stream waterline therefrom reading of data carries out address arithmetic; Data register support integer arithmetic, vector operation, three streamlines of data access therefrom reading of data carry out computing; Vector registor support vector arithmetic pipelining therefrom reading of data carries out computing.

Claims (10)

1. support the digital signal processor of 32 three emissions of SIMD; It is characterized in that: the streamline that comprises 3 parallel emissions: data access streamline, integer arithmetic streamline, vector operation streamline; Every streamline possesses independently decoding and performance element, and supports the SIMD operation;
Said data access streamline comprises access decoding unit, storage control unit, address arithmetic unit, address-generation unit; The access decoding unit links to each other with the address arithmetic unit with address-generation unit through control signal with storage control unit; The computing of control address generation unit and address arithmetic unit; Address-generation unit links to each other with data register with address register through data bus with the address arithmetic unit; Reading of data is carried out computing from address register and data register, and the result writes back address register or generates the memory access address and delivers to the data-carrier store interface; The access decoding unit is got instruction and is referred to that the access class instruction that the unit launches deciphers; Storage control unit is according to the operation of access decode results control address arithmetic element and address-generation unit; The address arithmetic unit is to calculating from address register, data register or several immediately address dates, and the result who obtains is kept in the address register; Address-generation unit is according to generation memory access addresses such as addressing modes; Have only the data access streamline to carry out read-write operation to external memory storage, all the other two streamlines are only to register manipulation;
Said integer arithmetic streamline comprises ALU decoding unit, arithmetic control module, bit processing unit, ALU, coprocessor interface unit; The ALU decoding unit links to each other with bit processing unit with ALU through control signal with the arithmetic control module; The computing of control ALU and bit processing unit; ALU links to each other with data register through data bus with bit processing unit; Reading of data is handled from data register, and writes back the result in the data register; Bit processing unit is used for the execute bit operation, comprises position insertion, position extraction, position replacement and shifting function etc.; The coprocessor interface unit is responsible for carrying out data with coprocessor and is exchanged;
Said vector operation streamline comprises taking advantage of adding decoding unit, taking advantage of and add control module, vector operation unit; Take advantage of and add decoding unit and add control module and link to each other with the vector operation unit through control signal with taking advantage of; The computing of control vector arithmetic element; The vector operation unit links to each other with vector registor with data register through data bus; Reading of data is handled from data register or vector registor, and writes back the result in data register or the vector registor; Comprise the multiplication unit of operand extraction unit, 4 16*32,2 64 bit accumulators in the vector operation unit, be mainly used in to carry out and parallel take advantage of, take advantage of and add, take advantage of and add reducing, support SIMD;
Also comprise:
The finger unit is got in instruction, be 3 streamlines shared get the finger unit, the programmable counter that is used for controlling this unit realizes that the order of program carries out and redirect, and conflict is made prediction to data; Get in the instruction buffer that refers to the unit to instruction through program storage interface read-in programme data, and the order code in the instruction buffer is judged, send to 32 or 16 bit instruction sign indicating numbers respectively in the corresponding streamline according to judged result;
FPU Float Point Unit as coprocessor, is connected to the integer arithmetic streamline through the coprocessor interface unit; Have independently floating intruction set, carry out floating-point operation, and send result back to data register through the integer arithmetic streamline through the integer arithmetic streamline;
Register file: comprise 16 32 bit address registers, 16 32 bit data register, 16 32 bit vector register and special function registers; Address register is used for address arithmetic and the memory access address generates, and data register is used for integer arithmetic and vector operation, and vector registor is used for the SIMD operation of support vector computing, all supports different streamlines simultaneously from parallel the reading and writing data of different ports;
The program storage interface is got through address bus and program bus and instruction and is referred to that the unit links to each other, receives instruction and gets getting of referring to that the unit sends over and refer to the address, and send to director data to instruct through program bus and get the finger unit; Instruction get refer to the unit through instruction bus with access decoding unit, ALU decoding unit, take advantage of and add decoding unit and link to each other, and 3 of the transmissions that can walk abreast are instructed in these 3 decoding units; Said data register, vector registor, address register link to each other with the data-carrier store interface unit through data bus, carry out exchanges data through data-carrier store interface and external data memory; Pipeline control unit links to each other with each execution unit through the streamline control signal, controls the execution of each execution unit, and accepts the feedback of each execution unit.
2. support the digital signal processor of 32 three emissions of SIMD according to claim 1; It is characterized in that; Possess multimedia and handle special instruction and Viterbi decoding special instruction; Support the instruction of 16 bit instructions and floating-point operation, support fixed point and floating-point operation, wherein 16 bit instructions are the subclass of 32 bit instructions.
3. support the digital signal processor of 32 three emissions of SIMD according to claim 1, it is characterized in that having independently data cache and program high-speed cache.
4. support the digital signal processor of 32 three emissions of SIMD according to claim 1, it is characterized in that every streamline possesses its proprietary instruction, two of specialized designs are used to distinguish this three types of instructions in the operational code of instruction; Get the finger stage in instruction, instruction is got the finger unit and is deciphered to confirm classes of instructions in advance to these two, and classification sends to corresponding streamline then.
5. support the digital signal processor of 32 three emissions of SIMD according to claim 1; It is characterized in that; Adopt the memory organization mode of stratification; The first order near execution unit be the internal register file, comprise 16 32 address register, 16 32 data register, 16 32 vector registor; The second level is high-speed cache; The third level is a data-carrier store; Two 32 bit address registers can be formed 64 bit address register pairs; Two 32 bit vector registers can be formed 64 bit vector register pairs; It is right that two 32 bit data register can be formed 64 bit data register; 4 vector registors or data register can be formed 128 bit data register groups, are used to support the SIMD operation.
6. supporting the digital signal processors of 32 three of SIMD emissions according to claim 1, it is characterized in that, adopt 5 stage pipeline structure, is respectively to get fingers, decoding, carry out 1, execution 2, write back, and wherein deciphers and operating part has 3 groups of independently parts respectively.
7. like 32 three digital signal processors of launching of the said support of claim 6 SIMD, it is characterized in that 3 streamline shared instructions are got the finger unit and deciphered, carrying out 1, carry out 2, writing back stage independent parallel execution; Wherein the data access streamline is responsible for address arithmetic, storage access, unconditional jump; The integer arithmetic streamline is responsible for adding reducing, logical operation, compare operation, shifting function, floating-point operation, bit manipulation, condition redirect, and the vector operation streamline mainly be responsible for to carry out and single or a plurality ofly parallel take advantage of, take advantage of and add, take advantage of plus-minus, take advantage of and subtract add operation.
8. support the digital signal processor of 32 three emissions of SIMD according to claim 1; It is characterized in that; Said instruction is got and is referred to that the unit has the preparatory decoding function of instruction, determines transmitting instructions in data access streamline, integer arithmetic streamline or vector operation streamline according to the preparatory decode results of instruction; The instruction buffer that refers to comprise in the unit 16*16 position is got in instruction, and the instruction that refers to unit judges instruction buffer exit is got in instruction, the figure place and the bar number of decision transmitting instructions, and multipotency is launched 3 32 bit instructions simultaneously; When data in the instruction buffer during smaller or equal to 128bit, instruction is got and is referred to that the unit can read in 128 routine datas through the program storage interface.
9. support the digital signal processors of 32 three of SIMD emissions according to claim 1, it is characterized in that said vector operation streamline has the instruction set of independently instruction decode and instruction control unit, independent vector operation unit, special use; The multiplication unit that comprises operand extraction unit, 4 16*32 in the vector operation unit, two 64 ACC comprises the SIMD instruction in the special-purpose instruction set, support parallel multiplication, take advantage of and add, take advantage of operations such as subtracting, take advantage of plus-minus; Support at most 4 groups 16 of executed in parallel or 2 groups 32 s' multiplication or take advantage of to add, take advantage of and subtract, take advantage of and add reducing.
10. support the digital signal processor of 32 three emissions of SIMD according to claim 1; It is characterized in that; Have the Vector Processing class instruction that is specifically designed to the Vector Processing streamline; The operation to 128 bit data register group Xn, XVn is supported in Vector Processing class instruction, is used to support single instrction to carry out 4 groups parallel 16 and takes advantage of add operation; The data register bank formed by 4 32 bit data register of Xn wherein, the data register bank that XVn is made up of 4 32 bit vector registers.
CN201210205812.0A 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD Active CN102750133B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210205812.0A CN102750133B (en) 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210205812.0A CN102750133B (en) 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD

Publications (2)

Publication Number Publication Date
CN102750133A true CN102750133A (en) 2012-10-24
CN102750133B CN102750133B (en) 2014-07-30

Family

ID=47030356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210205812.0A Active CN102750133B (en) 2012-06-20 2012-06-20 32-Bit triple-emission digital signal processor supporting SIMD

Country Status (1)

Country Link
CN (1) CN102750133B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218346A (en) * 2012-11-13 2013-07-24 长沙景嘉微电子股份有限公司 Digital signal processor applied to radio-frequency communication receiver
CN104699458A (en) * 2015-03-30 2015-06-10 哈尔滨工业大学 Fixed point vector processor and vector data access controlling method thereof
CN105426161A (en) * 2015-11-12 2016-03-23 天津大学 Decoding circuit for POWER instruction set vector coprocessor
CN106991073A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 Reading and writing data scheduler and reservation station for vector operation
WO2017185395A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector comparison operation
CN108139886A (en) * 2015-12-04 2018-06-08 谷歌有限责任公司 For the multi-functional execution channel of image processor
CN108228236A (en) * 2017-12-06 2018-06-29 中国航空工业集团公司西安航空计算技术研究所 A kind of powerful instruction for supporting flowing water emits processing circuit
WO2019047281A1 (en) * 2017-09-07 2019-03-14 中国科学院微电子研究所 Bit-oriented granularity information processing system
CN111580866A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
CN111651201A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN111656319A (en) * 2017-12-22 2020-09-11 阿里巴巴集团控股有限公司 Multi-pipeline architecture with special number detection
CN111857822A (en) * 2016-08-05 2020-10-30 中科寒武纪科技股份有限公司 Arithmetic device and operation method thereof
CN111913746A (en) * 2020-08-31 2020-11-10 中国人民解放军国防科技大学 Design method of low-overhead embedded processor
CN112099762A (en) * 2020-09-10 2020-12-18 上海交通大学 Co-processing system and method for quickly realizing SM2 cryptographic algorithm
CN112230995A (en) * 2020-10-13 2021-01-15 广东省新一代通信与网络创新研究院 Instruction generation method and device and electronic equipment
WO2022141321A1 (en) * 2020-12-30 2022-07-07 深圳市大疆创新科技有限公司 Dsp and parallel computing method therefor
CN115826910A (en) * 2023-02-07 2023-03-21 成都申威科技有限责任公司 Vector fixed point ALU processing system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1188275A (en) * 1996-08-19 1998-07-22 三星电子株式会社 Single-instruction-multiple-data processing with combined scalar/vector operations
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
US8108652B1 (en) * 2007-09-13 2012-01-31 Ronald Chi-Chun Hui Vector processing with high execution throughput

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1188275A (en) * 1996-08-19 1998-07-22 三星电子株式会社 Single-instruction-multiple-data processing with combined scalar/vector operations
CN1349159A (en) * 2001-11-28 2002-05-15 中国人民解放军国防科学技术大学 Vector processing method of microprocessor
US8108652B1 (en) * 2007-09-13 2012-01-31 Ronald Chi-Chun Hui Vector processing with high execution throughput

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MASAKI TOYOKURA等: "A Video DSP with a Macroblock-Level-Pipeline and a SIMD Type Vector-Pipeline Architecture for MPEG2 CODEC", 《IEEE JOURNAL OF SOLID-STATE CIRCUITS》, vol. 29, no. 12, 31 December 1994 (1994-12-31), pages 1474 - 1481, XP000495322, DOI: doi:10.1109/4.340420 *
王亮等: "基于对指令数据区分访问的混合cache低功耗策略", 《计算机应用研究》, vol. 25, no. 6, 15 June 2008 (2008-06-15), pages 1894 - 1896 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103218346A (en) * 2012-11-13 2013-07-24 长沙景嘉微电子股份有限公司 Digital signal processor applied to radio-frequency communication receiver
CN104699458A (en) * 2015-03-30 2015-06-10 哈尔滨工业大学 Fixed point vector processor and vector data access controlling method thereof
CN105426161A (en) * 2015-11-12 2016-03-23 天津大学 Decoding circuit for POWER instruction set vector coprocessor
CN105426161B (en) * 2015-11-12 2017-11-07 天津大学 A kind of decoding circuit of the vectorial coprocessor of POWER instruction set
CN108139886A (en) * 2015-12-04 2018-06-08 谷歌有限责任公司 For the multi-functional execution channel of image processor
CN108139886B (en) * 2015-12-04 2021-11-16 谷歌有限责任公司 Multi-function execution channel for image processor
CN106991073A (en) * 2016-01-20 2017-07-28 南京艾溪信息科技有限公司 Reading and writing data scheduler and reservation station for vector operation
CN111580866A (en) * 2016-01-20 2020-08-25 中科寒武纪科技股份有限公司 Vector operation device and operation method
WO2017185395A1 (en) * 2016-04-26 2017-11-02 北京中科寒武纪科技有限公司 Apparatus and method for executing vector comparison operation
CN111651201B (en) * 2016-04-26 2023-06-13 中科寒武纪科技股份有限公司 Apparatus and method for performing vector merge operation
CN111651201A (en) * 2016-04-26 2020-09-11 中科寒武纪科技股份有限公司 Device and method for executing vector merging operation
CN111857822A (en) * 2016-08-05 2020-10-30 中科寒武纪科技股份有限公司 Arithmetic device and operation method thereof
CN111857822B (en) * 2016-08-05 2024-04-05 中科寒武纪科技股份有限公司 Operation device and operation method thereof
WO2019047281A1 (en) * 2017-09-07 2019-03-14 中国科学院微电子研究所 Bit-oriented granularity information processing system
CN108228236A (en) * 2017-12-06 2018-06-29 中国航空工业集团公司西安航空计算技术研究所 A kind of powerful instruction for supporting flowing water emits processing circuit
CN111656319A (en) * 2017-12-22 2020-09-11 阿里巴巴集团控股有限公司 Multi-pipeline architecture with special number detection
CN111656319B (en) * 2017-12-22 2023-06-13 阿里巴巴集团控股有限公司 Multi-pipeline architecture with special number detection
CN111913746A (en) * 2020-08-31 2020-11-10 中国人民解放军国防科技大学 Design method of low-overhead embedded processor
CN112099762A (en) * 2020-09-10 2020-12-18 上海交通大学 Co-processing system and method for quickly realizing SM2 cryptographic algorithm
CN112099762B (en) * 2020-09-10 2024-03-12 上海交通大学 Synergistic processing system and method for rapidly realizing SM2 cryptographic algorithm
CN112230995A (en) * 2020-10-13 2021-01-15 广东省新一代通信与网络创新研究院 Instruction generation method and device and electronic equipment
CN112230995B (en) * 2020-10-13 2024-04-09 广东省新一代通信与网络创新研究院 Instruction generation method and device and electronic equipment
WO2022141321A1 (en) * 2020-12-30 2022-07-07 深圳市大疆创新科技有限公司 Dsp and parallel computing method therefor
CN115826910A (en) * 2023-02-07 2023-03-21 成都申威科技有限责任公司 Vector fixed point ALU processing system

Also Published As

Publication number Publication date
CN102750133B (en) 2014-07-30

Similar Documents

Publication Publication Date Title
CN102750133B (en) 32-Bit triple-emission digital signal processor supporting SIMD
Fang et al. swdnn: A library for accelerating deep learning applications on sunway taihulight
CN106843810B (en) Equipment, method and the machine readable media of the control flow of trace command
CN104603795B (en) Realize instruction and the micro-architecture of the instant context switching of user-level thread
Kumar The hp pa-8000 risc cpu
CN109144573A (en) Two-level pipeline framework based on RISC-V instruction set
CN109074260A (en) Out-of-order block-based processor and instruction scheduler
CN105426160A (en) Instruction classified multi-emitting method based on SPRAC V8 instruction set
GB2553783A (en) Vector multiply-add instruction
KR20120019329A (en) Processor, apparatus and method for memory management
CN104813279B (en) For reducing the instruction of the element in the vector registor with stride formula access module
US20130054939A1 (en) Integrated circuit having a hard core and a soft core
CN102508643A (en) Multicore-parallel digital signal processor and method for operating parallel instruction sets
CN108027773A (en) The generation and use of memory reference instruction sequential encoding
CN101739235A (en) Processor unit for seamless connection between 32-bit DSP and universal RISC CPU
CN102184092A (en) Special instruction set processor based on pipeline structure
CN107667345A (en) Packing data alignment plus computations, processor, method and system
CN103109261A (en) Method and apparatus for universal logical operations
CN109739556B (en) General deep learning processor based on multi-parallel cache interaction and calculation
US11726912B2 (en) Coupling wide memory interface to wide write back paths
CN105373367A (en) Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector
CN108475192A (en) Dispersion reduces instruction
CN101539852B (en) Processor, information processing apparatus and method for executing conditional storage instruction
CN104008021A (en) Precision exception signaling for multiple data architecture
CN104536914B (en) The associated processing device and method marked based on register access

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant