CN101021778A

CN101021778A - Computing group structure for superlong instruction word and instruction flow multidata stream fusion

Info

Publication number: CN101021778A
Application number: CN 200710034567
Authority: CN
Inventors: 邢座程; 杨学军; 张民选; 蒋江; 张承义; 马驰远; 李勇; 陈海燕; 高军; 李晋文; 衣晓飞; 张明; 穆长富; 阳柳; 曾献君; 倪晓强; 唐遇星
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2007-03-19
Filing date: 2007-03-19
Publication date: 2007-08-22
Anticipated expiration: 2027-03-19
Also published as: CN100456230C

Abstract

This invention discloses a VLIW with single-instruction multiple data streams converging computing cluster structure, which includes the data buffer and microcontroller connected with main controller, and their connective computing cluster. The main controller are responsible for the instruction and data mobile, instruction loads in the microcontroller, data loads in the data buffer and receives their output. The data buffer receives main controller data, supplies operation data for computing cluster and receives its result, and outputs to the main controller. Microcontroller receives the VLIW sequence form main controller, decodes and broadcast to each processing unit for parallel execution. The computing group includes a number of same processing units with a number of dealing components respectively and the same instruction sequence, and the data is from different modules of data buffer.

Description

The computing group structure that very long instruction word and single instruction stream multiple data stream merge

Technical field

The present invention is mainly concerned with the instruction process technology in the microprocessor Design, especially is applied to refer in particular to the computing group structure that a kind of very long instruction word and single instruction stream multiple data stream merge in the processor of computation-intensive calculating.

Background technology

Instruction level parallelism (Instruction Level Parallelism, be called for short ILP) be the main path of concurrency exploitation in the microprocessor Design, can executed in parallel between the incoherent instruction, improve the efficient that processor is carried out with this, as superscale technology and very long instruction word technology (Very Long Instruction Word is called for short VLIW).The VLIW technology disposes a plurality of functional parts on hardware, they can parallel execution of instructions, compiler is responsible for package is made into the very long instruction word sequence, each new instruction word comprise many can executed in parallel presumptive instruction, can map directly on the functional part during execution and handle, thereby reach the purpose of exploitation instruction level parallelism, do not have complicated hardware to detect relevant and transmitter logic.And also having lot of data level parallel (DataLevel Parallelism abbreviates DLP as) in the compute-intensive applications such as multimedia and science calculating simultaneously, the data of same type or structure often need to carry out identical one or a string operation.Adopt special instruction process technology can develop data level concurrency in this class method effectively, thereby promote the execution performance of processor.

At first the exploitation of data level concurrency can be converted into the instruction-level parallelism exploitation.Software flow is a kind of effective compiling dispatching method of exploitation instruction-level parallelism, compiler will not exist the relevant instruction from different cycle periods of data to reschedule by loop unrolling, form new loop body, with this increase in the processor can executed in parallel instruction strip number, it is relevant to solve data.Relevant between containing the data dependence between instruction in the parallel program of mass data level and mainly be storage and calculating, the software flow technology can be dissolved this being correlated with comparalive ease.The software flow technology can make simultaneously with superscale or VLIW technology and be used for developing ILP.

The Vector Processing technology is a kind of effective ways that improve processor bulk data handling property, utilize the thought of time-interleaving, data parallel is converted into instruction to walk abreast, by being used to handle the loop statement vectorization of same operation, not only can reduce the size of code of program, control dependence between the loop iteration can also be hidden in the vector instruction, improve the execution efficient of hardware.The chained technology of vector can effectively reduce the memory space of intermediate result on the basis of Vector Processing, alleviate the dispense pressure of vector registor.

Single instruction stream multiple data stream (Single Instruction, Multiple Data, abbreviation SIMD) technology is a kind of resource repeat techniques, comes the development data level parallel by disposing a plurality of parallel processing units or a processing unit being divided into many data paths.Article one, the control signal of instruction can be controlled a plurality of arithmetic units and works simultaneously, but the data of processing are from a plurality of data stream.The IA-32 instruction set of Intel and IA-64 instruction set all have the instruction expansion at SIMD, can improve the performance that numerical evaluation is used.

As seen, current concurrency exploitation for compute-intensive applications such as multimedia and science calculating is based on independently instruction level parallelism or data level concurrent development, continue in the application scale under the situation of expansion and complexity increase, utilize the difficulty of said method parallelization to become big, the performance gain that is obtained also constantly reduces.How support on hardware configuration that exploitation is new approaches that solve this type of problem in the parallel and instruction level parallelism of data level.

Summary of the invention

The technical problem to be solved in the present invention just is: at the problem of prior art existence, the invention provides a kind of data level concurrency and instruction-level parallelism can supported simultaneously develops, can further improve the very long instruction word of compute-intensive applications program execution performance on processor and the computing group structure that single instruction stream multiple data stream merges from number of ways in conjunction with very long instruction word technology, single instruction stream multiple data stream technology and software flow technology.

For solving the problems of the technologies described above, the solution that the present invention proposes is: the computing group structure that a kind of very long instruction word and single instruction stream multiple data stream merge, it is characterized in that: it comprises master controller, data buffering, SIMD calculates group and microcontroller, data buffering links to each other with master controller respectively with microcontroller, calculating faciation by SIMD between data buffering and the microcontroller connects, master controller is responsible for the preparation of instruction and data, call in memory unit in the microcontroller with needing SIMD to calculate instruction that the group carries out, control SIMD calculates group's startup and time-out, required source operand certificate is called in data buffering, and receives final calculation result in the data buffering; The data buffering parts receive the data that transmit from master controller, and store with specific organizational form, provide calculating required source operand for SIMD calculates the group, calculate and finish the output result that the back receives the calculating group, and net result outputs to master controller; The microcontroller parts receive the very long instruction word sequence that master controller provides, and after deciphering, with the form of broadcasting the alignment processing parts F executed in parallel that SIMD calculates each processing unit of group are assigned, are mapped in each operation; SIMD calculating group is a plurality of parallel processing elements with SIMD form tissue, and each processing unit structure is identical, and disposes a plurality of processing element, and performed instruction sequence all comes from microcontroller, but required computational data is from the diverse location of data buffering.

It is one group of identical processing unit PE of structure that described SIMD calculates the group, and the structure of all PE is all identical, carries out same instruction or instruction sequence from microcontroller simultaneously in the mode of SIMD; Each PE comprises a plurality of arithmetic, logical operation processing element Fn and local register file, logical operation processing element Fn supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each logical operation processing element Fn has independent local register file, directly operand is provided and preserves result of calculation for processing element.

Compared with prior art, because the data computation amount is bigger in the compute-intensive applications programs such as multimedia and science calculating, and the data of same type or structure often need to carry out identical one or a string operation.Therefore in the microprocessor of this type of application, have the following advantages by disposing hardware configuration proposed by the invention:

(1) computing group structure of very long instruction word technology proposed by the invention and single instruction stream multiple data stream technological incorporation is by the intermediate result in PE internal configurations local register, configuration data buffering save routine computing in system, avoided the memory bandwidth waste, utilize the strategy while development data level concurrency and the instruction-level parallelism of resource repetition and compiling scheduling, can improve the execution efficient of application program on processor;

(2) develop DLP and ILP simultaneously.Owing to support the VLIW program to carry out in the mode of SIMD, instruction-level parallelism in the co-development program and data level concurrency have improved the throughput that processor is carried out this type of application program greatly;

(3) hardware efficiency height.The shared cover instruction control logic of a plurality of PE, get finger, decoding, assignment, mapping, and operand is from different data stream, and the executive mode of this SIMD utilizes the control path of an instruction to realize the data throughput that multicomputer system could be realized, has improved hardware efficiency;

(4) the hardware implementation complexity is low.Because compiling can be determined the operation delay of various instructions, the work of concurrency exploitation is finished by compiler fully, has avoided complicated hardware to detect logic and streamline interlocking logic, has reduced hard-wired complexity;

(5) alleviated the memory bandwidth bottleneck.Owing to used the local register of PE inside, the intermediate result that instruction manipulation produces does not need to occupy outside data buffering, has alleviated the bandwidth pressure of exterior storage, and has accelerated the speed that operand reads;

In sum, hardware configuration proposed by the invention combines the advantage of VLIW and SIMD development sequence concurrency, be fit to be applied in the processor of compute-intensive applications, but be not limited to this kind processor, other processors that need develop multiple concurrency simultaneously also can adopt.

Description of drawings

Fig. 1 is a framed structure synoptic diagram of the present invention;

Fig. 2 is the computing group structure synoptic diagram that VLIW and SIMD merge among the present invention;

Fig. 3 is the schematic flow sheet of instruction process of the present invention.

Embodiment

Below with reference to the drawings and specific embodiments the present invention is described in further details.

Referring to shown in Figure 1, the computing group structure that very long instruction word of the present invention and single instruction stream multiple data stream merge, it comprises that master controller, data buffering, SIMD calculate group and microcontroller.Wherein master controller is responsible for the preparation of instruction and data, call in memory unit in the microcontroller with needing SIMD to calculate instruction that the group carries out, control SIMD calculates group's startup and time-out, and required source operand certificate is called in data buffering, and receives final calculation result in the data buffering; The data buffering parts receive the data that transmit from master controller, and store with specific organizational form, provide calculating required source operand for SIMD calculates the group, calculate and finish the output result that the back receives the calculating group, and net result outputs to master controller; The microcontroller parts receive the very long instruction word sequence that master controller provides, after deciphering, the alignment processing parts F executed in parallel of each operation being assigned, is mapped to each processing unit (Processing Element is called for short PE) of SIMD calculating group with the form of broadcasting; SIMD calculating group is a plurality of parallel processing elements with SIMD form tissue, and each processing unit structure is identical, and disposes a plurality of processing element, and performed instruction sequence all comes from microcontroller, but required computational data is from the diverse location of data buffering.

Fig. 2 is the computing group structure figure that VLIW and SIMD merge.SIMD calculate the group be one group of identical processing unit PE of structure (PE0, PE1 ..., PEN).The structure of all PE is all identical, carries out simultaneously from same of microcontroller instruction or instruction sequence in the mode of SIMD.Each PE comprise a plurality of arithmetic, logical operation processing element (F1, F2 ..., Fn) and local register file.Processing element F1, F2 ..., Fn supports very long instruction word to carry out, a plurality of dissimilar operations of parallel processing simultaneously (as addition, multiplication, take advantage of add, logical operation etc.).Each processing element all has independent local register file, directly operand is provided and preserves result of calculation for processing element.Operand at first reads in local register from data buffering, because master controller had been called in data buffering to operand before execution is deciphered, assigns, shone upon to the startup microcontroller, the time-delay that operand is called in the PE local register from data buffering is fixed.There is network to link to each other between inner each the local register file of processing unit, can exchange and calculate the middle ephemeral data that generates, the operand of each processing element all directly reads from own local register, so there is not the not ready and pause that can not expect that causes of factor certificate in processing element when executing instruction operations.Therefore, very long instruction word is carried out any operation required time-delay on all PE all be consistent, as can be known, compiler can be developed instruction-level parallelism fully, generate the very long instruction word instruction sequence, and utilize circulation in the software flow technology reorganization program, dissolve the data dependence between the very long instruction word, need not the hardware intervention.Also there are interconnection network between the PE simultaneously, can carry out necessary synchronous and exchanges data.

With following pseudo-coded program is example:

// data are prepared

load(datal[m]，meml)；

load(data2[m]，mem2)；

// data processing

for(i?from?l?to?m)

func(datal[i]，data2[i]，data3[i]，data4[i])；

func(data3[i]，data4[i]，data5[i]，data6[i])；

// data write back to external memory or network interface

send(data5[m]，mem3)；

send(data6[m]，chan0)；

// data processing function definition, input in_a and in_b, output out_c and out_d

func(in_a，in_b，out_c，out_d){

// article one very long instruction word (VLIW) word instruction I ₁

OP ₁₁// execution unit is F1

OP ₁₂// execution unit is F2

…

OP _1n // execution unit is Fn; Be the very long instruction word boundary marking

// second very long instruction word (VLIW) word instruction I ₂

OP ₂₁

OP ₂₂

…

OP _2n；；

…

}

The data of a succession of same structure of this program description (data1, data2) order is carried out sequence of operations (two func functions), and the process of producing new a succession of data (data5, data6), and main body is a for circulation.Wherein second func function of for round-robin is relevant with the data that have read-after-write between first func function, in the traditional instruction treatment technology, executing efficiency is that concurrency is subject to that this is relevant, and will produce a large amount of ephemeral data (data3, data4) between calling for twice, if register resources deficiency, then to be swapped out to storer and preserve, be read into again in the register when needing, waste memory bandwidth.The hardware configuration that adopts the present invention to propose can effectively be avoided these bottlenecks, and the implementation of this program in the present invention is:

1. carry out the load instruction, the flowing of main controller controls data stream, the log-on data buffering is carried out desired data sequence data1 and data2 from the external memory storage program of calling in, and leaves in the data buffering by the fixed indices address;

2. master controller is transmitted into the Instructions Cache parts in the microcontroller with the very long instruction word sequence in the program (first func function);

3. microcontroller is to very long instruction word (VLIW) word instruction I ₁, I ₂Deng deciphering, every bat translates n microoperation (OP _I1, OP _I2..., OP _In), carry out on n the processing element that is assigned to N PE that walk abreast:

(a) data are written into operation control data (in_a in_b) are read into the local register of PE from data buffering;

(b) data write back operation control data (out_c, out_d) local register from PE writes back data buffering;

(c) source operand of arithmetical logic operation is taken from local register separately, and execution result writes back local register separately;

(d) data can move between local register by the interconnection network of PE inside, also can stride PE by the interconnection network between PE and move;

4. carry out second func function, repeated for 2,3 steps;

5. carry out the send instruction, the flowing of main controller controls data stream, the log-on data buffering writes back to data sequence data5 and data6 the network port of external memory storage or appointment;

Each microoperation all has fixing, predictable execution time-delay, and all PE begin to carry out, finish simultaneously simultaneously, are the SIMD mode between the PE, the parallel employing VLIW mode of PE inside.

Fig. 3 is the flow process of instruction process of the present invention.Instruction is after master controller decoding, and judgement is data preparation instruction or data processing instructions.If the data preparation instruction then is transmitted into data buffering and carries out, the data buffering judgement is write command or reads instruction.Read instruction and read in monoblock data from external memory storage, and leave in the data buffering by the fixed indices address according to the instruction data address that provides and the length of peeking; Write command outputs to assigned address (external memory storage or the network port) with the specified data block in the data buffering.If data processing instructions is then got corresponding one section very long instruction word instruction sequence to microcontroller from external memory storage, microcontroller instructs one by one and deciphers, assigns and carries out.The microoperation that every very long instruction word (VLIW) word instruction translates is assigned to N PE simultaneously, is mapped to corresponding processing element F, and desired data is from data buffering, and the mode with SIMD between the PE is carried out.Article one, the very long instruction word instruction process finishes, and microcontroller starts the decoding and the assignment of next bar instruction; One section instruction sequence is finished, and master controller starts reading in and handling of next sequence.

Claims

1, the computing group structure that a kind of very long instruction word and single instruction stream multiple data stream merge, it is characterized in that: it comprises master controller, data buffering, SIMD calculates group and microcontroller, data buffering links to each other with master controller respectively with microcontroller, calculating faciation by SIMD between data buffering and the microcontroller connects, master controller is responsible for the preparation of instruction and data, call in memory unit in the microcontroller with needing SIMD to calculate instruction that the group carries out, control SIMD calculates group's startup and time-out, required source operand certificate is called in data buffering, and receives final calculation result in the data buffering; The data buffering parts receive the data that transmit from master controller, and store with specific organizational form, provide calculating required source operand for SIMD calculates the group, calculate and finish the output result that the back receives the calculating group, and net result outputs to master controller; The microcontroller parts receive the very long instruction word sequence that master controller provides, and after deciphering, with the form of broadcasting the alignment processing parts executed in parallel that SIMD calculates each processing unit of group are assigned, are mapped in each operation; It is the identical a plurality of parallel processing elements of structure that SIMD calculates the group, and each processing unit disposes a plurality of processing element, and performed instruction sequence all comes from microcontroller, but required computational data is from the diverse location of data buffering.

2, the computing group structure of very long instruction word according to claim 1 and single instruction stream multiple data stream fusion, it is characterized in that: it is one group of identical processing unit PE of structure that described SIMD calculates the group, the structure of all PE is all identical, carries out simultaneously from same of microcontroller instruction or instruction sequence in the mode of SIMD; Each PE comprises a plurality of arithmetic, logical operation processing element Fn and local register file, logical operation processing element Fn supports very long instruction word to carry out, the a plurality of dissimilar operations of parallel processing simultaneously, each logical operation processing element Fn has independent local register file, directly operand is provided and preserves result of calculation for processing element.