CN101021779A - Instruction control method aimed at stream processor - Google Patents

Instruction control method aimed at stream processor Download PDF

Info

Publication number
CN101021779A
CN101021779A CN 200710034568 CN200710034568A CN101021779A CN 101021779 A CN101021779 A CN 101021779A CN 200710034568 CN200710034568 CN 200710034568 CN 200710034568 A CN200710034568 A CN 200710034568A CN 101021779 A CN101021779 A CN 101021779A
Authority
CN
China
Prior art keywords
instruction
stream
level
program
level program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200710034568
Other languages
Chinese (zh)
Other versions
CN100461094C (en
Inventor
张民选
邢座程
蒋江
杨学军
齐树波
阳柳
曾献君
马驰远
李勇
陈海燕
高军
李晋文
衣晓飞
张明
穆长富
倪晓强
唐遇星
张承义
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CNB2007100345685A priority Critical patent/CN100461094C/en
Publication of CN101021779A publication Critical patent/CN101021779A/en
Application granted granted Critical
Publication of CN100461094C publication Critical patent/CN100461094C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Advance Control (AREA)

Abstract

This invention discloses a instruction control method for streaming processor, which divides the instruction control into flow-level process and core-level process, the flow-level process is responsible for data scheduling between calculation core and off-chip memory, the core-level process completes data calculation, and the steps is: (1) initializing scalar data of the core-level procedure, (2) preparing vector data for the core-level procedure, (3) start executing the core-level procedure, (4) executing the procedure, (5) storing vector data generated by the procedure into the off-chip memory, (6) reading scalar result of the procedure.

Description

A kind of command control method at stream handle
Technical field
The present invention relates generally to the command control method in the microprocessor Design field, refers in particular to a kind of towards the command control method at stream handle with computation-intensive and data parallelism.
Background technology
Along with the continuous expansion of computer realm, the application of a quasi-representative---stream is used the basic load that is just becoming microprocessor.So-called stream is exactly data queue continual, continuous, that move.This stream is used has following characteristics: computation-intensive, all to carry out a large amount of arithmetical operations to the stream element of each taking-up; Concurrency is with data level and behavior master.In the data stream correlativity of stream element a little less than, be relatively independent to the operation of various flows element, so just exist lot of data parallel and well postpone to hide; Locality comprises data reusing locality and the producer-consumer's locality.The instruction control of general purpose microprocessor device is general adopts five sections basic streamlines to be instruction fetching, Instruction decoding/read register, execution/effective address calculating, reference to storage and to write back.Simultaneously for develop instruction-level can adopt the dynamic dispatching technology such as scoreboard technology, dynamic branch predictor, register renaming and Tomasulo algorithm reduce data relevant with control the relevant pause that causes, adopt the outer DRAM of tertiary storage level LRF-CACHE-sheet to reduce the access time of average memory, thereby reduce CPI.Improve the method that the instruction set concurrency can also adopt the static scheduling of VLIW (very long instruction word), the execution control of VLIW is simple relatively, do not need hardware to carry out dynamic dispatching, can reduce the complexity of hardware, but this method is very high for the performance requirement of compiler, and the performance of compiler has determined the exploitation of instruction set concurrency.The control method of these instructions closely is coupled the visit of storer with the calculating operation that data are carried out.The shared proportion of computing unit in general purpose microprocessor is not very big, for example in the Iantium-2 processor, 12 integer parts only account for 6% of entire chip area with 2 floating point units with relevant register file, and this structure has certain limitation for the streaming application of computation-intensive.For the stream with very high data parallelism and computation-intensive is used, the data dependence in the streaming application a little less than, will calculate with memory access easily and separate, the method that therefore adopts general instruction to control can not obtain very high performance.
Summary of the invention
The technical problem to be solved in the present invention just is: at the technical matters of prior art existence, the invention provides a kind of method that adopts the two-stage instruction control, operation of data is separated with memory access, thereby obtain higher calculated performance, higher storage device access bandwidth, effectively reduce command control method at stream handle to the bandwidth demand of chip external memory.
For solving the problems of the technologies described above, the solution that the present invention proposes is: a kind of command control method for stream handle, instruction control is divided into stream level program and nuclear level program, stream level program is responsible for data in the scheduling of calculating between core and the chip external memory, the computing of the complete paired data of nuclear level program, its concrete steps are:
(1), initialize the scalar data of nuclear level program: stream level program will examine by the control word transfer instruction that needed scalar data is initialised in the microcontroller register in grade program process, when nuclear level program is carried out, by the communication class instruction, from the microcontroller register, be broadcast to and calculate in the group unit, do not need to initialize scalar data such as fruit stone level program, this step can be omitted so;
(2), prepare vector data for nuclear level program: by the instruction of flow transmission class, a part of calculating the required vector data to be processed of core or the vector data in double buffering technology is loaded into on-chip memory from chip external memory, step (1) and step (2) can walk abreast and carry out;
(3), start the program implementation of nuclear level: after waiting for that preceding two steps are finished, start nuclear level program and carry out;
(4), a nuclear level program implementation: do not need and flow that a level program is carried out synchronous communication and the nuclear grade required data to be processed of program are placed in the on-chip memory fully such as fruit stone level program, in the process that nuclear level program is carried out, whether stream level program ceaselessly detects nuclear level program complete so; Need and a stream level program is carried out synchronous communication such as fruit stone level program, when stream level program and a stream level program reached synchronous point, nuclear level program and a stream level program can continue to carry out; If the stream level adopts double buffering technology to provide data for nuclear level program, when nuclear level program was processed a part of data, stream level program was loaded into another part data in the on-chip memory from chip external memory so;
(5), will examine the vector data that the level program generates and store in the chip external memory: after stream level program waits for that nuclear level program is complete, the vector data of its generation be stored in the chip external memory; If the vector data that nuclear level program is generated has adopted double buffering technology, so in step (4), stream level program can store the part of the vector data that generates in the chip external memory into, in this step, with the last part data storage in chip external memory; If the vector data that a last nuclear level program generates is an intermediate result, be about to be used, and can be kept at fully in the on-chip memory by following nuclear level program, this step can be omitted so;
(6), read the scalar result of nuclear level program: after waiting for that nuclear level program is finished, stream level program reads out scalar result by the control word transfer instruction from the microcontroller register; Do not generate scalar result as fruit stone level program, this step can be omitted so; This step can walk abreast with step (5) and carry out.
The control procedure of described stream level instruction was divided into for three steps:
(1), the distribution of logic groove number and the generation of correlativity: a stream level compiler calls stream function and replaces with flow operation, a stream function can be converted to one or more flow operation, each flow operation finally all can be converted into a stream and refer to, and carry out the correlation analysis between the flow operation, comprise two steps: give flow operation assignment logic groove number and generate the correlativity mask;
(2), the dynamic generation and the transmission of the instruction of stream level: the instruction of stream level can only dynamically produce according to the program implementation situation, the software module that runs on the primary processor is responsible for flowing dynamic generation and the transmission that level is instructed, flow operation can be sent to stream controller by this software module must satisfy following two conditions: first, the logic groove of current flow operation to be sent number is available, the operating position of logic groove number in the inquiry stream controller that software module is not stopped, have only when for the logic groove of this command assignment number not in stream controller in the instruction queue in, this flow operation can be sent in the stream controller; The second, if there is flow operation A to depend on flow operation B, and B as the part of a certain double buffering by repeatedly transmission, after B must finish last transmission so, A just can be sent out;
(3), emission and the execution of the instruction of stream level: the instruction queue of a M item is arranged in the stream controller, can hold simultaneously the instruction of M bar stream level, wherein M is generally greater than N, and emission and the execution of the instruction of stream level are divided into three steps: the one, a stream level instruction entry instruction formation; When flowing instruction entry instruction formation, need to upgrade finishing correlation and launching correlation of this stream level instruction according to the stream level instruction that is present in the instruction queue; The 2nd, the instruction of stream level is transmitted in the functional unit from instruction queue to be carried out, when the emission correlation of stream level instruction with finish that correlation is met and should stream grade needed resource of instruction in the time of the free time, the stream level instruction i of the earliest entry instruction formation of stream controller meeting Dynamic Selection is transmitted into it and goes execution in relevant functional unit, and renewal is arranged in instruction queue and launches the correlation mask of the instruction of correlation in stream level instruction i existence; The 3rd, the instruction of stream level is complete, leaves instruction queue, if needed resource of stream level instruction is again idle when getting off, should stream grade instruction be finished so, can leave one's post from instruction queue.But because the same clock cycle may have the instruction of a plurality of stream level to be finished, therefore stream controller selects the stream level instruction k of the earliest entry instruction formation to leave one's post, and upgrades and be arranged in the correlation mask that there are the stream level instruction of finishing correlation in instruction queue and stream level instruction k.
Described stream level program can be transmitted the transmission that scalar data is finished in the class instruction by control word with the communication of nuclear level program, nuclear level program in the process of implementation needed scalar data need to be transmitted the class instruction by control word scalar data is written in the microcontroller register file, and the scalar operation result that will examine the level program by such stream level instruction reads in the stream controller register file, can not read and write by the instruction of control word transmission class and calculate the inner local register of group.
Described stream level program can start the program implementation of nuclear level by stream level instruction clustop with the communication of nuclear level program, after the instruction of stream level is ready to inlet flow and initializes scalar data for nuclear level program, start the program implementation of nuclear level by the clustop instruction, when stream controller detects a nuclear level program and is finished, be that the clustop instruction is when finishing, the vector result that outputs to the stream registers file that to examine the level program by the instruction of transfer of data class stream level is written in the chip external memory, and reads in the stream controller register from the microcontroller register by the scalar result that the instruction of control word transmission class will be examined grade program.
Described stream level program can adopt synchronous communication with the communication of nuclear level program, realize flowing the synchronous of level program and nuclear level program, the stream level instruction that is positioned in stream level program after the synchronic command can not be prior to this synchronic command emission, can launch when the first of instruction queue and a nuclear level program also reach synchronous point when synchronic command reaches; In nuclear level program, run into synchronic command, will cause whole pipeline stall; When stream level program reaches synchronous point earlier, wait for that synchronic command discharged the nuclear level production line when nuclear level program also reached synchronous point, make nuclear level program continue to carry out, stream level program is also carried out simultaneously, and it is similar that nuclear level program reaches earlier the synchronous point processing procedure.
Compared with prior art, advantage of the present invention just is:
1, hides access delay.The method of stream level with the control of nuclear level two-stage adopted in program implementation, and stream grade program is responsible for nuclear level program and need prepares the batch data of processing, and will examine a level program implementation result and store in the chip external memory.Therefore be that next nuclear level program preparation data can be with parallel when the program implementation of pronucleus level, the result that will work as the program implementation of pronucleus level stores in the chip external memory and can walk abreast with the program implementation of next one nuclear level, has hidden memory access latency.
2, can obtain high calculated performance.Adopt the nuclear level instruction of VLIW, can manage more functional unit.Because streaming application has computation-intensive and data parallel, therefore this structure combines the characteristics of streaming application, can obtain very high calculated performance simultaneously.
3, can obtain higher memory access bandwidth.Adopt the method for two-stage instruction control, can will separately process to the memory access of data with to the computing of data.What show can effectively utilize the bandwidth of chip external memory to the memory access of batch data.
4, can reduce bandwidth demand effectively to chip external memory.Come the transmission of management data between on-chip memory and chip external memory by the instruction of stream level, can develop fully the locality of on-chip memory, thereby reduce the memory access to chip external memory.
5, hardware design is simple.Stream level program and nuclear level program have all adopted the method for software and hardware combining, have reduced the complexity that hardware design realizes.In stream level, stream level compiler is responsible for the distribution of the detection of instruction dependency and logic groove number, and hardware does not need the correlation between the dynamic analysis instruction, and whether the correlation that only needs to detect instruction reaches satisfied.In the nuclear level, adopt the instruction format of VLIW, the correlation between the computations is by the static analysis of nuclear level compiler, so hardware does not need the correlation of dynamic analysis instruction.
Description of drawings
Fig. 1 is the form of stream level instruction;
Fig. 2 is the form of nuclear level instruction;
Fig. 3 is the control method of stream level instruction;
Fig. 4 is a two-stage program implementation flow process.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details.
Referring to shown in Figure 4, a kind of command control method at stream handle of the present invention is divided into stream level program and nuclear level program with instruction control, and stream level program is responsible for data in the scheduling of calculating between core and the chip external memory, nuclear level program is finished operation of data, and its concrete steps are:
1, the scalar data of initialization nuclear level program.Stream level program will examine by the control word transfer instruction that needed scalar data is initialised in the microcontroller register in the level program process, when nuclear level program is carried out, by the communication class instruction, be broadcast to from the microcontroller register in the calculating group unit.Do not need the initialization scalar data as fruit stone level program, this step can be omitted so.
2, prepare vector data for nuclear level program.By the instruction of flow transmission class, a part (in double buffering technology) of calculating required vector data to be processed of core or vector data is loaded into on-chip memory from chip external memory.The 1st step and this step can walk abreast and carry out.
3, start the program implementation of nuclear level.After waiting for that preceding two steps are finished, start nuclear level program and carry out.
4, nuclear level program implementation.Do not need and flow that a level program is carried out synchronous communication and the required data to be processed of nuclear level program are placed in the on-chip memory fully such as fruit stone level program, in the process that nuclear level program is carried out, whether stream level program ceaselessly detects nuclear level program complete so.Need and a stream level program is carried out synchronous communication as fruit stone level program, when a stream level program and a stream level program reached synchronous points, a nuclear level program and a stream level program can continue to carry out.If the stream level adopts double buffering technology to provide data for nuclear level program, when nuclear level program was processed a part of data, stream level program was loaded into another part data in the on-chip memory from chip external memory so.
5, the vector data that will examine the generation of level program stores in the chip external memory.After stream level program waits for that nuclear level program is complete, the vector data of its generation is stored in the chip external memory.If the vector data that nuclear level program is generated has adopted double buffering technology, so in the 4th step, stream level program can store the part of the vector data that generates in the chip external memory into, in this step, the last part data is stored in the chip external memory.If the vector data that a last nuclear level program generates is an intermediate result, be about to be used, and can be kept at fully in the on-chip memory by following nuclear level program, this step can be omitted so.
6, read the scalar result of nuclear level program.After waiting for that nuclear level program is finished, stream level program reads out scalar result by the control word transfer instruction from the microcontroller register.Do not generate scalar result as fruit stone level program, this step can be omitted so.This step and the 5th step can walk abreast and carry out.
Wherein, Fig. 3 is the control method of stream level instruction.The control of stream level instruction is divided into three processes:
1, the distribution of logic groove number and the generation of correlativity: a stream level compiler calls stream function and replaces with flow operation, and a stream function can be converted to one or more flow operation, and each flow operation finally all can be converted into a stream instruction.And carry out the correlation analysis between the flow operation, comprise two steps: give flow operation assignment logic groove number (Logic IssueSlot) and generate correlativity mask (RAWMask and WARMask).The logic groove number is a compiling concept fully, and stream level compiler is set only has N logic groove, numbers from 0 to N-1.Stream level compiler will be followed successively by each flow operation since No. 0 and distribute a logic groove, by recycling this N logic groove.Stream level compiler is always attempted distributing adjacent logic groove number to adjacent flow operation.The order that stream level compiler occurs in string routine according to flow operation is come assignment logic groove number, adopts the Cyclic distributed strategy of revising.Distribute logic groove number to flow operation after, compiler is analyzed each flow operation and may be present in simultaneously correlation between other flow operations that sent of the instruction queue in the stream controller after it sends before it.Current flow operation with its before flow operation exist one of following situation to have correlativity between them: visit identical SRF space, chip external memory, microcode memory space, used identical control register, identical content is revised or has been used in two operations.These correlations can be divided into three classes: one, RAW, and read-after-write is relevant, i.e. the content of stream, the content that current operation will use this part to be modified have been revised in previous operation; Two, WAR, writeafterread is relevant, and namely the content read of previous action need will be modified by the current operation; Three, WAW, write after write is relevant, and previous operation all will be made amendment to identical part with current operation.Stream level compiler represents correlation information between the flow operation with three kinds of correlation masks.And RAW and WAW are merged, represent with RAWMask, this is a kind ofly to finish relevantly, and what is called is finished relevant, and the meaning should be all instructions relevant with certain instruction this instruction that must be finished and could launch accurately; Represent that with WARMask emission is relevant, so-called emission is relevant, and this instruction could be launched after the meaning should be all instructions relevant with certain instruction and must launch accurately.
2, dynamic generation and the transmission of the instruction of stream level: the instruction of Dynamic Generation stream level, be based on following reason, there is the redirect between the basic blocks such as circulation, branch in stream level program, and compiler can't pre-determine the program implementation track, and therefore the instruction of stream level can only be according to program implementation situation Dynamic Generation.Run-time Dispatcher is responsible for flowing dynamic generation and the transmission of grade instruction, and it is a software module that runs on the primary processor.Flow operation can be sent to stream controller by Run-time Dispatcher must satisfy following two conditions: the first, and the logic groove of current flow operation to be sent number is available.The operating position of logic groove number in the inquiry stream controller that Run-time Dispatcher does not stop, have only when for the logic groove of this command assignment number not in stream controller in the instruction queue in, this flow operation can be sent in the stream controller.The second, if there is flow operation A to depend on flow operation B, and B as the part of a certain double buffering by repeatedly transmission (A is not in this double buffering), after B must finish last transmission so, A just can be sent out.
3, flow emission and the execution of level instruction, the instruction queue of a M item is arranged in the stream controller, can hold simultaneously the instruction (M is generally greater than N) of M bar stream level.Emission and the execution of the instruction of stream level are divided into three steps: the one, and stream level instruction entry instruction formation.When flowing instruction entry instruction formation, need to upgrade finishing correlation and launching correlation of this stream level instruction according to the stream level instruction that is present in the instruction queue.For example the stream level instruction of current entry instruction formation and logic groove number are finished correlation for the stream level instruction of j exists, but the logic groove number has left instruction queue for the stream level instruction of j has been finished, and needs so to upgrade the correlation information of finishing of this stream level instruction.The 2nd, the instruction of stream level is transmitted in the functional unit from instruction queue to be carried out.When the emission correlation of stream level instruction with finish that correlation is met and should stream grade needed resource of instruction in the time of the free time, the stream level instruction i of the earliest entry instruction formation of stream controller meeting Dynamic Selection is transmitted into it and goes execution in relevant functional unit, and renewal is arranged in instruction queue and launches the correlation mask of the instruction of correlation in stream level instruction i existence.The 3rd, the instruction of stream level is complete, leaves instruction queue.If one the needed resource of stream level instruction is again idle when getting off, should stream grade instruction be finished so, can from instruction queue, leave one's post.But because the same clock cycle may have the instruction of a plurality of stream level to be finished, therefore stream controller selects the stream level instruction k of the earliest entry instruction formation to leave one's post, and upgrades and be arranged in the correlation mask that there are the stream level instruction of finishing correlation in instruction queue and stream level instruction k.The design of non-flowing water of multicycle is adopted in the emission of stream level instruction, and the clock periodicity of every stream level instruction issue is not fixed, and minimum clock cycle, for example writes register instruction, be 8 clock cycle to the maximum, for example start the CLustop instruction that nuclear level program is carried out.The execution clock periodicity of stream level instruction also is unfixed, is the needed clock periodicity of transmit flow data for the clock periodicity of the execution of the stream level instruction of Stream Data Transmission.
In the present invention, instruction is divided into instruction of stream level and the instruction of nuclear level.The instruction of stream level is mainly a nuclear level program and provides and need flow data to be processed, be about to flow data from chip external memory (DRAM) by the load instruction load in the sheet in the stream registers file, and the flow data that will examine the generation of level program is by storing into the stream registers file of store instruction in sheet in the DRAM memory outside the sheet.General VLIW form is adopted in the instruction of nuclear level, and controls a plurality of calculating groups and carry out in the mode of SIMD (single instruction stream multiple data stream).Each territory of VLIW is corresponding with each functional unit that calculates the group.
The control method of stream level instruction adopts the control method of logic-based groove software and hardware cooperation.Compiler generates the emission correlativity between the instruction of stream level and finishes correlativity on the whole, and finish the detection of dynamic of stream level dependencies between instructions by stream controller, correlativity reach satisfied in, this stream level transmitting instructions is carried out in relevant functional unit.
The execution of nuclear level instruction is finished jointly by microcontroller and calculating group.The control method of general VLIW has been adopted in the instruction of nuclear level, and therefore the execution of nuclear level instruction does not adopt the method for dynamic dispatching to detect data dependence, and the data dependence between the instruction is by the compiler static scheduling.The instruction of nuclear level is divided into three streamlines logically: the one, and micro-control instruction execution pipeline, this streamline are responsible for carrying out the instruction in the microcontroller territory among the VLIW; The 2nd, data input and output execution pipeline, this streamline are responsible for carrying out the instruction in the input-output unit territory among the VLIW; The 3rd, calculate group execution pipeline of the instruction that can carry out, this streamline is responsible for finishing the calculating operation to the input data.Two stations of these three shared streamlines of streamline are instruction fetch first stop (Fetch1) and instruction fetch second station (Fetch2).The instruction of nuclear level is left in the microcontroller in the command memory, therefore is subjected to the size that the restriction of on-chip memory capacity causes examining grade program and also is restricted.What need when nuclear level program comparision is big that the programmer shows is a plurality of small routines with its cutting.The control of pipeline stall: because compiler has been considered the clock period of the execution of compute classes instruction when carrying out instruction scheduling, therefore the 3rd streamline can not cause the pause of streamline.But since the second streamline need to be from the data buffer of stream registers file sense data or in the data buffer of stream registers file data writing, thereby can cause that read operation pauses for empty the time or can cause that the write operation pause causes the pause of second streamline when being full when data buffer when data buffer.Because thereby the lock-step of nuclear level production line is carried out the pause that causes whole streamline.
Stream level program can adopt three kinds of means of communication with the communication of nuclear level program: the one, finish the transmission of scalar data by the instruction of control word transmission class.Nuclear level program in the process of implementation needed scalar data need to be transmitted the class instruction by control word scalar data is written in the microcontroller register file, and the scalar operation result that will examine grade program by such stream level instruction reads in the stream controller register file.Can not read and write the local register that calculates group inside by the instruction of control word transmission class.The 2nd, clustop starts the program implementation of nuclear level by the instruction of stream level.After the instruction of stream level is ready to inlet flow and initializes scalar data for nuclear level program, start the program implementation of nuclear level by the clustop instruction, when stream controller detects a nuclear level program and is finished, be that the clustop instruction is when finishing, the vector result that outputs to the stream registers file that to examine the level program by the instruction of transfer of data class stream level is written in the chip external memory, and reads in the stream controller register from the microcontroller register by the scalar result that the instruction of control word transmission class will be examined grade program.The 3rd, synchronous communication.Can realize flowing the synchronous of grade program and nuclear level program.The stream level instruction that is positioned in stream level program after the synchronic command can not be prior to this synchronic command emission, can launch when the first of instruction queue and a nuclear level program also reach synchronous point when synchronic command reaches.In nuclear level program, run into synchronic command, will cause whole pipeline stall.When stream level program reaches synchronous points earlier, wait for that synchronic command discharged the nuclear level production line when nuclear level program also reached synchronous points, make nuclear level program continue to carry out, stream level program is also carried out simultaneously.It is similar that nuclear level program reaches the synchronous points processing procedure earlier.
Wherein, referring to the form of stream level instruction shown in Figure 1, the emission groove number accounts for 5, represents 0 to 31 logic groove number.WAR Mask emission correlativity mask accounts for 32.Emission correlativitys of representing 32 the stream levels instruction of the instruction of this stream level and its front, the i bit representation be in the emission correlativity that the nearest logic groove of this instruction number instructs for the stream level of i.There is the emission correlation in " 1 " expression with this instruction, and there is not the emission correlation in " 0 " expression with this instruction.RAW Mask finishes the correlativity mask, accounts for 32.The correlation of finishing that represents the stream level instructions of 32 of this stream level instruction and its fronts, the i bit representation be in the nearest logic groove of this instruction number correlation of finishing for the stream level instruction of i." 1 " expression exists with this instruction finishes correlation, and " 0 " expression does not exist with this instruction finishes correlation.Stream Op has then comprised concrete flow operation information, and it accounts for 5 and does not wait to 80, that is to say that the stream instruction is elongated.The instruction of stream level does not comprise the instruction of computing class, mainly comprises following four class instructions: the instruction of control word transmission class, this class instruction are used at the transmission of data between the register or between primary processor and the control register.The instruction of flow transmission class, this class instruction realize the transmission of the flow data between the outer DRAM of stream registers file and sheet, the Stream Data Transmission between a plurality of processors and will examine a grade program and be sent to from the stream registers file in the nuclear level command memory of microcontroller.Syncsort instruction, main purpose are to make stream level program and the program implementation of nuclear level reach simultaneously a synchronous point.Start the instruction that nuclear level program is carried out, start nuclear level program inlet flow is carried out arithmetic operation.
Form referring to nuclear level shown in Figure 2 instruction, the VLIW instruction word is divided into 11 territories, and preceding 8 territories are 8 main function components in the corresponding computing bunch respectively: scratch pad register (Scratchpad), 4 multiplicaton addition units (MULADD), communication unit (COMM), condition generate control module (JB/VAL) and local conditional register file (CC).3 territories, back comprise microcontroller territory (Microcontroller), 8 stream input-output unit territories (DB0:DB7) and 1 reservation territory (Res).Wherein the territory of each parts correspondence further is divided into a plurality of subdomains again, except that microcontroller territory and stream input and output territory are more special, other territory comprises following subdomain substantially: the operational code of parts, the condition code register file read the address, local register file is read address (LRFx Rd), local register file write address (LRFx Wr), in addition an also respectively corresponding software flow segment number (LRFx Stg) of each local register write port and a cross bar switch address number (LRFx Bus).

Claims (5)

1, a kind of command control method at stream handle, it is characterized in that instruction control is divided into stream level program and nuclear level program, stream level program is responsible for data in the scheduling of calculating between core and the chip external memory, and nuclear level program is finished operation of data, and its concrete steps are:
(1), initialize the scalar data of nuclear level program: stream level program will examine by the control word transfer instruction that needed scalar data is initialised in the microcontroller register in grade program process, when nuclear level program is carried out, by the communication class instruction, from the microcontroller register, be broadcast to and calculate in the group unit, do not need to initialize scalar data such as fruit stone level program, this step can be omitted so;
(2), prepare vector data for nuclear level program: by the instruction of flow transmission class, a part of calculating the required vector data to be processed of core or the vector data in double buffering technology is loaded into on-chip memory from chip external memory, step (1) and step (2) can walk abreast and carry out;
(3), start the program implementation of nuclear level: after waiting for that preceding two steps are finished, start nuclear level program and carry out;
(4), a nuclear level program implementation: do not need and flow that a level program is carried out synchronous communication and the nuclear grade required data to be processed of program are placed in the on-chip memory fully such as fruit stone level program, in the process that nuclear level program is carried out, whether stream level program ceaselessly detects nuclear level program complete so; Need and a stream level program is carried out synchronous communication such as fruit stone level program, when stream level program and a stream level program reached synchronous point, nuclear level program and a stream level program can continue to carry out; If the stream level adopts double buffering technology to provide data for nuclear level program, when nuclear level program was processed a part of data, stream level program was loaded into another part data in the on-chip memory from chip external memory so;
(5), will examine the vector data that the level program generates and store in the chip external memory: after stream level program waits for that nuclear level program is complete, the vector data of its generation be stored in the chip external memory; If the vector data that nuclear level program is generated has adopted double buffering technology, so in step (4), stream level program can store the part of the vector data that generates in the chip external memory into, in this step, with the last part data storage in chip external memory; If the vector data that a last nuclear level program generates is an intermediate result, be about to be used, and can be kept at fully in the on-chip memory by following nuclear level program, this step can be omitted so;
(6), read the scalar result of nuclear level program: after waiting for that nuclear level program is finished, stream level program reads out scalar result by the control word transfer instruction from the microcontroller register; Do not generate scalar result as fruit stone level program, this step can be omitted so; This step can walk abreast with step (5) and carry out.
2, a kind of command control method at stream handle according to claim 1 is characterized in that the control procedure of described stream level instruction was divided into for three steps:
(1), the distribution of logic groove number and the generation of correlativity: a stream level compiler calls stream function and replaces with flow operation, a stream function can be converted to one or more flow operation, each flow operation finally all can be converted into a stream and refer to, and carry out the correlation analysis between the flow operation, comprise two steps: give flow operation assignment logic groove number and generate the correlativity mask;
(2), the dynamic generation and the transmission of the instruction of stream level: the instruction of stream level can only dynamically produce according to the program implementation situation, the software module that runs on the primary processor is responsible for flowing dynamic generation and the transmission that level is instructed, flow operation can be sent to stream controller by this software module must satisfy following two conditions: first, the logic groove of current flow operation to be sent number is available, the operating position of logic groove number in the inquiry stream controller that software module is not stopped, have only when for the logic groove of this command assignment number not in stream controller in the instruction queue in, this flow operation can be sent in the stream controller; The second, if there is flow operation A to depend on flow operation B, and B as the part of a certain double buffering by repeatedly transmission, after B must finish last transmission so, A just can be sent out;
(3), emission and the execution of the instruction of stream level: the instruction queue of a M item is arranged in the stream controller, can hold simultaneously the instruction of M bar stream level, wherein M is generally greater than N, and emission and the execution of the instruction of stream level are divided into three steps: the one, a stream level instruction entry instruction formation; When flowing instruction entry instruction formation, need to upgrade finishing correlation and launching correlation of this stream level instruction according to the stream level instruction that is present in the instruction queue; The 2nd, the instruction of stream level is transmitted in the functional unit from instruction queue to be carried out, when the emission correlation of stream level instruction with finish that correlation is met and should stream grade needed resource of instruction in the time of the free time, the stream level instruction i of the earliest entry instruction formation of stream controller meeting Dynamic Selection is transmitted into it and goes execution in relevant functional unit, and renewal is arranged in instruction queue and launches the correlation mask of the instruction of correlation in stream level instruction i existence; The 3rd, the instruction of stream level is complete, leaves instruction queue, if needed resource of stream level instruction is again idle when getting off, should stream grade instruction be finished so, can leave one's post from instruction queue.But because the same clock cycle may have the instruction of a plurality of stream level to be finished, therefore stream controller selects the stream level instruction k of the earliest entry instruction formation to leave one's post, and upgrades and be arranged in the correlation mask that there are the stream level instruction of finishing correlation in instruction queue and stream level instruction k.
3, a kind of command control method according to claim 1 and 2 at stream handle, the communication that it is characterized in that described stream level program and nuclear level program can be transmitted the transmission that scalar data is finished in the class instruction by control word, the needed in the process of implementation scalar data of nuclear level program need transmit the class instruction by control word scalar data is written in the microcontroller register file, and the scalar operation result that will examine the level program by such stream level instruction reads in the stream controller register file, can not read and write by the instruction of control word transmission class and calculate the inner local register of group.
4, a kind of command control method for stream handle according to claim 1 and 2, it is characterized in that described stream level program and the communication of nuclear level program can start the program implementation of nuclear level by stream level instruction clustop, after the instruction of stream level is ready to inlet flow and initializes scalar data for nuclear level program, start the program implementation of nuclear level by the clustop instruction, when stream controller detects a nuclear level program and is finished, be that the clustop instruction is when finishing, the vector result that outputs to the stream registers file that to examine the level program by the instruction of transfer of data class stream level is written in the chip external memory, and reads in the stream controller register from the microcontroller register by the scalar result that the instruction of control word transmission class will be examined grade program.
5, a kind of command control method for stream handle according to claim 1 and 2, it is characterized in that described stream level program can adopt synchronous communication with the communication of nuclear grade program, realize flowing the synchronous of level program and nuclear level program, the stream level instruction that is positioned in stream level program after the synchronic command can not be prior to this synchronic command emission, can launch when the first of instruction queue and a nuclear level program also reach synchronous point when synchronic command reaches; In nuclear level program, run into synchronic command, will cause whole pipeline stall; When stream level program reaches synchronous point earlier, wait for that synchronic command discharged the nuclear level production line when nuclear level program also reached synchronous point, make nuclear level program continue to carry out, stream level program is also carried out simultaneously, and it is similar that nuclear level program reaches earlier the synchronous point processing procedure.
CNB2007100345685A 2007-03-19 2007-03-19 Instruction control method aimed at stream processor Expired - Fee Related CN100461094C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100345685A CN100461094C (en) 2007-03-19 2007-03-19 Instruction control method aimed at stream processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100345685A CN100461094C (en) 2007-03-19 2007-03-19 Instruction control method aimed at stream processor

Publications (2)

Publication Number Publication Date
CN101021779A true CN101021779A (en) 2007-08-22
CN100461094C CN100461094C (en) 2009-02-11

Family

ID=38709554

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100345685A Expired - Fee Related CN100461094C (en) 2007-03-19 2007-03-19 Instruction control method aimed at stream processor

Country Status (1)

Country Link
CN (1) CN100461094C (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101907984A (en) * 2009-08-07 2010-12-08 威盛电子股份有限公司 Command processing method with and the SuperScale pipeline microprocessor that is suitable for
CN102053821A (en) * 2011-01-06 2011-05-11 中国人民解放军国防科学技术大学 Multithreading flow branch control method and control device thereof
CN101620526B (en) * 2009-07-03 2011-06-15 中国人民解放军国防科学技术大学 Method for reducing resource consumption of instruction memory on stream processor chip
CN101217564B (en) * 2008-01-16 2012-08-22 上海理工大学 A parallel communication system and the corresponding realization method of simple object access protocol
CN102722446A (en) * 2012-06-06 2012-10-10 北京航空航天大学 Dynamic recorder for local memory access model for stream processor
CN104025025A (en) * 2011-12-28 2014-09-03 英特尔公司 Systems, apparatuses, and methods for performing delta encoding on packed data elements
CN105512024A (en) * 2014-09-30 2016-04-20 龙芯中科技术有限公司 Method and device for generating detection instruction sequence
CN105593809A (en) * 2013-08-06 2016-05-18 甲骨文国际公司 Flexible configuration hardware streaming unit
CN107004308A (en) * 2014-10-30 2017-08-01 加拿大致博希迈有限公司 The tracking and device of product treatment line
CN107179895A (en) * 2017-05-17 2017-09-19 北京中科睿芯科技有限公司 A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture
US10037209B2 (en) 2011-12-28 2018-07-31 Intel Corporation Systems, apparatuses, and methods for performing delta decoding on packed data elements
CN109597654A (en) * 2018-12-07 2019-04-09 湖南国科微电子股份有限公司 Initialization of register method, the generation method and embedded system of configurations table
CN110825437A (en) * 2018-08-10 2020-02-21 北京百度网讯科技有限公司 Method and apparatus for processing data
CN111459549A (en) * 2020-04-07 2020-07-28 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor
CN111860804A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4576758B2 (en) * 2001-06-21 2010-11-10 ソニー株式会社 Data processing device
KR100463642B1 (en) * 2003-03-06 2004-12-29 한국과학기술원 Apparatus for accelerating multimedia processing by using the coprocessor
WO2004086760A1 (en) * 2003-03-27 2004-10-07 Matsushita Electric Industrial Co., Ltd. Data processing apparatus
US7920584B2 (en) * 2005-05-04 2011-04-05 Arm Limited Data processing system
CN100357932C (en) * 2006-06-05 2007-12-26 中国人民解放军国防科学技术大学 Method for decreasing data access delay in stream processor

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217564B (en) * 2008-01-16 2012-08-22 上海理工大学 A parallel communication system and the corresponding realization method of simple object access protocol
CN101620526B (en) * 2009-07-03 2011-06-15 中国人民解放军国防科学技术大学 Method for reducing resource consumption of instruction memory on stream processor chip
CN101907984B (en) * 2009-08-07 2014-10-29 威盛电子股份有限公司 Command processing method and its applicable super-scale pipeline microprocessor
CN101907984A (en) * 2009-08-07 2010-12-08 威盛电子股份有限公司 Command processing method with and the SuperScale pipeline microprocessor that is suitable for
CN102053821A (en) * 2011-01-06 2011-05-11 中国人民解放军国防科学技术大学 Multithreading flow branch control method and control device thereof
CN102053821B (en) * 2011-01-06 2014-03-19 中国人民解放军国防科学技术大学 Multithreading flow branch control method and control device thereof
US9965282B2 (en) 2011-12-28 2018-05-08 Intel Corporation Systems, apparatuses, and methods for performing delta encoding on packed data elements
US10671392B2 (en) 2011-12-28 2020-06-02 Intel Corporation Systems, apparatuses, and methods for performing delta decoding on packed data elements
CN104025025A (en) * 2011-12-28 2014-09-03 英特尔公司 Systems, apparatuses, and methods for performing delta encoding on packed data elements
CN104025025B (en) * 2011-12-28 2018-08-28 英特尔公司 Systems, devices and methods for executing incremental encoding to packaged data element
US10037209B2 (en) 2011-12-28 2018-07-31 Intel Corporation Systems, apparatuses, and methods for performing delta decoding on packed data elements
CN102722446A (en) * 2012-06-06 2012-10-10 北京航空航天大学 Dynamic recorder for local memory access model for stream processor
CN102722446B (en) * 2012-06-06 2015-03-25 北京航空航天大学 Dynamic recorder for local memory access model for stream processor
CN105593809A (en) * 2013-08-06 2016-05-18 甲骨文国际公司 Flexible configuration hardware streaming unit
CN105512024B (en) * 2014-09-30 2018-03-23 龙芯中科技术有限公司 The method and apparatus of generation detection command sequence
CN105512024A (en) * 2014-09-30 2016-04-20 龙芯中科技术有限公司 Method and device for generating detection instruction sequence
CN107004308A (en) * 2014-10-30 2017-08-01 加拿大致博希迈有限公司 The tracking and device of product treatment line
CN107179895A (en) * 2017-05-17 2017-09-19 北京中科睿芯科技有限公司 A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture
CN107179895B (en) * 2017-05-17 2020-08-28 北京中科睿芯科技有限公司 Method for accelerating instruction execution speed in data stream structure by applying composite instruction
CN110825437A (en) * 2018-08-10 2020-02-21 北京百度网讯科技有限公司 Method and apparatus for processing data
CN110825437B (en) * 2018-08-10 2022-04-29 昆仑芯(北京)科技有限公司 Method and apparatus for processing data
CN109597654A (en) * 2018-12-07 2019-04-09 湖南国科微电子股份有限公司 Initialization of register method, the generation method and embedded system of configurations table
CN109597654B (en) * 2018-12-07 2022-01-11 湖南国科微电子股份有限公司 Register initialization method, basic configuration table generation method and embedded system
CN111860804A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
CN111860804B (en) * 2019-04-27 2022-12-27 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
US12026606B2 (en) 2019-04-27 2024-07-02 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
US12093811B2 (en) 2019-04-27 2024-09-17 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN111459549A (en) * 2020-04-07 2020-07-28 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor
CN111459549B (en) * 2020-04-07 2022-11-01 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor

Also Published As

Publication number Publication date
CN100461094C (en) 2009-02-11

Similar Documents

Publication Publication Date Title
CN100461094C (en) Instruction control method aimed at stream processor
EP3314401B1 (en) Block-based architecture with parallel execution of successive blocks
JP2928695B2 (en) Multi-thread microprocessor using static interleave and instruction thread execution method in system including the same
CN107810480B (en) Instruction block allocation based on performance metrics
CN100357884C (en) Method, processor and system for processing instructions
US9811340B2 (en) Method and apparatus for reconstructing real program order of instructions in multi-strand out-of-order processor
EP3314398B1 (en) Reuse of decoded instruction blocks in a block based architecture
US7836276B2 (en) System and method for processing thread groups in a SIMD architecture
US20170083320A1 (en) Predicated read instructions
US5854934A (en) Optimizing compiler having data cache prefetch spreading
US8935515B2 (en) Method and apparatus for vector execution on a scalar machine
US10452399B2 (en) Broadcast channel architectures for block-based processors
US11531552B2 (en) Executing multiple programs simultaneously on a processor core
US20170083319A1 (en) Generation and use of block branch metadata
US20160378491A1 (en) Determination of target location for transfer of processor control
CN105426160A (en) Instruction classified multi-emitting method based on SPRAC V8 instruction set
US6061367A (en) Processor with pipelining structure and method for high-speed calculation with pipelining processors
CN111656337A (en) System and method for executing instructions
EP0496407A2 (en) Parallel pipelined instruction processing system for very long instruction word
Nicolau et al. ROPE: a statically scheduled supercomputer architecture
Singhvi et al. Pipeline Hazards and its Resolutions
JP2861234B2 (en) Instruction processing unit
CN118796277A (en) GPGPU-based instruction pipeline optimization and dynamic programming method and system
Karplus et al. A compiler-driven supercomputer
Tziouvaras et al. Rapid, low-power loop execution in a network of functional units

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090211

Termination date: 20110319