CN101021779A - Instruction control method aimed at stream processor - Google Patents
Instruction control method aimed at stream processor Download PDFInfo
- Publication number
- CN101021779A CN101021779A CN 200710034568 CN200710034568A CN101021779A CN 101021779 A CN101021779 A CN 101021779A CN 200710034568 CN200710034568 CN 200710034568 CN 200710034568 A CN200710034568 A CN 200710034568A CN 101021779 A CN101021779 A CN 101021779A
- Authority
- CN
- China
- Prior art keywords
- instruction
- stream
- level
- program
- level program
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Advance Control (AREA)
Abstract
This invention discloses a instruction control method for streaming processor, which divides the instruction control into flow-level process and core-level process, the flow-level process is responsible for data scheduling between calculation core and off-chip memory, the core-level process completes data calculation, and the steps is: (1) initializing scalar data of the core-level procedure, (2) preparing vector data for the core-level procedure, (3) start executing the core-level procedure, (4) executing the procedure, (5) storing vector data generated by the procedure into the off-chip memory, (6) reading scalar result of the procedure.
Description
Technical field
The present invention relates generally to the command control method in the microprocessor Design field, refers in particular to a kind of towards the command control method at stream handle with computation-intensive and data parallelism.
Background technology
Along with the continuous expansion of computer realm, the application of a quasi-representative---stream is used the basic load that is just becoming microprocessor.So-called stream is exactly data queue continual, continuous, that move.This stream is used has following characteristics: computation-intensive, all to carry out a large amount of arithmetical operations to the stream element of each taking-up; Concurrency is with data level and behavior master.In the data stream correlativity of stream element a little less than, be relatively independent to the operation of various flows element, so just exist lot of data parallel and well postpone to hide; Locality comprises data reusing locality and the producer-consumer's locality.The instruction control of general purpose microprocessor device is general adopts five sections basic streamlines to be instruction fetching, Instruction decoding/read register, execution/effective address calculating, reference to storage and to write back.Simultaneously for develop instruction-level can adopt the dynamic dispatching technology such as scoreboard technology, dynamic branch predictor, register renaming and Tomasulo algorithm reduce data relevant with control the relevant pause that causes, adopt the outer DRAM of tertiary storage level LRF-CACHE-sheet to reduce the access time of average memory, thereby reduce CPI.Improve the method that the instruction set concurrency can also adopt the static scheduling of VLIW (very long instruction word), the execution control of VLIW is simple relatively, do not need hardware to carry out dynamic dispatching, can reduce the complexity of hardware, but this method is very high for the performance requirement of compiler, and the performance of compiler has determined the exploitation of instruction set concurrency.The control method of these instructions closely is coupled the visit of storer with the calculating operation that data are carried out.The shared proportion of computing unit in general purpose microprocessor is not very big, for example in the Iantium-2 processor, 12 integer parts only account for 6% of entire chip area with 2 floating point units with relevant register file, and this structure has certain limitation for the streaming application of computation-intensive.For the stream with very high data parallelism and computation-intensive is used, the data dependence in the streaming application a little less than, will calculate with memory access easily and separate, the method that therefore adopts general instruction to control can not obtain very high performance.
Summary of the invention
The technical problem to be solved in the present invention just is: at the technical matters of prior art existence, the invention provides a kind of method that adopts the two-stage instruction control, operation of data is separated with memory access, thereby obtain higher calculated performance, higher storage device access bandwidth, effectively reduce command control method at stream handle to the bandwidth demand of chip external memory.
For solving the problems of the technologies described above, the solution that the present invention proposes is: a kind of command control method for stream handle, instruction control is divided into stream level program and nuclear level program, stream level program is responsible for data in the scheduling of calculating between core and the chip external memory, the computing of the complete paired data of nuclear level program, its concrete steps are:
(1), initialize the scalar data of nuclear level program: stream level program will examine by the control word transfer instruction that needed scalar data is initialised in the microcontroller register in grade program process, when nuclear level program is carried out, by the communication class instruction, from the microcontroller register, be broadcast to and calculate in the group unit, do not need to initialize scalar data such as fruit stone level program, this step can be omitted so;
(2), prepare vector data for nuclear level program: by the instruction of flow transmission class, a part of calculating the required vector data to be processed of core or the vector data in double buffering technology is loaded into on-chip memory from chip external memory, step (1) and step (2) can walk abreast and carry out;
(3), start the program implementation of nuclear level: after waiting for that preceding two steps are finished, start nuclear level program and carry out;
(4), a nuclear level program implementation: do not need and flow that a level program is carried out synchronous communication and the nuclear grade required data to be processed of program are placed in the on-chip memory fully such as fruit stone level program, in the process that nuclear level program is carried out, whether stream level program ceaselessly detects nuclear level program complete so; Need and a stream level program is carried out synchronous communication such as fruit stone level program, when stream level program and a stream level program reached synchronous point, nuclear level program and a stream level program can continue to carry out; If the stream level adopts double buffering technology to provide data for nuclear level program, when nuclear level program was processed a part of data, stream level program was loaded into another part data in the on-chip memory from chip external memory so;
(5), will examine the vector data that the level program generates and store in the chip external memory: after stream level program waits for that nuclear level program is complete, the vector data of its generation be stored in the chip external memory; If the vector data that nuclear level program is generated has adopted double buffering technology, so in step (4), stream level program can store the part of the vector data that generates in the chip external memory into, in this step, with the last part data storage in chip external memory; If the vector data that a last nuclear level program generates is an intermediate result, be about to be used, and can be kept at fully in the on-chip memory by following nuclear level program, this step can be omitted so;
(6), read the scalar result of nuclear level program: after waiting for that nuclear level program is finished, stream level program reads out scalar result by the control word transfer instruction from the microcontroller register; Do not generate scalar result as fruit stone level program, this step can be omitted so; This step can walk abreast with step (5) and carry out.
The control procedure of described stream level instruction was divided into for three steps:
(1), the distribution of logic groove number and the generation of correlativity: a stream level compiler calls stream function and replaces with flow operation, a stream function can be converted to one or more flow operation, each flow operation finally all can be converted into a stream and refer to, and carry out the correlation analysis between the flow operation, comprise two steps: give flow operation assignment logic groove number and generate the correlativity mask;
(2), the dynamic generation and the transmission of the instruction of stream level: the instruction of stream level can only dynamically produce according to the program implementation situation, the software module that runs on the primary processor is responsible for flowing dynamic generation and the transmission that level is instructed, flow operation can be sent to stream controller by this software module must satisfy following two conditions: first, the logic groove of current flow operation to be sent number is available, the operating position of logic groove number in the inquiry stream controller that software module is not stopped, have only when for the logic groove of this command assignment number not in stream controller in the instruction queue in, this flow operation can be sent in the stream controller; The second, if there is flow operation A to depend on flow operation B, and B as the part of a certain double buffering by repeatedly transmission, after B must finish last transmission so, A just can be sent out;
(3), emission and the execution of the instruction of stream level: the instruction queue of a M item is arranged in the stream controller, can hold simultaneously the instruction of M bar stream level, wherein M is generally greater than N, and emission and the execution of the instruction of stream level are divided into three steps: the one, a stream level instruction entry instruction formation; When flowing instruction entry instruction formation, need to upgrade finishing correlation and launching correlation of this stream level instruction according to the stream level instruction that is present in the instruction queue; The 2nd, the instruction of stream level is transmitted in the functional unit from instruction queue to be carried out, when the emission correlation of stream level instruction with finish that correlation is met and should stream grade needed resource of instruction in the time of the free time, the stream level instruction i of the earliest entry instruction formation of stream controller meeting Dynamic Selection is transmitted into it and goes execution in relevant functional unit, and renewal is arranged in instruction queue and launches the correlation mask of the instruction of correlation in stream level instruction i existence; The 3rd, the instruction of stream level is complete, leaves instruction queue, if needed resource of stream level instruction is again idle when getting off, should stream grade instruction be finished so, can leave one's post from instruction queue.But because the same clock cycle may have the instruction of a plurality of stream level to be finished, therefore stream controller selects the stream level instruction k of the earliest entry instruction formation to leave one's post, and upgrades and be arranged in the correlation mask that there are the stream level instruction of finishing correlation in instruction queue and stream level instruction k.
Described stream level program can be transmitted the transmission that scalar data is finished in the class instruction by control word with the communication of nuclear level program, nuclear level program in the process of implementation needed scalar data need to be transmitted the class instruction by control word scalar data is written in the microcontroller register file, and the scalar operation result that will examine the level program by such stream level instruction reads in the stream controller register file, can not read and write by the instruction of control word transmission class and calculate the inner local register of group.
Described stream level program can start the program implementation of nuclear level by stream level instruction clustop with the communication of nuclear level program, after the instruction of stream level is ready to inlet flow and initializes scalar data for nuclear level program, start the program implementation of nuclear level by the clustop instruction, when stream controller detects a nuclear level program and is finished, be that the clustop instruction is when finishing, the vector result that outputs to the stream registers file that to examine the level program by the instruction of transfer of data class stream level is written in the chip external memory, and reads in the stream controller register from the microcontroller register by the scalar result that the instruction of control word transmission class will be examined grade program.
Described stream level program can adopt synchronous communication with the communication of nuclear level program, realize flowing the synchronous of level program and nuclear level program, the stream level instruction that is positioned in stream level program after the synchronic command can not be prior to this synchronic command emission, can launch when the first of instruction queue and a nuclear level program also reach synchronous point when synchronic command reaches; In nuclear level program, run into synchronic command, will cause whole pipeline stall; When stream level program reaches synchronous point earlier, wait for that synchronic command discharged the nuclear level production line when nuclear level program also reached synchronous point, make nuclear level program continue to carry out, stream level program is also carried out simultaneously, and it is similar that nuclear level program reaches earlier the synchronous point processing procedure.
Compared with prior art, advantage of the present invention just is:
1, hides access delay.The method of stream level with the control of nuclear level two-stage adopted in program implementation, and stream grade program is responsible for nuclear level program and need prepares the batch data of processing, and will examine a level program implementation result and store in the chip external memory.Therefore be that next nuclear level program preparation data can be with parallel when the program implementation of pronucleus level, the result that will work as the program implementation of pronucleus level stores in the chip external memory and can walk abreast with the program implementation of next one nuclear level, has hidden memory access latency.
2, can obtain high calculated performance.Adopt the nuclear level instruction of VLIW, can manage more functional unit.Because streaming application has computation-intensive and data parallel, therefore this structure combines the characteristics of streaming application, can obtain very high calculated performance simultaneously.
3, can obtain higher memory access bandwidth.Adopt the method for two-stage instruction control, can will separately process to the memory access of data with to the computing of data.What show can effectively utilize the bandwidth of chip external memory to the memory access of batch data.
4, can reduce bandwidth demand effectively to chip external memory.Come the transmission of management data between on-chip memory and chip external memory by the instruction of stream level, can develop fully the locality of on-chip memory, thereby reduce the memory access to chip external memory.
5, hardware design is simple.Stream level program and nuclear level program have all adopted the method for software and hardware combining, have reduced the complexity that hardware design realizes.In stream level, stream level compiler is responsible for the distribution of the detection of instruction dependency and logic groove number, and hardware does not need the correlation between the dynamic analysis instruction, and whether the correlation that only needs to detect instruction reaches satisfied.In the nuclear level, adopt the instruction format of VLIW, the correlation between the computations is by the static analysis of nuclear level compiler, so hardware does not need the correlation of dynamic analysis instruction.
Description of drawings
Fig. 1 is the form of stream level instruction;
Fig. 2 is the form of nuclear level instruction;
Fig. 3 is the control method of stream level instruction;
Fig. 4 is a two-stage program implementation flow process.
Embodiment
Below with reference to the drawings and specific embodiments the present invention is described in further details.
Referring to shown in Figure 4, a kind of command control method at stream handle of the present invention is divided into stream level program and nuclear level program with instruction control, and stream level program is responsible for data in the scheduling of calculating between core and the chip external memory, nuclear level program is finished operation of data, and its concrete steps are:
1, the scalar data of initialization nuclear level program.Stream level program will examine by the control word transfer instruction that needed scalar data is initialised in the microcontroller register in the level program process, when nuclear level program is carried out, by the communication class instruction, be broadcast to from the microcontroller register in the calculating group unit.Do not need the initialization scalar data as fruit stone level program, this step can be omitted so.
2, prepare vector data for nuclear level program.By the instruction of flow transmission class, a part (in double buffering technology) of calculating required vector data to be processed of core or vector data is loaded into on-chip memory from chip external memory.The 1st step and this step can walk abreast and carry out.
3, start the program implementation of nuclear level.After waiting for that preceding two steps are finished, start nuclear level program and carry out.
4, nuclear level program implementation.Do not need and flow that a level program is carried out synchronous communication and the required data to be processed of nuclear level program are placed in the on-chip memory fully such as fruit stone level program, in the process that nuclear level program is carried out, whether stream level program ceaselessly detects nuclear level program complete so.Need and a stream level program is carried out synchronous communication as fruit stone level program, when a stream level program and a stream level program reached synchronous points, a nuclear level program and a stream level program can continue to carry out.If the stream level adopts double buffering technology to provide data for nuclear level program, when nuclear level program was processed a part of data, stream level program was loaded into another part data in the on-chip memory from chip external memory so.
5, the vector data that will examine the generation of level program stores in the chip external memory.After stream level program waits for that nuclear level program is complete, the vector data of its generation is stored in the chip external memory.If the vector data that nuclear level program is generated has adopted double buffering technology, so in the 4th step, stream level program can store the part of the vector data that generates in the chip external memory into, in this step, the last part data is stored in the chip external memory.If the vector data that a last nuclear level program generates is an intermediate result, be about to be used, and can be kept at fully in the on-chip memory by following nuclear level program, this step can be omitted so.
6, read the scalar result of nuclear level program.After waiting for that nuclear level program is finished, stream level program reads out scalar result by the control word transfer instruction from the microcontroller register.Do not generate scalar result as fruit stone level program, this step can be omitted so.This step and the 5th step can walk abreast and carry out.
Wherein, Fig. 3 is the control method of stream level instruction.The control of stream level instruction is divided into three processes:
1, the distribution of logic groove number and the generation of correlativity: a stream level compiler calls stream function and replaces with flow operation, and a stream function can be converted to one or more flow operation, and each flow operation finally all can be converted into a stream instruction.And carry out the correlation analysis between the flow operation, comprise two steps: give flow operation assignment logic groove number (Logic IssueSlot) and generate correlativity mask (RAWMask and WARMask).The logic groove number is a compiling concept fully, and stream level compiler is set only has N logic groove, numbers from 0 to N-1.Stream level compiler will be followed successively by each flow operation since No. 0 and distribute a logic groove, by recycling this N logic groove.Stream level compiler is always attempted distributing adjacent logic groove number to adjacent flow operation.The order that stream level compiler occurs in string routine according to flow operation is come assignment logic groove number, adopts the Cyclic distributed strategy of revising.Distribute logic groove number to flow operation after, compiler is analyzed each flow operation and may be present in simultaneously correlation between other flow operations that sent of the instruction queue in the stream controller after it sends before it.Current flow operation with its before flow operation exist one of following situation to have correlativity between them: visit identical SRF space, chip external memory, microcode memory space, used identical control register, identical content is revised or has been used in two operations.These correlations can be divided into three classes: one, RAW, and read-after-write is relevant, i.e. the content of stream, the content that current operation will use this part to be modified have been revised in previous operation; Two, WAR, writeafterread is relevant, and namely the content read of previous action need will be modified by the current operation; Three, WAW, write after write is relevant, and previous operation all will be made amendment to identical part with current operation.Stream level compiler represents correlation information between the flow operation with three kinds of correlation masks.And RAW and WAW are merged, represent with RAWMask, this is a kind ofly to finish relevantly, and what is called is finished relevant, and the meaning should be all instructions relevant with certain instruction this instruction that must be finished and could launch accurately; Represent that with WARMask emission is relevant, so-called emission is relevant, and this instruction could be launched after the meaning should be all instructions relevant with certain instruction and must launch accurately.
2, dynamic generation and the transmission of the instruction of stream level: the instruction of Dynamic Generation stream level, be based on following reason, there is the redirect between the basic blocks such as circulation, branch in stream level program, and compiler can't pre-determine the program implementation track, and therefore the instruction of stream level can only be according to program implementation situation Dynamic Generation.Run-time Dispatcher is responsible for flowing dynamic generation and the transmission of grade instruction, and it is a software module that runs on the primary processor.Flow operation can be sent to stream controller by Run-time Dispatcher must satisfy following two conditions: the first, and the logic groove of current flow operation to be sent number is available.The operating position of logic groove number in the inquiry stream controller that Run-time Dispatcher does not stop, have only when for the logic groove of this command assignment number not in stream controller in the instruction queue in, this flow operation can be sent in the stream controller.The second, if there is flow operation A to depend on flow operation B, and B as the part of a certain double buffering by repeatedly transmission (A is not in this double buffering), after B must finish last transmission so, A just can be sent out.
3, flow emission and the execution of level instruction, the instruction queue of a M item is arranged in the stream controller, can hold simultaneously the instruction (M is generally greater than N) of M bar stream level.Emission and the execution of the instruction of stream level are divided into three steps: the one, and stream level instruction entry instruction formation.When flowing instruction entry instruction formation, need to upgrade finishing correlation and launching correlation of this stream level instruction according to the stream level instruction that is present in the instruction queue.For example the stream level instruction of current entry instruction formation and logic groove number are finished correlation for the stream level instruction of j exists, but the logic groove number has left instruction queue for the stream level instruction of j has been finished, and needs so to upgrade the correlation information of finishing of this stream level instruction.The 2nd, the instruction of stream level is transmitted in the functional unit from instruction queue to be carried out.When the emission correlation of stream level instruction with finish that correlation is met and should stream grade needed resource of instruction in the time of the free time, the stream level instruction i of the earliest entry instruction formation of stream controller meeting Dynamic Selection is transmitted into it and goes execution in relevant functional unit, and renewal is arranged in instruction queue and launches the correlation mask of the instruction of correlation in stream level instruction i existence.The 3rd, the instruction of stream level is complete, leaves instruction queue.If one the needed resource of stream level instruction is again idle when getting off, should stream grade instruction be finished so, can from instruction queue, leave one's post.But because the same clock cycle may have the instruction of a plurality of stream level to be finished, therefore stream controller selects the stream level instruction k of the earliest entry instruction formation to leave one's post, and upgrades and be arranged in the correlation mask that there are the stream level instruction of finishing correlation in instruction queue and stream level instruction k.The design of non-flowing water of multicycle is adopted in the emission of stream level instruction, and the clock periodicity of every stream level instruction issue is not fixed, and minimum clock cycle, for example writes register instruction, be 8 clock cycle to the maximum, for example start the CLustop instruction that nuclear level program is carried out.The execution clock periodicity of stream level instruction also is unfixed, is the needed clock periodicity of transmit flow data for the clock periodicity of the execution of the stream level instruction of Stream Data Transmission.
In the present invention, instruction is divided into instruction of stream level and the instruction of nuclear level.The instruction of stream level is mainly a nuclear level program and provides and need flow data to be processed, be about to flow data from chip external memory (DRAM) by the load instruction load in the sheet in the stream registers file, and the flow data that will examine the generation of level program is by storing into the stream registers file of store instruction in sheet in the DRAM memory outside the sheet.General VLIW form is adopted in the instruction of nuclear level, and controls a plurality of calculating groups and carry out in the mode of SIMD (single instruction stream multiple data stream).Each territory of VLIW is corresponding with each functional unit that calculates the group.
The control method of stream level instruction adopts the control method of logic-based groove software and hardware cooperation.Compiler generates the emission correlativity between the instruction of stream level and finishes correlativity on the whole, and finish the detection of dynamic of stream level dependencies between instructions by stream controller, correlativity reach satisfied in, this stream level transmitting instructions is carried out in relevant functional unit.
The execution of nuclear level instruction is finished jointly by microcontroller and calculating group.The control method of general VLIW has been adopted in the instruction of nuclear level, and therefore the execution of nuclear level instruction does not adopt the method for dynamic dispatching to detect data dependence, and the data dependence between the instruction is by the compiler static scheduling.The instruction of nuclear level is divided into three streamlines logically: the one, and micro-control instruction execution pipeline, this streamline are responsible for carrying out the instruction in the microcontroller territory among the VLIW; The 2nd, data input and output execution pipeline, this streamline are responsible for carrying out the instruction in the input-output unit territory among the VLIW; The 3rd, calculate group execution pipeline of the instruction that can carry out, this streamline is responsible for finishing the calculating operation to the input data.Two stations of these three shared streamlines of streamline are instruction fetch first stop (Fetch1) and instruction fetch second station (Fetch2).The instruction of nuclear level is left in the microcontroller in the command memory, therefore is subjected to the size that the restriction of on-chip memory capacity causes examining grade program and also is restricted.What need when nuclear level program comparision is big that the programmer shows is a plurality of small routines with its cutting.The control of pipeline stall: because compiler has been considered the clock period of the execution of compute classes instruction when carrying out instruction scheduling, therefore the 3rd streamline can not cause the pause of streamline.But since the second streamline need to be from the data buffer of stream registers file sense data or in the data buffer of stream registers file data writing, thereby can cause that read operation pauses for empty the time or can cause that the write operation pause causes the pause of second streamline when being full when data buffer when data buffer.Because thereby the lock-step of nuclear level production line is carried out the pause that causes whole streamline.
Stream level program can adopt three kinds of means of communication with the communication of nuclear level program: the one, finish the transmission of scalar data by the instruction of control word transmission class.Nuclear level program in the process of implementation needed scalar data need to be transmitted the class instruction by control word scalar data is written in the microcontroller register file, and the scalar operation result that will examine grade program by such stream level instruction reads in the stream controller register file.Can not read and write the local register that calculates group inside by the instruction of control word transmission class.The 2nd, clustop starts the program implementation of nuclear level by the instruction of stream level.After the instruction of stream level is ready to inlet flow and initializes scalar data for nuclear level program, start the program implementation of nuclear level by the clustop instruction, when stream controller detects a nuclear level program and is finished, be that the clustop instruction is when finishing, the vector result that outputs to the stream registers file that to examine the level program by the instruction of transfer of data class stream level is written in the chip external memory, and reads in the stream controller register from the microcontroller register by the scalar result that the instruction of control word transmission class will be examined grade program.The 3rd, synchronous communication.Can realize flowing the synchronous of grade program and nuclear level program.The stream level instruction that is positioned in stream level program after the synchronic command can not be prior to this synchronic command emission, can launch when the first of instruction queue and a nuclear level program also reach synchronous point when synchronic command reaches.In nuclear level program, run into synchronic command, will cause whole pipeline stall.When stream level program reaches synchronous points earlier, wait for that synchronic command discharged the nuclear level production line when nuclear level program also reached synchronous points, make nuclear level program continue to carry out, stream level program is also carried out simultaneously.It is similar that nuclear level program reaches the synchronous points processing procedure earlier.
Wherein, referring to the form of stream level instruction shown in Figure 1, the emission groove number accounts for 5, represents 0 to 31 logic groove number.WAR Mask emission correlativity mask accounts for 32.Emission correlativitys of representing 32 the stream levels instruction of the instruction of this stream level and its front, the i bit representation be in the emission correlativity that the nearest logic groove of this instruction number instructs for the stream level of i.There is the emission correlation in " 1 " expression with this instruction, and there is not the emission correlation in " 0 " expression with this instruction.RAW Mask finishes the correlativity mask, accounts for 32.The correlation of finishing that represents the stream level instructions of 32 of this stream level instruction and its fronts, the i bit representation be in the nearest logic groove of this instruction number correlation of finishing for the stream level instruction of i." 1 " expression exists with this instruction finishes correlation, and " 0 " expression does not exist with this instruction finishes correlation.Stream Op has then comprised concrete flow operation information, and it accounts for 5 and does not wait to 80, that is to say that the stream instruction is elongated.The instruction of stream level does not comprise the instruction of computing class, mainly comprises following four class instructions: the instruction of control word transmission class, this class instruction are used at the transmission of data between the register or between primary processor and the control register.The instruction of flow transmission class, this class instruction realize the transmission of the flow data between the outer DRAM of stream registers file and sheet, the Stream Data Transmission between a plurality of processors and will examine a grade program and be sent to from the stream registers file in the nuclear level command memory of microcontroller.Syncsort instruction, main purpose are to make stream level program and the program implementation of nuclear level reach simultaneously a synchronous point.Start the instruction that nuclear level program is carried out, start nuclear level program inlet flow is carried out arithmetic operation.
Form referring to nuclear level shown in Figure 2 instruction, the VLIW instruction word is divided into 11 territories, and preceding 8 territories are 8 main function components in the corresponding computing bunch respectively: scratch pad register (Scratchpad), 4 multiplicaton addition units (MULADD), communication unit (COMM), condition generate control module (JB/VAL) and local conditional register file (CC).3 territories, back comprise microcontroller territory (Microcontroller), 8 stream input-output unit territories (DB0:DB7) and 1 reservation territory (Res).Wherein the territory of each parts correspondence further is divided into a plurality of subdomains again, except that microcontroller territory and stream input and output territory are more special, other territory comprises following subdomain substantially: the operational code of parts, the condition code register file read the address, local register file is read address (LRFx Rd), local register file write address (LRFx Wr), in addition an also respectively corresponding software flow segment number (LRFx Stg) of each local register write port and a cross bar switch address number (LRFx Bus).
Claims (5)
1, a kind of command control method at stream handle, it is characterized in that instruction control is divided into stream level program and nuclear level program, stream level program is responsible for data in the scheduling of calculating between core and the chip external memory, and nuclear level program is finished operation of data, and its concrete steps are:
(1), initialize the scalar data of nuclear level program: stream level program will examine by the control word transfer instruction that needed scalar data is initialised in the microcontroller register in grade program process, when nuclear level program is carried out, by the communication class instruction, from the microcontroller register, be broadcast to and calculate in the group unit, do not need to initialize scalar data such as fruit stone level program, this step can be omitted so;
(2), prepare vector data for nuclear level program: by the instruction of flow transmission class, a part of calculating the required vector data to be processed of core or the vector data in double buffering technology is loaded into on-chip memory from chip external memory, step (1) and step (2) can walk abreast and carry out;
(3), start the program implementation of nuclear level: after waiting for that preceding two steps are finished, start nuclear level program and carry out;
(4), a nuclear level program implementation: do not need and flow that a level program is carried out synchronous communication and the nuclear grade required data to be processed of program are placed in the on-chip memory fully such as fruit stone level program, in the process that nuclear level program is carried out, whether stream level program ceaselessly detects nuclear level program complete so; Need and a stream level program is carried out synchronous communication such as fruit stone level program, when stream level program and a stream level program reached synchronous point, nuclear level program and a stream level program can continue to carry out; If the stream level adopts double buffering technology to provide data for nuclear level program, when nuclear level program was processed a part of data, stream level program was loaded into another part data in the on-chip memory from chip external memory so;
(5), will examine the vector data that the level program generates and store in the chip external memory: after stream level program waits for that nuclear level program is complete, the vector data of its generation be stored in the chip external memory; If the vector data that nuclear level program is generated has adopted double buffering technology, so in step (4), stream level program can store the part of the vector data that generates in the chip external memory into, in this step, with the last part data storage in chip external memory; If the vector data that a last nuclear level program generates is an intermediate result, be about to be used, and can be kept at fully in the on-chip memory by following nuclear level program, this step can be omitted so;
(6), read the scalar result of nuclear level program: after waiting for that nuclear level program is finished, stream level program reads out scalar result by the control word transfer instruction from the microcontroller register; Do not generate scalar result as fruit stone level program, this step can be omitted so; This step can walk abreast with step (5) and carry out.
2, a kind of command control method at stream handle according to claim 1 is characterized in that the control procedure of described stream level instruction was divided into for three steps:
(1), the distribution of logic groove number and the generation of correlativity: a stream level compiler calls stream function and replaces with flow operation, a stream function can be converted to one or more flow operation, each flow operation finally all can be converted into a stream and refer to, and carry out the correlation analysis between the flow operation, comprise two steps: give flow operation assignment logic groove number and generate the correlativity mask;
(2), the dynamic generation and the transmission of the instruction of stream level: the instruction of stream level can only dynamically produce according to the program implementation situation, the software module that runs on the primary processor is responsible for flowing dynamic generation and the transmission that level is instructed, flow operation can be sent to stream controller by this software module must satisfy following two conditions: first, the logic groove of current flow operation to be sent number is available, the operating position of logic groove number in the inquiry stream controller that software module is not stopped, have only when for the logic groove of this command assignment number not in stream controller in the instruction queue in, this flow operation can be sent in the stream controller; The second, if there is flow operation A to depend on flow operation B, and B as the part of a certain double buffering by repeatedly transmission, after B must finish last transmission so, A just can be sent out;
(3), emission and the execution of the instruction of stream level: the instruction queue of a M item is arranged in the stream controller, can hold simultaneously the instruction of M bar stream level, wherein M is generally greater than N, and emission and the execution of the instruction of stream level are divided into three steps: the one, a stream level instruction entry instruction formation; When flowing instruction entry instruction formation, need to upgrade finishing correlation and launching correlation of this stream level instruction according to the stream level instruction that is present in the instruction queue; The 2nd, the instruction of stream level is transmitted in the functional unit from instruction queue to be carried out, when the emission correlation of stream level instruction with finish that correlation is met and should stream grade needed resource of instruction in the time of the free time, the stream level instruction i of the earliest entry instruction formation of stream controller meeting Dynamic Selection is transmitted into it and goes execution in relevant functional unit, and renewal is arranged in instruction queue and launches the correlation mask of the instruction of correlation in stream level instruction i existence; The 3rd, the instruction of stream level is complete, leaves instruction queue, if needed resource of stream level instruction is again idle when getting off, should stream grade instruction be finished so, can leave one's post from instruction queue.But because the same clock cycle may have the instruction of a plurality of stream level to be finished, therefore stream controller selects the stream level instruction k of the earliest entry instruction formation to leave one's post, and upgrades and be arranged in the correlation mask that there are the stream level instruction of finishing correlation in instruction queue and stream level instruction k.
3, a kind of command control method according to claim 1 and 2 at stream handle, the communication that it is characterized in that described stream level program and nuclear level program can be transmitted the transmission that scalar data is finished in the class instruction by control word, the needed in the process of implementation scalar data of nuclear level program need transmit the class instruction by control word scalar data is written in the microcontroller register file, and the scalar operation result that will examine the level program by such stream level instruction reads in the stream controller register file, can not read and write by the instruction of control word transmission class and calculate the inner local register of group.
4, a kind of command control method for stream handle according to claim 1 and 2, it is characterized in that described stream level program and the communication of nuclear level program can start the program implementation of nuclear level by stream level instruction clustop, after the instruction of stream level is ready to inlet flow and initializes scalar data for nuclear level program, start the program implementation of nuclear level by the clustop instruction, when stream controller detects a nuclear level program and is finished, be that the clustop instruction is when finishing, the vector result that outputs to the stream registers file that to examine the level program by the instruction of transfer of data class stream level is written in the chip external memory, and reads in the stream controller register from the microcontroller register by the scalar result that the instruction of control word transmission class will be examined grade program.
5, a kind of command control method for stream handle according to claim 1 and 2, it is characterized in that described stream level program can adopt synchronous communication with the communication of nuclear grade program, realize flowing the synchronous of level program and nuclear level program, the stream level instruction that is positioned in stream level program after the synchronic command can not be prior to this synchronic command emission, can launch when the first of instruction queue and a nuclear level program also reach synchronous point when synchronic command reaches; In nuclear level program, run into synchronic command, will cause whole pipeline stall; When stream level program reaches synchronous point earlier, wait for that synchronic command discharged the nuclear level production line when nuclear level program also reached synchronous point, make nuclear level program continue to carry out, stream level program is also carried out simultaneously, and it is similar that nuclear level program reaches earlier the synchronous point processing procedure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100345685A CN100461094C (en) | 2007-03-19 | 2007-03-19 | Instruction control method aimed at stream processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2007100345685A CN100461094C (en) | 2007-03-19 | 2007-03-19 | Instruction control method aimed at stream processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101021779A true CN101021779A (en) | 2007-08-22 |
CN100461094C CN100461094C (en) | 2009-02-11 |
Family
ID=38709554
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2007100345685A Expired - Fee Related CN100461094C (en) | 2007-03-19 | 2007-03-19 | Instruction control method aimed at stream processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100461094C (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101907984A (en) * | 2009-08-07 | 2010-12-08 | 威盛电子股份有限公司 | Command processing method with and the SuperScale pipeline microprocessor that is suitable for |
CN102053821A (en) * | 2011-01-06 | 2011-05-11 | 中国人民解放军国防科学技术大学 | Multithreading flow branch control method and control device thereof |
CN101620526B (en) * | 2009-07-03 | 2011-06-15 | 中国人民解放军国防科学技术大学 | Method for reducing resource consumption of instruction memory on stream processor chip |
CN101217564B (en) * | 2008-01-16 | 2012-08-22 | 上海理工大学 | A parallel communication system and the corresponding realization method of simple object access protocol |
CN102722446A (en) * | 2012-06-06 | 2012-10-10 | 北京航空航天大学 | Dynamic recorder for local memory access model for stream processor |
CN104025025A (en) * | 2011-12-28 | 2014-09-03 | 英特尔公司 | Systems, apparatuses, and methods for performing delta encoding on packed data elements |
CN105512024A (en) * | 2014-09-30 | 2016-04-20 | 龙芯中科技术有限公司 | Method and device for generating detection instruction sequence |
CN105593809A (en) * | 2013-08-06 | 2016-05-18 | 甲骨文国际公司 | Flexible configuration hardware streaming unit |
CN107004308A (en) * | 2014-10-30 | 2017-08-01 | 加拿大致博希迈有限公司 | The tracking and device of product treatment line |
CN107179895A (en) * | 2017-05-17 | 2017-09-19 | 北京中科睿芯科技有限公司 | A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture |
US10037209B2 (en) | 2011-12-28 | 2018-07-31 | Intel Corporation | Systems, apparatuses, and methods for performing delta decoding on packed data elements |
CN109597654A (en) * | 2018-12-07 | 2019-04-09 | 湖南国科微电子股份有限公司 | Initialization of register method, the generation method and embedded system of configurations table |
CN110825437A (en) * | 2018-08-10 | 2020-02-21 | 北京百度网讯科技有限公司 | Method and apparatus for processing data |
CN111459549A (en) * | 2020-04-07 | 2020-07-28 | 上海兆芯集成电路有限公司 | Microprocessor with highly advanced branch predictor |
CN111860804A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4576758B2 (en) * | 2001-06-21 | 2010-11-10 | ソニー株式会社 | Data processing device |
KR100463642B1 (en) * | 2003-03-06 | 2004-12-29 | 한국과학기술원 | Apparatus for accelerating multimedia processing by using the coprocessor |
WO2004086760A1 (en) * | 2003-03-27 | 2004-10-07 | Matsushita Electric Industrial Co., Ltd. | Data processing apparatus |
US7920584B2 (en) * | 2005-05-04 | 2011-04-05 | Arm Limited | Data processing system |
CN100357932C (en) * | 2006-06-05 | 2007-12-26 | 中国人民解放军国防科学技术大学 | Method for decreasing data access delay in stream processor |
-
2007
- 2007-03-19 CN CNB2007100345685A patent/CN100461094C/en not_active Expired - Fee Related
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101217564B (en) * | 2008-01-16 | 2012-08-22 | 上海理工大学 | A parallel communication system and the corresponding realization method of simple object access protocol |
CN101620526B (en) * | 2009-07-03 | 2011-06-15 | 中国人民解放军国防科学技术大学 | Method for reducing resource consumption of instruction memory on stream processor chip |
CN101907984B (en) * | 2009-08-07 | 2014-10-29 | 威盛电子股份有限公司 | Command processing method and its applicable super-scale pipeline microprocessor |
CN101907984A (en) * | 2009-08-07 | 2010-12-08 | 威盛电子股份有限公司 | Command processing method with and the SuperScale pipeline microprocessor that is suitable for |
CN102053821A (en) * | 2011-01-06 | 2011-05-11 | 中国人民解放军国防科学技术大学 | Multithreading flow branch control method and control device thereof |
CN102053821B (en) * | 2011-01-06 | 2014-03-19 | 中国人民解放军国防科学技术大学 | Multithreading flow branch control method and control device thereof |
US9965282B2 (en) | 2011-12-28 | 2018-05-08 | Intel Corporation | Systems, apparatuses, and methods for performing delta encoding on packed data elements |
US10671392B2 (en) | 2011-12-28 | 2020-06-02 | Intel Corporation | Systems, apparatuses, and methods for performing delta decoding on packed data elements |
CN104025025A (en) * | 2011-12-28 | 2014-09-03 | 英特尔公司 | Systems, apparatuses, and methods for performing delta encoding on packed data elements |
CN104025025B (en) * | 2011-12-28 | 2018-08-28 | 英特尔公司 | Systems, devices and methods for executing incremental encoding to packaged data element |
US10037209B2 (en) | 2011-12-28 | 2018-07-31 | Intel Corporation | Systems, apparatuses, and methods for performing delta decoding on packed data elements |
CN102722446A (en) * | 2012-06-06 | 2012-10-10 | 北京航空航天大学 | Dynamic recorder for local memory access model for stream processor |
CN102722446B (en) * | 2012-06-06 | 2015-03-25 | 北京航空航天大学 | Dynamic recorder for local memory access model for stream processor |
CN105593809A (en) * | 2013-08-06 | 2016-05-18 | 甲骨文国际公司 | Flexible configuration hardware streaming unit |
CN105512024B (en) * | 2014-09-30 | 2018-03-23 | 龙芯中科技术有限公司 | The method and apparatus of generation detection command sequence |
CN105512024A (en) * | 2014-09-30 | 2016-04-20 | 龙芯中科技术有限公司 | Method and device for generating detection instruction sequence |
CN107004308A (en) * | 2014-10-30 | 2017-08-01 | 加拿大致博希迈有限公司 | The tracking and device of product treatment line |
CN107179895A (en) * | 2017-05-17 | 2017-09-19 | 北京中科睿芯科技有限公司 | A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture |
CN107179895B (en) * | 2017-05-17 | 2020-08-28 | 北京中科睿芯科技有限公司 | Method for accelerating instruction execution speed in data stream structure by applying composite instruction |
CN110825437A (en) * | 2018-08-10 | 2020-02-21 | 北京百度网讯科技有限公司 | Method and apparatus for processing data |
CN110825437B (en) * | 2018-08-10 | 2022-04-29 | 昆仑芯(北京)科技有限公司 | Method and apparatus for processing data |
CN109597654A (en) * | 2018-12-07 | 2019-04-09 | 湖南国科微电子股份有限公司 | Initialization of register method, the generation method and embedded system of configurations table |
CN109597654B (en) * | 2018-12-07 | 2022-01-11 | 湖南国科微电子股份有限公司 | Register initialization method, basic configuration table generation method and embedded system |
CN111860804A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
CN111860804B (en) * | 2019-04-27 | 2022-12-27 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
US12026606B2 (en) | 2019-04-27 | 2024-07-02 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
US12093811B2 (en) | 2019-04-27 | 2024-09-17 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
CN111459549A (en) * | 2020-04-07 | 2020-07-28 | 上海兆芯集成电路有限公司 | Microprocessor with highly advanced branch predictor |
CN111459549B (en) * | 2020-04-07 | 2022-11-01 | 上海兆芯集成电路有限公司 | Microprocessor with highly advanced branch predictor |
Also Published As
Publication number | Publication date |
---|---|
CN100461094C (en) | 2009-02-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100461094C (en) | Instruction control method aimed at stream processor | |
EP3314401B1 (en) | Block-based architecture with parallel execution of successive blocks | |
JP2928695B2 (en) | Multi-thread microprocessor using static interleave and instruction thread execution method in system including the same | |
CN107810480B (en) | Instruction block allocation based on performance metrics | |
CN100357884C (en) | Method, processor and system for processing instructions | |
US9811340B2 (en) | Method and apparatus for reconstructing real program order of instructions in multi-strand out-of-order processor | |
EP3314398B1 (en) | Reuse of decoded instruction blocks in a block based architecture | |
US7836276B2 (en) | System and method for processing thread groups in a SIMD architecture | |
US20170083320A1 (en) | Predicated read instructions | |
US5854934A (en) | Optimizing compiler having data cache prefetch spreading | |
US8935515B2 (en) | Method and apparatus for vector execution on a scalar machine | |
US10452399B2 (en) | Broadcast channel architectures for block-based processors | |
US11531552B2 (en) | Executing multiple programs simultaneously on a processor core | |
US20170083319A1 (en) | Generation and use of block branch metadata | |
US20160378491A1 (en) | Determination of target location for transfer of processor control | |
CN105426160A (en) | Instruction classified multi-emitting method based on SPRAC V8 instruction set | |
US6061367A (en) | Processor with pipelining structure and method for high-speed calculation with pipelining processors | |
CN111656337A (en) | System and method for executing instructions | |
EP0496407A2 (en) | Parallel pipelined instruction processing system for very long instruction word | |
Nicolau et al. | ROPE: a statically scheduled supercomputer architecture | |
Singhvi et al. | Pipeline Hazards and its Resolutions | |
JP2861234B2 (en) | Instruction processing unit | |
CN118796277A (en) | GPGPU-based instruction pipeline optimization and dynamic programming method and system | |
Karplus et al. | A compiler-driven supercomputer | |
Tziouvaras et al. | Rapid, low-power loop execution in a network of functional units |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
C17 | Cessation of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20090211 Termination date: 20110319 |