CN105893319A - Multi-lane/multi-core system and method - Google Patents

Multi-lane/multi-core system and method Download PDF

Info

Publication number
CN105893319A
CN105893319A CN201410781446.2A CN201410781446A CN105893319A CN 105893319 A CN105893319 A CN 105893319A CN 201410781446 A CN201410781446 A CN 201410781446A CN 105893319 A CN105893319 A CN 105893319A
Authority
CN
China
Prior art keywords
track
processor core
instruction
data
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410781446.2A
Other languages
Chinese (zh)
Inventor
林正浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Original Assignee
Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinhao Bravechips Micro Electronics Co Ltd filed Critical Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Priority to CN201410781446.2A priority Critical patent/CN105893319A/en
Priority to PCT/CN2015/096769 priority patent/WO2016091164A1/en
Publication of CN105893319A publication Critical patent/CN105893319A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/167Interprocessor communication using a common memory, e.g. mailbox
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Advance Control (AREA)

Abstract

The present invention provides a multi-lane/multi-core system and method. The multi-lane/multi-core system comprises multiple lanes/ processor cores, and each lane/processor core is capable of executing a same instruction or different instructions and accessing a memory. The multi-lane/multi-core system may further comprise a loop controller and a data engine, which are used to perform a data read/write operation on the memory when a loop program is executed, so explicit occurrence of a data access instruction in the loop program is avoided. The system is further capable of performing a post-processing operation on execution results of the multiple lanes/processor cores, and accessing the memory. The post-processing operation may be performed by the lanes/processor cores or a special post processor. By performing a mapping/aggregating operation by using the multi-lane/multi-core system with the post processor, which is provided by the present invention, a large number of memory accessing operations in the conventional multi-lane/multi-core system can be avoided, and therefore performance is significantly improved.

Description

A kind of multilane/multiple nucleus system and method
Technical field
The present invention relates to computer, communication and integrated circuit fields.
Background technology
In order to improve the operational efficiency of multiprocessor nuclear system, can by some calculate operation (as Information Statistics, Matrix operations etc.) it is divided into two stages.The operation of first stage can map (map) to multiple processors Executed in parallel in core, shortens the execution time with high degree of parallelism;The operation of second stage is then to grasp the first stage The execution result made carries out being polymerized (reduce) and processes, thus obtains final result.This mode is referred to as reflecting Penetrate/be polymerized (map/reduce).
A kind of current conventional method processing mapping/aggregation problem is by memorizer and substantial amounts of meter by network Calculation machine/processor couples together.The intermediate object program obtained is stored in after having performed map operation by computer/processor In memorizer, read corresponding intermediate object program from memorizer again when performing converging operationJu Hecaozuo afterwards follow-up to complete Operation.But, in such an embodiment, need frequently to carry out time-consuming memory access operation, and pass through net Network carries out data transmission causing long delay, the most inefficient.And use common single instruction single data stream (S1SD) processor or polycaryon processor, then be limited to ALU (ALU) or processor core Number limits, it is impossible to real large-scale parallel carries out the operation of mapping phase.
Another kind of method is to use track (1ane) processor to realize mapping/polymerization.Single-instruction multiple-data stream (SIMD) (SIMD) graphic process unit (GPU) is a kind of common track processor.GPU have a large amount of Track, several tracks are divided into one group.Can be for identical or different in all tracks often organized in track Data perform identical instruction, improve executed in parallel efficiency, wanting of mapping phase can be met well Ask.
Refer to Fig. 1, it is an embodiment of the GPU realized according to prior art.Wherein track 11, 12,13,14 may be constructed a track group, perform same instruction simultaneously, and share same memorizer (Fig. 1 does not shows).Data path is not directly connected to transmit data between track, but by one Track writes data into, is read these data from memorizer by another track again, it is achieved two cars Data transmission between road.
This GPU uses special SIMD instruction collection (the such as PTX different from general processor (CPU) Instruction set), and the special development environment (such as CUDA environment) different from the programmed environment of CPU, this The conversion added between the difficulty of programming, and two kinds of instruction set is difficult, also influences whether that system is overall Efficiency.
Additionally, due to different tracks have to carry out same instruction, therefore execution efficiency in the track group of GPU Necessarily poor than each track can perform different instructions.Although GPU has reached higher at mapping phase Execution efficiency, but when entering polymerization stage, still can bring delay because frequently accessing memorizer and wait, And multiple track accesses memorizer and has high requirements bandwidth simultaneously.
The invention discloses the architecture of brand-new track processor system, fundamentally solve above-mentioned all Problem.
Summary of the invention
The present invention proposes a kind of multilane/multiple nucleus system, comprises multiple track/processor core, and each track/ Processor core has different track/processor cores number, and each track/processor core can perform identical or different Instruction and access memorizer;The execution result of described a plurality of track/processor cores is also carried out by described system Post-processing operation, and access memorizer.
Optionally, in the system, between the plurality of track/processor core, there is global bus, be used for passing Pass the data in depositor, move or calculate operation carrying out standdle carrier road/processor core register value.
Optionally, in the system, the track/processor core in multilane/multiple nucleus system is divided into a plurality of Track/processor core group, the bus switch conducting in the global bus within each track/processor core group, car Bus switch in global bus between road/processor core group disconnects so that each track/processor core group is same Time carry out standdle carrier road/processor core register value therein and move or calculate operation;Different track/processor cores When the execution same degree of polymerization arranges instruction simultaneously, corresponding described bus switch is configured, it is achieved phase The degree of polymerization between the track/processor core answered.
Optionally, in the system, different track/processor cores perform same track/processor core simultaneously Between operational order time, determine track, source/processor core and target track according to respective described track/processor core number / processor core, by delivering to target carriage by the register value of track, source/processor core between described track/processor core Road/processor core, is carried out post-processing operation by described target track/processor core;Different track/processor cores are same Between Shi Zhihang same track/processor core during operational order, determine according to respective described track/processor core number Track, source/processor core and target track/processor core, by between described track/processor core by track, source/place The register value of reason device core delivers to target track/processor core, described target track/processor core after carrying out Reason operation.
Optionally, in the system, by instruction by the track/processor core number of each track/processor core Move in the general register of this track/processor core.
Optionally, in the system, different track/processor cores pass through according to different track/processor cores number Same instruction is calculated different data addresses.
Optionally, in the system, also comprising one or more preprocessor, each preprocessor is with multiple Several tracks/processor core connects, and receives the execution result of described a plurality of track/processor core, and to described Perform result to be polymerized.
Optionally, in the system, polymerization result is directly stored in memorizer by described preprocessor.
Optionally, in the system, polymerization result is sent back at least one track/process by described preprocessor In the depositor of device core.
Optionally, in the system, the degree of polymerization of post-processing operation is determined by instruction.
Optionally, in the system, described preprocessor carries out converging operationJu Hecaozuo by performing post processing instruction.
Optionally, in the system, described preprocessor is connected by transmission bus;Each preprocessor Instruct by performing described post processing, by execution result and the adjacent preprocessor of corresponding track/processor core Output carries out converging operationJu Hecaozuo.
Optionally, in the system, described preprocessor is connected by tree-shaped bus;Wherein, the first order Preprocessor instructs by performing described post processing, carries out the execution result of two corresponding track/processor cores Converging operationJu Hecaozuo, and it is each that this polymerization result, post processing instruction or its decoding result are passed to successively remaining by level Level preprocessor;Remaining preprocessor at different levels instructs by performing described post processing, corresponding to previous stage two The polymerization result of preprocessor is polymerized again.
Optionally, in the system, track/between processor core and preprocessor, preprocessor and Hou Chu Path between reason device can configure.
Optionally, in the system, configure, by plural number by described path is turned on or off Individual preprocessor realizes after being grouped by group converging operationJu Hecaozuo.
Optionally, in the system, configure by described path is turned on or off so that after Processor realizes the converging operationJu Hecaozuo of different polymerization degree.
Optionally, in the system, a kind of general judging module produce according to the control signal exported Next control signal when continuing executing with by current state, and under receiving when not continuing executing with by current state One control signal;Described general judging module is controlled to run by the described control signal that exported according to described system Feedback, select a kind of output in next control signal of the two to continue to run with to control described system;
Described general judging module at least includes: arithmetical unit, depositor and selector;Wherein: depositor is used In storage current control signal, and current control signal output is run to control described system;Arithmetical unit is used In producing next control signal when continuing executing with by current state according to described current control signal state, and Next control signal described is sent to described selector;Selector is believed described current control according to described system Number control under run feedback, to arithmetical unit produce described in next control signal and receive do not continue by current state Next control signal during continuous execution selects, and selection result is updated in described depositor.
Optionally, in the system, odd number or a plurality of cycle controller;Wherein, each circulation control A loop body in device correspondence job sequence processed, for the execution number of times of described loop body is counted, And determine whether circulation is finished;Odd number or plurality of data engine, be divided into some groups, often organize at least Comprise a data engine also;Often the corresponding cycle controller of group data engine, is used for calculating described circulation The address of the data used in body, and control memorizer and complete data access operation;By instruction, circulation is set Cycle-index in controller;When going to corresponding recursion instruction, described cycle-index subtracts one every time;Institute Stating after corresponding whole of recursion instruction have circulated, described cycle-index is reset as former arranging value.
Optionally, in the system, whenever going to described recursion instruction, data engine more new data Address, and it is ready for track/processor core use according to new data address acquisition corresponding data;If circulation The execution result of instruction represents that circulation continues, then data engine is by obtaining data address plus address step size Described new data address;If the execution result of recursion instruction represents loop ends, then data engine is by data Address resets to the former value that arranges as described new data address.
Optionally, in the system, described data engine also comprises a FIFO buffering;Work as data After engine is set, i.e. it is stored in described FIFO according to the data address acquisition corresponding data arranged and delays Punching is for track/processor core;Data address is i.e. updated after each data acquisition completes, and according to new Data address obtains corresponding data and is stored in described FIFO buffering;Whenever going to described recursion instruction, Data that described FIFO buffer drop is stored in the earliest the data being stored in secondary morning are stored in the earliest as new Data;If the execution result of recursion instruction represents loop ends, then data address is reset to by data engine The former value that arranges is as described new data address, and empties described FIFO buffering.
Optionally, in the system, described data engine also comprises a Fusion Module;Described fusion mould Block receives track/processor core and writes after the data and appropriate address of memorizer, first according to this address from storage Device reads the data of former storage, and after the data sent here with track/processor core carry out calculating operation, then root According in this write back address memorizer.
Optionally, in the system, each track described/processor core performs same program, and by rear The execution result of each track/processor core is compared by processor, it is judged that in described track/processor core Whether there is the abnormal abnormal track/processor core of work, thus realize the Autonomous test of described system;Work as existence During abnormal track/processor core, determine the track/processor core number of abnormal track/processor core.
Optionally, in the system, the track/processor core number of abnormal track/processor core is stored in car In road allotter;Track allotter walks around described abnormal track/processor core when distributing track/processor core, Thus realize the selfreparing of described system.
The present invention also proposes a kind of multilane/multinuclear and performs method, and every track/processor core has different cars Road/processor core number, each track/processor core can perform identical or different instruction;Again to described plural number The execution result of individual track/processor core carries out post-processing operation, and accesses memorizer.
Optionally, in the process, posted by the global bus's transmission between the plurality of track/processor core Data in storage, move or calculate operation carrying out standdle carrier road/processor core register value.
Optionally, in the process, the track/processor core in multilane/multiple nucleus system is divided into a plurality of Track/processor core group, the bus switch conducting in the global bus within each track/processor core group, car Bus switch in global bus between road/processor core group disconnects so that each track/processor core group is same Time carry out standdle carrier road/processor core register value therein and move or calculate operation;Different track/processor cores When the execution same degree of polymerization arranges instruction simultaneously, corresponding described bus switch is configured, it is achieved phase The degree of polymerization between the track/processor core answered.
Optionally, in the process, different track/processor cores perform same track/processor core simultaneously Between operational order time, determine track, source/processor core and target track according to respective described track/processor core number / processor core, by delivering to target carriage by the register value of track, source/processor core between described track/processor core Road/processor core, is carried out post-processing operation by described target track/processor core;Different track/processor cores are same Between Shi Zhihang same track/processor core during operational order, determine according to respective described track/processor core number Track, source/processor core and target track/processor core, by between described track/processor core by track, source/place The register value of reason device core delivers to target track/processor core, described target track/processor core after carrying out Reason operation.
Optionally, in the process, by instruction by the track/processor core number of each track/processor core Move in the general register of this track/processor core.
Optionally, in the process, different track/processor cores pass through according to different track/processor cores number Same instruction is calculated different data addresses.
Optionally, in the process, by odd number or a plurality of preprocessor and a plurality of track/processors Core connects, and receives the execution result of described a plurality of track/processor core, and gathers described execution result Close.
Optionally, in the process, described preprocessor polymerization result is directly stored in memorizer.
Optionally, in the process, by described preprocessor polymerization result sent back at least one track/ In the depositor of reason device core.
Optionally, in the process, the degree of polymerization of post-processing operation is determined by instruction.
Optionally, in the process, described preprocessor polymerization behaviour is carried out by performing post processing instruction Make.
Optionally, in the process, described preprocessor is connected by transmission bus;Each preprocessor Instruct by performing described post processing, by execution result and the adjacent preprocessor of corresponding track/processor core Output carries out converging operationJu Hecaozuo.
Optionally, in the process, described preprocessor is connected by tree-shaped bus;Wherein, the first order Preprocessor instructs by performing described post processing, carries out the execution result of two corresponding track/processor cores Converging operationJu Hecaozuo, and it is each that this polymerization result, post processing instruction or its decoding result are passed to successively remaining by level Level preprocessor;Remaining preprocessor at different levels instructs by performing described post processing, corresponding to previous stage two The polymerization result of preprocessor is polymerized again.
Optionally, in the process, track/between processor core and preprocessor, preprocessor and Hou Chu Path between reason device can configure.
Optionally, in the process, configure, by plural number by described path is turned on or off Individual preprocessor realizes after being grouped by group converging operationJu Hecaozuo.
Optionally, in the process, configure by described path is turned on or off so that after Processor realizes the converging operationJu Hecaozuo of different polymerization degree.
Optionally, in the process, also comprising a kind of control method, described control method is according to exporting Control signal produce next control signal when continuing executing with by current state, and receive by current state Next control signal when continuing executing with;Further according to system by described exported control signal control run anti- Feedback, selects a kind of output in next control signal of the two to continue to run with control system;Described control Method at least includes: storage current control signal, and uses current control signal control system to run;According to Described current control signal state produces next control signal when continuing executing with by current state;According to described System runs feedback, next when continuing executing with described current state under described current control signal controls Next control signal when control signal and reception are not continued executing with by current state selects, and will select Result is updated to current control signal.
Optionally, in the process, also comprise: by odd number or a plurality of cycle controller to loop body Execution number of times carry out counting determining whether circulation is finished;Each cycle controller correspondence job sequence In a loop body;Follow described in odd number or a plurality of data engine corresponding with cycle controller calculate The address of the data used in the loop body that ring controller is corresponding, and control memorizer and complete data access operation; Cycle-index in cycle controller is set by instruction;When going to corresponding recursion instruction, described every time Cycle-index subtracts one;After corresponding whole of described recursion instruction have been circulated, described cycle-index is reset as Former value is set.
Optionally, in the process, whenever going to described recursion instruction, data engine number is updated According to address, and it is ready for track/processor core use according to new data address acquisition corresponding data;If following The execution result of fourth finger order represents that circulation continues, then data engine is by obtaining data address plus address step size To described new data address;If the execution result of recursion instruction represents loop ends, then data engine is by number The former value that arranges is reset to as described new data address according to address.
Optionally, in the process, FIFO buffer data are used;When data engine is set After, i.e. obtain corresponding data according to the data address arranged and be stored in described FIFO buffering for track/place Reason device core uses;After each data acquisition completes, i.e. update data address, and obtain according to new data address Take corresponding data and be stored in described FIFO buffering;Whenever going to described recursion instruction, described first enter elder generation Go out the data that buffer drop is stored in the earliest the data being stored in secondary morning as the new data being stored in the earliest;If The execution result of recursion instruction represents loop ends, then data engine data address is reset to former arrange value make For described new data address, and empty described FIFO buffering.
Optionally, in the process, use Fusion Module to the data in memorizer and track/processor core The data sent here merge;Described Fusion Module receive track/processor core write toward memorizer data and After appropriate address, from memorizer, first read the data of former storage according to this address, and with track/processor core After the data sent here carry out calculating operation, further according in this write back address memorizer.
Optionally, in the process, each track/processor core same program is performed, and by rear The execution result of each track/processor core is compared by reason device, it is judged that in described track/processor core be No have the abnormal abnormal track/processor core of work, thus realizes the Autonomous test of described system;Different when existing Often during track/processor core, determine the track/processor core number of abnormal track/processor core.
Optionally, in the process, store the track/processor core number of abnormal track/processor core, and Walk around described abnormal track/processor core when distributing track/processor core, thus realize reviewing one's lessons by oneself of described system Multiple.
The present invention also proposes one and utilizes normalization track/processor core number to perform on multilane/multiple nucleus system The method of program, the corresponding normalization track/processor core number of each track/processor core.
Optionally, in the process, plural number bar track/processor core is when performing cyclic program, by every time The renewal of circulation trigger data address, and carry out memorizer read or write according to described new data address, Thus avoid to express in cyclic program data access instruction occurs.
Optionally, in the process, the data that described a plurality of track/processor core executed in parallel is same are drawn Hold up and instruction is set;Described data engine calculates according to described configuration information and produces odd number or plurality of data ground Location, and carry out memorizer read or write according to described data address.
Optionally, in the process, described data engine is corresponding according at least to each track/processor core Normalization track/processor core number and address gaps, calculate data corresponding to each track/processor core and initiate Address.
Optionally, in the process, described data engine is corresponding according at least to each track/processor core Address step size, calculates data address corresponding during circulation every time.
Optionally, in the process, when there being multilayer circulation nesting, described a plurality of track/processor cores What executed in parallel was same be circularly set, and cycle controller is configured by instruction;Described it be circularly set instruction and comprise Configuration information at least include the cycle-index of loop body;Described a plurality of track/processor core also executed in parallel Same recursion instruction;When performing described recursion instruction, described cycle controller carries out corresponding counts;Logical Cross described counting: if the cycle-index having occurred and that is less than the cycle-index of loop body, then by cycle controller Control to read loop body initial order to supply to perform, thus repeat described loop body;If have occurred and that follows Ring number of times is equal to the cycle-index of loop body, then by sequence address after cycle controller control reading loop body Next instruction for performing, thus terminate to perform described loop body.
Optionally, in the process, each cycle controller coordinates with odd number or plurality of data engine, Triggered described data engine by circulation every time and calculate new data address, and enter according to described new data address Line storage read or write, thus avoid to express in cyclic program data access instruction occurs.
Optionally, in the process, to perform described recursion instruction as trigger condition.
Optionally, in the process, it would be desirable to after the program two-dimensional development being performed a plurality of times by a plurality of tracks/ Processor core executed in parallel;The described execution number of times i.e. degree of parallelism being performed a plurality of times;Described two-dimensional development includes sky Between launch and the time launch;Wherein: space development, the most a plurality of track/processor cores can be simultaneous for not With data perform identical instruction so that described program is launched on Spatial Dimension, each track/processor Program described in core executed in parallel;Time launches, i.e. when available track/processor check figure is less than described degree of parallelism, Described program is performed a plurality of times so that described program is opened up on time dimension by described a plurality of track/processor cores Opening, described a plurality of track/processor core serials successively perform described program.
Optionally, in the process, track allotter gradually deducts according to the space resources number of PROGRAMMED REQUESTS Available resources space number, will be mapped as the use to time resource to the requirement of space resources.
Optionally, in the process, by the track normalizing that allotter output reference track/processor core is corresponding Change track number, thus affect the starting data address of data engine in benchmark track/processor core, be defined as sky Between launch space starting point;Track/processor check figure definition current spatial can be used to launch with track allotter Space scale, and calculate time scale to control time/space conversion when track/processor core performs.
Optionally, in the process, by adjusting the normalization track number of benchmark track/processor core, really Determine the expansion execution of epicycle time is which part in space development.
Optionally, in the process, the program being run or job sequence degree of parallelism demand is clearly provided; When running described program or job sequence, by described multilane/multiple nucleus system according to described degree of parallelism demand certainly Dynamic distribution track/processor core;When available track/processor core number cannot meet described degree of parallelism demand, The described program of execution is circulated several times, to meet described degree of parallelism demand by described multilane/multiple nucleus system.
Optionally, in the process, compiler executed in parallel loop body is determined when compiling described program Time maximum parallelism degree, produce comprise described maximum parallelism degree degree of parallelism instruction is set;When described multilane/ When multiple nucleus system performs described cyclic program, track allotter perform degree of parallelism and instruction is set, and according to can With track/processor check figure distribution track/processor core, determine the track/processor check figure participating in executed in parallel, And the number of times that the circulation of described program is performed by these track/processor cores.
Optionally, in the process, the degree of parallelism of program the number having launched and being executed in parallel is deducted Mesh obtains remaining degree of parallelism;When remaining degree of parallelism less than available track/processor check figure, by track allotter Program described in the track/processor core executed in parallel of distribution respective number;And after this program finishes execution, institute Program of stating all is finished.
For this area professional person, it is also possible under the explanation of the present invention, the inspiration of claims and drawing, Understand, understand the present invention and comprised other aspect contents.
Beneficial effect
Use of the present invention have between track process and the multilane system of preprocessor carries out mapping/being polymerized Operation, can avoid substantial amounts of memory access operation in tradition multilane system, thus significantly increasing property Energy.Program is distinctly claimed degree of parallelism, and processor distributes execution resource as requested, and parallel meeting When spending, Automatic Cycle performs instruction segment to meet program requirement.Processor system use process between track or after locate Reason carries out self-test, selfreparing, reduces testing cost, increases system yield and reliability.
For the professional person of this area, other advantages and applications of the present invention will be apparent from.
Accompanying drawing explanation
Fig. 1 is an embodiment of the graphic process unit realized according to prior art;
Fig. 2 is an embodiment of extended instruction form of the present invention;
Fig. 3 is an embodiment of global bus between track of the present invention;
Fig. 4 is an embodiment of the multilane system comprising and improving track of the present invention;
Fig. 5 is to use to connect bus and the original calculating in each track resource between track at the car performing map operation An embodiment of converging operationJu Hecaozuo is performed on road;
Fig. 6 is that between track, the instruction degree of polymerization is embodiment when ' 2 ';
Fig. 7 is that between track, the instruction degree of polymerization is embodiment when ' 4 ';
Fig. 8 is an embodiment of tree-shaped bus structures preprocessor;
Fig. 9 is an embodiment of transmission bus structures preprocessor;
Figure 10 A is an embodiment of the tree-shaped bus of restructural;
Figure 10 B is an embodiment of restructural tree-shaped bus configuration situation;
Figure 10 C is another embodiment of restructural tree-shaped bus configuration situation;
Figure 10 D is another embodiment of restructural tree-shaped bus configuration situation;
Figure 11 is an embodiment of restructural transmission bus;
Figure 12 A is an embodiment of restructural degree of polymerization concrete structure;
Figure 12 B is degree of polymerization embodiment of configuring condition when being ' 2 ';
Figure 12 C is degree of polymerization embodiment of configuring condition when being ' 4 ';
Figure 12 D is degree of polymerization embodiment of configuring condition when being ' 8 ';
Figure 13 A is an embodiment of the multilane system improving cycle efficieny;
Figure 13 B is the multilane system improving a cycle efficieny embodiment for matrix multiplication;
Figure 14 A is an embodiment of matrix multiplication;
Figure 14 B is a schematic diagram of the substep generation result of matrix multiplication;
Figure 15 is an embodiment of the job sequence realizing described matrix multiplication;
Detailed description of the invention
The multilane system and method proposed the present invention below in conjunction with the drawings and specific embodiments is made the most in detail Describe in detail bright.According to following explanation and claims, advantages and features of the invention will be apparent from.Need explanation , accompanying drawing all uses the form simplified very much and all uses non-ratio accurately, only in order to convenient, distinct Ground aids in illustrating the purpose of the embodiment of the present invention.
It should be noted that, in order to clearly demonstrate present disclosure, the present invention especially exemplified by multiple embodiments to enter The different implementations of the one step explaination present invention, the plurality of embodiment is the not exhaustive formula of enumerative.Additionally, Succinct in order to illustrate, content noted above in front embodiment is often omitted in rear embodiment, therefore, In rear embodiment, NM content can be accordingly with reference to front embodiment.
Although this invention can extend in amendment in a variety of forms and replacing, description also lists Concrete enforcement legend is also described in detail.It should be appreciated that the starting point of inventor is not by this Bright being limited to illustrated specific embodiment, antithesis, the starting point of inventor is to protect all based on by this Improvement, equivalency transform and the amendment carried out in the spirit or scope of rights statement definition.Same components and parts number Code is likely to be used for all accompanying drawings to represent same or similar part.
Additionally, illustrate as a example by the multilane system comprising track in this manual, but skill of the present invention Art scheme can also be applied to comprising the multiple nucleus system of any suitable processor (Processor).Such as, Described processor can be processor core (Processor Core), general processor (General Processor), Central processing unit (CPU), microcontroller (MCU), digital signal processor (DSP), presentation manager Core (GPU Core), SOC(system on a chip) (SOC), special IC (ASIC) etc..
Instruction address of the present invention (InstructionAddress) refers to instruct depositing in main storage Storage address, i.e. can find this to instruct according to this address in main storage;And data address (Data Address) data storage address in main storage is referred to, i.e. can be according to this address at main storage In find this data.At this for the sake of simple and clear, all assume that virtual address is equal to physical address, for Needing to carry out the situation of address mapping, the method for the invention is the most applicable.In the present invention, present instruction The instruction currently being performed by track or obtaining can be referred to;Present instruction block can refer to containing current just by car The instruction block of the instruction that road performs.
In the present invention, each track in multilane system, can when performing identical data access instruction Calculate, with the track number according to itself, the data address that each track is different, thus access different data with reality Existing SIMD operation.With in prior art or do not have track number, or track number is typically derived from hardwired not With, the track number that in multilane system of the present invention, participation data address calculates can derive from joins in advance The register value put or the register value of dynamically configuration.
Specifically, according to technical solution of the present invention, can be when system initialization by track number (Lane Number, LN) insert in the special track depositor in track.So, once multilane system System there is certain track make mistakes (such as: the mistake in manufacture process), can be by remaining car in this system Road is redistributed track number and writes track depositor described in corresponding track, substitutes described with redundancy track Go out lay-by, improve yield and the reliability of multilane system.Additionally, according to technical solution of the present invention, also Can be by programming dynamically to each distribution track number, track in multilane system operation, raising is many The motility of lane system.
In the present invention, can extend on the basis of existing cpu instruction collection and obtain new extended instruction, For track number is moved from described specified register (move) in the general register in track, to hold Row subsequent operation.Unlike existing multilane system, it is not absolutely required in the present invention by one Extended instruction is compiled by new development environment (such as CUDA).Specifically, described extended instruction can There is identical instruction format with existing cpu instruction.Certainly, the present invention also can run new exploitation ring The extended instruction of border compiling.
Refer to Fig. 2, it is an embodiment of extended instruction form of the present invention.In the present embodiment with DLX instruction set on textbook illustrates as cpu instruction collection, also may be used for other cpu instruction collection Realize with same method, do not repeat them here.Instruction format A shows in DLX instruction set and typically deposits The form that device type (R-Type) instructs, wherein domain of instruction 21 is operation code (OPcode), domain of instruction 22 Being respectively two source registers number with 24, domain of instruction 23 is destination register number, and domain of instruction 25,26 is for expanding Exhibition operation code.
According to technical solution of the present invention, track movement can be expanded on the basis of register type instruction Instruction.If the track number in this multilane system is at each car by hardwired or read only memory (ROM) Road determines, then in instruction format A, domain of instruction 21 be corresponding operation code, domain of instruction 23 be mesh Scalar register file number, domain of instruction 22,24,25 and 26 all can not use (or be preserved for other extension use On the way).Now, when the command decoder in track translate this instruction be by track move time, it is possible to straight Connect in the depositor that track number storage corresponding for described hardwired to domain of instruction 23 is pointed to.
If this multilane system stores track number with depositor, first by software, original lane number is write each car Depositor in road.Now, when the command decoder in track translate this instruction be track move time, Just can directly will move to as target as the value (track number) in the track depositor of source register In the general register of depositor, or using as the value in the general register of source register as track number It is configured to as in the track depositor of destination register.Can also control to fill in each car by controller on sheet The track number in road.
Multilane system of the present invention can also have global bus for transmitting number between all tracks According to.Refer to Fig. 3, it is an embodiment of global bus between track of the present invention.In the present embodiment As a example by four tracks 31,32,33 and 34, the track number of its correspondence be respectively ' 0 ', ' 1 ', ' 2 ' and ' 3 ', and each track comprises independent buffer and arithmetical unit, such as register file (RF) with perform list Unit (EX).Hereinafter all illustrate as a example by the processor track carrying out register-register type operation, but equally Method and system can be used in such as stack-type, the cumulative other types such as type and register-memory type process Device is as the multilane processor in track.Unlike the prior art, in the present embodiment, each track Register file have one read mouth 36 can be by sense switch control (such as the sense switch 36 of track 31 correspondence) It is connected to global bus 35 to should the part in track;And one write mouth 38 and be also connected to global bus 35. Between the described part in global bus each track corresponding then by bus switch (as corresponding with track 31 total Wiretap 37) it is turned on or off, so that the register value in any one track can be sent to another In individual track.Additionally, in the present embodiment, the data in global bus 35 write mouth described in can passing through It is stored in register file (such as: the register value in track 31 is stored by global bus 35 In the register file in track 33), it is also possible to directly it is bypassed to performance element for performing to calculate accordingly operation (such as: the register value in track 31 is delivered to by global bus 35 performance element 39 in track 34 Add operation is carried out) with the register value in track 34.
Further, it is also possible to the track in multilane system is divided into several track groups, inside the group of each track Global bus on bus switch conducting, and bus switch between the group of track disconnects, the most each track Group can carry out standdle carrier road register value therein simultaneously moves or calculate operation.Each track has one Reference register, is responsible for write by track allotter.Some tracks of multilane system are divided by track allotter One thread of dispensing uses, and the reference register in one of them track in group sets track on the basis of value (Starting Lane), its operation is slightly different with remaining track in group;Benchmark in remaining track in group is posted Storage sets value as non-referenced track.Below example is assumed often to organize track on the basis of interior leftmost track.
According to technical solution of the present invention, with same instruction, certain register value in one track can be write simultaneously Entering in the same destination register in other all tracks, such instruction is referred to as broadcasting instructions, can take Fig. 2 A form.In this instruction domain of instruction 21 be the operation code that moves of broadcast data, domain of instruction 22 be that source is deposited Device number, domain of instruction 23 is destination register number, this destination register number can identical with source register number or Different.Track on the basis of assuming in Fig. 3 31, when 31, when 32,33,34 tracks perform this instruction, car Road 31 decoding instruction is broadcast data move, i.e. checks the reference register in track, according to being wherein The sense switch 36 of track 31 correspondence is turned on by the value in benchmark track;Track 32,33 and 34 translates broadcast Data movement instructions, checks the reference register in respective track, and the value according to non-referenced track therein will The sense switch of its correspondence disconnects.As in this broadcast data move, domain of instruction 22 is R8, domain of instruction 24 For R9;During in the most each track, command decoder controls reading track 31, the value of depositor R8 is through global bus 35 Deliver to the depositor R9 in track 32,33 and 34 stores.Another broadcast mode can define in instruction Domain of instruction 25,26 specifies a track number, and the track having this track number in group (can be also for track, source Can not be benchmark track), in group, the track in this track number non-is as remaining track in upper example.Concrete operations Process as hereinbefore, does not repeats them here.Common parameter transmission, scalar can be realized by this operation The operation such as transmission or track transmission.
According to technical solution of the present invention, it is also possible to extend multilane parallel data on the basis of existing instruction set Access instruction.When using multilane system to carry out SIMD operation, it is desirable to these tracks can be at same Under the control of data access instruction, different pieces of information address is accessed simultaneously.In the present invention, can be according to track Number determine data address, it is achieved under same instruction controls, different operations is done in each track.With each phase The data that adjacent track accesses are by as a example by same address interval D iff (space increments), if known reference track accesses Data address (hereinafter referred to as presumptive address StartingAddress) and address gaps, then remaining each car Road has only to track number (the hereinafter referred to as control vehicle Taoist monastic name determining itself track number with described benchmark track Starting LN) difference after, so that it may calculate respective data address to access respective data.
In the present invention, the difference between track number and this Taoist monastic name of benchmark in described each track is referred to as normalization Track number (Normalized Lane Number, NLN), i.e. NLN=LN-Starting LN.This normalizing Changing operation can use the track move described in precedent and standdle carrier road computations to realize.Specifically, The most all tracks perform the move of same track, track number move to certain depositor (such as The depositor R1 in each track) in.Then, all tracks perform same broadcast subtraction instruction, its instruction In domain of instruction 21 in for broadcast reducing code, domain of instruction 22 value is for R1, and domain of instruction 24 value is R1, refers to Making territory 23 for R2, its meaning is with the value (being specified by domain of instruction 24) in R1 depositor in all each tracks The value (being specified by domain of instruction 22) deducted in benchmark track in R1 depositor, is stored in all each cars by result R2 depositor (being specified by domain of instruction 23) in road.Each track is to the decoding of this instruction and to bus between track Setting as described in precedent, repeat no more.If R1 depositor has the track number in each track in each track, When this broadcast subtracts after instruction is finished, and in each track, the value of depositor R2 is exactly the corresponding described base in this track The normalization track number in quasi-track.As a example by multilane system in Fig. 3 embodiment, it is assumed that only use track 33 and 34, and track 33 is benchmark track, then after having performed the move of above-mentioned track, track 33 He Value in the depositor R1 of 34 is respectively ' 2 ' and ' 3 '.After having performed above-mentioned standdle carrier road subtraction instruction again, Value in the depositor R2 in track 33 and 34 is respectively ' 0 ' and ' 1 ', i.e. corresponding normalization track number. An above-mentioned broadcast subtraction instruction can also be by a broadcast data move and a common subtraction instruction Co-operation, i.e. first moves in each track by the instruction number in benchmark track with broadcast data move, respectively Respective track number is subtracted each other and is obtained respective NLN by track more therewith.
The most still say as a example by extending multilane parallel data access instruction on the basis of DLX instruction set Bright.In fig. 2, typical type of memory (M-Type) during instruction format B shows DLX instruction set The form of instruction, wherein domain of instruction 21 be operation code, domain of instruction 22 be base address register number, domain of instruction 28 is address offset amount (OFFSET).By the base address in depositor that base address register number is pointed to Value (BASE) is added with described address offset amount and can be obtained by data address (ADDR).Data are read Instruction fetch, domain of instruction 23 is destination register number;For instruction data storage, domain of instruction 23 is then that source is posted Storage number.
According to technical solution of the present invention, it is possible to use existing type of memory instruction format realizes multilane also Row data access.Specifically, for a data access, first with overall situation base address and address offset amount Offset Addition obtains the initial address of this access and is stored in each track in register file, and this initial address is called (Base). After being calculated normalization track NLN and writing depositor R2, all tracks perform same multiplication Instruction, is multiplied address gaps Lane Diff with the value in depositor R2, the result (Lane that each track obtains Diff*NLN) after, then same addition instruction is performed, by described multiplication result plus data access starting point Location is worth, and the result ((BASE)+Diff*NLN) obtained writes back in depositor.So, when all tracks In the register file that can each point to register number Base when performing data access instruction each storage because of The data address that NLN value is different each carries out data access.The most each track is carried out the dress that same is common Carry (Load) or storage instruction (Store), then can realize the loading to multiple data and storage.Hereafter, Each track can all add a step value (Stride) with same instruction at respective above-mentioned data access address To produce the next recycling for program of the next data access address.
According to technical solution of the present invention, it is also possible to the structure in track itself is improved so that multilane system System not only can realize SIMD operation, it is also possible to realizes super SIMD operation, multiple instruction-single data stream (MISD) Operation and multiple instruction multiple data stream (MIMD) operate, thus preferably carry out the map operation of high degree of parallelism.
Refer to Fig. 4, it is an embodiment of the multilane system comprising and improving track of the present invention.? The present embodiment illustrates as a example by the multilane system comprising 4 tracks.Described multilane system comprises 402, tag ram (Tag) of one tucker, 408, track table (Track of 404, scanning device Table) 410, command memory 406, and four tracks 401,403,405 and 407.Each The structure in track is identical, as a example by track 401, in addition to comprising performance element 411 and register file 412, Also comprise a tracking device (Tracker) 414 and buffering 417 is read in an instruction.
Instruction in the shared instruction memorizer 406 of track 401,403,405,407 in the present embodiment.Under Face illustrates as a example by track 401, and relevant running and operation are also applied for other tracks.Work as car When the performance element 411 in road 401 performs an instruction, it is necessary first to from memorizer, 406 reading instructions are deposited It is put into instruction to read in buffering 417.In the present embodiment, the Capacity Ratio command memory of buffering 417 is read in instruction 406 is little, and access time delay is the most shorter.Command memory 406 and instruction read buffering 417 can by any suitably Storage device is constituted.Instruction block (Instruction Block) tissue pressed by command memory 406;Tag ram 404 In a list item, a line in track table 410 is corresponding with an instruction block in 406;Three all by Buffer address BNX addresses.A list item in tag ram 404 has command adapted thereto block in 406 Block address (i.e. a high position for PC address);In track table 410, a line (referred to as one track) has multiple list item, Each list item is corresponding with an instruction in corresponding instruction block, and both of which is by block bias internal amount BNY in PC address Addressing.Organize for purposes of illustration only, set this command memory with direct mapping mode, thus the mark in PC address Signing the label that (Tag) is equal in this example, index (Index) is equal to the BNY in this example, and block bias internal Amount (Offset) is equal to the BNY in this example.Hereinafter referred to as comprising Tag, the address of BNX, BNY is instruction Address.Claiming the address comprising BNX, BNY is buffer address.
Tucker 402 obtains instruction block from the memorizer of lower level time and stores command memory by index value In 406, also by index value, the address tag of this instruction block is inserted tag ram 404.Scanning device 408 Storage is examined to each instruction in the instruction block of command memory 406, and extracts information, such as: Instruction type (it is assumed herein that type is non-branch, unconditional branch and conditional branching three kinds), instruction address are also Instruction type is stored in track table 410 list item that buffer address points to, and this list item correspondence is stored in command memory The instruction of the information that is extracted in 406.If the instruction being investigated is branch instruction, then scanning device 408 more with Finger offsets amount (Branch Offset) addition calculation comprised in instruction address and instruction goes out Branch Target Instruction Address.By calculated Branch Target Instruction address tag part and the mark of storage in tag ram 404 Sign coupling;As i.e. delivered to tucker 402 with this Branch Target Instruction address, as above by 402 from more without coupling The memorizer of low level obtains instruction block and stores in command memory 406 and build corresponding mark by index value Sign and track;If any coupling then by the BNX value of coupling list item, together with the block bias internal in branch target address Amount BNY is stored in by the track pointed by the buffer address of branch instruction together as the buffer address of branch target Table list item.So in track table 410 in each list item in a track and command memory 406 one An instruction correspondence in instruction block, the most each list item at least contains instruction type, and branch instruction is the most more The buffer address of the branch target containing this branch instruction.End tracing point is more increased at the end of every track, Wherein having the buffer address of the next instruction block of order, this buffer address is added by the instruction address of current orbit The instruction address of the next instruction block that instruction block length obtains, as above example mates gained at tag ram 404; The instruction type terminating tracing point is unconditional branch.Therefore in track table 410, storage has a rail network, Wherein contain the order between all instructions stored in command memory 406 and branch's relation.
With a tracking device 414 matching track table 410, can carry to track 401 by control instruction memorizer 406 Perform for it for instruction.Tracking device 414 is made up of incrementer 441, selector 442, depositor 443.Post The output of storage 443 is buffer address, an instruction block of BNX directional order memorizer 406 therein, It is also directed to the respective rail in track table 410;Wherein BNY selects an instruction for track from this instruction block 401 perform, and also read the corresponding list item in this track.By in the instruction type information in list item and track 401 The branch that produces of branch decision logic judge to control selector 442.When the instruction type of 410 outputs is overstepping one's bounds Zhi Shi, selector 442 selects the output of incrementer 441, and it is the buffer address increasing one of present instruction.? Following clock cycle, new buffer address points to next instruction of order of present instruction and reads for track 401 Perform.When the instruction type of 410 outputs is unconditional branch, then selector 442 selects track table 410 The branch target cache address contained in the list item of output, following clock cycle track performs Branch Target Instruction. When the instruction type of 410 outputs is conditional branching, the branch that selector 442 is provided by track 401 judges Controlling, as non-limbed in being judged as, then selector 442 as above selects the output of incrementer 441, following clock Next instruction of cycle track execution sequence;As being judged as performing branch, then selector 442 selects track table The branch target cache address contained in the list item of 410 outputs, following clock cycle track performs branch target and refers to Order.So tracking device 414 is worked in coordination with track table 410, judges according to the branch in instruction type and track 401 Signal feedback flows to determination procedure, persistently provides instruction to track.
Plural number bar track respective tracking device etc. each can have the instruction of a plurality of reading mouth to deposit by independent access one Store up and provide independent instruction stream for respective track.Embodiment in Fig. 4 uses a finger having odd number to read mouth Make memorizer 406 and a plurality of instruction read buffering 417 etc. and realize said function.In this implementation, actually In instruction reading buffering 417, storage is command memory 406 content and in track table 410 the one of corresponding content Individual subset, therefore tracking device 414 is actually connected to instruction reading buffering 417, cooperating, only Have and there is no desired data in 417, just through the moderator (figure that the request from 4 tracks is arbitrated In do not show) be connected to as shown in Figure 4 command memory 406 content and track table 410 with instruction fetch block and Respective rail stores 417.
Instruction reads there is single or a plurality of instruction memory blocks in buffering 417, and at least a part of which stores to comprise and currently refers to Make block at an interior instruction block.Corresponding with each instruction memory blocks, there is corresponding BNX memorizer, coupling Device and orbiting memory.When an instruction block is stored into the instruction memory blocks that instruction is read in buffering 417, its BNX value and respective rail are also read into 417.In tracking device, the buffer address of depositor 443 output is sent Read buffering to instruction 417 to mate with all BNX wherein stored, if with one of them instruction memory blocks BNX coupling, i.e. with in BNY sense order memory block in buffer address one instruction for track perform, Also reading a list item in the track of correspondence and be supplied to tracking device, operation thereafter and tracking device in upper example are from track Table reads list item the same, repeat no more.If it does not match, be sent to refer to by this buffer address (through arbitration) Make memorizer 406 and track table 410 be read from instruction block and respective rail stores instruction and reads buffering 417 In the instruction memory blocks specified by permutation logic.Performed below as the situation of above-mentioned coupling, repeat no more.
Fig. 4 shows 4 tracks 401,403,405,407 and the most exclusive tracking device 414 altogether thereof, 416,418 and 420.Each tracking device is read buffering with the instruction in its this track and is connected, and produces in this track Raw branch judges to control lower collaborative work.Only instruction in this track is read when buffering does not has a desired data Tracking device just access instruction memorizer 406 and track table 410 fetch data and fill the instruction reading buffering in this track. Thus reduce the bandwidth requirement to memorizer 406 with track table 410.Track 403,405,407 with Track 401 structure is identical, it may have as 417 instruction read buffering, the performance element such as 411 with as The register file of 412.
As described in precedent, in multilane system of the present invention, can by global bus (in Fig. 4 not Display) data in a track are delivered to other tracks so that between track in exchange register heap 412 Data.
When multilane system of the present invention carries out single instruction single data stream process, four tracks have only to One is had to work.Such as, the instruction in track 401 read buffering 417 tracking device 414 control under to Performance element 411 output order is for execution, and other three tracks and corresponding instruction are read buffering, performed list Unit and tracking device can be in closed mode.Such as, closedown is read slow to described three tracks and corresponding instruction Punching, performance element, register file and the clock signal of tracking device or power supply.Also without using between track Global bus.According to said method can realize and the function of existing single instruction single data stream processor.
When multilane system of the present invention carries out single-instruction multiple-data stream (SIMD) process, to employ whole four As a example by track, these four track instructions read to store identical instruction block in buffering, and respective tracking device is held The identical action of row, controls each instruction respectively and reads buffering and provide identical instruction to corresponding performance element. The register file that each track is corresponding then can store different data, and reading corresponding to each track/ Memory element can carry out read/write operation to different data addresses respectively.So, four tracks perform same Program, but each track perform the data used during program can be different, thus realize singly referring to existing Make the function that techniques for Multiple Data-Streams Processing device is identical.The present embodiment does not use the global bus between track.By this side Method can realize and the function of existing Single-instruction multiple data-stream processing.
In the above-described embodiments, the instruction in each track reads buffering all by tracking device read pointer in same track Value control output order, i.e. produce corresponding wordline (word line) according to the value of this read pointer and instruction read Memory element addressing in buffering reads corresponding contents.According to technical solution of the present invention, it is also possible to described finger Order is read buffering and is made further improvements, and the correspondence that the instruction of all tracks is read in buffering is deposited by configurable switch Storage unit links together.So, when described switching off, each track just method as described in precedent is each Buffering is read to the offer instruction of corresponding performance element from controlling each instruction.When described switch conduction, only One track tracking device work (the tracking device in remaining track does not works), according to its read pointer produce corresponding The content in the same memory element of Buffer output is read in the instruction that wordline can control in all tracks.Now, only Each instruction is wanted to read the identical content of buffer-stored, can be under the tracking device in a track control, to institute There is the instruction that the performance element output in track is same, it is achieved single-instruction multiple-data stream (SIMD) operates.This improvement can be fitted For any single-instruction multiple-data stream (SIMD) operation of the present invention or the operation of super single-instruction multiple-data stream (SIMD).
Multilane system of the present invention can also work with super single-instruction multiple-data stream (SIMD) pattern.Common singly refer to Making multiple data stream is that a plurality of track performs same instruction a moment.When branch occurs in program, existing Single-instruction multiple-data stream (SIMD) pattern be each branch's situation to be judged by each bar track, produce each bar Track to cover value (Mask) the most each bar track step-wise execution under the control covering value.First it is to cover Lid value is that some tracks of 0 (being judged as non-limbed) perform unbranched instructions (i.e. suitable after branch instruction Sequence instructs), cover the track that value is 1 at that time and stop wait.Then covering value is to stop in 0 these tracks Treat, and cover remaining track that value is 1 (being judged as branch) and perform Branch Target Instruction and subsequent instructions thereof. If You Jichong branch, the efficiency of multilane processor will be made to have a greatly reduced quality so that existing single instrction Techniques for Multiple Data-Streams Processing device is only used for some separate procedures not having branch.The super single instruction multiple of the present invention Different from existing single-instruction multiple-data stream (SIMD) pattern according to stream mode, a plurality of track performs same program, each There are oneself independent branch's judgment mechanism and instruction addressing mechanism in bar track, can independently execute same program In different sections or different branch, so performing the effect that has efficiency during multiple branches still can keep 100% Rate.
When multilane system of the present invention carries out single-instruction multiple-data stream (SIMD) process, use the money of a thread Source;Four instructions read to store in buffering identical instruction block, and to accordingly under respective tracking device controls Performance element provide the identical or different instruction in same program, and in register file corresponding to each track Then can store different data.Because the data that each track performs to use during program can be different, even if Performing same instruction, every the respective branch in track judgement also can be different, causes four tracks in branch Point starts the different branches of the program that performs.Because in this mode, each track is in the control of respective tracking device Under system parallel, independently executing programs, thus efficiency outclass existing use shade under conditions of having branch Depositor processes the Single-instruction multiple data-stream processing of branch by several times.Do not use between track in the present embodiment is complete Office's bus.According to said method can extremely efficiently realize identical with existing Single-instruction multiple data-stream processing Function.
When multilane system of the present invention carries out multiple instruction-single data stream process, the instruction in four tracks is read Buffering stores different instruction blocks, and provides to corresponding performance element under respective tracking device controls Different instructions.Here, read number by the register file in a track from data buffer storage (not shown) According to, and the data in this register file are delivered to other three cars as described in precedent by the global bus between track In the register file in road so that the data in the register file in four tracks keep consistent.So, four cars Road can perform different programs simultaneously based on same data, thus realize multiple instruction-single data stream and process The function of device.
When multilane system of the present invention carries out multiple instruction multiple data stream process, the instruction in four tracks is read Buffering stores different programs, under respective tracking device controls, provides difference to corresponding performance element Instruction.Correspondingly, the register file that four tracks are corresponding separately reads different number from data buffer storage According to or by data write back data cache.The present embodiment does not use the global bus between track.It is according to said method Different Data Sources can be based respectively on, perform the journey performed between different programs, i.e. each track simultaneously Sequence is the most orthogonal with data, thus realizes the function of multiple instruction multiple data stream processor.Multiple-instruction multiple-data Stream handle is suitable for carrying out mapping-be polymerized the map operation in (Map-Reduce) operation.
On the other hand, polymerization is the process that data assemble (Aggregate), and its degree of parallelism the most relatively reflects The process penetrated is little.Existing method is to perform polymerization, by map with SISD processor or polycaryon processor Result is split from the computer performing to map by the memorizer and network having long delay, is distributed to perform polymerization Computer perform.The map operation performed by various tracks disclosed in this invention processor, during it maps Between result can carry out converging operationJu Hecaozuo by above-mentioned existing method.If but with the same core of track processor The each track performing map operation on sheet, or other tracks on same chip, or special beyond each track The data processing resources (hereinafter referred to as PPU, preprocessor Post Processing Unit) added, Ke Yi Completely realize the operation of mapping-polymerization on same chip, eliminate above-mentioned long delay, promote operating efficiency.With Under disclosure be applicable to various tracks processor and polycaryon processor, including disclosed in this invention aforementioned all Track processor.
Fig. 5 is to use to connect bus and the original calculating in each track resource between track at execution map operation The embodiment of converging operationJu Hecaozuo is performed on track.Wherein 50,51,52, and 53 is 4 tracks, and 50 is base Accurate (Starting) track.Every track is respectively arranged with arithmetic element 54 and register file 55, register file 55 Two operands 56 and 57 are exported to arithmetic element 54;Every track is also respectively arranged with controllable switch 58 and will post Another output 59 of storage heap 55 is connected to bus 60 between track.Bus between adjacent lane respective track There is controllable switch 61 that bus 60 between each track can be connected on demand between 60.Bus between track 60 inputs that data thereon can be sent to register file 55 also can bypass an input to arithmetic element 55 57.When performing the track operations such as mapping, switch 58 and switch 61 not UNICOM, the most each track is independent Operation, do not have data to exchange between track.When performing operation between the tracks such as polymerization, some track (claims For track, source) switch 58 UNICOM, make the output 59 of register file 55 in this track be connected between track always On line 60, and switch between bus 60 between the track of this track and adjacent lane (referred to as target track) 61 are also controlled by UNICOM, and bus 60 is switched to the input 57 of arithmetic element 54 by adjacent (target) track On.57 is the output of bypass logic, and this bypass logic may select operand or the bus that register file 55 provides 60.Between track, operational order controls this bypass logic selection bus 60 in this example.So fortune in target track Calculate unit 54 just to process from the operand 56 of target track register file and deposit from track, source The operand 57 of device heap, and operation result is write back the register file 55 in target track self.Now target carriage The switch 58 not UNICOM in road is to avoid the data collision sent here with track, source.So carry out operating between track.
Between the track increased in original instruction set, operational order can control above-mentioned switch and bypass mechanism with complete Become and operate between above-mentioned track.Normalization track number according to every track, same instructs in difference between track Track carry out different operations.Between track, instruction can take A in register type instruction format such as Fig. 2, Wherein in instruction 20,21 sections (i.e. domains of instruction) are operation code, and 22 sections is first operand register address, 23 Duan Wei tri-(result) operand register address, 24 sections is second operand register address, and 25 sections are Dead section, 26 sections is auxiliary operation code.Between track, in polymerization instruction, operation code 21 meaning is poly-for performing between track Closing operation, the arithmetical logic operation that it specifically performs can be individually determined by operation code 21, or by 21 and 26 sections Jointly determine.Assume between this track that instruction performs in this instance is polymerization add operation.In instruction 23,24 Three common for Duan Rutong operand instruction are usually the track (being target track in this instance) performing computing 3rd (result) operand of middle register file 55 and second operand address register address.But, First operand register address in 22 sections instructs the register file 55 being but directed in track, source between track 's.Command decoder in each track translates and translates gathering in instruction 25 sections further when instructing as polymerization instruction Right, in conjunction with the normalizing track number in each track, determine that those tracks are track, source, those tracks are target carriage Road.
Define each track NLN to arrange from left to right by increasing at this.The polymerization instruction when the degree of polymerization is ' 2 ' Control by NLN lowest order in two adjacent lanes be the definition of ' 1 ' be track, source, lowest order is ' 0 ' For target track, and by 22 segment addressing in operational order between track, by the register file 55 in track, source Contents in table read, put between track through output 59 and track, the source breaker in middle 58 of this register file Bus 60;Switch 61 UNICOM between track, source and target track simultaneously, makes the register file in track, source export 59 inputs 57 being sent to arithmetic element 54 in target track.Simultaneously between target track and its left-hand lane Switch switch 61 between 61, and track, source and its right-hand lane is all turned off so that track, source register file Export 59 and be sent to its target track, and do not affect other tracks.Same time many groups source, target track Converging operationJu Hecaozuo is carried out performing parallel instructions between same track.
In Fig. 6 explanatory diagram 5, embodiment instructs the degree of polymerization between track is operation when ' 2 '.If track 50 Normalization track number be ' 00 ', the normalization track number (NLN) in corresponding track 51,52,53 is ' 01, 10,11 '.In the most each track, command decoder is when performing instruction between the track that the degree of polymerization is ' 2 ', it is intended that NLN lowest order be the track of ' 1 ' be track, source, it is intended that NLN lowest order be the track of ' 0 ' be target carriage Road, and each switch 58 and 61 etc. is set accordingly.Its result is as in Fig. 6, track 51 is track, source, by car The list item that between road, 22 sections in operational order control to read in its register file 55 delivers to car through output 59 By 24 sections of controls in operational order between track in the input 57 of arithmetic element 54 and track, source 50 in road 50 The register file list item 56 read is added, and its operation result writes back between register file 55 track in track 50 23 sections of pointed list items in operational order are deposited.Meanwhile, in track 53, register file output 59 depends on same Reason is sent in track 52 be polymerized with the operand 56 of register file output in track 52, result write car In road 52 in register file 55.So complete to these four track degree of polymerization be ' 2 ' track between instruct and hold OK, polymerization results in the middle of parallel generation two.
On this basis, polymerization instruction between the track that the degree of polymerization is ' 4 ' can be continued executing with.Each track instructs Decoder is when performing instruction between the track that the degree of polymerization is ' 4 ', it is intended that minimum two of NLN is ' 10 ' Track is track, source, it is intended that NLN minimum two be ' 00 ' track be target track, other tracks are not joined With operation.Respectively it is not involved in operating the switch 58 Jun Bu UNICOM in track.It is in track, source to lead to target carriage track data Switch 61 UNICOM (at track, source and the right of switch, target track is in the left of switch) on road.No The switch 61 being on passage does not connects that (with the left of switch at track, source, target track is in the right side of switch Side).As it is shown in fig. 7, the now previous step execution degree of polymerization leaves in track 52 for produce when ' 2 ' In register file by operational order between track 22 sections point to middle polymerization results be sent to track 50 and its That deposits in register file 55 is further polymerized by 23 sections of middle polymerization results pointed in instruction, its knot Fruit is stored in the register file 55 in track 50 by 24 sections of list items pointed in instruction.
After so performing to instruct between the track that two degree of polymerization be ' 2 ' and ' 4 ', originally it was distributed in four cars Mapping result in road is just polymerized to a polymerization result.In like manner, the track of more high polymerization degree is performed afterwards Between instruction can by the mapping result of more multilane, or other types in the middle of perform result polymerization.Such as the degree of polymerization Instruction for ' 8 ' will using track that NLN is ' 100 ' as track, source, will wherein data aggregate to NLN For the target track of ' 000 '.Such as the track that the instruction that the degree of polymerization is ' 16 ' will be ' 1000 ' with NLN As track, source, by wherein data aggregate to the target track that NLN is ' 0000 '.As long as performing n bar to gather Right from ' 2 ' start increasings tracks between instruct and 2n intermediate object program 2n bar track can be polymerized to One result leaves in the register file in benchmark track.Can be stored in memorizer by subsequent instructions. The degree of polymerization only allows to be 2n, wherein n is more than or equal to ' 1 '.
Fig. 5, in 6,7 the register file 55 in each track be all provided separately one read mouth 59 support between track Operation, this function reading mouth also can be provided without setting up individually reading mouth 59 by reading mouth 57.Refer between track Order can also be a data move, the most directly controls to carry out computing the arithmetical unit in track, but will Data in the register file of track, source move to target track and are stored in its register file.The most again in target carriage Road performs normal operation instruction, to data own in the register file of target track and from track, source in-migration Data carry out converging operationJu Hecaozuo.Between this track, data movement instructions can be by instruction format A of Fig. 2, with above-mentioned The arithmetic element 54 that between track, this operation code 21 instructed unlike operational order does not control target track is entered Row operation, but the register file 55 controlling target track stores track, the source data sent here in bus 60. Additionally instruction in 24 sections the most inoperative, because now arithmetic element 54 need not operand.Remaining and car Between road, operational order is identical.Track processor structure also to have corresponding amendment, and now between track, bus 60 is It is connected to the input of register file 55.Its operation be with the operation code of data movement instructions 21 sections between track (or 21 sections, 26 sections of common effects) indicate each track to carry out data movement operations;The degree of polymerization and NLN in 25 sections Common effect track, selection source and target track also control each switch 58 and 61 and set up track, source and target track Between bus connect;Track, the source register file list item pointed to by operand address 22 sections in instruction reads Data, write the register file list item in the target track pointed to by result address 24 sections in instruction to complete car Data between road move.Can via bypass logic with Selecting operation unit 54 and the input of register file 55, Make track processor support computing class instruction between track simultaneously, and between track, data move class instruction.
Between above-mentioned track, bus may also used as broadcast bus.Refer to Fig. 5, the form of broadcasting instructions is with upper State data movement instructions identical.When each track translates broadcasting instructions, the register file addresses of 22 sections in instruction Control register file 55 in benchmark track (NLN is ' 000 ') and read data, and control this track is opened Close 58 connections, these data are put bus 60, also will switch 61 connection.Remaining NLN is not ' 000 ' Track, then its respective switch 61 is connected, make these data arrive all tracks, and according to instruction in 24 The register file addresses of section writes the data into respective register file, completes the broadcast delivery of data.
If the data processing resources increased again outside bus between track between track (hereinafter referred to as preprocessor, Post Processing Unit, PPU) make each track only need to perform map operation, and converging operationJu Hecaozuo is transferred to PPU Perform, then can further improve efficiency.Bus between track with PPU is connected can various topological structures, Hereinafter give some instances.Fig. 8 is a kind of tree-shaped bus structures.The most every two tracks such as 80, have between 81 One PPU such as 84, its operation result is kept in by the output register in PPU.First level (layer) Adjacent two tracks of two of PPU such as 84, the 85 each acceptance of input in the output of register files.Second The PPU of level such as 86 accepts the output of the PPU of the first level.Each level separately has a control depositor Such as 87, the level less by numerical value transmits control signal to numerical value compared with its level of big one.These PPU and total Lines etc. are controlled by post processing instruction.
Post processing instruction is encoded in a program as other tradition cpu instructions, by command memory, or Instruction buffer, or instruction read buffering IRB be supplied to each track.Post processing instruction does not affect at traditional in track Reason device operation, but the special result controlled after processing conventional processors operates, and its form can To be the form A in Fig. 2.When each track goes to a post processing instruction, Instruction decoding in each track It is decoded by device, is judged as that post processing instructs according to the operation code in 21 sections, accordingly according in 22 sections Register file addresses, from respective register file read data be sent to the first level PPU such as 84 through bus, The input of 85.The instruction of this post processing simultaneously is also directed to the first level PPU.PPU has operation code decoder In instructing according to post processing, the coding in 21 sections, and/or 26 sections determines which kind of operation PPU should perform, as added Method, subtraction etc..PPU84, the output of 85 is sent to the input of PPU86 at following clock cycle, simultaneously after Process instruction and after depositor 87 is temporary, be also directed to PPU86 to control its operation.The most pending data After and, instruction is all transmitted until top PPU layer by layer through respective streamline.Top PPU output Polymerization result is controlled to write back by the register file result address of 24 sections in the post processing instruction come through streamline transmission Register file in one track.This track can be acquiescence such as benchmark track, or instructed by post processing 25 sections, 26 sections etc. specify.A kind of optimization is the decoding controlling PPU operation Instruction decoding in track Device is carried out, and only transmits to control the operation of each layer PPU through streamline by the control signal after decoding.
Post processing instruction can also control directly to write back polymerization result memorizer.Now post processing instruction is taked Form B such as Fig. 2.When each track goes to such post processing instruction, Instruction decoding in each track It is decoded by device, is judged as that post processing instructs according to the operation code in 21 sections, accordingly according in 23 sections Register file addresses, from respective register file read data be sent to the first level PPU such as 84 through bus, The input of 85;Which kind of operation coding decoded decision PPU in 21 sections and/or 26 sections should perform simultaneously, translates Control signal after Ma is for controlling the operation of the first level PPU;Thereafter aggregated data and converging operationJu Hecaozuo control Transmit layer by layer as ibid example along tree flow waterline.Instruction 22 is used in the most each track when decoding simultaneously The register file addresses of section reads base address from register file, produces with the offset addition in instruction 28 sections Storage address.This address is also transmitted until top PPU through tree-shaped bus layer by layer with the data of polymerization.? There is storage logic, according to this address, last polymerization result is stored (Store) to memorizer after high-rise PPU. The generation of storage address can share in track original adder in arithmetic element, it is also possible to increases one specially Individual adder, makes rear operational order can instruct executed in parallel simultaneously with other.
Fig. 9 is a kind of transmission bus structures, and there is a PPU in the most corresponding each track 90,91,92,93 Such as 94, PPU there is an output register output it temporary.The output bus 95 of each PPU is by it Data temporary in output register are delivered to an input of PPU in its track, left, another of this PPU Individual input 96 is from the depositor 97 in its track.When the track processor in this example performs post processing instruction In each track, it is decoded by command decoder, is judged as that post processing instructs according to the operation code in 21 sections, Accordingly according to the register file addresses in 23 sections, from respective register file, read data be sent to respective track Depositor 97 keep in.Simultaneously the highest for the NLN command decoder in track (being 93 in this example) also translates The type of rear operation, together with destination register heap address, or as above example calculates the storage address of gained and delivers to Depositor 98 is kept in.The PPU94 in this track of following clock cycle carries out one to the data from depositor 96 Individual do-nothing operation (can be that in the track that regulation NLN is the highest, PPU is fixing carries out do-nothing operation), actual effect is Data on the depositor 96 in track 93 are stored in the output register of the PPU94 in this track;Deposit simultaneously Rear operating control signal on device 97 and storage address also passes through or logic 99 is stored into depositor 98.This After when rear operating control signal is delivered to certain track, the PPU in this track i.e. to two input 95 And the data on 96 carry out converging operationJu Hecaozuo, result is stored in output register.So when after operating control signal When being delivered to (benchmark) track on the left side, the output 85 of this track PPU be from NLN be peak It it is the polymerization result of the intermediate object program of the mapping in all tracks of ' 0 ' or other operation generations to NLN. Thereafter according to as the type as above example of rear operational order or write back register file, or directly write back memorizer.
The bus of rear operation connects topological structure and can also is that the mixing of said structure.Such as can be by track Track processor packet on reason device chip, such as chip has 16 groups and often organizes 32 tracks.Can be Use the post-polymerization treatment device of tree topology bus in each group, and between group and group, use transmission bus The post-polymerization treatment device of topological structure.Bus can be hard-wired, and the most existing GPU is exactly one group of car Road performs same instruction, and hard-wired bus can be used to connect.Another kind of use track table of the present invention And instruction read buffer IRB track processor can between one group of track free graduation, track and track Between can perform different instruction, this be accomplished by according to the reconfigurable bus of driveway partition connect.
Figure 10 shows an embodiment of the tree-shaped bus of restructural.In figure, each PPU has independent grasping Write back the path of register file or memorizer as result, and each Zhi Douke of tree-shaped bus independently turns off. Figure 10 A shows that a result to 8 tracks carries out the tree-shaped bus of restructural of converging operationJu Hecaozuo, here, often One-level PPU is sent to the path of next stage PPU and all turns on, and the PPU in addition to PPU 106 is sent to deposit The path of reservoir all disconnects.So, after 3 grades are polymerized, polymerization result stores from PPU 106 output In memorizer.
What Figure 10 B showed that a result to 2 in 8 tracks and 6 carries out converging operationJu Hecaozuo respectively can Reconstruct the configuring condition of tree-shaped bus.This example is with the difference of Figure 10 A embodiment, and PPU 100 send Path toward PPU 104 is disconnected, and the path that PPU 100 is sent to memorizer is switched on.So, 2 cars Road is after 1 grade is polymerized, and its polymerization result stores to memorizer from PPU 100 output, other 6 cars Road is then through 3 grades of polymerizations, and its polymerization result is from PPU 106 output storage to memorizer.
Figure 10 C shows that one carries out weighing of converging operationJu Hecaozuo respectively to two groups of each results of 4 in 8 tracks The configuring condition of Broussonetia papyrifera shape bus.This example is with the difference of Figure 10 A embodiment, PPU 104,105 The path being sent to PPU 106 is disconnected, and the path that PPU 104,105 is sent to memorizer is switched on.So, The polymerization result in first 4 track stores to memorizer from PPU 104 output, other 4 tracks Polymerization result is from PPU 105 output storage to memorizer.
Figure 10 D shows that one carries out weighing of converging operationJu Hecaozuo respectively to 4 groups of each results of 2 in 8 tracks The configuring condition of Broussonetia papyrifera shape bus.This example is with the difference of Figure 10 A embodiment, PPU 100,101, 102,103 paths being sent to PPU 104,105 respectively are disconnected, and PPU 100,101,102,103 The path being sent to memorizer is switched on.So, through 1 grade be polymerized after, each polymerization result respectively from PPU 100, 101,102, and 103 output storages are in memorizer.In above-mentioned Figure 10, each post processing path connects on-off Drive and controlled by track allotter.Between each track in following embodiment, path is also controlled by track allotter, Illustrate the most one by one.
Figure 11 shows an embodiment of restructural transmission bus.In figure, each PPU has independent grasping Write back the path of register file or memorizer as result, and transmit bus each section all can independently turn off. As shown in figure 11, bus is 2 tracks and the converging operationJu Hecaozuo in 3 tracks to 5 driveway partitions;Here, PPU 113 The transmission bus being sent to PPU 112 is disconnected, and other transmission buses are both turned on so that track 118,119 Result is after PPU 114,113 is polymerized successively, and its polymerization result stores to memorizer from PPU 113 output, The result in track 115,116,117 is after PPU 112,111,110 is polymerized successively simultaneously, its polymerization result From PPU 110 output storage to memorizer.
Figure 12 is an embodiment of the restructural degree of polymerization.Figure 12 A is its concrete structure.It is assumed herein that it is minimum The degree of polymerization is ' 2 ', then for four tracks 120,121,122 and 123, need two PPU 124 altogether With 125.In the present embodiment, track to PPU, PPU to other PPU, and PPU to itself Switch is all had to be turned on or off on data path.
Refer to Figure 12 B, one embodiment of configuring condition when it is ' 2 ' for the degree of polymerization.Now, at figure On the basis of 12A, switch 126,127,128 and 129 is both turned on so that the result in track can be sent to In corresponding PPU;Switch 1211,1221 and 1231 all disconnects so that do not transmit data between PPU; Switch 1241 and 1251 also disconnects.Such configuration, is equivalent to each PPU and all selects only to receive its correspondence The result in two tracks, and the result of said two track input is carried out converging operationJu Hecaozuo, and each exports The polymerization result in corresponding two tracks.
Afterwards, refer to Figure 12 C, one embodiment of configuring condition when it is ' 4 ' for the degree of polymerization.Here, Switch 126,127,128 and 129 all disconnects so that each PPU no longer receives the new value sent here from track; Switch 1221,1241 and 1251 is both turned on, switchs 1211 and 1231 disconnections so that each PPU all connects Receipts output itself last time converging operationJu Hecaozuo polymerization result, and from an adjacent PPU (be in this example Right survey PPU) the two track polymerization results that last time, converging operationJu Hecaozuo obtained that export.Such configuration, is equivalent to Two two track polymerization results are carried out converging operationJu Hecaozuo again, obtains Four-Lane Road polymerization result and by PPU 124 Output.
Afterwards, refer to Figure 12 D, one embodiment of configuring condition when it is ' 8 ' for the degree of polymerization.In order to The example that the degree of polymerization is ' 8 ' is described, needs 8 tracks, but for the ease of display, the most only show 5 tracks are shown.Wherein 4 tracks and corresponding PPU with Figure 12 A on the left of dotted line are identical, and dotted line is right Side also has identical structure, but illustrate only first track.With the structure in Figure 12 A it is the most still The configuration that the degree of polymerization is ' 8 ' is illustrated by example.On the left of dotted line, switch 126,127,128 and 129 All disconnect so that each PPU no longer receives the new value sent here from track;Switch 1221,1231,1241 With 1251 be both turned on, switch 1211 disconnections.All switches on the right side of dotted line are also carried out same configuration.This Sample, PPU 124, except receiving the Four-Lane Road polymerization result that last time, converging operationJu Hecaozuo obtained of self output, also receives The result of corresponding four track polymerizations last time sent here on the right side of dotted line, and again gathered by PPU 124 Closing operation, obtains eight track polymerization results and is exported by PPU 124.
According to technical solution of the present invention, said method class can be released the configuration of the converging operationJu Hecaozuo in more tracks Structure, does not repeats them here.
Below as a example by performing matrix multiplication, the actual motion of multilane system of the present invention is illustrated. Of the present invention without the most common GPU in track improved, and improved car as shown in Figure 4 Road is suitable for this embodiment.In the present embodiment, matrix A, B are 4 row 4 and arrange, and be multiplied the knot obtained Really matrix is also that 4 row 4 arrange.The of the first row (a00, a01, a02, a03) of matrix A and matrix B String (b00, b10, b20, b30) can pass through multiply-add first element that can obtain matrix of consequence C (c00), concrete calculating process is: c00=a00*b00+a01*b10+a02*b20+a03*b30. Perform by serial command, the most at least need 4 multiplication and 3 sub-addition computings, totally 7 computings.Additionally, Owing to existing processor system typically requires the interim findings that memorizer is kept in calculating, and a multiplications/additions Computing generally requires extra execution 2 secondary data reading and the storage of 1 secondary data (assumes do not have under best-case Memory miss occurs, totally 4 cycles), needed for therefore can performing one matrix element multiplication Time Estimate is 7*4=28 cycle.
In multilane system, it is clear that 4 multiplication and corresponding data access operation can be assigned to 4 Executed in parallel in track so that have only to 1*4=4 cycle just can complete multiplication operation, afterwards by 3 In sub-addition 2 times and corresponding data access operation are assigned to the (1*4=4 altogether of executed in parallel in 2 tracks The individual cycle), then be assigned in 1 track perform (altogether by last 1 sub-addition and corresponding data access operation 1*4=4 cycle), 12 cycles just can complete a matrix element multiplication altogether, and performance is held with serial 28 cycles of row are compared and increase.
It is possible to further the result using preprocessor of the present invention to operate multiplication carries out polymerization behaviour Make (i.e. add operation).Here, owing to preprocessor can directly receive the execution result in corresponding track, because of This can save the data access that the storage of the data after multiplication has operated is corresponding with add operation, it is only necessary to will Final polymerization result stores in memorizer.
Specifically, the data engine in each track first according to corresponding track number, data initial address with And the address gaps in different tracks calculates data address.Wherein, the initial address of matrix A is exactly data The address of a00, corresponding address gaps is ' 1 ', and the initial address of matrix B is exactly the address of data b00, Corresponding address gaps is ' 4 '.So, in first track, the address of multiplicand and multiplier is respectively a00 Address with b00;In second track the address of the address of multiplicand and multiplier respectively a00 add ' 1 ', The address of b00 adds ' 4 ', i.e. the address of a01 and b10;Similarly, in third and fourth track multiplicand and The address of multiplier is respectively the address of a02 and b20, and the address of a03 and b30.
After data engine completes the calculating of data address, all of four tracks are with SIMD mode priority two Secondary multiplicand and multiplier corresponding for each track is read in depositor parallel.The execution in the most each track Module performs multiplying order simultaneously, obtains (a00*b00), (a01*b10), (a02*b20), (a03*b30) Four multiplication results also carry out follow-up aminated polyepichlorohydrin.Complete above-mentioned multiplying and need 3 cycles (2 altogether Cycle carries out digital independent, and 1 cycle carries out multiplication calculating).
Assume that described multilane system uses transmission bus preprocessor as described in Figure 9 to carry out converging operationJu Hecaozuo, Then the described preprocessor in four tracks performs same addition instruction.That is, the output in the 4th track is in correspondence Being added with ' 0 ' in preprocessor, its result is sent to the defeated of preprocessor corresponding to the 3rd track and the 3rd track Going out to be added, its result is sent to preprocessor corresponding to the 2nd track by that analogy again.So, through 4 week The rear operation of phase, exports final result from first track and stores memorizer, completing the calculating of c00 Rear storage is in memorizer.According to said method, complete the multiplying of a matrix element (such as c00), need altogether Wanting 6 cycles (multiplication 3 cycles, cumulative 4 cycles, data 1 cycle of storage), performance compares serial Perform (28 cycles) and tradition multilane system (12 cycles) all improves a lot.
Assume that again described multilane system uses tree-shaped bus preprocessor as described in Figure 8 to carry out converging operationJu Hecaozuo, The most described preprocessor carries out convergence by pipeline system successively twice result to four track outputs and adds up. I.e., first within a cycle simultaneously to the 1st, 2 tracks, and the output of the 3rd, 4 tracks is separately summed, In next cycle, above-mentioned two addition results is added again, stores after obtaining final result in memorizer, Store in memorizer after completing the calculating of c00.According to said method, a matrix element (such as c00) is completed Multiplying, needs 4 cycles (multiplication 1 cycle, addition 2 cycles, data 1 weeks of storage altogether Phase), performance has again one than above-mentioned three kinds of methods (respectively 28 cycles, 12 cycles and 6 cycles) Fixed raising.
According to technical solution of the present invention, there is the multilane system of mapping/paradigmatic structure, moreover it is possible to complete described in employing The parallelization becoming other operates, such as vector or the addition and subtraction of matrix, dot product etc., specific operation process and precedent Similar, do not repeat them here.
Above-described embodiment gives and utilizes multilane system of the present invention to carry out a matrix of elements multiplication also Row realizes.Afterwards, the instruction in this embodiment can be performed by art methods circulation, it is achieved complete Matrix multiplication.According to technical solution of the present invention, the multilane system in precedent can be improved, increase Odd number or a plurality of circulation (Loop) control module, and make loop control module control the merit of data engine Can so that loop code is no longer necessary to data access instruction, thus improves cycle efficieny.
Refer to Figure 13 A, it is the embodiment of multilane system of raising cycle efficieny of the present invention. For convenience of description, illustrate only a track 140.Scanning device 408, command memory in described system If 406, data storage 146, track table 410, track allotter 188 and track group controller 189 thereof Dry cycle controller 130, storage data engine 170 is that each track shares.Each track 140 there is it certainly Some tracking devices 414 and performance element 147 (comprising register file 148), and several data engines 150. Tracking device 414 reads the instruction in Instruction Register 406 for command decoder 149 in each track 140 simultaneously Decoding, controls the operation in each track.Above-mentioned instruction is also directed to track allotter 188, decoder pair therein Instruction decoding, does not distribute track resource to needing the program of resource by instruction request and Request Priority, and point Join a track column of dispensers 189 and be managed collectively the track resource distributed to meet program requirement.Real at this Executing in example, special track request instruction is to processor system request track resource;Special data engine is joined Put instruction and in each track, ask a data engine 150 for each data access (Load or Store) Deng and configure data access step-length;Special is circularly set instruction for each program cycle request one circulation control Device 130 processed, arranges cycle-index in this cycle controller, and is circulated with this program by this cycle controller In the corresponding data engine of all data accesses be associated.These requests are all by track allotter 188 basis Resource in its available resources pond and the priority level Resources allocation of request at that time.Thereafter special in program follows A cycle controller 130 is specified in fourth finger order, and according to the cycle-index determination procedure therein flow direction, (execution follows Ring or exit circulation);Now such as perform circulation data engine 150 grade then with this circular correlation and press preset data Access step-length stride accesses data register;Then cycle controller 130 and data engine 150 is circulated as exited Deng recovery be commanded arrange time state, with treat next time circulation.Owing to each track is carried out same instruction, Therefore it is identical at the circulating level residing for all tracks of synchronization;The data access that each track performs Although instructing identical, but each track may be different from the same data address instructed in corresponding data engine. So, a cycle controller controls the data engine being controlled in each track, can be to making at a plurality of car The same data access instruction performed in road each accesses the different address of data storage in each track.
Here, the effect of data engine 150 is before track needs to use data, walk according to data address Long (incremental time) calculates data address the most in advance and obtains data from data storage 146, with only needing Execution data engine once arranges instruction and instead of the data access instruction being performed a plurality of times in loop code (LD or ST), thus reduce the instruction number needing to perform, improve program operational efficiency.Cycle controller 130 Then provide cyclical information to data engine so that data engine 150 can use in difference circulates and follow with this Ring corresponding data address step-length calculates the data address in subsequent cycle automatically, and accesses data the most in advance Memorizer obtains data.Track allotter 188 and driveway controller 189 are then according to track needed for execution program Number and currently available number of track-lines carry out track distribution;And in the case of available number of track-lines is less than required number of track-lines, Divide program described in multiple runs, to realize the function of complete routine.
Refer to Figure 13 B, it is one embodiment of multilane system of raising cycle efficieny of the present invention. For convenience of description, illustrate only a track.In the present embodiment command memory 406, track table 140, Data storage 146, track allotter 188 and track group controller 189 thereof are identical with several structures Cycle controller 130, and data engine 170 shared by each track.Tracking device 141, performance element 147 (bag Containing register file 148) and data engine 150,160 belong to track, be used alone for described track.? In following example, what track processor performed is SIMD operation, therefore control vehicle can be only used in practical operation Tracking device 414 in road reads the content in track table 410, and moves towards according to cycle criterion determination procedure, with The program controlling a plurality of track performs.For a track, the plurality of data engine that it comprises can be phase Isostructural, or different structure.Such as, in Figure 13 B, show the data of three kinds of different structures Engine, wherein data engine 150 and 160 is deposited for data for data read command, data engine 170 Storage instruction.The data engine of 150 or 160 structures can be used in reading data.
Can make by configuring (as switch 180,181,182 is configured respectively) in the present embodiment Data are stored by the realization that can be associated with part data engine of cycle controller by program loop control The access of device.Below in conjunction with the example of the matrix multiplication in Figure 14, and command adapted thereto sequence in Figure 15 Illustrate.Refer to the matrix that Figure 14 A, matrix M and N are four row four row, its multiplication result is square Battle array P;Element in each matrix enters several 0-F with 16 expresses.Employ 4 tracks in this example, therefore may be used With as described in embodiment before, mapped and converging operationJu Hecaozuo, be calculated in P by once parallel Individual element.Therefore, altogether need 16 such operations (as Figure 14 B shows) that matrix multiplication (meter can be completed Calculating result is P0~PF).According to matrix multiplication rule, these 16 times operations can circulate by two-layer.Its In, interior loop complete often to go in four elements calculating (i.e. Figure 14 B rushes continuous four row, such as: P0~P3, P4~P7, P8~PB, PC~PF), outer loop then completes the calculating of four row elements altogether (in Figure 14 B Discontinuous four pieces).Therefore, for all participation tracks, two-layer circulation needs 2 cycle controllers altogether. Additionally, 2 input data (i.e. each in matrix M and N are used in the operation that each track is carried out altogether Element), and produce output data (i.e. an element in matrix P).Therefore, each track needs altogether Want the corresponding M of two loading data engines 150 and N matrix.Post-polymerization treatment device needs storage data to draw Hold up 170 corresponding P matrixes.
In order to realize described function, the multilane system that the present embodiment is corresponding is referred on the basis of precedent Order extension.Extension instruction as shown in Figure 15: degree of parallelism arranges instruction (SETWDTH), normalizing track Number arranging instruction (SETNLN), data engine arranges instruction (SETDE), is circularly set instruction (SETLOOP), recursion instruction (LOOP) and space-time cycle criterion instruction instruction (LOOPTS).Need Illustrating, Figure 15 has been merely given as the example of a kind of cyclic program operated in system of the present invention, To those skilled in the art, change and the instruction of instruction format in this program are replaced, adjusted With the protection domain that improvement all should belong to claims of the present invention.
In the present embodiment, perform the program of above-mentioned matrix multiplication as shown in figure 15, the form of each extended instruction It is A form in Fig. 2, including operation code (OP) 21, source operand one (Source 1) 22, target Operand (Dest) 23, source operand two (Source 2) 24, auxiliary territory (AUX) 25,26;Or In Fig. 2, B form comprises operation code (OP) 21, source operand one (Source 1) 22, base address are deposited Device address 23 and side-play amount 28.The operation code 21 of extended instruction is decoded and i.e. understands which kind of this instruction does Operation.
In the present embodiment, perform in No. 16 tracks when program starts is the ordinary instruction being not required to executed in parallel, Going to Article 1 in Figure 15 afterwards, its operation mnemonic code is the instruction of SETWDTH.This instruction is B The degree of parallelism of form arranges instruction, is that disclosure Computer Software program is asked to multilane processor system The communication way of hardware track resource, in order to arrange track allotter 188 and track group controller 189.Instruction In side-play amount 28 store PROGRAMMED REQUESTS use number of lanes (such as ' 4 ', i.e. represent need to use 4 Individual track).Source operand 22 territory has request number (Request Number), base address register address 22 can.Another kind of degree of parallelism can also be defined instruction is set, with posting that 22 territories in B form are pointed to Offset addition in base address and 28 territories in storage, itself and take required from caching or memorizer as address Number of track-lines.
Return to Figure 13 B, in this example, when performing this instruction, track request number and track number of request 196 quilt Send into track allotter 188.Track allotter 188 distribute at that time can number of track-lines with the track of satisfied needs Number;Can also make track group controller more than 189 points in the case of available number of track-lines is less than required number of track-lines Wheel performs programmed instruction to realize the function of complete routine.Track allotter 188 is by available track in this example Number 17,18,19 available resources ponds from 188 are moved 188 Central Plains to and are come the track that track 16 takies Name with above-mentioned request Q in group record and by this track group, distribute in multiple tracks group controller 189 Give this group for one, request Q is stored in this request depositor in 189, makes subsequent instructions to pass through This request Q controls this track group controller 189, and by 189 control Q groups 16,17,18, The operation in No. 19 tracks.188 and the depositor 191 that is stored in Q group 189 of available number of track-lines that Q is asked, This 189 being associated with this group track newly assigned, in making Q group, each track is by the management of this Q group 189.
In this example, track group controller 189 by depositor 191,194, subtractor 192, selector 193 Constitute with logic 195.Wherein, depositor 191 store by the maximum number of track-lines being currently available for use, Its output is sent to an input of subtractor 192.One input 196 of selector 193 derives from also Row degree arranges the request number of track-lines comprised in the territory 28 of instruction, performs selector 193 when degree of parallelism arranges instruction This request number of track-lines is selected to be sent to another input of subtractor 192.Subtractor 192 then will ask track Number deducts the available number of track-lines in depositor 191, and the result obtained (i.e.: does not also obtain the request car of distribution Number of channels) it is written in depositor 194.It is defeated that the value of depositor 194 is then sent to another of selector 193 Enter end and logic 195.The output of logic 195 receiving register 194 and depositor 191 produces 3 outputs 197,198, and 199.Because available number of track-lines is possibly less than the number of track-lines of request, need the recycling can With track to complete the requirement of program, in the present embodiment, the NLN in benchmark track is not necessarily ' 0 ', and It is to be set by the value of control vehicle Taoist monastic name 197.When circulating the last time, available number of track-lines is possibly more than request Number of track-lines, this track group now controlled in track allotter 188 with this recycling number of track-lines 198 will Unnecessary track number returns to resource pool, only uses the track also needed to complete instruction cycles.Cycle criterion 199 Then it is used for judging whether to perform circulation.
The meaning herein circulated is as follows.In order to improve programming efficiency and code density in prior art, will be repeatedly The cyclic representation of one section of code backward branch performed, this is to code expansion in time, can not Time needed for the execution of minimizing program;And be that instruction is set with special degree of parallelism disclosed in the present embodiment SETWDTH asks track resource expressly to processor system, be by program on space (multilane) Expansion, it is possible to save program perform needed for time, typically can replace the outermost layer in prior art program Circulation.But number of track-lines available in processor system is possibly less than the number of track-lines of PROGRAMMED REQUESTS, now this reality Execute example and process this problem with space-time two-dimension expansion, i.e. first with available track, instruction segment is done space development and perform, The part (i.e. request number of track-lines is beyond the part of available number of track-lines) that space development is not enough is in time with circulation Launch.This circulating in is not expressed in program, space request that to be processor system express according to program and At that time can space resources and determine the circulation that exchanges space with circulation time for, hereon referred to as time idle loop with The circulation difference expressed in money and program.When in the present embodiment, track group controller 189 controls, idle loop holds OK.
When depositor 194 output is more than ' 0 ' (meaning is unsatisfactory for asking number of track-lines for available number of track-lines), 197 Value is the output of 194, and 198 values are the output of 191, and the value of 199 is ' 1 ', represents after needing to perform Idle loop time continuous.When depositor 194 output is equal to ' 0 ', (meaning meets request just for available number of track-lines Number of track-lines), 197 values are the output of 194, and 198 values are the output of 191, and the value of 199 is ' 0 ', table Show idle loop when need not follow-up.When depositor 194 output is less than ' 0 ', (meaning is many for available number of track-lines In request number of track-lines), 197 values are the output that value is depositor 191 and the depositor 194 of ' 0 ', 198 The sum that output is added, and the value of 199 is ' 0 ', idle loop when expression need not perform follow-up.As mentioned above In the present embodiment, in SETWIDTH instruction, 28 territories are ' 4 ', and track allotter 188 is assigned with for Q request Article 4, track, the difference in depositor 194 is ' 0 '.Therefore 197 values are ' 4 ' for ' 0 ', 198 values, and The value of 199 is ' 0 '.
Track allotter 188 makes newly assigned 17,18,19 tracks in Q group track also accept 16 tracks to follow The instruction that mark device reads from instruction buffer 406.Track allotter 188 also will be organized in interior No. 16 tracks as aforementioned Reference register set track (Starting Lane) on the basis of value;By in remaining 17,18,19 track interior for group Reference register set value as non-referenced track.
Return to Figure 15, in 16 tracks tracking device control 16,17,18,19 tracks start executed in parallel these Subsequent instructions.Next is broadcast loading instruction BCLD, and this instruction controls benchmark track and posts with wherein R18 Original base location in storage takes from memorizer or in caching as data address plus the side-play amount in instruction New base address is stored in group R28 depositor in all tracks.Three instructions below are common loading instructions LD, makes N0 element in Figure 14 A, M0 unit with the base address in each track R28 plus suitable side-play amount Element and the storage address of P0 element, i.e. multiplicand matrix N, multiplicand matrix M and the number of matrix of consequence P According to initial address, be stored in depositor R1, R2 and R3 in each track respectively.Next mnemonic code is The instruction of MOVLN is the move of aforementioned track, by the track number in each track from its track depositor Move into R11 depositor in register file.Again next mnemonic code be the instruction of SUBSCH be in benchmark track The middle track group controller that track number in R11 depositor in its register file is deducted the association of Q group track The control vehicle Taoist monastic names 197 of 189 outputs are worth ' 0 ', are differed from and are stored in R12 (R12 in benchmark track in this example Depositor intermediate value is identical with R11 intermediate value).Again next mnemonic code be the instruction of BCSUB be that aforementioned broadcast subtracts Instruction, carries out track normalization operation as aforementioned, is deducted in the track number in R11 depositor in each track Benchmark track is stored back to R11 after R12 depositor intermediate value.Next SETNLN instructs R11 in each track again Depositor intermediate value is stored in the NLN depositor of all data engines in self track attached.After having operated respectively In track, in each data engine, the normalization track number of storage starts to arrange by increasing from ' 0 ' number of benchmark track, I.e. 16, the NLN in 17,18, No. 19 tracks is 0,1,2,3.
Data engine arranges in source operand one territory 22 of instruction (SETDE) and deposits containing data initial address Device number (such as: R1, R2, R3);Target operand 23 stores number during data engine access data According to source/destination register number, this depositor obtains, for the storage when digital independent, the data of coming, or in data There is provided during storage and need to be sent to the data of memorizer;Source operand two territory 24 stores this instruction to arrange Data engine numbering (such as: data engine DE0, DE1, DE31);First auxiliary territory 25 contains Accessing the data address change step (stride) during data, the second auxiliary territory 26 is containing performing this instruction every time Adjacent lane between address gaps (Lane Diff).In each track, command decoder is translating a data The data engine that when engine arranges instruction, one this track of distribution is attached, compiles this engine in territory 24 in instruction Number association (numbered register as this numbering is stored in this engine), make subsequent instructions can control by this numbering Make the operation of this data engine.The information simultaneously also arranged in instruction in other territories by data engine is stored in this and draws Each depositor in holding up.As a example by data engine 150, the register address in territory 23 in instruction is stored in 150 Middle depositor 159 is to indicate the destination register of data engine fetched data;Deposit with the step-length in territory 25 Enter depositor 155;It is stored in depositor 157 with address gaps between the track in territory 26 (Lane Diff);And With difference value between the base address in the register address readout register in territory 22 and normalizing track number and track Product addition, its result ((Base)+NLN*LaneDiff) is as this data engine initial in this track Address is stored in depositor 153.Aforesaid operations can be with the dedicated computing Resource Calculation in each track, it is also possible to The performance element controlled in track with instruction (such as a multiply-add instruction, a data move) completes, It is not repeated herein.Data engine arranges instruction and controls the output of selector 152 mask register 153 further, At following clock cycle, the output of selector 152 is stored in depositor 151.Loading data engine 150 is with 151 Output as addresses access data cache 146, the data of acquisition are stored in this track that depositor 159 points to Depositor in middle register file.In loading data engine 150, depositor 151 is updated every time, all can touch Send out this data engine to deposit with this updated value for the loaded bus of data 156 in address reading data buffer 146 Enter in track in register file by the depositor pointed by depositor 159 in 150.Simultaneously in depositor 151 Value be added with the step value in depositor 155 and produce next step data memory addresses.
Every data engine configuration-direct configures a data engine in each track.I.e. Article 1 data engine Instruction configures DE0 data engine in each track and reads N matrix;Article 2 instruction configuration DE1 data engine Read Metzler matrix;Article 3 data engine arranges instruction and arranges the storage depending on preprocessor PPU190 (Store) polymerization result is write back the P matrix in memorizer 146 by data engine 170.The address of 170 produces Part 171,172,173,174,175,176 to the corresponding portion loaded in (Load) data engine 150 Dividing 151,152,153,154,155,156 is identical;The flow direction of data, loading data it is at Bu Tong In engine, data flow is to flow to register file in track from data buffer storage, and stores the data in data engine The flow direction is to flow to data buffer storage from register file.The most it is not both depositor in storage data engine 170 The reading mouth of what the register number of 179 storages were directed to is register file and depositor in loading data engine 150 159 point to register files write mouth.The loading data engine depending on preprocessor can be without depositor 179, because its unique data source is the output of preprocessor.
Data engine numbering implicitly (Implicitly) choosing that can arrange in instruction in territory 23 with data engine Select loading or storage data engine (as numbering DE0-DE15 arranges loading data engine, numbering DE16-DE23 Arranging the storage data engine in each track, numbering DE24-DE31 arranges preprocessor storage data engine). Instruction can also be set with dominant (Explicit) loading data engine, track storage engines arrange instruction and Preprocessor storage engines arranges and has instructed same setting.The data engine using recessiveness in this example is arranged Instruction.It is also provided with corresponding preprocessor annexation when track allotter 188 distributes track simultaneously. When performing the 3rd data engine and arranging instruction, number according to DE31 in domain of instruction 23, track allotter 188 distribution arrange with this instruction that to be polymerized all four tracks in the configurable converging network in these four tracks of group defeated The storage data engine 170 affiliated by preprocessor 190 of the intermediate object program gone out.
In this example, the data address of matrix M changes at outer loop, and its address step size (Stride) is ' 4 ' (i.e. address increases ' 4 ' every time), the address gaps (Lane Diff) of adjacent lane is ' 1 ';Matrix The data address of N changes in interior loop, after its address step size is ' 1 ', and interior loop completes, Its data address is reset as initial address, and the address gaps of adjacent lane is ' 4 '.The data ground of matrix P Location step-length is always ' 1 '.Owing to the element in matrix P is all the result after polymerization, the most do not exist adjacent Track address gaps (value is ' 0 ').Three data engines are carried out by three data engine configuration-directs respectively Arrange, wherein Article 1 instruction to each lane configurations first row each element in N matrix in corresponding diagram 14A Initial address;Article 2 instruction is the first row each element in Metzler matrix in corresponding diagram 14A to each lane configurations Initial address;Article 3 instruction is then configured with in corresponding diagram 14A in P matrix the to last preprocessor The initial address of a line first row element;Make it the state computation data address according to respective cycle controller.
So, before performing after two loop configuration instruction, the R5 in No. 0 (normalization track number) track Depositor has the M0 element in Figure 14 A, R4 depositor has the N0 element in Figure 14 A;1 Number track there is M1 element in R5 depositor, R4 depositor has N4 element;In No. 2 tracks R5 depositor has M2 element, R4 depositor has N8 element;In No. 3 tracks in R5 depositor There is M3 element, R4 depositor has NC element.The most such, it is because being responsible for loading M Its track of DE1 data engine of matrix is spaced apart ' 1 ', and is responsible for loading the DE0 data engine of N matrix Its track is spaced apart ' 4 '.After having performed Article 3 data engine configuration-direct, data engine 170 is posted The P0 element in data buffer 146 is pointed in the output of storage 171, prepares to produce preprocessor 190 Polymerization result through FIFO (FIFO) 176 write data storage 146.FIFO 176 is temporary poly- Close result to avoid the read/write conflict of data buffer 146.
It is circularly set instruction (SETLOOP) and cycle controller 130 is set, for B instruction format in Fig. 2, Its 28 territory stores the number of times (such as: 3) that circulation performs, target operand 23 stores this circulation Corresponding cycle controller numbering (such as: cycle controller J, K);Territory 22 stores and this circular correlation Data engine numbering.When decoder translate one be circularly set instruction time, track allotter 188 is its point Join a cycle controller 130 to share for this group track, and the cycle controller numbering in instruction is followed with this Ring controller association is (as being stored in this numbering this numbering in this controller, or record in 188 and being somebody's turn to do Controller is correlated with).Carry out the cycle-index in this domain of instruction 28 is stored in distributed cycle controller simultaneously Operation such as depositor 131 grade in 130.As a example by Figure 15, Article 1 is circularly set instruction, track in Figure 13 B Allotter 188 distributes a cycle controller 130, and the named J of numbering being pressed domain of instruction 23 for it. This is circularly set instruction and the control line 137 of cycle controller J in Figure 13 B is set to ' 0 ', passes through and door 136 make control line 138 for ' 0 ' so that selector 133 mask register 131 in J cycle controller 130 Output.This is circularly set instruction and makes in 130 at following clock cycle control register write signal 149 The output write depositor 134 of selector 133.Control line 137 is reset to ' 1 ' afterwards, simultaneously according to instruction The switch 181 of the data engine 150 of numbered DE0 is set to Guan Bi by territory 22 makes selector 152 be followed by J 138 control lines of ring controller 130 control;Switch 183 is set to Guan Bi and makes DE0 data engine 150 The write control signal 158 of middle depositor 151 is by the register write signal 149 of J cycle controller 130 Control;The data engine making numbered DE0 follows cycle controller J action, relevant to J.Now J The cycle-index ' 3 ' being stored in depositor 134 in number cycle controller 130, through or door 135 carry out ' or ' After operation so that control line 141 value is for ' 1 ', more through making control line 138 be worth also for ' 1 ' after door 136, This value controls selector 133 in 130 and selects the output valve ' 2 ' of decrement device 132, also controls to select in 150 Device 152 selects the output of adder 154, and its value is that the data address of depositor 151 output is plus depositor The step value of storage in 155, in DE0 data engine, this step value is ' 1 '.
It is circularly set, for one, all data engines numbering that in instruction, 24,25,26 territories occur and all presses above-mentioned Process is relevant to the cycle controller of numbering distribution in 23 territories in this instruction, this cycle controller control each phase Close the stepping of data engine.If (every these instruction is right for the loading contained in a circulation or storage instruction Answer a data engine) number be circularly set in instruction more than one 22 territories can comprise data engine numbering Number time, can increase domain of instruction 23 in a program, 28 identical are circularly set instruction, its 24, The data engine numbering that Article 1 instruction fails to lay down is placed in 25,26 territories.Track allotter 188 translates Article 2 Instruction, finds to be assigned with a cycle controller for cycle controller numbering in 23 territories in this instruction, will The data engine pointed by data engine numbering contained in instruction is the most relevant to this cycle controller; And the cycle-index value in 22 territories in this instruction is write again depositor 131 (in this cycle controller 130 The value that secondary is write is identical with first time), or do not write this depositor.In Figure 15, Article 2 is circularly set instruction also As above example arranges cycle controller K and by associated for data engine DE1.Depositor 155 in DE1 engine The step value of storage is ' 4 ', data address during therefore the output valve of its adder 154 is its depositor 151 Add ' 4 '.
Followed by ordinary multiplications command M UL, by R4 content of registers in each track and R5 depositor In numerical value be multiplied, it accumulates R6 depositor in its this track.Next is polymerization addition instruction again RDUADD, sends the numerical value in R6 depositor in each track into post processing network and is added.In this instruction 25 Territory is also directed to data engine DE31.Refer to Figure 13 B, when this instruction reaches DE31 along post processing network During data engine 170, i.e. with the P0 address having in depositor 171, by the output of preprocessor 190 Write data buffer storage 146.So complete the operation of the first row in Figure 14 B.I.e. No. 0 track completes M0*N0 Computing, 1,2, No. 3 track is respectively completed M*N4, the computing of M2*N8, M3*NC, preprocessor By 4 respective product addition in track, the value obtained is stored back in data buffer storage be stored data engine by DE31 First element P0 of P matrix that in 170, depositor 171 points to.P0 storage complete after depositor in DE31 171 update, and now under not having data engine configuration-direct effect, selector 172 selects adder 174 Output, it is that the address of P0 element is plus the step-length ' 1 ' of storage, the i.e. ground of P1 element in depositor 175 Location is using the storage address as next circulation.Above-mentioned MUL Yu RDUADD instruction can also merge into one Take advantage of-be polymerized and add instruction to save instruction execution cycle.
Return to Figure 15, instruct followed by LOOP.Recursion instruction LOOP be a kind of side-play amount be negative value Special branch instructs, and form is Type B in Fig. 2, containing the cycle controller controlled by this instruction in its 22 territory Numbering, is J in this instructs;Containing finger offsets amount in its 28 territory, in sensing program in this instance MUL instructs (with T 1 labelling).Please see Figure 13B, be stored into finger in the instruction 128 from hierarchy storage While making buffer 406, these instructions are also scanned by scanning device 408, analyze, calculate, the class of instruction Type and branch target are extracted and are stored in track table 410 table corresponding with the instruction in Instruction Register 406 ?.In this embodiment containing controlling selector 139 and the signal of selector 141 in instruction type.When When the instruction being carrying out not is recursion instruction, the respective type signal read from track table 410 controls choosing Selecting device 139 selects the TAKEN signal of branch's decision logic 149 in track to control to select in tracking device Device 442.From the BRANCH signal in track normal time continuously effective, make each clock of depositor 443 Cycle all updates, and provides instruction address to instruction buffer 406 and track table so that 406 every circumference tracks provide New instruction is for performing;This invalidating signal when only the streamline in track stops, making depositor 443 Stop updating, make 406 time-outs provide new instruction to track.
When in tracking device, depositor 443 exports the address of Article 1 LOOP instruction, with this address from instruction Buffer 406 reads this recursion instruction and performs for track decoding, reads from track table 410 with this address simultaneously Go out type signal with this recursion instruction to control selector 139 and select the output from selector 168 to control Selector 442 in tracking device.Track table 410 also exports the branch target address T1 of this recursion instruction and send simultaneously To an input of selector 442, another of selector 442 inputs the output from incrementer 441, Its value is that the address (the most now output of depositor 443) of this recursion instruction increases one.Therefore according to selector The output of 168 determines that following clock cycle performs next instruction (Article 2 LOOP of order of present instruction Instruction) or the Branch Target Instruction T1 of present instruction (MUL instruction).This recursion instruction is decoded, with In instruction, the control of the value in 22 territories selector 168 selects the output 141 of J cycle controller 130 to control Selector 442.When cycle controller 130 is configured to ' J ', selector 168 select from this 130 The control line of output signal 141 is configured to ' J ' simultaneously, numbering ' J ' coupling in therefore later instructing Select the output 141 of ' J ' cycle controller.Selector 168 is also such to other selections inputted, and is all base In coupling.This duration of control line 141 is ' 1 ' as previously mentioned, and its meaning is carried out circulation, chosen device Control, after 168 i.e. selection 139, the T1 address that in tracking device, selector 442 selects track table 410 to export, make It becomes the instruction address of following clock cycle.Simultaneously according to the decoding of this recursion instruction, or directly according to rail The recursion instruction type signal read in road table 410 enables signal in (Enable) J cycle controller 130 Signal 158. in 146 and DE0
So in the next clock cycle, in J cycle controller 130, depositor 134 updates and is stored in new following Ring number of times ' 2 ';In each track, in DE0 data engine, depositor 151 updates, and is stored in relatively preceding value and increases ' 1 ' Also fetch data from data buffer storage 146 with this and be stored in the R4 depositor in respective track in the new address of (step value); T1 address is stored into depositor 443 in tracking device, reads MUL with access instruction caching 406 and track table 410 Instruction and respective rail table list item thereof.Because DE1 data engine is uncorrelated with J cycle controller, it is not subject to Affecting to this recursion instruction, the value in the R5 depositor write by DE1 data engine in the most each track does not has Change.The most now still having the M0 element in Figure 14 A in the R5 depositor in No. 0 track, R4 posts The N1 element having in storage;No. 1 track still there is M1 element in R5 depositor, in R4 depositor There is N5 element;No. 2 tracks still there are M2 element in R5 depositor, R4 depositor has N9 Element;No. 3 tracks still there are M3 element in R5 depositor, R4 depositor has ND element. The most again perform MUL Yu the RDUADD instruction in Figure 15 circulation, then as front obtained such as Figure 14 B In result M0*NI+M1*N5+M2*N9+M3*ND shown in the second row be stored in Figure 13 B DE31 storage P1 position in P matrix in data buffer storage 146 indicated by depositor 171 in data engine.
The most again perform Article 1 LOOP instruction in Figure 15, now because J cycle controller 130 is deposited Device 134 intermediate value is ' 2 ', and control line 141 and 138 is ' 1 ', circulation the most performed as described above.Its result makes In each track, in R4 depositor, data update;The execution of MUL and RDUADD instruction produces in Figure 14 B The result write back data memorizer 146 of P2 row, also making depositor 134 intermediate value is ' 1 '.
The most again perform Article 1 LOOP instruction in Figure 15, be now ' 1 ' because of depositor 134 intermediate value, Control line 141 and 138 is ' 1 ', circulation the most performed as described above.Its result makes in each track in R4 depositor Data update;The execution of MUL and RDUADD instruction produces the result write back data of P3 row in Figure 14 B Memorizer 146, also making depositor 134 intermediate value is ' 0 '.
The most again perform Article 1 LOOP instruction in Figure 15, now because of in J cycle controller 130 Depositor 134 intermediate value is ' 0 ', and control line 141 and 138 is ' 0 ', makes program perform to exit circulation (interior Circulation).Its process is ' 0 ' to control to select in tracking device by selector 168 and 139 on control line 141 Select device 442 and select the output of incrementer 441, be stored in next instruction of order at next period register 443 The address of (i.e. Article 2 LOOP instruction in Figure 15).' 0 ' control J on control line 138 simultaneously The cycle-index ' 3 ' stored in selector 133 mask register 131 in cycle controller is at next It is stored in depositor 134 week;On control line 138 ' 0 ' also controls D E 0 data in each track In engine 150, the base address in selector 152 mask register 153 is stored in depositor in next week 151.Cycle controller and data engine that will be relevant with this recursion instruction (interior circulation) all return to Its original state, is ready to again perform whole interior circulation.
In each track of following clock cycle DE0 data engine with the data address in its depositor 151 from data Caching 146 reads R 4 depositor that data are stored in each track.Press the address in depositor 443 simultaneously From instruction buffer 406, read the Article 2 LOOP instruction in Figure 15, and from track table 410, read phase The recursion instruction type answered and branch target T 1.The execution of Article 2 LOOP instruction and Article 1 LOOP Instruction execution similar, difference be in this instruction perform be outer circulation, act on K cycle controller and The data engine DE 1 of association, and do not affect J cycle controller and the data engine of association thereof of interior circulation DE 0.The cycle-index that depositor 134 in K cycle controller now stores for ' 3 ' (by Fig. 5 the Article two, SETLOOP instruction is by its 22 territory intermediate value write), the therefore control line 141 of K cycle controller It is ' 1 ' with 138.Therefore outer circulation is carried out such as above interior circulation example, depositor 134 in K cycle controller In cycle-index be kept to ' 2 ';In each track, in DE 1 data engine, depositor 151 is stored in adder Data address in former depositor 151 is added the new number obtained by 151 with the step value ' 4 ' in depositor 155 According to address, and from data buffer storage 146, read data with this new data address;Depositor 443 is also deposited Enter the branch target address (being T 1 equally) of the Article 2 LOOP instruction read from track table 410.
Following clock cycle reads the MUL in Figure 15 with address in depositor 443 from instruction buffer 406 Instruction performs for the decoding of each track.The R5 depositor in No. 0 track now has the M4 unit in Figure 14 A Element, has N0 element in R4 depositor;Having M5 element in R5 depositor in No. 1 track, R4 deposits Device has N4 element;No. 2 tracks there are M6 element in R5 depositor, R4 depositor has N8 element;No. 3 tracks there are M7 element in R5 depositor, R4 depositor has NC element. Therefore performed in Figure 15 after MUL and RDUADD instruction, i.e. such as P4 row in front generation such as Figure 14 B Result be stored in data buffer storage 146.The most again perform Article 1 LOOP instruction in Figure 15, because of now J Cycle-index in cycle controller is ' 3 ', then circulation in performing, and jumps back to MUL instruction.So as front In performing, circulation is respectively completed P5 in Figure 14 B for 3 times, and the operation of P6, P7 row, now in J cycle controller Cycle-index is kept to ' 0 ', and program exits interior circulation.
The most again perform the instruction of Article 2 LOOP, because in now K cycle controller, depositor 134 stores Cycle-index be ' 2 ', therefore perform outer circulation return to MUL instruction.This takes turns in performing in outer circulation and follows Ring 3 times, performs altogether the instruction segment that instructs from MUL to Article 1 LOOP 4 times, calculates and store figure Four P matrix elements from P8 to PB in 14B.The most again perform the instruction of Article 2 LOOP, therefore Time K cycle controller in depositor 134 storage cycle-index be ' 1 ', the most again execution outer circulation return Instruct to MUL.This takes turns and performs interior circulation 3 times in outer circulation, performs from MUL to Article 1 LOOP altogether The instruction segment of instruction 4 times, calculates and stores four P matrix elements from PC to PF in Figure 14 B.This After again perform the instruction of Article 2 LOOP, the circulation stored because of depositor 134 in now K cycle controller Number of times is ' 0 ', therefore exits outer circulation, next instruction of execution sequence.
Next instructs its mnemonic code is LOOPTS, and meaning is space-time cycle criterion.Fig. 2 is taked in this instruction The form of middle B, its 21 territory is operation code;The track group that its 22 territory indicates this instruction to be acted on (is now Q);Its 23 territory need not in this instruction;Its 28 territory is branch target side-play amount, herein means to T2 i.e. Figure 15 Middle MOVLN instructs.This instruction when being stored in Instruction Register 406 its instruction type and branch target by Scanning device 408 extracts, calculates and is stored in list item corresponding with this instruction in 406 in track table 410.When translating When code device translates this instruction, control the numerical value controlled in logical AND domain of instruction 22 of selector 168 in Figure 13 B Q mates, and selects the space-time cycle criterion 199 of Q driveway controller 189 to export and (now i.e. exits for ' 0 ' Circulation).This instruction type simultaneously read from track table 410 controls selector 139 and selects selector 168 Output select the output of incrementer 441 to be stored in depositor 443 to control selector 442 in tracking device.Hereafter I.e. next instruction of execution sequence SETWDTH, in this instruction, the request number of track-lines in 28 territories is ' 1 ', car Road allotter 188 accordingly by this track in addition to track, benchmark track that is 16 other tracks 17,18,19 and Each cycle controller 130 of association in this track group, storage data engine 170 disassociation such as grade regains resource Storehouse is to treat that other request benchmark (16) tracks from other threads or this thread continue executing with following bicycle Road serial command.
If track allotter is only assigned with two tracks 16 and 17, then Q track when performing Figure 15 Program The difference that group controller 189 obtains after request number of track-lines ' 4 ' is deducted available number of track-lines ' 2 ' is stored in for ' 2 ' Depositor 194.Now the output reference track number 197 of logic 195 is ' 2 ', this recycling number of track-lines 198 is also ' 2 ', and space-time cycle criterion 199 is ' 1 '.Now track allotter 188 is according to this recycling Number of track-lines 198 intermediate value ' 2 ' controls track 16 and 17 and is involved in program execution.Benchmark (16) track is being held Value in row SUBSCH instruction late register R12 is less by ' 2 ' than R11 intermediate value.Therefore BCSUB is being performed After instruction, its NLN of this 2 track is 2, and 3;Wherein NLN be 2 track on the basis of track.Because In track, data engine 150 sets the initial data address relevant to NLN ((Base)+NLN*Diff) of value, Therefore these two tracks are actually held when continuing executing with Figure 15 Program to space-time cycle criterion instruction LOOPTS Having gone in Figure 14 B right half, incomplete result is stored in data storage 146 by the operation in No. 2 and No. 3 tracks Middle P matrix.When going to space-time cycle criterion instruction, idle loop the most constantly judges 199 as ' 1 ', Then the chosen device of this value 168,139 controls branch's mesh that selector 442 selects track table 410 now to export Mark T2, starts to perform branching back to MOVLN instruction in Figure 15 next week.This circulation determines also to make Q Depositor 194 in instruction group controller 189 updates, and is stored in former depositor 194 and is worth (residue request track Number) subtract each other through subtractor 192 with depositor 191 intermediate value (available number of track-lines) after ' 2 ' chosen devices 193 After difference ' 0 '.The most now logic 195 output valve 197 is ' 0 ', 198 to be ' 2 ', and 199 is ' 0 '.
Now track allotter 188 controls track 3 and 4 according to this recycling number of track-lines 198 intermediate value ' 2 ' The program that is involved in performs.Benchmark track instructs in the value in late register R12 and R11 at execution SUBSCH It is worth identical.Therefore after performing BCSUB instruction, this 2 track (16 and 17) its NLN is 0, and 1; Wherein NLN be 0 track on the basis of track.Because data engine 150 sets the primary data ground of value in track Location relevant to NLN ((Base)+NLN*Diff), therefore these two tracks continue executing with Figure 15 Program extremely A left side half in Figure 14 B is actually performs during space-time cycle criterion instruction LOOPTS, No. 0 and No. 1 track Operation.Store data engine 170 during execution during idle loop and be operated in read-modify-write pattern (Read-Modify-Write) under.First will write before i.e. depositing result in data storage 146 list item being stored in Hold (last time performs the imperfect result that program produces) to read, produce not with the current circulation storing 1 Complete result computing in preprocessor 190 is written back after becoming complete result and is stored in data storage 146.? This storage preprocessor 190 affiliated by data engine 170 can be the processor of one three input, accepts Output from two previous level preprocessors and the input from data storage 146;Or can be Two input processors, perform twice operation and calculate complete result with 3 inputs.Hereafter Q track group is held again Row space-time cycle criterion instruction LOOPTS instruction, because space-time cycle criterion value 199 is ' 0 ', exits circulation.
When and for example fruit performs Figure 15 Program, track allotter is assigned with three tracks 16,17 and 18, then Q The difference that track group controller 189 will ask number of track-lines ' 4 ' to obtain after deducting available number of track-lines ' 3 ' is ' 1 ' It is stored in depositor 194.Now the output reference track number 197 of logic 195 is ' 1 ', this recycling car Number of channels 198 is ' 3 ', and space-time cycle criterion 199 is ' 1 '.Now track allotter 188 is according to this circulation Use number of track-lines 198 intermediate value ' 3 ' to control track 16,17 and 18 and be involved in program execution.Such as precedent now Article three, the NLN in track 16,17 and 18 is 1 respectively, 2,3, and the program that therefore performs is to space-time cycle criterion It is to complete the partial arithmetic result in 1,2, No. 3 tracks in Figure 14 B to be stored in number during instruction LOOPTS instruction According to P matrix in memorizer 146.Because space-time cycle criterion 199 is ' 1 ', idle loop when therefore performing. It is '-2 ' that the now request of the residue in depositor 194 number of track-lines ' 1 ' deducts the difference that available number of track-lines ' 3 ' obtains It is stored in depositor 194.As it was previously stated, when depositor 194 output is less than ' 0 ', the output of logic 195 197 value for ' 0 ', 198 values be 191 export with 194 output be added and (3+ (-2)=1), and The value of 199 is ' 0 ', accordingly track allotter according to 198 value by 17,18 tracks regain resource pools, only Stay 16 tracks to continue executing with, and the value that 16 tracks are with 197 arranges track NLN for ' 0 '.Program is held Row produce 0 track in Figure 14 B operation result and with the post-treated device of the partial results in P matrix 190 Computing produces complete result and is stored in P matrix in data storage 146.
In Figure 13 B, loading data engine 160 can exchange with 150 and use.An advanced person it is additionally arranged first in 160 Go out 166, for the temporary data read from data buffer storage 146.In 160, data address register 161 should Controlled by the storage status signal of first in first out 166.When 166 less than time, depositor 161 can update, And when 166 expire, then depositor 161 does not updates.And in 150 control depositor 161 update by The signal that LOOP instruction produces changes into controlling the reading of first in first out 166 and reading data being stored in 160 By the register entry pointed by register address register 169, the corresponding component in miscellaneous part and 150 And function one_to_one corresponding, such as 161,162,163,164,165,167, and 169 difference correspondences 151,152,153,154,155,157 and 159, repeat no more.Loading data engine 160 I.e. start from data buffer storage 146 with this after being stored in effective data address in its data address register 161 Middle reading data, when data are successfully read and store in FIFO 166, depositor 161 I.e. update, store what former data address was added through adder 163 with the step-length of storage in step length register 165 Next data address, and read next data from data buffer 146 accordingly and be stored in first in first out 166.So Operation is until first in first out 166 side's of having been filled with stopping.So when LOOP instruction instruction 160 is to register file Middle number completion according to time, these data are from first in first out 166, and therefore mask data buffer 146 may visit Ask delay.When due to data read-out so that 166 less than time, then loading data engine 160 recover from data delay Storage 146 reads data to fill 166.When the circulation associated with 160 (indirect by cycle controller) refers to When order execution result is judged as exiting circulation, in 160, first in first out 166 content is cleared.Additionally the present embodiment Middle loading data engine 150,160, storage data engine 170 basic structure is identical, and simply data flow is not With.Can be with data engine that a kind of data flow can be arranged to perform the function of above-mentioned three kinds of data engines.
Therefore in the present embodiment, program proposes space requirement, processor system by instruction to processor system System provide at that time can space resources (track 140, cycle controller 130, track group controller etc.) and The space requirement that will be unable to fully meet launches beyond the circulation that program is expressed the most in time.Data Space (interval, track) and time (step-length) increment, each data of instruction offer are depended in the operation of engine Engine is spaced in spatially launching by track when configuration, automatically walks by trigger condition temporally increment afterwards Enter.The cycle-index provided with instruction is depended in the operation of cycle controller, makes related resource by the circulation set Number of times performs circulation.And circulate and can be associated with the memory access of data engine, as to data engine Trigger condition, make the access of memorizer is followed ring carry out.Data engine can be controlled according to track number Initial data address expansion spatially.System can also be with arranging control vehicle Taoist monastic name and available number of track-lines The idle loop when space requirement of resource is converted to by mode by program, exchanges space for the time.It addition, data Engine configuration and cycle controller arrange and all carry out beyond the circulation of program, and system completes to calculate equally to be held The instruction strip number relatively prior art of row is greatly reduced.
System and method in the present embodiment also apply be applicable to be not required to the program of converging operationJu Hecaozuo, i.e. in the journey of Figure 15 In sequence, the instruction of Article 3 SETDE changes the storage data engine 170 arranging in each track into, deposits therein Device 179 is deposited the source register number of storage operation, and the RDUADD instruction in cancelbot.The most each The result of track computing by the storage data engine 170 in respective track perform recursion instruction time from depositing The data that the depositor that device 179 intermediate value is pointed to reads pointed by the data address stored in depositor 151 are deposited List item in reservoir 146.When circulating, data engine 170 is also such as data engine 150 1 in the present embodiment every time As stepping update the data address in depositor 151.
The CPU programming of prior art is programming model based on a uniprocessor, will by programmer Need the instruction segment repeated to compile to launch in time for circulation.The GPU program programming of prior art is base In the multiprocessor programming model of a fixing number of track-lines, the number of track-lines used by program will be with par-ticular processor Number of track-lines is corresponding, and therefore program cannot be general, also cannot be compatible with the program of CPU.The invention discloses one Plant elastic novel programmed model, it is characterized in that one can independently be carried out space and change to the time by processor Universal programming model, can be applicable to containing space disclosed in this invention (single to the monokaryon of time switch technology Track) or all arithmetic units of multinuclear (multilane), including CPU, DSP, MCU, GPU, GPGPU, Arithmetical unit etc. in memorizer.This programming model also allow in program serial executable portion and executed in parallel part without Seam connects, degree of parallelism the resource needed for instruction notification processor system is set and by processor system according to asking The priority asked, and the automatic Resources allocation of available resources is to meet program requirements.Based on this programming model, Programmer can also may be used with the program that this programming model is write with easier spatial parallelism mode coding It is common to it and performs the various arithmetic units that resource is had nothing in common with each other.This programming model makes based on cpu instruction collection Instruction extension can apply internuclear collaborative computing each with multilane processor or polycaryon processor.
The present embodiment is explanation as a example by the multilane processor in Fig. 4, actually disclosed in the present embodiment Method and system can apply processor system in office to obtain effect same.As long as there being odd number core, so that it may To use the cycle controller 130 in the present embodiment, data engine 150 grade performs being circularly set in this example, Data engine configuration and recursion instruction.Scanning device 408 in the present embodiment, track table 410 and tracking device are also Not necessarily, the PC ground that can export with the instruction addressing unit (PC unit) in prior art processor Location addressing instruction buffer 406 provides instruction to track, and by the output signal of selector 139 in Figure 14 B Judge that signal controls this instruction addressing unit as branch.Figure 14 B controls the signal of selector 139 Can be obtained by Instruction decoding, select the output of 168 selectores in execution recursion instruction, refer to performing remaining The branch that when making, selection processor He Zhong branch decision logic 149 produces judges signal TAKEN.Above-mentioned Monokaryon (track) system increases track group controller 189, it is possible to realize the space described in the present embodiment / time switching function, performs described degree of parallelism and arranges instruction.By depositor 191 in 189 in monokaryon system Replace with a constant ' 1 ', then this system is as it was previously stated, space-time can be converted to by the space requirement of program Circulation, correctly performs program.If multinuclear (multilane system), the most also need to set up track allotter 188 To require Resources allocation according to program.If requiring to perform instruction or post processing instruction between track, then by this Bright the disclosed embodiments are set up between track or post-processing unit and bus.
The embodiment of a mapping/aggregating algorithm is given below based on technology of the present invention.With big data (big Data), as a example by the counting in Processing Algorithm, substantial amounts of record, every record are comprised the most in the data file In contain some numerical attributes of certain entity.The target of enumeration problem seeks to calculate certain of each entity The function expression value of individual numerical attribute, such as summation, meansigma methods etc..Specifically, such as at CDR file In contain the flow byte number that phone number and each network access, need the network calculating each mobile phone to visit Ask flow summation.When using mapping/aggregating algorithm to solve this problem, in mapping process, can be by difference The entity identification (phone number) that extracts in different CDR files of track and Target Attribute values (flow Byte number);In the course of the polymerization process, the property value of all identical entity identifications is carried out according to function expression Computing (such as additive operation).
In actual motion, the track of all these counting algorithms of participation is performed both by same program segment, and is polymerized The adder that module then can be restrained by tree-shaped is constituted, and every layer of two results added to last layer are also sent to down One layer.Such as pseudo-code below:
Wherein, code " outputValue=propertyValue1+propertyValue2 ' achieve similar figure The polymerization addition function of RDUADD in 15 embodiments.So, it is assumed that have 4 tracks to participate in running, then Said procedure is run in these 4 tracks simultaneously, 4 data files can be carried out numerical value extraction simultaneously, and lead to Cross two layers of polymer and obtain final result.
In actual moving process, as a example by structure shown in Fig. 4 and Figure 13 B, command memory 406 is deposited Store up the above-mentioned false code realizing mapping function.The tracking device in four tracks 401,403,405 and 407 is equal Start to perform identical code from same instruction, and perform the loop code of same cycle-index.Obviously, this Time 4 tracking devices work completely the same, accordingly it is also possible to only use tracking device to control all 4 Track.In this example, data storage stores the content of multiple CDR file, by data engine Configuration so that each data engine starts to visit from the different data address of corresponding different CDR files Ask data storage.So, the respective data engine in each track obtains different ticket from data storage Data in file, extract the network access traffic information of each mobile phone and are sent to aggregation module.Polymerization mould Block performs the above-mentioned false code realizing polymerizable functional, and the flow information sending each track here adds up, To the final flow summation needed.In this example, the instruction performed in each track of synchronization is identical , but the data processed are different, achieve described tally function with SIMD pattern.
In the above example, if varying in size of each CDR file (means the execution circulation generation of correspondence The cycle-index of code need not identical), then in order to keep all tracks to be performed both by identical instruction, each car The cycle-index that road performs must be identical.In implementing, each track is carried out same being circularly set Instruction so that all write same maximum cycle in the depositor 131 of all cycle controllers used. So, in addition to that track that cycle-index is most, remaining track all has to repeatedly perform useless behaviour Make.Additionally, in this case, all CDR files account for the space size in data storage also must be with That maximum CDR file is identical, and the read-write operation that its part additionally accounted for carries out more is to actual result not Have an impact, only can cause the waste of data storage memory space.Therefore, it can of the present invention many In the way of MIMD, above-mentioned false code is performed, to improve track execution efficiency and data storage on lane system The storage efficiency of device.
As a example by structure shown in Fig. 4 and Figure 13 B, command memory 406 stores above-mentioned realization and maps merit The false code of energy.The tracking device in four tracks 401,403,405 and 407 all starts to perform from same instruction Identical code, but 4 tracking devices each work alone, and the respective data engine in each track is deposited from data Reservoir obtains the data in different CDR file, extracts the network access traffic information of each mobile phone and send Toward aggregation module.
In the present embodiment, although what each track performed is circularly set instruction is identical, but is set to Cycle-index in cycle controller is but different.Such as, being circularly set in instruction in this example, permissible No longer by immediate (such as ' 3 ' in the example of Figure 13,14,15) as cycle-index, but use depositor It is worth and writes depositor 131 as cycle-index.Specifically, determine that each CDR file is corresponding when compiling Cycle-index, and this cycle-index and CDR file are together stored as data in data storage, Before cycle controller is set, described cycle-index is read in the same depositor in each track, and performing When being circularly set instruction, the value of this depositor is write depositor 131.
Running afterwards is similar with precedent, and difference is that the track being finished can be suspended Work has reduced power consumption, it is also possible to carry out other follow-up operations in advance.And aggregation module is receiving each car During the operation result that road is sent here, it is necessary to until the operation result in all tracks all arrives, just carry out follow-up Converging operationJu Hecaozuo.Obviously, the method that interlocking can be used, utilize synchronizing signal to ensure the correctness of converging operationJu Hecaozuo. In this case, instruction that each track of synchronization performs also differs, and processed data are the most not With, i.e. achieve described tally function with MIMD pattern.
Additionally, multilane system of the present invention can carry out mapping/converging operationJu Hecaozuo with stream data.Such as, In one example, call bill data is sent to each track with data-stream form, when needs are to all tickets When in data, the dial-out number of times of certain specific Outgoing Number is added up, can be by each track respectively to not Same call bill data inlet flow (the most different data) carries out information retrieval.In this example, owing to having only to The number of times extracting this information is simply added, therefore when each track is according to respective tracking device During operation, after once completing the execution of demapping section code, it is not necessary to wait until that other tracks also produce synchronization letter Number just can perform the polymerization add operation of corresponding band fusion function at any time.That is, by preprocessor from data The most stored accumulated value before reading in memorizer, and with the extraction results added in described track after, then deposit Store up back in data storage so that the extraction result in this track can be added up at any time, and real-time update data are deposited Final result in reservoir.
In another example, need to add up again after the classifying content in above-mentioned CDR file.Such as, Add up the duration of call for specific several telephone numbers respectively.At this time, it may be necessary to each track performs respectively Different programs, extracts the call corresponding to different particular telephone numbers respectively for same CDR file Duration, send aggregation module to add up.Now, still as a example by structure shown in Fig. 4 and Figure 13 B, instruction storage Storing multistage in device 406 and realize the false code of mapping function, every section of described false code basic structure is similar to, But the code of (getItemId function as escribed above) is different at coupling telephone number, it is thus possible to district Separate the different telephone numbers in same CDR file (the most identical input data).Such as, therein one The code of Duan Shixian mapping function is as follows:
The code that another section realizes mapping function is as follows:
Remaining each section code realizing mapping function all with above-mentioned two sections similar, do not repeat them here.
In this example, aggregation module is configured to many groups, when often group is for the call that a kind of telephone number is corresponding Length is polymerized.Specifically, for often organizing aggregation module, it is necessary to wait until the corresponding operation result in all tracks When all arriving, just carrying out follow-up converging operationJu Hecaozuo, respectively group aggregation module might not complete polymerization simultaneously Operation.In this case, instruction that each track of synchronization performs also differs, but processed number According to identical, i.e. achieve described tally function with MISD pattern.
In another example, need the duration of call of the same user in different CDR files is weighted After add up again.Add owing to can not be multiplied by again after first the duration of call of this user in all CDR files being added up Weight coefficient, the most described weighting multiplication can not be carried out in aggregation module.At this point it is possible to described weighting is taken advantage of Method is placed in mapping code and is performed respectively by each track, and aggregation module is only to the duration of call after weighting Add up, required function can be realized.
Such as, the code realizing mapping function that the present embodiment is corresponding is as follows:
Obviously, above-mentioned code can be applied to before three examples in, for realizing weight data SIMD, MIMD, MISD mapping/converging operationJu Hecaozuo.Concrete methods of realizing is referred to previous embodiment, This repeats no more.
Although the above embodiments are all with being read by tracking device addressing tracks table and instruction described in Fig. 4 embodiment As a example by the processor system of buffering, but mouth instruction buffer is each read with its PC addressing of address in each track more one Device, or address the own instruction in each track and read buffering and provide to each track instruction can also realize above-mentioned reality respectively Execute the operation of MIMD or MISD in example.The instruction addressing unit in the most each track is sent to respectively with PC address From instruction read buffering, if hit, read buffering directly reading instruction from instruction and perform in track;If It is not hit by, PC address is sent to Instruction Register, read instruction block and fill instruction reading buffering, simultaneously will instruction Switch to track perform.Instruction based on track number disclosed in this invention performs, and processes, locate afterwards between track Reason, circulation, circulates method and the data such as the stepping data access of association, two-dimensional development, space/time conversion Engine, cycle controller, driveway controller, the device such as track allotter can be used in any multilane system.
Process between post processing or track and can be also used for self-test on the sheet of multilane processor chips.Below with Illustrating as a example by tree-shaped post processing bus in Fig. 8, same method can have in enforcement disclosed in this invention Such as between the track of Fig. 3 transmission bus of bus or Fig. 9.80,81,82,83, track perform same The test vector (program) of sample, this vector can be by chip tester or with multilane processor chips to be measured The system of device is sent into from outside, or is deposited by the on-chip testing controller reading sheet of multilane processor chips Storage in reservoir, or produce on sheet by certain algorithm.In this test vector containing post processing instruction make this 4 The result that bar track execution test vector produces is sent to preprocessor 84,85 and 86 and compares, and by post processing The operation result of device or deliver to the test controller on sheet, or deliver to the tester of off-chip or system by it Judge.Preprocessor in this example increases the function that test is special, can incite somebody to action according to the control of test controller One input of preprocessor switches to its output, and ignores another one input.Or can also be such as Figure 10 Embodiment by between preprocessor connection bus selection turn off.The result in several tracks of runback exists in the future Preprocessor compares operation (or checking after subtracting each other whether its difference is ' 0 ') and produces fiducial value afterwards, as compared It is identical that the track that value is then compared for ' 1 ' performs result;As fiducial value then needs to test further for ' 0 '. Each preprocessor, can by fiducial value and two inputs (such as under self-test default conditions Left input) transmit toward next stage preprocessor.First assume that all 4 tracks the most normally work, then every car The execution result that road performs same test vector is the most identical, and the fiducial value therefore read from preprocessor 86 is ' 1 ', show that 4 tracks the most normally work.Assume again 83 track cisco unity malfunctions, therefore from Hou Chu The fiducial value that reason device 86 reads is ' 0 ', shows that at least one track work is abnormal.The survey of on and off the chip Examination controller controls preprocessor 84 the most further and the result in track 80 is bypassed, and also controls preprocessor The result in track 82 is bypassed by 85, and the fiducial value now read from preprocessor 86 is ' 1 ', shows track 80 and 82 is normal.Comparing the result in track 80 and 81 the most again, the fiducial value that preprocessor 86 reads is still For ' 1 ', the most i.e. can determine whether that track 83 is abnormal.Above-described embodiment at least can detect in a plurality of track Article one, the exception in track, but can analogize by this and with different tracks (processor core), same data be performed The result of same program carries out comparing than taking turns more and locking a plurality of abnormal track (core).This method has one Individual dead angle, i.e. cannot differentiate all tracks of processor has same fault.Can be with off-chip input or sheet for this As a means of differentiation compared with the accordingly result that the minority expected results of upper storage is vectorial with the execution of at least one track.
The track number in abnormal track is stored in track allotter 188 makes abnormal track number not occur in resource pool, Just can repair multilane (core) processor chips.For abnormal track, 188 can be by its track number and big The track depositor that a track is stored in each track is all moved to right in the track number in this track number.So basis Abnormal track can be walked around in track number in this depositor when producing NLN, and makes NLN still continuously to perform Instruction based on NLN.If configuration preprocessor makes it ignore the output in abnormal track further, this has different Often the multilane in track processes and can perform converging operationJu Hecaozuo disclosed in this invention.This method can also be used for into Runway distributes, and some non-conterminous tracks are distributed to a program needing multilane space resources or line Journey.If needing multilane processor to have certain number of track, then can increase redundancy car in the design Road, resets track number after making self-test, also have the track of enough numbers to meet after getting rid of abnormal track Requirement.Same method of testing can be used for testing and processes between track or preprocessor and connection, principle phase With, repeat no more.
Above-mentioned self-test when chip production is tested, all tracks parallel running on chip once can be tested to Amount, controlling to carry out operating or carry out post-processing operation between track with on and off the chip test controller thereafter will be each The execution result in track is compared to each other and can position abnormal track, to reduce testing cost;And can be by abnormal car To improve chip yield in Taoist monastic name record non-volatility memorizer on sheet.Also can be when system boot or fortune The most automatically test is performed and by abnormal track number record memorizer on sheet during row, the most permissible The fault produced during processor uses carries out selfreparing and increases reliability.On-chip testing vector generator Odd number or a plurality of tandom number generator can be used, or incrementer is to each territory in whole piece instruction or instruction Carry out exhaustive respectively and realize.
Although embodiments of the invention only architectural feature and/or procedure to the present invention is described, But it is to be understood that, the claim of the present invention is not only limited to and described feature and process.On the contrary, Described feature and process simply realize several examples of the claims in the present invention.
It should be appreciated that the multiple parts listed in above-described embodiment are only to facilitate describe, it is also possible to Comprise miscellaneous part, or some parts can be combined or save.The plurality of parts can be distributed in multiple In system, can be that be physically present or virtual, it is also possible to realize (such as integrated circuit) with hardware, use Software realizes or is realized by combination thereof.
Obviously, according to the explanation to above-mentioned preferably embodiment, no matter how soon the technology development of this area has, Which kind of may obtain the most in the future and be the most still difficult to the progress of prediction, the present invention all can be common by this area Replacement that corresponding parameter, configuration are adapted according to the principle of the present invention by technical staff, adjust and change Enter, all these replacements, adjust and improve the protection domain that all should belong to claims of the present invention.

Claims (53)

1. multilane/multiple nucleus system, it is characterised in that comprise multiple track/processor core, Mei Geche Road/processor core has different track/processor cores number, and each track/processor core can perform identical or not Same instruction also accesses memorizer;The execution result of described a plurality of track/processor cores is also entered by described system Row post-processing operation, and access memorizer.
2. the system as claimed in claim 1, it is characterised in that have complete between the plurality of track/processor core Office's bus, for transmitting the data in depositor, by carry out standdle carrier road/processor core register value move or in terms of Calculate operation.
3. system as claimed in claim 2, it is characterised in that:
Track/processor core in multilane/multiple nucleus system is divided into a plurality of track/processor core group, Mei Geche Bus switch conducting in global bus within road/processor core group, the overall situation between track/processor core group Bus switch in bus disconnects so that each track/processor core group carries out standdle carrier road/place simultaneously therein Reason device core register value moves or calculates operation;
Different track/processor cores perform the same degree of polymerization when arranging instruction simultaneously, to corresponding described bus Switch configures, it is achieved the degree of polymerization between corresponding track/processor core.
4. system as claimed in claim 3, it is characterised in that;Different track/processor cores perform same simultaneously Article one, between track/processor core during operational order, according to respective described track/processor core number determine track, source/ Processor core and target track/processor core, by bus between described track/processor core by track, source/processor The register value of core delivers to target track/processor core, described target track/processor core carry out post processing behaviour Make;
When different track/processor cores perform operational order between same track/processor core, according to each simultaneously Described track/processor core number determines track, source/processor core and target track/processor core, by described track Between/processor core, the register value of track, source/processor core is delivered to target track/processor core, by described target Track/processor core carries out post-processing operation.
5. the system as claimed in claim 1, it is characterised in that by instruction by each track/processor core Track/processor core number moves in the general register of this track/processor core.
6. system as claimed in claim 5, it is characterised in that different track/processor cores are according to different tracks Number/processor core number is calculated different data addresses by same instruction.
7. the system as claimed in claim 1, it is characterised in that described system also comprises one or more rear place Reason device, each preprocessor is connected with a plurality of track/processor cores, receives described a plurality of track/processor The execution result of core, and described execution result is polymerized.
8. system as claimed in claim 7, it is characterised in that:
Polymerization result is directly stored in memorizer by described preprocessor;Or
Polymerization result is sent back in the depositor of at least one track/processor core by described preprocessor.
9. system as claimed in claim 7, it is characterised in that determined the polymerization of post-processing operation by instruction Degree.
10. system as claimed in claim 9, it is characterised in that described preprocessor is by performing post processing Instruction carries out converging operationJu Hecaozuo.
11. systems as claimed in claim 10, it is characterised in that:
Described preprocessor is connected by transmission bus;Each preprocessor instructs by performing described post processing, The corresponding execution result of track/processor core is carried out converging operationJu Hecaozuo with the output of adjacent preprocessor;Or
Described preprocessor is connected by tree-shaped bus;Wherein, first order preprocessor by perform described after Process instruction, the execution result of two corresponding track/processor cores is carried out converging operationJu Hecaozuo, and this polymerization is tied Really, post processing instruction or its decoding result are passed to remaining preprocessor at different levels successively by level;After remaining is at different levels Processor instructs by performing described post processing, to the polymerization result of two corresponding preprocessors of previous stage again Polymerization.
12. systems as claimed in claim 11, it is characterised in that track/between processor core and preprocessor, Path between preprocessor and preprocessor can configure;
Configure by described path is turned on or off, real after a plurality of preprocessors are grouped Now by group converging operationJu Hecaozuo;Or
Configure by described path is turned on or off so that preprocessor realizes different polymerization degree Converging operationJu Hecaozuo.
13. the system as claimed in claim 1, it is characterised in that by a kind of general judging module according to The control signal of output produces next control signal when continuing executing with by current state, and receives not by current Next control signal when state continues executing with;Described general judging module according to described system by described the most defeated Go out the feedback that control signal controls to run, select a kind of output in next control signal of the two to control Described system continues to run with;
Described general judging module at least includes: arithmetical unit, depositor and selector;Wherein:
Depositor is used for storing current control signal, and by current control signal output to control described system fortune OK;
Arithmetical unit is for producing next when continuing executing with by current state according to described current control signal state Control signal, and next control signal described is sent to described selector;
Selector runs feedback according to described system under described current control signal controls, and produces arithmetical unit Next control signal described and receive next control signal when not continuing executing with by current state and select Select, and selection result is updated in described depositor.
14. the system as claimed in claim 1, it is characterised in that also comprise:
Odd number or a plurality of cycle controller;Wherein, in each cycle controller correspondence job sequence Individual loop body, for counting the execution number of times of described loop body, and determines whether circulation is finished;
Odd number or plurality of data engine, be divided into some groups, and often group is including at least a data engine also;
The often corresponding cycle controller of group data engine, for calculating the data used in described loop body Address, and control memorizer and complete data access operation;
Cycle-index in cycle controller is set by instruction;
When going to corresponding recursion instruction, described cycle-index subtracts one every time;With
After corresponding whole of described recursion instruction have been circulated, described cycle-index is reset as former arranging value.
15. systems as claimed in claim 14, it is characterised in that whenever going to described recursion instruction, Data engine updates data address, and is ready for track/process according to new data address acquisition corresponding data Device core uses;
If the execution result of recursion instruction represents that circulation continues, then data engine is by adding Shangdi by data address Location step-length obtains described new data address;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine Put value as described new data address.
16. systems as claimed in claim 14, it is characterised in that described data engine also comprises one and first enters First go out buffering;
After data engine is set, i.e. obtains corresponding data according to the data address arranged and be stored in described FIFO buffers for track/processor core;
After each data acquisition completes, i.e. update data address, and obtain respective counts according to new data address According to being stored in described FIFO buffering;Whenever going to described recursion instruction, described FIFO buffering is lost Abandon the data being stored in the earliest the data that are stored in secondary morning as the new data being stored in the earliest;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine Put value as described new data address, and empty described FIFO buffering.
17. systems as claimed in claim 14, it is characterised in that described data engine also comprises a fusion Module;Described Fusion Module receives track/processor core and writes after the data and appropriate address of memorizer, first From memorizer, read the data of former storage according to this address, and the data sent here with track/processor core are carried out After calculating operation, further according in this write back address memorizer.
18. systems as claimed in claim 7, it is characterised in that:
Each track described/processor core performs same program, and by preprocessor to each track/processor The execution result of core compares, it is judged that whether have the abnormal exception of work in described track/processor core Track/processor core, thus realize the Autonomous test of described system;
When there is abnormal track/processor core, determine the track/processor core number of abnormal track/processor core.
19. systems as claimed in claim 18, it is characterised in that:
Track/the processor core number of abnormal track/processor core is stored in allotter;
Described abnormal track/processor core walked around by allotter when distributing track/processor core, thus realizes described The selfreparing of system.
20. 1 kinds of multilane/multinuclears perform method, it is characterised in that each track/processor core has different Track/processor core number, each track/processor core can perform identical or different instruction;Again to described multiple The execution result of several track/processor cores carries out post-processing operation, and accesses memorizer.
21. methods as claimed in claim 20, it is characterised in that by between the plurality of track/processor core Global bus's transmission depositor in data, move or calculate carrying out standdle carrier road/processor core register value Operation.
22. methods as claimed in claim 21, it is characterised in that:
Track/processor core in multilane/multiple nucleus system is divided into a plurality of track/processor core group, Mei Geche Bus switch conducting in global bus within road/processor core group, the overall situation between track/processor core group Bus switch in bus disconnects so that each track/processor core group carries out standdle carrier road/place simultaneously therein Reason device core register value moves or calculates operation;
Different track/processor cores perform the same degree of polymerization when arranging instruction simultaneously, to corresponding described bus Switch configures, it is achieved the degree of polymerization between corresponding track/processor core.
23. methods as claimed in claim 22, it is characterised in that: different track/processor cores perform simultaneously Between same track/processor core during operational order, determine track, source according to respective described track/processor core number / processor core and target track/processor core, by bus between described track/processor core by track, source/process The register value of device core delivers to target track/processor core, described target track/processor core carry out post processing Operation;
When different track/processor cores perform operational order between same track/processor core, according to each simultaneously Described track/processor core number determines track, source/processor core and target track/processor core, by described track Between/processor core, the register value of track, source/processor core is delivered to target track/processor core, by described target Track/processor core carries out post-processing operation.
24. methods as claimed in claim 20, it is characterised in that by instruction by each track/processor core Track/processor core number move in the general register of this track/processor core.
25. methods as claimed in claim 24, it is characterised in that different track/processor cores are according to different cars Road/processor core number is calculated different data addresses by same instruction.
26. methods as claimed in claim 20, it is characterised in that by odd number or a plurality of preprocessor with A plurality of tracks/processor core connects, and receives the execution result of described a plurality of track/processor core, and to institute State execution result to be polymerized.
27. methods as claimed in claim 26, it is characterised in that:
By described preprocessor, polymerization result is directly stored in memorizer;Or
By described preprocessor polymerization result sent back in the depositor of at least one track/processor core.
28. methods as claimed in claim 26, it is characterised in that determine the poly-of post-processing operation by instruction Right.
29. methods as claimed in claim 28, it is characterised in that by described preprocessor by perform after Reason instruction carries out converging operationJu Hecaozuo.
30. methods as claimed in claim 29, it is characterised in that:
Described preprocessor is connected by transmission bus;Each preprocessor instructs by performing described post processing, The corresponding execution result of track/processor core is carried out converging operationJu Hecaozuo with the output of adjacent preprocessor;Or
Described preprocessor is connected by tree-shaped bus;Wherein, first order preprocessor by perform described after Process instruction, the execution result of two corresponding track/processor cores is carried out converging operationJu Hecaozuo, and this polymerization is tied Really, post processing instruction or its decoding result are passed to remaining preprocessor at different levels successively by level;After remaining is at different levels Processor instructs by performing described post processing, to the polymerization result of two corresponding preprocessors of previous stage again Polymerization.
31. methods as claimed in claim 30, it is characterised in that track/between processor core and preprocessor, Path between preprocessor and preprocessor can configure;
Configure by described path is turned on or off, real after a plurality of preprocessors are grouped Now by group converging operationJu Hecaozuo;Or
Configure by described path is turned on or off so that preprocessor realizes different polymerization degree Converging operationJu Hecaozuo.
A kind of 32. methods as claimed in claim 20, it is characterised in that also comprise control method, described control Method processed produces next control signal when continuing executing with by current state according to the control signal exported, and Receive next control signal when not continuing executing with by current state;Control has been exported by described further according to system Signal controls the feedback run, and selects a kind of output in next control signal of the two to continue with control system Reforwarding row;
Described control method at least includes:
Storage current control signal, and use current control signal control system to run;
Next control signal when continuing executing with is produced by current state according to described current control signal state;
Under described current control signal controls, run feedback according to described system, described current state is continued Next control signal when next control signal during execution and reception are not continued executing with by current state is selected Select, and selection result is updated to current control signal.
33. methods as claimed in claim 20, it is characterised in that also comprise:
Counted to determine that circulation is to the execution number of times of loop body by odd number or a plurality of cycle controller No it is finished;A loop body in each cycle controller correspondence job sequence;
Described cycle controller is calculated corresponding by odd number or a plurality of data engine corresponding with cycle controller Loop body in the address of data used, and control memorizer and complete data access operation;
Cycle-index in cycle controller is set by instruction;
When going to corresponding recursion instruction, described cycle-index subtracts one every time;With
After corresponding whole of described recursion instruction have been circulated, described cycle-index is reset as former arranging value.
34. methods as claimed in claim 33, it is characterised in that whenever going to described recursion instruction, Updated data address by data engine, and be ready for track/place according to new data address acquisition corresponding data Reason device core uses;
If the execution result of recursion instruction represents that circulation continues, then data engine is by adding Shangdi by data address Location step-length obtains described new data address;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine Put value as described new data address.
35. methods as claimed in claim 33, it is characterised in that use FIFO buffer data;
After data engine is set, i.e. obtains corresponding data according to the data address arranged and be stored in described FIFO buffers for track/processor core;
After each data acquisition completes, i.e. update data address, and obtain respective counts according to new data address According to being stored in described FIFO buffering;Whenever going to described recursion instruction, described FIFO buffering is lost Abandon the data being stored in the earliest the data that are stored in secondary morning as the new data being stored in the earliest;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine Put value as described new data address, and empty described FIFO buffering.
36. methods as claimed in claim 33, it is characterised in that use Fusion Module to the number in memorizer Merge according to the data sent here with track/processor core;Described Fusion Module receives track/processor core and writes After the data and appropriate address of memorizer, from memorizer, first read the data of former storage according to this address, And after the data sent here with track/processor core carry out calculating operation, further according in this write back address memorizer.
37. methods as claimed in claim 20, it is characterised in that:
Same program is performed by each track/processor core, and by preprocessor to each track/processor core Execution result compare, it is judged that whether described track/processor core has the abnormal abnormal car of work Road/processor core, thus realize the Autonomous test of described system;
When there is abnormal track/processor core, determine the track/processor core number of abnormal track/processor core.
38. methods as claimed in claim 37, it is characterised in that:
Store the track/processor core number of abnormal track/processor core, and when distributing track/processor core around Cross described abnormal track/processor core, thus realize the selfreparing of described system.
39. 1 kinds utilize the side that normalization track/processor core number performs program on multilane/multiple nucleus system Method, it is characterised in that the corresponding normalization track/processor core number of each track/processor core.
40. methods as claimed in claim 39, it is characterised in that: plural number bar track/processor core is performing During cyclic program, by circulating the renewal of trigger data address every time, and carry out according to described new data address Memorizer read or write, thus avoid to express in cyclic program data access instruction occurs.
41. methods as claimed in claim 40, it is characterised in that:
The same data engine of described a plurality of track/processor core executed in parallel arranges instruction;Described data are drawn Hold up to calculate according to described configuration information and produce odd number or plurality of data address, and according to described data address Carry out memorizer read or write.
42. methods as claimed in claim 41, it is characterised in that:
Described data engine is according at least to normalization track/processor core number corresponding to each track/processor core And address gaps, calculate the data initial address that each track/processor core is corresponding.
43. methods as claimed in claim 41, it is characterised in that:
Described data engine, according at least to address step size corresponding to each track/processor core, calculates and follows every time Data address corresponding during ring.
44. methods as claimed in claim 40, it is characterised in that when there being multilayer circulation nesting, described What a plurality of tracks/processor core executed in parallel was same be circularly set, and cycle controller is configured by instruction;Institute State and be circularly set the configuration information that comprises of instruction and at least include the cycle-index of loop body;
The recursion instruction that described a plurality of track/processor core also executed in parallel is same;Refer to performing described circulation When making, described cycle controller carries out corresponding counts;By described counting:
If the cycle-index having occurred and that is less than the cycle-index of loop body, then controlled to read by cycle controller Loop body initial order is for performing, thus repeats described loop body;
If the cycle-index having occurred and that is equal to the cycle-index of loop body, then controlled to read by cycle controller After loop body, the next instruction of sequence address is for performing, thus terminates to perform described loop body.
45. methods as claimed in claim 44, it is characterised in that each cycle controller and odd number or Plurality of data engine coordinates, circulation every time trigger described data engine and calculate new data address, and root Carry out memorizer read or write according to described new data address, thus avoid and be explicitly shown in cyclic program Existing data access instruction.
46. methods as claimed in claim 45, it is characterised in that to perform described recursion instruction as touching Clockwork spring part.
47. methods as claimed in claim 39, it is characterised in that the program two dimension that needs are performed a plurality of times By a plurality of tracks/processor core executed in parallel after expansion;The described execution number of times i.e. degree of parallelism being performed a plurality of times; Described two-dimensional development includes that space development and time launch;Wherein:
Space development, the most a plurality of track/processor cores can be simultaneous for different data and perform identical finger Order so that described program is launched on Spatial Dimension, program described in each track/processor core executed in parallel;
Time launches, i.e. when available track/processor check figure is less than described degree of parallelism, by described a plurality of cars Road/processor core is performed a plurality of times described program so that described program is launched on time dimension, described a plurality of The serial successively of track/processor core performs described program.
48. methods as claimed in claim 47, it is characterised in that:
Track allotter gradually deducts available resources space number according to the space resources number of PROGRAMMED REQUESTS, will be to sky Between the requirement of resource be mapped as the use to time resource.
49. methods as claimed in claim 48, it is characterised in that:
By the normalization track number that allotter output reference track, track/processor core is corresponding, thus affect benchmark The starting data address of data engine in track/processor core, is defined as the space starting point of space development;
The space scale of track/processor check figure definition current spatial expansion can be used with track allotter, and calculate Go out time scale to control time/space conversion when track/processor core performs.
50. methods as claimed in claim 49, it is characterised in that: by adjusting benchmark track/processor core Normalization track number, determine the epicycle time launch perform be which part in space development.
51. methods as claimed in claim 47, it is characterised in that:
Degree of parallelism demand is clearly given by the program being run or job sequence;
When running described program or job sequence, by described multilane/multiple nucleus system according to described degree of parallelism need Ask and automatically distribute track/processor core;
When available track/processor core number cannot meet described degree of parallelism demand, by described multilane/multinuclear System circulates the described program of execution several times, to meet described degree of parallelism demand.
52. methods as claimed in claim 51, it is characterised in that:
Determined maximum parallelism degree during executed in parallel loop body when compiling described program by compiler, produce bag Degree of parallelism containing described maximum parallelism degree arranges instruction;
When described multilane/multiple nucleus system performs described cyclic program, track allotter perform degree of parallelism and set Put instruction, and according to available track/processor check figure distribution track/processor core, determine and participate in executed in parallel Track/processor check figure, and the number of times that the circulation of described program is performed by these track/processor cores.
53. methods as claimed in claim 52, it is characterised in that deducted by the degree of parallelism of program and opened up The number opened and be executed in parallel obtains remaining degree of parallelism;When residue degree of parallelism is less than available track/processor core During number, by program described in the track/processor core executed in parallel of track allotter distribution respective number;And this After program finishes execution, described program is all finished.
CN201410781446.2A 2014-12-12 2014-12-12 Multi-lane/multi-core system and method Pending CN105893319A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410781446.2A CN105893319A (en) 2014-12-12 2014-12-12 Multi-lane/multi-core system and method
PCT/CN2015/096769 WO2016091164A1 (en) 2014-12-12 2015-12-09 Multilane/multicore system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410781446.2A CN105893319A (en) 2014-12-12 2014-12-12 Multi-lane/multi-core system and method

Publications (1)

Publication Number Publication Date
CN105893319A true CN105893319A (en) 2016-08-24

Family

ID=56106715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410781446.2A Pending CN105893319A (en) 2014-12-12 2014-12-12 Multi-lane/multi-core system and method

Country Status (2)

Country Link
CN (1) CN105893319A (en)
WO (1) WO2016091164A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179895A (en) * 2017-05-17 2017-09-19 北京中科睿芯科技有限公司 A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture
CN109189476A (en) * 2018-09-19 2019-01-11 郑州云海信息技术有限公司 The control stream of FPGA executes method, apparatus, equipment and medium
CN109669682A (en) * 2018-12-18 2019-04-23 上海交通大学 Mapping method based on general reconfigurable processor DBSS and MBSS
CN111158757A (en) * 2019-12-31 2020-05-15 深圳芯英科技有限公司 Parallel access device and method and chip
CN111860804A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
CN114328592A (en) * 2022-03-16 2022-04-12 北京奥星贝斯科技有限公司 Aggregation calculation method and device
CN115269455A (en) * 2022-09-30 2022-11-01 湖南兴天电子科技股份有限公司 Disk data read-write control method and device based on FPGA and storage terminal
TWI805731B (en) * 2019-04-09 2023-06-21 韓商愛思開海力士有限公司 Multi-lane data processing circuit and system
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11397624B2 (en) * 2019-01-22 2022-07-26 Arm Limited Execution of cross-lane operations in data processing systems
TWI825315B (en) * 2020-05-08 2023-12-11 安圖斯科技股份有限公司 Assigning method and assigning system for graphic resource
CN113722085B (en) * 2020-05-26 2024-04-30 安图斯科技股份有限公司 Distribution method and distribution system of graphic resources
CN112307431B (en) * 2020-11-09 2023-10-27 哲库科技(上海)有限公司 VDSP, data processing method and communication equipment
CN114816734B (en) * 2022-03-28 2024-05-10 西安电子科技大学 Cache bypass system based on memory access characteristics and data storage method thereof

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642442A (en) * 1995-04-10 1997-06-24 United Parcel Services Of America, Inc. Method for locating the position and orientation of a fiduciary mark
CN101299199A (en) * 2008-06-26 2008-11-05 上海交通大学 Heterogeneous multi-core system based on configurable processor and instruction set extension
US7535844B1 (en) * 2004-01-28 2009-05-19 Xilinx, Inc. Method and apparatus for digital signal communication
CN101477512A (en) * 2009-01-16 2009-07-08 中国科学院计算技术研究所 Processor system and its access method
CN101561766A (en) * 2009-05-26 2009-10-21 北京理工大学 Low-expense block synchronous method supporting multi-core assisting thread
CN101719105A (en) * 2009-12-31 2010-06-02 中国科学院计算技术研究所 Optimization method and optimization system for memory access in multi-core system
CN102362256A (en) * 2010-04-13 2012-02-22 华为技术有限公司 Method and device for processing common data structure
CN102576314A (en) * 2009-07-27 2012-07-11 先进微装置公司 Mapping processing logic having data parallel threads across processors
TW201301032A (en) * 2011-06-24 2013-01-01 Kenneth Cheng-Hao Lin High-performance cache system and method
CN102880594A (en) * 2012-10-17 2013-01-16 电子科技大学 Parallel matrix full-selected primary element Gauss-Jordan inversion algorithm based on multi-core DSP (Digital Signal Processor)
CN103365821A (en) * 2013-06-06 2013-10-23 北京时代民芯科技有限公司 Address generator of heterogeneous multi-core processor
CN103383654A (en) * 2012-05-03 2013-11-06 百度在线网络技术(北京)有限公司 Method and device for adjusting mappers to execute on multi-core machine
CN103731386A (en) * 2014-01-02 2014-04-16 北京邮电大学 High-speed modulation method based on GPP and SIMD technologies
US20140122841A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Efficient usage of a register file mapper and first-level data register file
US8749561B1 (en) * 2003-03-14 2014-06-10 Nvidia Corporation Method and system for coordinated data execution using a primary graphics processor and a secondary graphics processor
CN104050092A (en) * 2013-03-15 2014-09-17 上海芯豪微电子有限公司 Data caching system and method

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5642442A (en) * 1995-04-10 1997-06-24 United Parcel Services Of America, Inc. Method for locating the position and orientation of a fiduciary mark
US8749561B1 (en) * 2003-03-14 2014-06-10 Nvidia Corporation Method and system for coordinated data execution using a primary graphics processor and a secondary graphics processor
US7535844B1 (en) * 2004-01-28 2009-05-19 Xilinx, Inc. Method and apparatus for digital signal communication
CN101299199A (en) * 2008-06-26 2008-11-05 上海交通大学 Heterogeneous multi-core system based on configurable processor and instruction set extension
CN101477512A (en) * 2009-01-16 2009-07-08 中国科学院计算技术研究所 Processor system and its access method
CN101561766A (en) * 2009-05-26 2009-10-21 北京理工大学 Low-expense block synchronous method supporting multi-core assisting thread
CN102576314A (en) * 2009-07-27 2012-07-11 先进微装置公司 Mapping processing logic having data parallel threads across processors
CN101719105A (en) * 2009-12-31 2010-06-02 中国科学院计算技术研究所 Optimization method and optimization system for memory access in multi-core system
CN102362256A (en) * 2010-04-13 2012-02-22 华为技术有限公司 Method and device for processing common data structure
TW201301032A (en) * 2011-06-24 2013-01-01 Kenneth Cheng-Hao Lin High-performance cache system and method
CN103383654A (en) * 2012-05-03 2013-11-06 百度在线网络技术(北京)有限公司 Method and device for adjusting mappers to execute on multi-core machine
CN102880594A (en) * 2012-10-17 2013-01-16 电子科技大学 Parallel matrix full-selected primary element Gauss-Jordan inversion algorithm based on multi-core DSP (Digital Signal Processor)
US20140122841A1 (en) * 2012-10-31 2014-05-01 International Business Machines Corporation Efficient usage of a register file mapper and first-level data register file
CN104050092A (en) * 2013-03-15 2014-09-17 上海芯豪微电子有限公司 Data caching system and method
CN103365821A (en) * 2013-06-06 2013-10-23 北京时代民芯科技有限公司 Address generator of heterogeneous multi-core processor
CN103731386A (en) * 2014-01-02 2014-04-16 北京邮电大学 High-speed modulation method based on GPP and SIMD technologies

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘华海: "节点内多CPU多GPU协同并行绘制关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107179895B (en) * 2017-05-17 2020-08-28 北京中科睿芯科技有限公司 Method for accelerating instruction execution speed in data stream structure by applying composite instruction
CN107179895A (en) * 2017-05-17 2017-09-19 北京中科睿芯科技有限公司 A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture
CN109189476B (en) * 2018-09-19 2021-10-29 郑州云海信息技术有限公司 Control flow execution method, device, equipment and medium of FPGA
CN109189476A (en) * 2018-09-19 2019-01-11 郑州云海信息技术有限公司 The control stream of FPGA executes method, apparatus, equipment and medium
CN109669682A (en) * 2018-12-18 2019-04-23 上海交通大学 Mapping method based on general reconfigurable processor DBSS and MBSS
TWI805731B (en) * 2019-04-09 2023-06-21 韓商愛思開海力士有限公司 Multi-lane data processing circuit and system
CN111860804B (en) * 2019-04-27 2022-12-27 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
CN111860804A (en) * 2019-04-27 2020-10-30 中科寒武纪科技股份有限公司 Fractal calculation device and method, integrated circuit and board card
US11841822B2 (en) 2019-04-27 2023-12-12 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
US12026606B2 (en) 2019-04-27 2024-07-02 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
US12093811B2 (en) 2019-04-27 2024-09-17 Cambricon Technologies Corporation Limited Fractal calculating device and method, integrated circuit and board card
CN111158757B (en) * 2019-12-31 2021-11-30 中昊芯英(杭州)科技有限公司 Parallel access device and method and chip
CN111158757A (en) * 2019-12-31 2020-05-15 深圳芯英科技有限公司 Parallel access device and method and chip
CN114328592A (en) * 2022-03-16 2022-04-12 北京奥星贝斯科技有限公司 Aggregation calculation method and device
CN114328592B (en) * 2022-03-16 2022-05-06 北京奥星贝斯科技有限公司 Aggregation calculation method and device
CN115269455A (en) * 2022-09-30 2022-11-01 湖南兴天电子科技股份有限公司 Disk data read-write control method and device based on FPGA and storage terminal
CN115269455B (en) * 2022-09-30 2022-12-23 湖南兴天电子科技股份有限公司 Disk data read-write control method and device based on FPGA and storage terminal

Also Published As

Publication number Publication date
WO2016091164A1 (en) 2016-06-16

Similar Documents

Publication Publication Date Title
CN105893319A (en) Multi-lane/multi-core system and method
CN103635875B (en) For by using by can subregion engine instance the memory segment that is performed come support code block of virtual core
CN103547993B (en) By using the virtual core by divisible engine instance come execute instruction sequence code block
CN109597646A (en) Processor, method and system with configurable space accelerator
Teflioudi et al. Distributed matrix completion
CN102902512B (en) A kind of multi-threading parallel process method based on multi-thread programming and message queue
CN103218208B (en) For implementing the system and method for the memory access operation being shaped
CN108804220A (en) A method of the satellite task planning algorithm research based on parallel computation
CN110476174A (en) Including neural network processor internuncial between device
CN104424158A (en) General unit-based high-performance processor system and method
CN105190541A (en) A method for executing blocks of instructions using a microprocessor architecture having a register view, source view, instruction view, and a plurality of register templates
Uchida et al. An efficient GPU implementation of ant colony optimization for the traveling salesman problem
CN108268278A (en) Processor, method and system with configurable space accelerator
CN108509270A (en) The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige
CN104035751A (en) Graphics processing unit based parallel data processing method and device
CN103562866A (en) Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
CN106227507A (en) Calculating system and controller thereof
CN105468439B (en) The self-adaptive parallel method of neighbours in radii fixus is traversed under CPU-GPU isomery frame
KR20130090147A (en) Neural network computing apparatus and system, and method thereof
CN101855614A (en) Have the hierarchy type microcode store more than core processor
CN102508820B (en) Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)
CN101717817A (en) Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar
CN108205704A (en) A kind of neural network chip
Pan et al. GPU-based parallel collision detection for real-time motion planning
CN105373367A (en) Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice
DD01 Delivery of document by public notice

Addressee: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd.

Document name: the First Notification of an Office Action

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 201203 501, No. 14, Lane 328, Yuqing Road, Pudong New Area, Shanghai

Applicant after: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd.

Address before: 200092, B, block 1398, Siping Road, Shanghai, Yangpu District 1202

Applicant before: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd.

WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160824