CN105893319A - Multi-lane/multi-core system and method - Google Patents
Multi-lane/multi-core system and method Download PDFInfo
- Publication number
- CN105893319A CN105893319A CN201410781446.2A CN201410781446A CN105893319A CN 105893319 A CN105893319 A CN 105893319A CN 201410781446 A CN201410781446 A CN 201410781446A CN 105893319 A CN105893319 A CN 105893319A
- Authority
- CN
- China
- Prior art keywords
- track
- processor core
- instruction
- data
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 154
- 238000012805 post-processing Methods 0.000 claims abstract description 60
- 238000006116 polymerization reaction Methods 0.000 claims description 110
- 238000003860 storage Methods 0.000 claims description 97
- 230000008569 process Effects 0.000 claims description 73
- 239000000872 buffer Substances 0.000 claims description 49
- 230000003139 buffering effect Effects 0.000 claims description 46
- 230000002159 abnormal effect Effects 0.000 claims description 34
- 238000012360 testing method Methods 0.000 claims description 24
- 230000005540 biological transmission Effects 0.000 claims description 21
- 238000010606 normalization Methods 0.000 claims description 19
- 238000011161 development Methods 0.000 claims description 15
- 238000009826 distribution Methods 0.000 claims description 11
- 230000004927 fusion Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 abstract description 26
- 230000004931 aggregating effect Effects 0.000 abstract description 3
- 230000000875 corresponding effect Effects 0.000 description 100
- 239000011159 matrix material Substances 0.000 description 53
- 230000006870 function Effects 0.000 description 30
- 238000013500 data storage Methods 0.000 description 20
- 238000011068 loading method Methods 0.000 description 19
- 238000007792 addition Methods 0.000 description 18
- 238000012545 processing Methods 0.000 description 11
- 230000002776 aggregation Effects 0.000 description 10
- 238000004220 aggregation Methods 0.000 description 10
- 230000008878 coupling Effects 0.000 description 10
- 238000010168 coupling process Methods 0.000 description 10
- 238000005859 coupling reaction Methods 0.000 description 10
- 230000001276 controlling effect Effects 0.000 description 9
- 230000001965 increasing effect Effects 0.000 description 8
- 230000033001 locomotion Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 7
- 238000004064 recycling Methods 0.000 description 7
- 230000008859 change Effects 0.000 description 6
- 239000000284 extract Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000001343 mnemonic effect Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000000151 deposition Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000007868 post-polymerization treatment Methods 0.000 description 3
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 2
- 241000705930 Broussonetia papyrifera Species 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000001186 cumulative effect Effects 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000004080 punching Methods 0.000 description 2
- 238000005303 weighing Methods 0.000 description 2
- 241000196324 Embryophyta Species 0.000 description 1
- 239000000654 additive Substances 0.000 description 1
- 230000000996 additive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000004899 motility Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 229920002755 poly(epichlorohydrin) Polymers 0.000 description 1
- 229920000642 polymer Polymers 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/16—Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
- G06F15/163—Interprocessor communication
- G06F15/167—Interprocessor communication using a common memory, e.g. mailbox
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Advance Control (AREA)
Abstract
The present invention provides a multi-lane/multi-core system and method. The multi-lane/multi-core system comprises multiple lanes/ processor cores, and each lane/processor core is capable of executing a same instruction or different instructions and accessing a memory. The multi-lane/multi-core system may further comprise a loop controller and a data engine, which are used to perform a data read/write operation on the memory when a loop program is executed, so explicit occurrence of a data access instruction in the loop program is avoided. The system is further capable of performing a post-processing operation on execution results of the multiple lanes/processor cores, and accessing the memory. The post-processing operation may be performed by the lanes/processor cores or a special post processor. By performing a mapping/aggregating operation by using the multi-lane/multi-core system with the post processor, which is provided by the present invention, a large number of memory accessing operations in the conventional multi-lane/multi-core system can be avoided, and therefore performance is significantly improved.
Description
Technical field
The present invention relates to computer, communication and integrated circuit fields.
Background technology
In order to improve the operational efficiency of multiprocessor nuclear system, can by some calculate operation (as Information Statistics,
Matrix operations etc.) it is divided into two stages.The operation of first stage can map (map) to multiple processors
Executed in parallel in core, shortens the execution time with high degree of parallelism;The operation of second stage is then to grasp the first stage
The execution result made carries out being polymerized (reduce) and processes, thus obtains final result.This mode is referred to as reflecting
Penetrate/be polymerized (map/reduce).
A kind of current conventional method processing mapping/aggregation problem is by memorizer and substantial amounts of meter by network
Calculation machine/processor couples together.The intermediate object program obtained is stored in after having performed map operation by computer/processor
In memorizer, read corresponding intermediate object program from memorizer again when performing converging operationJu Hecaozuo afterwards follow-up to complete
Operation.But, in such an embodiment, need frequently to carry out time-consuming memory access operation, and pass through net
Network carries out data transmission causing long delay, the most inefficient.And use common single instruction single data stream
(S1SD) processor or polycaryon processor, then be limited to ALU (ALU) or processor core
Number limits, it is impossible to real large-scale parallel carries out the operation of mapping phase.
Another kind of method is to use track (1ane) processor to realize mapping/polymerization.Single-instruction multiple-data stream (SIMD)
(SIMD) graphic process unit (GPU) is a kind of common track processor.GPU have a large amount of
Track, several tracks are divided into one group.Can be for identical or different in all tracks often organized in track
Data perform identical instruction, improve executed in parallel efficiency, wanting of mapping phase can be met well
Ask.
Refer to Fig. 1, it is an embodiment of the GPU realized according to prior art.Wherein track 11,
12,13,14 may be constructed a track group, perform same instruction simultaneously, and share same memorizer
(Fig. 1 does not shows).Data path is not directly connected to transmit data between track, but by one
Track writes data into, is read these data from memorizer by another track again, it is achieved two cars
Data transmission between road.
This GPU uses special SIMD instruction collection (the such as PTX different from general processor (CPU)
Instruction set), and the special development environment (such as CUDA environment) different from the programmed environment of CPU, this
The conversion added between the difficulty of programming, and two kinds of instruction set is difficult, also influences whether that system is overall
Efficiency.
Additionally, due to different tracks have to carry out same instruction, therefore execution efficiency in the track group of GPU
Necessarily poor than each track can perform different instructions.Although GPU has reached higher at mapping phase
Execution efficiency, but when entering polymerization stage, still can bring delay because frequently accessing memorizer and wait,
And multiple track accesses memorizer and has high requirements bandwidth simultaneously.
The invention discloses the architecture of brand-new track processor system, fundamentally solve above-mentioned all
Problem.
Summary of the invention
The present invention proposes a kind of multilane/multiple nucleus system, comprises multiple track/processor core, and each track/
Processor core has different track/processor cores number, and each track/processor core can perform identical or different
Instruction and access memorizer;The execution result of described a plurality of track/processor cores is also carried out by described system
Post-processing operation, and access memorizer.
Optionally, in the system, between the plurality of track/processor core, there is global bus, be used for passing
Pass the data in depositor, move or calculate operation carrying out standdle carrier road/processor core register value.
Optionally, in the system, the track/processor core in multilane/multiple nucleus system is divided into a plurality of
Track/processor core group, the bus switch conducting in the global bus within each track/processor core group, car
Bus switch in global bus between road/processor core group disconnects so that each track/processor core group is same
Time carry out standdle carrier road/processor core register value therein and move or calculate operation;Different track/processor cores
When the execution same degree of polymerization arranges instruction simultaneously, corresponding described bus switch is configured, it is achieved phase
The degree of polymerization between the track/processor core answered.
Optionally, in the system, different track/processor cores perform same track/processor core simultaneously
Between operational order time, determine track, source/processor core and target track according to respective described track/processor core number
/ processor core, by delivering to target carriage by the register value of track, source/processor core between described track/processor core
Road/processor core, is carried out post-processing operation by described target track/processor core;Different track/processor cores are same
Between Shi Zhihang same track/processor core during operational order, determine according to respective described track/processor core number
Track, source/processor core and target track/processor core, by between described track/processor core by track, source/place
The register value of reason device core delivers to target track/processor core, described target track/processor core after carrying out
Reason operation.
Optionally, in the system, by instruction by the track/processor core number of each track/processor core
Move in the general register of this track/processor core.
Optionally, in the system, different track/processor cores pass through according to different track/processor cores number
Same instruction is calculated different data addresses.
Optionally, in the system, also comprising one or more preprocessor, each preprocessor is with multiple
Several tracks/processor core connects, and receives the execution result of described a plurality of track/processor core, and to described
Perform result to be polymerized.
Optionally, in the system, polymerization result is directly stored in memorizer by described preprocessor.
Optionally, in the system, polymerization result is sent back at least one track/process by described preprocessor
In the depositor of device core.
Optionally, in the system, the degree of polymerization of post-processing operation is determined by instruction.
Optionally, in the system, described preprocessor carries out converging operationJu Hecaozuo by performing post processing instruction.
Optionally, in the system, described preprocessor is connected by transmission bus;Each preprocessor
Instruct by performing described post processing, by execution result and the adjacent preprocessor of corresponding track/processor core
Output carries out converging operationJu Hecaozuo.
Optionally, in the system, described preprocessor is connected by tree-shaped bus;Wherein, the first order
Preprocessor instructs by performing described post processing, carries out the execution result of two corresponding track/processor cores
Converging operationJu Hecaozuo, and it is each that this polymerization result, post processing instruction or its decoding result are passed to successively remaining by level
Level preprocessor;Remaining preprocessor at different levels instructs by performing described post processing, corresponding to previous stage two
The polymerization result of preprocessor is polymerized again.
Optionally, in the system, track/between processor core and preprocessor, preprocessor and Hou Chu
Path between reason device can configure.
Optionally, in the system, configure, by plural number by described path is turned on or off
Individual preprocessor realizes after being grouped by group converging operationJu Hecaozuo.
Optionally, in the system, configure by described path is turned on or off so that after
Processor realizes the converging operationJu Hecaozuo of different polymerization degree.
Optionally, in the system, a kind of general judging module produce according to the control signal exported
Next control signal when continuing executing with by current state, and under receiving when not continuing executing with by current state
One control signal;Described general judging module is controlled to run by the described control signal that exported according to described system
Feedback, select a kind of output in next control signal of the two to continue to run with to control described system;
Described general judging module at least includes: arithmetical unit, depositor and selector;Wherein: depositor is used
In storage current control signal, and current control signal output is run to control described system;Arithmetical unit is used
In producing next control signal when continuing executing with by current state according to described current control signal state, and
Next control signal described is sent to described selector;Selector is believed described current control according to described system
Number control under run feedback, to arithmetical unit produce described in next control signal and receive do not continue by current state
Next control signal during continuous execution selects, and selection result is updated in described depositor.
Optionally, in the system, odd number or a plurality of cycle controller;Wherein, each circulation control
A loop body in device correspondence job sequence processed, for the execution number of times of described loop body is counted,
And determine whether circulation is finished;Odd number or plurality of data engine, be divided into some groups, often organize at least
Comprise a data engine also;Often the corresponding cycle controller of group data engine, is used for calculating described circulation
The address of the data used in body, and control memorizer and complete data access operation;By instruction, circulation is set
Cycle-index in controller;When going to corresponding recursion instruction, described cycle-index subtracts one every time;Institute
Stating after corresponding whole of recursion instruction have circulated, described cycle-index is reset as former arranging value.
Optionally, in the system, whenever going to described recursion instruction, data engine more new data
Address, and it is ready for track/processor core use according to new data address acquisition corresponding data;If circulation
The execution result of instruction represents that circulation continues, then data engine is by obtaining data address plus address step size
Described new data address;If the execution result of recursion instruction represents loop ends, then data engine is by data
Address resets to the former value that arranges as described new data address.
Optionally, in the system, described data engine also comprises a FIFO buffering;Work as data
After engine is set, i.e. it is stored in described FIFO according to the data address acquisition corresponding data arranged and delays
Punching is for track/processor core;Data address is i.e. updated after each data acquisition completes, and according to new
Data address obtains corresponding data and is stored in described FIFO buffering;Whenever going to described recursion instruction,
Data that described FIFO buffer drop is stored in the earliest the data being stored in secondary morning are stored in the earliest as new
Data;If the execution result of recursion instruction represents loop ends, then data address is reset to by data engine
The former value that arranges is as described new data address, and empties described FIFO buffering.
Optionally, in the system, described data engine also comprises a Fusion Module;Described fusion mould
Block receives track/processor core and writes after the data and appropriate address of memorizer, first according to this address from storage
Device reads the data of former storage, and after the data sent here with track/processor core carry out calculating operation, then root
According in this write back address memorizer.
Optionally, in the system, each track described/processor core performs same program, and by rear
The execution result of each track/processor core is compared by processor, it is judged that in described track/processor core
Whether there is the abnormal abnormal track/processor core of work, thus realize the Autonomous test of described system;Work as existence
During abnormal track/processor core, determine the track/processor core number of abnormal track/processor core.
Optionally, in the system, the track/processor core number of abnormal track/processor core is stored in car
In road allotter;Track allotter walks around described abnormal track/processor core when distributing track/processor core,
Thus realize the selfreparing of described system.
The present invention also proposes a kind of multilane/multinuclear and performs method, and every track/processor core has different cars
Road/processor core number, each track/processor core can perform identical or different instruction;Again to described plural number
The execution result of individual track/processor core carries out post-processing operation, and accesses memorizer.
Optionally, in the process, posted by the global bus's transmission between the plurality of track/processor core
Data in storage, move or calculate operation carrying out standdle carrier road/processor core register value.
Optionally, in the process, the track/processor core in multilane/multiple nucleus system is divided into a plurality of
Track/processor core group, the bus switch conducting in the global bus within each track/processor core group, car
Bus switch in global bus between road/processor core group disconnects so that each track/processor core group is same
Time carry out standdle carrier road/processor core register value therein and move or calculate operation;Different track/processor cores
When the execution same degree of polymerization arranges instruction simultaneously, corresponding described bus switch is configured, it is achieved phase
The degree of polymerization between the track/processor core answered.
Optionally, in the process, different track/processor cores perform same track/processor core simultaneously
Between operational order time, determine track, source/processor core and target track according to respective described track/processor core number
/ processor core, by delivering to target carriage by the register value of track, source/processor core between described track/processor core
Road/processor core, is carried out post-processing operation by described target track/processor core;Different track/processor cores are same
Between Shi Zhihang same track/processor core during operational order, determine according to respective described track/processor core number
Track, source/processor core and target track/processor core, by between described track/processor core by track, source/place
The register value of reason device core delivers to target track/processor core, described target track/processor core after carrying out
Reason operation.
Optionally, in the process, by instruction by the track/processor core number of each track/processor core
Move in the general register of this track/processor core.
Optionally, in the process, different track/processor cores pass through according to different track/processor cores number
Same instruction is calculated different data addresses.
Optionally, in the process, by odd number or a plurality of preprocessor and a plurality of track/processors
Core connects, and receives the execution result of described a plurality of track/processor core, and gathers described execution result
Close.
Optionally, in the process, described preprocessor polymerization result is directly stored in memorizer.
Optionally, in the process, by described preprocessor polymerization result sent back at least one track/
In the depositor of reason device core.
Optionally, in the process, the degree of polymerization of post-processing operation is determined by instruction.
Optionally, in the process, described preprocessor polymerization behaviour is carried out by performing post processing instruction
Make.
Optionally, in the process, described preprocessor is connected by transmission bus;Each preprocessor
Instruct by performing described post processing, by execution result and the adjacent preprocessor of corresponding track/processor core
Output carries out converging operationJu Hecaozuo.
Optionally, in the process, described preprocessor is connected by tree-shaped bus;Wherein, the first order
Preprocessor instructs by performing described post processing, carries out the execution result of two corresponding track/processor cores
Converging operationJu Hecaozuo, and it is each that this polymerization result, post processing instruction or its decoding result are passed to successively remaining by level
Level preprocessor;Remaining preprocessor at different levels instructs by performing described post processing, corresponding to previous stage two
The polymerization result of preprocessor is polymerized again.
Optionally, in the process, track/between processor core and preprocessor, preprocessor and Hou Chu
Path between reason device can configure.
Optionally, in the process, configure, by plural number by described path is turned on or off
Individual preprocessor realizes after being grouped by group converging operationJu Hecaozuo.
Optionally, in the process, configure by described path is turned on or off so that after
Processor realizes the converging operationJu Hecaozuo of different polymerization degree.
Optionally, in the process, also comprising a kind of control method, described control method is according to exporting
Control signal produce next control signal when continuing executing with by current state, and receive by current state
Next control signal when continuing executing with;Further according to system by described exported control signal control run anti-
Feedback, selects a kind of output in next control signal of the two to continue to run with control system;Described control
Method at least includes: storage current control signal, and uses current control signal control system to run;According to
Described current control signal state produces next control signal when continuing executing with by current state;According to described
System runs feedback, next when continuing executing with described current state under described current control signal controls
Next control signal when control signal and reception are not continued executing with by current state selects, and will select
Result is updated to current control signal.
Optionally, in the process, also comprise: by odd number or a plurality of cycle controller to loop body
Execution number of times carry out counting determining whether circulation is finished;Each cycle controller correspondence job sequence
In a loop body;Follow described in odd number or a plurality of data engine corresponding with cycle controller calculate
The address of the data used in the loop body that ring controller is corresponding, and control memorizer and complete data access operation;
Cycle-index in cycle controller is set by instruction;When going to corresponding recursion instruction, described every time
Cycle-index subtracts one;After corresponding whole of described recursion instruction have been circulated, described cycle-index is reset as
Former value is set.
Optionally, in the process, whenever going to described recursion instruction, data engine number is updated
According to address, and it is ready for track/processor core use according to new data address acquisition corresponding data;If following
The execution result of fourth finger order represents that circulation continues, then data engine is by obtaining data address plus address step size
To described new data address;If the execution result of recursion instruction represents loop ends, then data engine is by number
The former value that arranges is reset to as described new data address according to address.
Optionally, in the process, FIFO buffer data are used;When data engine is set
After, i.e. obtain corresponding data according to the data address arranged and be stored in described FIFO buffering for track/place
Reason device core uses;After each data acquisition completes, i.e. update data address, and obtain according to new data address
Take corresponding data and be stored in described FIFO buffering;Whenever going to described recursion instruction, described first enter elder generation
Go out the data that buffer drop is stored in the earliest the data being stored in secondary morning as the new data being stored in the earliest;If
The execution result of recursion instruction represents loop ends, then data engine data address is reset to former arrange value make
For described new data address, and empty described FIFO buffering.
Optionally, in the process, use Fusion Module to the data in memorizer and track/processor core
The data sent here merge;Described Fusion Module receive track/processor core write toward memorizer data and
After appropriate address, from memorizer, first read the data of former storage according to this address, and with track/processor core
After the data sent here carry out calculating operation, further according in this write back address memorizer.
Optionally, in the process, each track/processor core same program is performed, and by rear
The execution result of each track/processor core is compared by reason device, it is judged that in described track/processor core be
No have the abnormal abnormal track/processor core of work, thus realizes the Autonomous test of described system;Different when existing
Often during track/processor core, determine the track/processor core number of abnormal track/processor core.
Optionally, in the process, store the track/processor core number of abnormal track/processor core, and
Walk around described abnormal track/processor core when distributing track/processor core, thus realize reviewing one's lessons by oneself of described system
Multiple.
The present invention also proposes one and utilizes normalization track/processor core number to perform on multilane/multiple nucleus system
The method of program, the corresponding normalization track/processor core number of each track/processor core.
Optionally, in the process, plural number bar track/processor core is when performing cyclic program, by every time
The renewal of circulation trigger data address, and carry out memorizer read or write according to described new data address,
Thus avoid to express in cyclic program data access instruction occurs.
Optionally, in the process, the data that described a plurality of track/processor core executed in parallel is same are drawn
Hold up and instruction is set;Described data engine calculates according to described configuration information and produces odd number or plurality of data ground
Location, and carry out memorizer read or write according to described data address.
Optionally, in the process, described data engine is corresponding according at least to each track/processor core
Normalization track/processor core number and address gaps, calculate data corresponding to each track/processor core and initiate
Address.
Optionally, in the process, described data engine is corresponding according at least to each track/processor core
Address step size, calculates data address corresponding during circulation every time.
Optionally, in the process, when there being multilayer circulation nesting, described a plurality of track/processor cores
What executed in parallel was same be circularly set, and cycle controller is configured by instruction;Described it be circularly set instruction and comprise
Configuration information at least include the cycle-index of loop body;Described a plurality of track/processor core also executed in parallel
Same recursion instruction;When performing described recursion instruction, described cycle controller carries out corresponding counts;Logical
Cross described counting: if the cycle-index having occurred and that is less than the cycle-index of loop body, then by cycle controller
Control to read loop body initial order to supply to perform, thus repeat described loop body;If have occurred and that follows
Ring number of times is equal to the cycle-index of loop body, then by sequence address after cycle controller control reading loop body
Next instruction for performing, thus terminate to perform described loop body.
Optionally, in the process, each cycle controller coordinates with odd number or plurality of data engine,
Triggered described data engine by circulation every time and calculate new data address, and enter according to described new data address
Line storage read or write, thus avoid to express in cyclic program data access instruction occurs.
Optionally, in the process, to perform described recursion instruction as trigger condition.
Optionally, in the process, it would be desirable to after the program two-dimensional development being performed a plurality of times by a plurality of tracks/
Processor core executed in parallel;The described execution number of times i.e. degree of parallelism being performed a plurality of times;Described two-dimensional development includes sky
Between launch and the time launch;Wherein: space development, the most a plurality of track/processor cores can be simultaneous for not
With data perform identical instruction so that described program is launched on Spatial Dimension, each track/processor
Program described in core executed in parallel;Time launches, i.e. when available track/processor check figure is less than described degree of parallelism,
Described program is performed a plurality of times so that described program is opened up on time dimension by described a plurality of track/processor cores
Opening, described a plurality of track/processor core serials successively perform described program.
Optionally, in the process, track allotter gradually deducts according to the space resources number of PROGRAMMED REQUESTS
Available resources space number, will be mapped as the use to time resource to the requirement of space resources.
Optionally, in the process, by the track normalizing that allotter output reference track/processor core is corresponding
Change track number, thus affect the starting data address of data engine in benchmark track/processor core, be defined as sky
Between launch space starting point;Track/processor check figure definition current spatial can be used to launch with track allotter
Space scale, and calculate time scale to control time/space conversion when track/processor core performs.
Optionally, in the process, by adjusting the normalization track number of benchmark track/processor core, really
Determine the expansion execution of epicycle time is which part in space development.
Optionally, in the process, the program being run or job sequence degree of parallelism demand is clearly provided;
When running described program or job sequence, by described multilane/multiple nucleus system according to described degree of parallelism demand certainly
Dynamic distribution track/processor core;When available track/processor core number cannot meet described degree of parallelism demand,
The described program of execution is circulated several times, to meet described degree of parallelism demand by described multilane/multiple nucleus system.
Optionally, in the process, compiler executed in parallel loop body is determined when compiling described program
Time maximum parallelism degree, produce comprise described maximum parallelism degree degree of parallelism instruction is set;When described multilane/
When multiple nucleus system performs described cyclic program, track allotter perform degree of parallelism and instruction is set, and according to can
With track/processor check figure distribution track/processor core, determine the track/processor check figure participating in executed in parallel,
And the number of times that the circulation of described program is performed by these track/processor cores.
Optionally, in the process, the degree of parallelism of program the number having launched and being executed in parallel is deducted
Mesh obtains remaining degree of parallelism;When remaining degree of parallelism less than available track/processor check figure, by track allotter
Program described in the track/processor core executed in parallel of distribution respective number;And after this program finishes execution, institute
Program of stating all is finished.
For this area professional person, it is also possible under the explanation of the present invention, the inspiration of claims and drawing,
Understand, understand the present invention and comprised other aspect contents.
Beneficial effect
Use of the present invention have between track process and the multilane system of preprocessor carries out mapping/being polymerized
Operation, can avoid substantial amounts of memory access operation in tradition multilane system, thus significantly increasing property
Energy.Program is distinctly claimed degree of parallelism, and processor distributes execution resource as requested, and parallel meeting
When spending, Automatic Cycle performs instruction segment to meet program requirement.Processor system use process between track or after locate
Reason carries out self-test, selfreparing, reduces testing cost, increases system yield and reliability.
For the professional person of this area, other advantages and applications of the present invention will be apparent from.
Accompanying drawing explanation
Fig. 1 is an embodiment of the graphic process unit realized according to prior art;
Fig. 2 is an embodiment of extended instruction form of the present invention;
Fig. 3 is an embodiment of global bus between track of the present invention;
Fig. 4 is an embodiment of the multilane system comprising and improving track of the present invention;
Fig. 5 is to use to connect bus and the original calculating in each track resource between track at the car performing map operation
An embodiment of converging operationJu Hecaozuo is performed on road;
Fig. 6 is that between track, the instruction degree of polymerization is embodiment when ' 2 ';
Fig. 7 is that between track, the instruction degree of polymerization is embodiment when ' 4 ';
Fig. 8 is an embodiment of tree-shaped bus structures preprocessor;
Fig. 9 is an embodiment of transmission bus structures preprocessor;
Figure 10 A is an embodiment of the tree-shaped bus of restructural;
Figure 10 B is an embodiment of restructural tree-shaped bus configuration situation;
Figure 10 C is another embodiment of restructural tree-shaped bus configuration situation;
Figure 10 D is another embodiment of restructural tree-shaped bus configuration situation;
Figure 11 is an embodiment of restructural transmission bus;
Figure 12 A is an embodiment of restructural degree of polymerization concrete structure;
Figure 12 B is degree of polymerization embodiment of configuring condition when being ' 2 ';
Figure 12 C is degree of polymerization embodiment of configuring condition when being ' 4 ';
Figure 12 D is degree of polymerization embodiment of configuring condition when being ' 8 ';
Figure 13 A is an embodiment of the multilane system improving cycle efficieny;
Figure 13 B is the multilane system improving a cycle efficieny embodiment for matrix multiplication;
Figure 14 A is an embodiment of matrix multiplication;
Figure 14 B is a schematic diagram of the substep generation result of matrix multiplication;
Figure 15 is an embodiment of the job sequence realizing described matrix multiplication;
Detailed description of the invention
The multilane system and method proposed the present invention below in conjunction with the drawings and specific embodiments is made the most in detail
Describe in detail bright.According to following explanation and claims, advantages and features of the invention will be apparent from.Need explanation
, accompanying drawing all uses the form simplified very much and all uses non-ratio accurately, only in order to convenient, distinct
Ground aids in illustrating the purpose of the embodiment of the present invention.
It should be noted that, in order to clearly demonstrate present disclosure, the present invention especially exemplified by multiple embodiments to enter
The different implementations of the one step explaination present invention, the plurality of embodiment is the not exhaustive formula of enumerative.Additionally,
Succinct in order to illustrate, content noted above in front embodiment is often omitted in rear embodiment, therefore,
In rear embodiment, NM content can be accordingly with reference to front embodiment.
Although this invention can extend in amendment in a variety of forms and replacing, description also lists
Concrete enforcement legend is also described in detail.It should be appreciated that the starting point of inventor is not by this
Bright being limited to illustrated specific embodiment, antithesis, the starting point of inventor is to protect all based on by this
Improvement, equivalency transform and the amendment carried out in the spirit or scope of rights statement definition.Same components and parts number
Code is likely to be used for all accompanying drawings to represent same or similar part.
Additionally, illustrate as a example by the multilane system comprising track in this manual, but skill of the present invention
Art scheme can also be applied to comprising the multiple nucleus system of any suitable processor (Processor).Such as,
Described processor can be processor core (Processor Core), general processor (General Processor),
Central processing unit (CPU), microcontroller (MCU), digital signal processor (DSP), presentation manager
Core (GPU Core), SOC(system on a chip) (SOC), special IC (ASIC) etc..
Instruction address of the present invention (InstructionAddress) refers to instruct depositing in main storage
Storage address, i.e. can find this to instruct according to this address in main storage;And data address (Data
Address) data storage address in main storage is referred to, i.e. can be according to this address at main storage
In find this data.At this for the sake of simple and clear, all assume that virtual address is equal to physical address, for
Needing to carry out the situation of address mapping, the method for the invention is the most applicable.In the present invention, present instruction
The instruction currently being performed by track or obtaining can be referred to;Present instruction block can refer to containing current just by car
The instruction block of the instruction that road performs.
In the present invention, each track in multilane system, can when performing identical data access instruction
Calculate, with the track number according to itself, the data address that each track is different, thus access different data with reality
Existing SIMD operation.With in prior art or do not have track number, or track number is typically derived from hardwired not
With, the track number that in multilane system of the present invention, participation data address calculates can derive from joins in advance
The register value put or the register value of dynamically configuration.
Specifically, according to technical solution of the present invention, can be when system initialization by track number (Lane
Number, LN) insert in the special track depositor in track.So, once multilane system
System there is certain track make mistakes (such as: the mistake in manufacture process), can be by remaining car in this system
Road is redistributed track number and writes track depositor described in corresponding track, substitutes described with redundancy track
Go out lay-by, improve yield and the reliability of multilane system.Additionally, according to technical solution of the present invention, also
Can be by programming dynamically to each distribution track number, track in multilane system operation, raising is many
The motility of lane system.
In the present invention, can extend on the basis of existing cpu instruction collection and obtain new extended instruction,
For track number is moved from described specified register (move) in the general register in track, to hold
Row subsequent operation.Unlike existing multilane system, it is not absolutely required in the present invention by one
Extended instruction is compiled by new development environment (such as CUDA).Specifically, described extended instruction can
There is identical instruction format with existing cpu instruction.Certainly, the present invention also can run new exploitation ring
The extended instruction of border compiling.
Refer to Fig. 2, it is an embodiment of extended instruction form of the present invention.In the present embodiment with
DLX instruction set on textbook illustrates as cpu instruction collection, also may be used for other cpu instruction collection
Realize with same method, do not repeat them here.Instruction format A shows in DLX instruction set and typically deposits
The form that device type (R-Type) instructs, wherein domain of instruction 21 is operation code (OPcode), domain of instruction 22
Being respectively two source registers number with 24, domain of instruction 23 is destination register number, and domain of instruction 25,26 is for expanding
Exhibition operation code.
According to technical solution of the present invention, track movement can be expanded on the basis of register type instruction
Instruction.If the track number in this multilane system is at each car by hardwired or read only memory (ROM)
Road determines, then in instruction format A, domain of instruction 21 be corresponding operation code, domain of instruction 23 be mesh
Scalar register file number, domain of instruction 22,24,25 and 26 all can not use (or be preserved for other extension use
On the way).Now, when the command decoder in track translate this instruction be by track move time, it is possible to straight
Connect in the depositor that track number storage corresponding for described hardwired to domain of instruction 23 is pointed to.
If this multilane system stores track number with depositor, first by software, original lane number is write each car
Depositor in road.Now, when the command decoder in track translate this instruction be track move time,
Just can directly will move to as target as the value (track number) in the track depositor of source register
In the general register of depositor, or using as the value in the general register of source register as track number
It is configured to as in the track depositor of destination register.Can also control to fill in each car by controller on sheet
The track number in road.
Multilane system of the present invention can also have global bus for transmitting number between all tracks
According to.Refer to Fig. 3, it is an embodiment of global bus between track of the present invention.In the present embodiment
As a example by four tracks 31,32,33 and 34, the track number of its correspondence be respectively ' 0 ', ' 1 ', ' 2 ' and
' 3 ', and each track comprises independent buffer and arithmetical unit, such as register file (RF) with perform list
Unit (EX).Hereinafter all illustrate as a example by the processor track carrying out register-register type operation, but equally
Method and system can be used in such as stack-type, the cumulative other types such as type and register-memory type process
Device is as the multilane processor in track.Unlike the prior art, in the present embodiment, each track
Register file have one read mouth 36 can be by sense switch control (such as the sense switch 36 of track 31 correspondence)
It is connected to global bus 35 to should the part in track;And one write mouth 38 and be also connected to global bus 35.
Between the described part in global bus each track corresponding then by bus switch (as corresponding with track 31 total
Wiretap 37) it is turned on or off, so that the register value in any one track can be sent to another
In individual track.Additionally, in the present embodiment, the data in global bus 35 write mouth described in can passing through
It is stored in register file (such as: the register value in track 31 is stored by global bus 35
In the register file in track 33), it is also possible to directly it is bypassed to performance element for performing to calculate accordingly operation
(such as: the register value in track 31 is delivered to by global bus 35 performance element 39 in track 34
Add operation is carried out) with the register value in track 34.
Further, it is also possible to the track in multilane system is divided into several track groups, inside the group of each track
Global bus on bus switch conducting, and bus switch between the group of track disconnects, the most each track
Group can carry out standdle carrier road register value therein simultaneously moves or calculate operation.Each track has one
Reference register, is responsible for write by track allotter.Some tracks of multilane system are divided by track allotter
One thread of dispensing uses, and the reference register in one of them track in group sets track on the basis of value
(Starting Lane), its operation is slightly different with remaining track in group;Benchmark in remaining track in group is posted
Storage sets value as non-referenced track.Below example is assumed often to organize track on the basis of interior leftmost track.
According to technical solution of the present invention, with same instruction, certain register value in one track can be write simultaneously
Entering in the same destination register in other all tracks, such instruction is referred to as broadcasting instructions, can take Fig. 2
A form.In this instruction domain of instruction 21 be the operation code that moves of broadcast data, domain of instruction 22 be that source is deposited
Device number, domain of instruction 23 is destination register number, this destination register number can identical with source register number or
Different.Track on the basis of assuming in Fig. 3 31, when 31, when 32,33,34 tracks perform this instruction, car
Road 31 decoding instruction is broadcast data move, i.e. checks the reference register in track, according to being wherein
The sense switch 36 of track 31 correspondence is turned on by the value in benchmark track;Track 32,33 and 34 translates broadcast
Data movement instructions, checks the reference register in respective track, and the value according to non-referenced track therein will
The sense switch of its correspondence disconnects.As in this broadcast data move, domain of instruction 22 is R8, domain of instruction 24
For R9;During in the most each track, command decoder controls reading track 31, the value of depositor R8 is through global bus 35
Deliver to the depositor R9 in track 32,33 and 34 stores.Another broadcast mode can define in instruction
Domain of instruction 25,26 specifies a track number, and the track having this track number in group (can be also for track, source
Can not be benchmark track), in group, the track in this track number non-is as remaining track in upper example.Concrete operations
Process as hereinbefore, does not repeats them here.Common parameter transmission, scalar can be realized by this operation
The operation such as transmission or track transmission.
According to technical solution of the present invention, it is also possible to extend multilane parallel data on the basis of existing instruction set
Access instruction.When using multilane system to carry out SIMD operation, it is desirable to these tracks can be at same
Under the control of data access instruction, different pieces of information address is accessed simultaneously.In the present invention, can be according to track
Number determine data address, it is achieved under same instruction controls, different operations is done in each track.With each phase
The data that adjacent track accesses are by as a example by same address interval D iff (space increments), if known reference track accesses
Data address (hereinafter referred to as presumptive address StartingAddress) and address gaps, then remaining each car
Road has only to track number (the hereinafter referred to as control vehicle Taoist monastic name determining itself track number with described benchmark track
Starting LN) difference after, so that it may calculate respective data address to access respective data.
In the present invention, the difference between track number and this Taoist monastic name of benchmark in described each track is referred to as normalization
Track number (Normalized Lane Number, NLN), i.e. NLN=LN-Starting LN.This normalizing
Changing operation can use the track move described in precedent and standdle carrier road computations to realize.Specifically,
The most all tracks perform the move of same track, track number move to certain depositor (such as
The depositor R1 in each track) in.Then, all tracks perform same broadcast subtraction instruction, its instruction
In domain of instruction 21 in for broadcast reducing code, domain of instruction 22 value is for R1, and domain of instruction 24 value is R1, refers to
Making territory 23 for R2, its meaning is with the value (being specified by domain of instruction 24) in R1 depositor in all each tracks
The value (being specified by domain of instruction 22) deducted in benchmark track in R1 depositor, is stored in all each cars by result
R2 depositor (being specified by domain of instruction 23) in road.Each track is to the decoding of this instruction and to bus between track
Setting as described in precedent, repeat no more.If R1 depositor has the track number in each track in each track,
When this broadcast subtracts after instruction is finished, and in each track, the value of depositor R2 is exactly the corresponding described base in this track
The normalization track number in quasi-track.As a example by multilane system in Fig. 3 embodiment, it is assumed that only use track
33 and 34, and track 33 is benchmark track, then after having performed the move of above-mentioned track, track 33 He
Value in the depositor R1 of 34 is respectively ' 2 ' and ' 3 '.After having performed above-mentioned standdle carrier road subtraction instruction again,
Value in the depositor R2 in track 33 and 34 is respectively ' 0 ' and ' 1 ', i.e. corresponding normalization track number.
An above-mentioned broadcast subtraction instruction can also be by a broadcast data move and a common subtraction instruction
Co-operation, i.e. first moves in each track by the instruction number in benchmark track with broadcast data move, respectively
Respective track number is subtracted each other and is obtained respective NLN by track more therewith.
The most still say as a example by extending multilane parallel data access instruction on the basis of DLX instruction set
Bright.In fig. 2, typical type of memory (M-Type) during instruction format B shows DLX instruction set
The form of instruction, wherein domain of instruction 21 be operation code, domain of instruction 22 be base address register number, domain of instruction
28 is address offset amount (OFFSET).By the base address in depositor that base address register number is pointed to
Value (BASE) is added with described address offset amount and can be obtained by data address (ADDR).Data are read
Instruction fetch, domain of instruction 23 is destination register number;For instruction data storage, domain of instruction 23 is then that source is posted
Storage number.
According to technical solution of the present invention, it is possible to use existing type of memory instruction format realizes multilane also
Row data access.Specifically, for a data access, first with overall situation base address and address offset amount Offset
Addition obtains the initial address of this access and is stored in each track in register file, and this initial address is called (Base).
After being calculated normalization track NLN and writing depositor R2, all tracks perform same multiplication
Instruction, is multiplied address gaps Lane Diff with the value in depositor R2, the result (Lane that each track obtains
Diff*NLN) after, then same addition instruction is performed, by described multiplication result plus data access starting point
Location is worth, and the result ((BASE)+Diff*NLN) obtained writes back in depositor.So, when all tracks
In the register file that can each point to register number Base when performing data access instruction each storage because of
The data address that NLN value is different each carries out data access.The most each track is carried out the dress that same is common
Carry (Load) or storage instruction (Store), then can realize the loading to multiple data and storage.Hereafter,
Each track can all add a step value (Stride) with same instruction at respective above-mentioned data access address
To produce the next recycling for program of the next data access address.
According to technical solution of the present invention, it is also possible to the structure in track itself is improved so that multilane system
System not only can realize SIMD operation, it is also possible to realizes super SIMD operation, multiple instruction-single data stream (MISD)
Operation and multiple instruction multiple data stream (MIMD) operate, thus preferably carry out the map operation of high degree of parallelism.
Refer to Fig. 4, it is an embodiment of the multilane system comprising and improving track of the present invention.?
The present embodiment illustrates as a example by the multilane system comprising 4 tracks.Described multilane system comprises
402, tag ram (Tag) of one tucker, 408, track table (Track of 404, scanning device
Table) 410, command memory 406, and four tracks 401,403,405 and 407.Each
The structure in track is identical, as a example by track 401, in addition to comprising performance element 411 and register file 412,
Also comprise a tracking device (Tracker) 414 and buffering 417 is read in an instruction.
Instruction in the shared instruction memorizer 406 of track 401,403,405,407 in the present embodiment.Under
Face illustrates as a example by track 401, and relevant running and operation are also applied for other tracks.Work as car
When the performance element 411 in road 401 performs an instruction, it is necessary first to from memorizer, 406 reading instructions are deposited
It is put into instruction to read in buffering 417.In the present embodiment, the Capacity Ratio command memory of buffering 417 is read in instruction
406 is little, and access time delay is the most shorter.Command memory 406 and instruction read buffering 417 can by any suitably
Storage device is constituted.Instruction block (Instruction Block) tissue pressed by command memory 406;Tag ram 404
In a list item, a line in track table 410 is corresponding with an instruction block in 406;Three all by
Buffer address BNX addresses.A list item in tag ram 404 has command adapted thereto block in 406
Block address (i.e. a high position for PC address);In track table 410, a line (referred to as one track) has multiple list item,
Each list item is corresponding with an instruction in corresponding instruction block, and both of which is by block bias internal amount BNY in PC address
Addressing.Organize for purposes of illustration only, set this command memory with direct mapping mode, thus the mark in PC address
Signing the label that (Tag) is equal in this example, index (Index) is equal to the BNY in this example, and block bias internal
Amount (Offset) is equal to the BNY in this example.Hereinafter referred to as comprising Tag, the address of BNX, BNY is instruction
Address.Claiming the address comprising BNX, BNY is buffer address.
Tucker 402 obtains instruction block from the memorizer of lower level time and stores command memory by index value
In 406, also by index value, the address tag of this instruction block is inserted tag ram 404.Scanning device 408
Storage is examined to each instruction in the instruction block of command memory 406, and extracts information, such as:
Instruction type (it is assumed herein that type is non-branch, unconditional branch and conditional branching three kinds), instruction address are also
Instruction type is stored in track table 410 list item that buffer address points to, and this list item correspondence is stored in command memory
The instruction of the information that is extracted in 406.If the instruction being investigated is branch instruction, then scanning device 408 more with
Finger offsets amount (Branch Offset) addition calculation comprised in instruction address and instruction goes out Branch Target Instruction
Address.By calculated Branch Target Instruction address tag part and the mark of storage in tag ram 404
Sign coupling;As i.e. delivered to tucker 402 with this Branch Target Instruction address, as above by 402 from more without coupling
The memorizer of low level obtains instruction block and stores in command memory 406 and build corresponding mark by index value
Sign and track;If any coupling then by the BNX value of coupling list item, together with the block bias internal in branch target address
Amount BNY is stored in by the track pointed by the buffer address of branch instruction together as the buffer address of branch target
Table list item.So in track table 410 in each list item in a track and command memory 406 one
An instruction correspondence in instruction block, the most each list item at least contains instruction type, and branch instruction is the most more
The buffer address of the branch target containing this branch instruction.End tracing point is more increased at the end of every track,
Wherein having the buffer address of the next instruction block of order, this buffer address is added by the instruction address of current orbit
The instruction address of the next instruction block that instruction block length obtains, as above example mates gained at tag ram 404;
The instruction type terminating tracing point is unconditional branch.Therefore in track table 410, storage has a rail network,
Wherein contain the order between all instructions stored in command memory 406 and branch's relation.
With a tracking device 414 matching track table 410, can carry to track 401 by control instruction memorizer 406
Perform for it for instruction.Tracking device 414 is made up of incrementer 441, selector 442, depositor 443.Post
The output of storage 443 is buffer address, an instruction block of BNX directional order memorizer 406 therein,
It is also directed to the respective rail in track table 410;Wherein BNY selects an instruction for track from this instruction block
401 perform, and also read the corresponding list item in this track.By in the instruction type information in list item and track 401
The branch that produces of branch decision logic judge to control selector 442.When the instruction type of 410 outputs is overstepping one's bounds
Zhi Shi, selector 442 selects the output of incrementer 441, and it is the buffer address increasing one of present instruction.?
Following clock cycle, new buffer address points to next instruction of order of present instruction and reads for track 401
Perform.When the instruction type of 410 outputs is unconditional branch, then selector 442 selects track table 410
The branch target cache address contained in the list item of output, following clock cycle track performs Branch Target Instruction.
When the instruction type of 410 outputs is conditional branching, the branch that selector 442 is provided by track 401 judges
Controlling, as non-limbed in being judged as, then selector 442 as above selects the output of incrementer 441, following clock
Next instruction of cycle track execution sequence;As being judged as performing branch, then selector 442 selects track table
The branch target cache address contained in the list item of 410 outputs, following clock cycle track performs branch target and refers to
Order.So tracking device 414 is worked in coordination with track table 410, judges according to the branch in instruction type and track 401
Signal feedback flows to determination procedure, persistently provides instruction to track.
Plural number bar track respective tracking device etc. each can have the instruction of a plurality of reading mouth to deposit by independent access one
Store up and provide independent instruction stream for respective track.Embodiment in Fig. 4 uses a finger having odd number to read mouth
Make memorizer 406 and a plurality of instruction read buffering 417 etc. and realize said function.In this implementation, actually
In instruction reading buffering 417, storage is command memory 406 content and in track table 410 the one of corresponding content
Individual subset, therefore tracking device 414 is actually connected to instruction reading buffering 417, cooperating, only
Have and there is no desired data in 417, just through the moderator (figure that the request from 4 tracks is arbitrated
In do not show) be connected to as shown in Figure 4 command memory 406 content and track table 410 with instruction fetch block and
Respective rail stores 417.
Instruction reads there is single or a plurality of instruction memory blocks in buffering 417, and at least a part of which stores to comprise and currently refers to
Make block at an interior instruction block.Corresponding with each instruction memory blocks, there is corresponding BNX memorizer, coupling
Device and orbiting memory.When an instruction block is stored into the instruction memory blocks that instruction is read in buffering 417, its
BNX value and respective rail are also read into 417.In tracking device, the buffer address of depositor 443 output is sent
Read buffering to instruction 417 to mate with all BNX wherein stored, if with one of them instruction memory blocks
BNX coupling, i.e. with in BNY sense order memory block in buffer address one instruction for track perform,
Also reading a list item in the track of correspondence and be supplied to tracking device, operation thereafter and tracking device in upper example are from track
Table reads list item the same, repeat no more.If it does not match, be sent to refer to by this buffer address (through arbitration)
Make memorizer 406 and track table 410 be read from instruction block and respective rail stores instruction and reads buffering 417
In the instruction memory blocks specified by permutation logic.Performed below as the situation of above-mentioned coupling, repeat no more.
Fig. 4 shows 4 tracks 401,403,405,407 and the most exclusive tracking device 414 altogether thereof,
416,418 and 420.Each tracking device is read buffering with the instruction in its this track and is connected, and produces in this track
Raw branch judges to control lower collaborative work.Only instruction in this track is read when buffering does not has a desired data
Tracking device just access instruction memorizer 406 and track table 410 fetch data and fill the instruction reading buffering in this track.
Thus reduce the bandwidth requirement to memorizer 406 with track table 410.Track 403,405,407 with
Track 401 structure is identical, it may have as 417 instruction read buffering, the performance element such as 411 with as
The register file of 412.
As described in precedent, in multilane system of the present invention, can by global bus (in Fig. 4 not
Display) data in a track are delivered to other tracks so that between track in exchange register heap 412
Data.
When multilane system of the present invention carries out single instruction single data stream process, four tracks have only to
One is had to work.Such as, the instruction in track 401 read buffering 417 tracking device 414 control under to
Performance element 411 output order is for execution, and other three tracks and corresponding instruction are read buffering, performed list
Unit and tracking device can be in closed mode.Such as, closedown is read slow to described three tracks and corresponding instruction
Punching, performance element, register file and the clock signal of tracking device or power supply.Also without using between track
Global bus.According to said method can realize and the function of existing single instruction single data stream processor.
When multilane system of the present invention carries out single-instruction multiple-data stream (SIMD) process, to employ whole four
As a example by track, these four track instructions read to store identical instruction block in buffering, and respective tracking device is held
The identical action of row, controls each instruction respectively and reads buffering and provide identical instruction to corresponding performance element.
The register file that each track is corresponding then can store different data, and reading corresponding to each track/
Memory element can carry out read/write operation to different data addresses respectively.So, four tracks perform same
Program, but each track perform the data used during program can be different, thus realize singly referring to existing
Make the function that techniques for Multiple Data-Streams Processing device is identical.The present embodiment does not use the global bus between track.By this side
Method can realize and the function of existing Single-instruction multiple data-stream processing.
In the above-described embodiments, the instruction in each track reads buffering all by tracking device read pointer in same track
Value control output order, i.e. produce corresponding wordline (word line) according to the value of this read pointer and instruction read
Memory element addressing in buffering reads corresponding contents.According to technical solution of the present invention, it is also possible to described finger
Order is read buffering and is made further improvements, and the correspondence that the instruction of all tracks is read in buffering is deposited by configurable switch
Storage unit links together.So, when described switching off, each track just method as described in precedent is each
Buffering is read to the offer instruction of corresponding performance element from controlling each instruction.When described switch conduction, only
One track tracking device work (the tracking device in remaining track does not works), according to its read pointer produce corresponding
The content in the same memory element of Buffer output is read in the instruction that wordline can control in all tracks.Now, only
Each instruction is wanted to read the identical content of buffer-stored, can be under the tracking device in a track control, to institute
There is the instruction that the performance element output in track is same, it is achieved single-instruction multiple-data stream (SIMD) operates.This improvement can be fitted
For any single-instruction multiple-data stream (SIMD) operation of the present invention or the operation of super single-instruction multiple-data stream (SIMD).
Multilane system of the present invention can also work with super single-instruction multiple-data stream (SIMD) pattern.Common singly refer to
Making multiple data stream is that a plurality of track performs same instruction a moment.When branch occurs in program, existing
Single-instruction multiple-data stream (SIMD) pattern be each branch's situation to be judged by each bar track, produce each bar
Track to cover value (Mask) the most each bar track step-wise execution under the control covering value.First it is to cover
Lid value is that some tracks of 0 (being judged as non-limbed) perform unbranched instructions (i.e. suitable after branch instruction
Sequence instructs), cover the track that value is 1 at that time and stop wait.Then covering value is to stop in 0 these tracks
Treat, and cover remaining track that value is 1 (being judged as branch) and perform Branch Target Instruction and subsequent instructions thereof.
If You Jichong branch, the efficiency of multilane processor will be made to have a greatly reduced quality so that existing single instrction
Techniques for Multiple Data-Streams Processing device is only used for some separate procedures not having branch.The super single instruction multiple of the present invention
Different from existing single-instruction multiple-data stream (SIMD) pattern according to stream mode, a plurality of track performs same program, each
There are oneself independent branch's judgment mechanism and instruction addressing mechanism in bar track, can independently execute same program
In different sections or different branch, so performing the effect that has efficiency during multiple branches still can keep 100%
Rate.
When multilane system of the present invention carries out single-instruction multiple-data stream (SIMD) process, use the money of a thread
Source;Four instructions read to store in buffering identical instruction block, and to accordingly under respective tracking device controls
Performance element provide the identical or different instruction in same program, and in register file corresponding to each track
Then can store different data.Because the data that each track performs to use during program can be different, even if
Performing same instruction, every the respective branch in track judgement also can be different, causes four tracks in branch
Point starts the different branches of the program that performs.Because in this mode, each track is in the control of respective tracking device
Under system parallel, independently executing programs, thus efficiency outclass existing use shade under conditions of having branch
Depositor processes the Single-instruction multiple data-stream processing of branch by several times.Do not use between track in the present embodiment is complete
Office's bus.According to said method can extremely efficiently realize identical with existing Single-instruction multiple data-stream processing
Function.
When multilane system of the present invention carries out multiple instruction-single data stream process, the instruction in four tracks is read
Buffering stores different instruction blocks, and provides to corresponding performance element under respective tracking device controls
Different instructions.Here, read number by the register file in a track from data buffer storage (not shown)
According to, and the data in this register file are delivered to other three cars as described in precedent by the global bus between track
In the register file in road so that the data in the register file in four tracks keep consistent.So, four cars
Road can perform different programs simultaneously based on same data, thus realize multiple instruction-single data stream and process
The function of device.
When multilane system of the present invention carries out multiple instruction multiple data stream process, the instruction in four tracks is read
Buffering stores different programs, under respective tracking device controls, provides difference to corresponding performance element
Instruction.Correspondingly, the register file that four tracks are corresponding separately reads different number from data buffer storage
According to or by data write back data cache.The present embodiment does not use the global bus between track.It is according to said method
Different Data Sources can be based respectively on, perform the journey performed between different programs, i.e. each track simultaneously
Sequence is the most orthogonal with data, thus realizes the function of multiple instruction multiple data stream processor.Multiple-instruction multiple-data
Stream handle is suitable for carrying out mapping-be polymerized the map operation in (Map-Reduce) operation.
On the other hand, polymerization is the process that data assemble (Aggregate), and its degree of parallelism the most relatively reflects
The process penetrated is little.Existing method is to perform polymerization, by map with SISD processor or polycaryon processor
Result is split from the computer performing to map by the memorizer and network having long delay, is distributed to perform polymerization
Computer perform.The map operation performed by various tracks disclosed in this invention processor, during it maps
Between result can carry out converging operationJu Hecaozuo by above-mentioned existing method.If but with the same core of track processor
The each track performing map operation on sheet, or other tracks on same chip, or special beyond each track
The data processing resources (hereinafter referred to as PPU, preprocessor Post Processing Unit) added, Ke Yi
Completely realize the operation of mapping-polymerization on same chip, eliminate above-mentioned long delay, promote operating efficiency.With
Under disclosure be applicable to various tracks processor and polycaryon processor, including disclosed in this invention aforementioned all
Track processor.
Fig. 5 is to use to connect bus and the original calculating in each track resource between track at execution map operation
The embodiment of converging operationJu Hecaozuo is performed on track.Wherein 50,51,52, and 53 is 4 tracks, and 50 is base
Accurate (Starting) track.Every track is respectively arranged with arithmetic element 54 and register file 55, register file 55
Two operands 56 and 57 are exported to arithmetic element 54;Every track is also respectively arranged with controllable switch 58 and will post
Another output 59 of storage heap 55 is connected to bus 60 between track.Bus between adjacent lane respective track
There is controllable switch 61 that bus 60 between each track can be connected on demand between 60.Bus between track
60 inputs that data thereon can be sent to register file 55 also can bypass an input to arithmetic element 55
57.When performing the track operations such as mapping, switch 58 and switch 61 not UNICOM, the most each track is independent
Operation, do not have data to exchange between track.When performing operation between the tracks such as polymerization, some track (claims
For track, source) switch 58 UNICOM, make the output 59 of register file 55 in this track be connected between track always
On line 60, and switch between bus 60 between the track of this track and adjacent lane (referred to as target track)
61 are also controlled by UNICOM, and bus 60 is switched to the input 57 of arithmetic element 54 by adjacent (target) track
On.57 is the output of bypass logic, and this bypass logic may select operand or the bus that register file 55 provides
60.Between track, operational order controls this bypass logic selection bus 60 in this example.So fortune in target track
Calculate unit 54 just to process from the operand 56 of target track register file and deposit from track, source
The operand 57 of device heap, and operation result is write back the register file 55 in target track self.Now target carriage
The switch 58 not UNICOM in road is to avoid the data collision sent here with track, source.So carry out operating between track.
Between the track increased in original instruction set, operational order can control above-mentioned switch and bypass mechanism with complete
Become and operate between above-mentioned track.Normalization track number according to every track, same instructs in difference between track
Track carry out different operations.Between track, instruction can take A in register type instruction format such as Fig. 2,
Wherein in instruction 20,21 sections (i.e. domains of instruction) are operation code, and 22 sections is first operand register address, 23
Duan Wei tri-(result) operand register address, 24 sections is second operand register address, and 25 sections are
Dead section, 26 sections is auxiliary operation code.Between track, in polymerization instruction, operation code 21 meaning is poly-for performing between track
Closing operation, the arithmetical logic operation that it specifically performs can be individually determined by operation code 21, or by 21 and 26 sections
Jointly determine.Assume between this track that instruction performs in this instance is polymerization add operation.In instruction 23,24
Three common for Duan Rutong operand instruction are usually the track (being target track in this instance) performing computing
3rd (result) operand of middle register file 55 and second operand address register address.But,
First operand register address in 22 sections instructs the register file 55 being but directed in track, source between track
's.Command decoder in each track translates and translates gathering in instruction 25 sections further when instructing as polymerization instruction
Right, in conjunction with the normalizing track number in each track, determine that those tracks are track, source, those tracks are target carriage
Road.
Define each track NLN to arrange from left to right by increasing at this.The polymerization instruction when the degree of polymerization is ' 2 '
Control by NLN lowest order in two adjacent lanes be the definition of ' 1 ' be track, source, lowest order is ' 0 '
For target track, and by 22 segment addressing in operational order between track, by the register file 55 in track, source
Contents in table read, put between track through output 59 and track, the source breaker in middle 58 of this register file
Bus 60;Switch 61 UNICOM between track, source and target track simultaneously, makes the register file in track, source export
59 inputs 57 being sent to arithmetic element 54 in target track.Simultaneously between target track and its left-hand lane
Switch switch 61 between 61, and track, source and its right-hand lane is all turned off so that track, source register file
Export 59 and be sent to its target track, and do not affect other tracks.Same time many groups source, target track
Converging operationJu Hecaozuo is carried out performing parallel instructions between same track.
In Fig. 6 explanatory diagram 5, embodiment instructs the degree of polymerization between track is operation when ' 2 '.If track 50
Normalization track number be ' 00 ', the normalization track number (NLN) in corresponding track 51,52,53 is ' 01,
10,11 '.In the most each track, command decoder is when performing instruction between the track that the degree of polymerization is ' 2 ', it is intended that
NLN lowest order be the track of ' 1 ' be track, source, it is intended that NLN lowest order be the track of ' 0 ' be target carriage
Road, and each switch 58 and 61 etc. is set accordingly.Its result is as in Fig. 6, track 51 is track, source, by car
The list item that between road, 22 sections in operational order control to read in its register file 55 delivers to car through output 59
By 24 sections of controls in operational order between track in the input 57 of arithmetic element 54 and track, source 50 in road 50
The register file list item 56 read is added, and its operation result writes back between register file 55 track in track 50
23 sections of pointed list items in operational order are deposited.Meanwhile, in track 53, register file output 59 depends on same
Reason is sent in track 52 be polymerized with the operand 56 of register file output in track 52, result write car
In road 52 in register file 55.So complete to these four track degree of polymerization be ' 2 ' track between instruct and hold
OK, polymerization results in the middle of parallel generation two.
On this basis, polymerization instruction between the track that the degree of polymerization is ' 4 ' can be continued executing with.Each track instructs
Decoder is when performing instruction between the track that the degree of polymerization is ' 4 ', it is intended that minimum two of NLN is ' 10 '
Track is track, source, it is intended that NLN minimum two be ' 00 ' track be target track, other tracks are not joined
With operation.Respectively it is not involved in operating the switch 58 Jun Bu UNICOM in track.It is in track, source to lead to target carriage track data
Switch 61 UNICOM (at track, source and the right of switch, target track is in the left of switch) on road.No
The switch 61 being on passage does not connects that (with the left of switch at track, source, target track is in the right side of switch
Side).As it is shown in fig. 7, the now previous step execution degree of polymerization leaves in track 52 for produce when ' 2 '
In register file by operational order between track 22 sections point to middle polymerization results be sent to track 50 and its
That deposits in register file 55 is further polymerized by 23 sections of middle polymerization results pointed in instruction, its knot
Fruit is stored in the register file 55 in track 50 by 24 sections of list items pointed in instruction.
After so performing to instruct between the track that two degree of polymerization be ' 2 ' and ' 4 ', originally it was distributed in four cars
Mapping result in road is just polymerized to a polymerization result.In like manner, the track of more high polymerization degree is performed afterwards
Between instruction can by the mapping result of more multilane, or other types in the middle of perform result polymerization.Such as the degree of polymerization
Instruction for ' 8 ' will using track that NLN is ' 100 ' as track, source, will wherein data aggregate to NLN
For the target track of ' 000 '.Such as the track that the instruction that the degree of polymerization is ' 16 ' will be ' 1000 ' with NLN
As track, source, by wherein data aggregate to the target track that NLN is ' 0000 '.As long as performing n bar to gather
Right from ' 2 ' start increasings tracks between instruct and 2n intermediate object program 2n bar track can be polymerized to
One result leaves in the register file in benchmark track.Can be stored in memorizer by subsequent instructions.
The degree of polymerization only allows to be 2n, wherein n is more than or equal to ' 1 '.
Fig. 5, in 6,7 the register file 55 in each track be all provided separately one read mouth 59 support between track
Operation, this function reading mouth also can be provided without setting up individually reading mouth 59 by reading mouth 57.Refer between track
Order can also be a data move, the most directly controls to carry out computing the arithmetical unit in track, but will
Data in the register file of track, source move to target track and are stored in its register file.The most again in target carriage
Road performs normal operation instruction, to data own in the register file of target track and from track, source in-migration
Data carry out converging operationJu Hecaozuo.Between this track, data movement instructions can be by instruction format A of Fig. 2, with above-mentioned
The arithmetic element 54 that between track, this operation code 21 instructed unlike operational order does not control target track is entered
Row operation, but the register file 55 controlling target track stores track, the source data sent here in bus 60.
Additionally instruction in 24 sections the most inoperative, because now arithmetic element 54 need not operand.Remaining and car
Between road, operational order is identical.Track processor structure also to have corresponding amendment, and now between track, bus 60 is
It is connected to the input of register file 55.Its operation be with the operation code of data movement instructions 21 sections between track (or
21 sections, 26 sections of common effects) indicate each track to carry out data movement operations;The degree of polymerization and NLN in 25 sections
Common effect track, selection source and target track also control each switch 58 and 61 and set up track, source and target track
Between bus connect;Track, the source register file list item pointed to by operand address 22 sections in instruction reads
Data, write the register file list item in the target track pointed to by result address 24 sections in instruction to complete car
Data between road move.Can via bypass logic with Selecting operation unit 54 and the input of register file 55,
Make track processor support computing class instruction between track simultaneously, and between track, data move class instruction.
Between above-mentioned track, bus may also used as broadcast bus.Refer to Fig. 5, the form of broadcasting instructions is with upper
State data movement instructions identical.When each track translates broadcasting instructions, the register file addresses of 22 sections in instruction
Control register file 55 in benchmark track (NLN is ' 000 ') and read data, and control this track is opened
Close 58 connections, these data are put bus 60, also will switch 61 connection.Remaining NLN is not ' 000 '
Track, then its respective switch 61 is connected, make these data arrive all tracks, and according to instruction in 24
The register file addresses of section writes the data into respective register file, completes the broadcast delivery of data.
If the data processing resources increased again outside bus between track between track (hereinafter referred to as preprocessor,
Post Processing Unit, PPU) make each track only need to perform map operation, and converging operationJu Hecaozuo is transferred to PPU
Perform, then can further improve efficiency.Bus between track with PPU is connected can various topological structures,
Hereinafter give some instances.Fig. 8 is a kind of tree-shaped bus structures.The most every two tracks such as 80, have between 81
One PPU such as 84, its operation result is kept in by the output register in PPU.First level (layer)
Adjacent two tracks of two of PPU such as 84, the 85 each acceptance of input in the output of register files.Second
The PPU of level such as 86 accepts the output of the PPU of the first level.Each level separately has a control depositor
Such as 87, the level less by numerical value transmits control signal to numerical value compared with its level of big one.These PPU and total
Lines etc. are controlled by post processing instruction.
Post processing instruction is encoded in a program as other tradition cpu instructions, by command memory, or
Instruction buffer, or instruction read buffering IRB be supplied to each track.Post processing instruction does not affect at traditional in track
Reason device operation, but the special result controlled after processing conventional processors operates, and its form can
To be the form A in Fig. 2.When each track goes to a post processing instruction, Instruction decoding in each track
It is decoded by device, is judged as that post processing instructs according to the operation code in 21 sections, accordingly according in 22 sections
Register file addresses, from respective register file read data be sent to the first level PPU such as 84 through bus,
The input of 85.The instruction of this post processing simultaneously is also directed to the first level PPU.PPU has operation code decoder
In instructing according to post processing, the coding in 21 sections, and/or 26 sections determines which kind of operation PPU should perform, as added
Method, subtraction etc..PPU84, the output of 85 is sent to the input of PPU86 at following clock cycle, simultaneously after
Process instruction and after depositor 87 is temporary, be also directed to PPU86 to control its operation.The most pending data
After and, instruction is all transmitted until top PPU layer by layer through respective streamline.Top PPU output
Polymerization result is controlled to write back by the register file result address of 24 sections in the post processing instruction come through streamline transmission
Register file in one track.This track can be acquiescence such as benchmark track, or instructed by post processing
25 sections, 26 sections etc. specify.A kind of optimization is the decoding controlling PPU operation Instruction decoding in track
Device is carried out, and only transmits to control the operation of each layer PPU through streamline by the control signal after decoding.
Post processing instruction can also control directly to write back polymerization result memorizer.Now post processing instruction is taked
Form B such as Fig. 2.When each track goes to such post processing instruction, Instruction decoding in each track
It is decoded by device, is judged as that post processing instructs according to the operation code in 21 sections, accordingly according in 23 sections
Register file addresses, from respective register file read data be sent to the first level PPU such as 84 through bus,
The input of 85;Which kind of operation coding decoded decision PPU in 21 sections and/or 26 sections should perform simultaneously, translates
Control signal after Ma is for controlling the operation of the first level PPU;Thereafter aggregated data and converging operationJu Hecaozuo control
Transmit layer by layer as ibid example along tree flow waterline.Instruction 22 is used in the most each track when decoding simultaneously
The register file addresses of section reads base address from register file, produces with the offset addition in instruction 28 sections
Storage address.This address is also transmitted until top PPU through tree-shaped bus layer by layer with the data of polymerization.?
There is storage logic, according to this address, last polymerization result is stored (Store) to memorizer after high-rise PPU.
The generation of storage address can share in track original adder in arithmetic element, it is also possible to increases one specially
Individual adder, makes rear operational order can instruct executed in parallel simultaneously with other.
Fig. 9 is a kind of transmission bus structures, and there is a PPU in the most corresponding each track 90,91,92,93
Such as 94, PPU there is an output register output it temporary.The output bus 95 of each PPU is by it
Data temporary in output register are delivered to an input of PPU in its track, left, another of this PPU
Individual input 96 is from the depositor 97 in its track.When the track processor in this example performs post processing instruction
In each track, it is decoded by command decoder, is judged as that post processing instructs according to the operation code in 21 sections,
Accordingly according to the register file addresses in 23 sections, from respective register file, read data be sent to respective track
Depositor 97 keep in.Simultaneously the highest for the NLN command decoder in track (being 93 in this example) also translates
The type of rear operation, together with destination register heap address, or as above example calculates the storage address of gained and delivers to
Depositor 98 is kept in.The PPU94 in this track of following clock cycle carries out one to the data from depositor 96
Individual do-nothing operation (can be that in the track that regulation NLN is the highest, PPU is fixing carries out do-nothing operation), actual effect is
Data on the depositor 96 in track 93 are stored in the output register of the PPU94 in this track;Deposit simultaneously
Rear operating control signal on device 97 and storage address also passes through or logic 99 is stored into depositor 98.This
After when rear operating control signal is delivered to certain track, the PPU in this track i.e. to two input 95
And the data on 96 carry out converging operationJu Hecaozuo, result is stored in output register.So when after operating control signal
When being delivered to (benchmark) track on the left side, the output 85 of this track PPU be from NLN be peak
It it is the polymerization result of the intermediate object program of the mapping in all tracks of ' 0 ' or other operation generations to NLN.
Thereafter according to as the type as above example of rear operational order or write back register file, or directly write back memorizer.
The bus of rear operation connects topological structure and can also is that the mixing of said structure.Such as can be by track
Track processor packet on reason device chip, such as chip has 16 groups and often organizes 32 tracks.Can be
Use the post-polymerization treatment device of tree topology bus in each group, and between group and group, use transmission bus
The post-polymerization treatment device of topological structure.Bus can be hard-wired, and the most existing GPU is exactly one group of car
Road performs same instruction, and hard-wired bus can be used to connect.Another kind of use track table of the present invention
And instruction read buffer IRB track processor can between one group of track free graduation, track and track
Between can perform different instruction, this be accomplished by according to the reconfigurable bus of driveway partition connect.
Figure 10 shows an embodiment of the tree-shaped bus of restructural.In figure, each PPU has independent grasping
Write back the path of register file or memorizer as result, and each Zhi Douke of tree-shaped bus independently turns off.
Figure 10 A shows that a result to 8 tracks carries out the tree-shaped bus of restructural of converging operationJu Hecaozuo, here, often
One-level PPU is sent to the path of next stage PPU and all turns on, and the PPU in addition to PPU 106 is sent to deposit
The path of reservoir all disconnects.So, after 3 grades are polymerized, polymerization result stores from PPU 106 output
In memorizer.
What Figure 10 B showed that a result to 2 in 8 tracks and 6 carries out converging operationJu Hecaozuo respectively can
Reconstruct the configuring condition of tree-shaped bus.This example is with the difference of Figure 10 A embodiment, and PPU 100 send
Path toward PPU 104 is disconnected, and the path that PPU 100 is sent to memorizer is switched on.So, 2 cars
Road is after 1 grade is polymerized, and its polymerization result stores to memorizer from PPU 100 output, other 6 cars
Road is then through 3 grades of polymerizations, and its polymerization result is from PPU 106 output storage to memorizer.
Figure 10 C shows that one carries out weighing of converging operationJu Hecaozuo respectively to two groups of each results of 4 in 8 tracks
The configuring condition of Broussonetia papyrifera shape bus.This example is with the difference of Figure 10 A embodiment, PPU 104,105
The path being sent to PPU 106 is disconnected, and the path that PPU 104,105 is sent to memorizer is switched on.So,
The polymerization result in first 4 track stores to memorizer from PPU 104 output, other 4 tracks
Polymerization result is from PPU 105 output storage to memorizer.
Figure 10 D shows that one carries out weighing of converging operationJu Hecaozuo respectively to 4 groups of each results of 2 in 8 tracks
The configuring condition of Broussonetia papyrifera shape bus.This example is with the difference of Figure 10 A embodiment, PPU 100,101,
102,103 paths being sent to PPU 104,105 respectively are disconnected, and PPU 100,101,102,103
The path being sent to memorizer is switched on.So, through 1 grade be polymerized after, each polymerization result respectively from PPU 100,
101,102, and 103 output storages are in memorizer.In above-mentioned Figure 10, each post processing path connects on-off
Drive and controlled by track allotter.Between each track in following embodiment, path is also controlled by track allotter,
Illustrate the most one by one.
Figure 11 shows an embodiment of restructural transmission bus.In figure, each PPU has independent grasping
Write back the path of register file or memorizer as result, and transmit bus each section all can independently turn off.
As shown in figure 11, bus is 2 tracks and the converging operationJu Hecaozuo in 3 tracks to 5 driveway partitions;Here, PPU 113
The transmission bus being sent to PPU 112 is disconnected, and other transmission buses are both turned on so that track 118,119
Result is after PPU 114,113 is polymerized successively, and its polymerization result stores to memorizer from PPU 113 output,
The result in track 115,116,117 is after PPU 112,111,110 is polymerized successively simultaneously, its polymerization result
From PPU 110 output storage to memorizer.
Figure 12 is an embodiment of the restructural degree of polymerization.Figure 12 A is its concrete structure.It is assumed herein that it is minimum
The degree of polymerization is ' 2 ', then for four tracks 120,121,122 and 123, need two PPU 124 altogether
With 125.In the present embodiment, track to PPU, PPU to other PPU, and PPU to itself
Switch is all had to be turned on or off on data path.
Refer to Figure 12 B, one embodiment of configuring condition when it is ' 2 ' for the degree of polymerization.Now, at figure
On the basis of 12A, switch 126,127,128 and 129 is both turned on so that the result in track can be sent to
In corresponding PPU;Switch 1211,1221 and 1231 all disconnects so that do not transmit data between PPU;
Switch 1241 and 1251 also disconnects.Such configuration, is equivalent to each PPU and all selects only to receive its correspondence
The result in two tracks, and the result of said two track input is carried out converging operationJu Hecaozuo, and each exports
The polymerization result in corresponding two tracks.
Afterwards, refer to Figure 12 C, one embodiment of configuring condition when it is ' 4 ' for the degree of polymerization.Here,
Switch 126,127,128 and 129 all disconnects so that each PPU no longer receives the new value sent here from track;
Switch 1221,1241 and 1251 is both turned on, switchs 1211 and 1231 disconnections so that each PPU all connects
Receipts output itself last time converging operationJu Hecaozuo polymerization result, and from an adjacent PPU (be in this example
Right survey PPU) the two track polymerization results that last time, converging operationJu Hecaozuo obtained that export.Such configuration, is equivalent to
Two two track polymerization results are carried out converging operationJu Hecaozuo again, obtains Four-Lane Road polymerization result and by PPU 124
Output.
Afterwards, refer to Figure 12 D, one embodiment of configuring condition when it is ' 8 ' for the degree of polymerization.In order to
The example that the degree of polymerization is ' 8 ' is described, needs 8 tracks, but for the ease of display, the most only show
5 tracks are shown.Wherein 4 tracks and corresponding PPU with Figure 12 A on the left of dotted line are identical, and dotted line is right
Side also has identical structure, but illustrate only first track.With the structure in Figure 12 A it is the most still
The configuration that the degree of polymerization is ' 8 ' is illustrated by example.On the left of dotted line, switch 126,127,128 and 129
All disconnect so that each PPU no longer receives the new value sent here from track;Switch 1221,1231,1241
With 1251 be both turned on, switch 1211 disconnections.All switches on the right side of dotted line are also carried out same configuration.This
Sample, PPU 124, except receiving the Four-Lane Road polymerization result that last time, converging operationJu Hecaozuo obtained of self output, also receives
The result of corresponding four track polymerizations last time sent here on the right side of dotted line, and again gathered by PPU 124
Closing operation, obtains eight track polymerization results and is exported by PPU 124.
According to technical solution of the present invention, said method class can be released the configuration of the converging operationJu Hecaozuo in more tracks
Structure, does not repeats them here.
Below as a example by performing matrix multiplication, the actual motion of multilane system of the present invention is illustrated.
Of the present invention without the most common GPU in track improved, and improved car as shown in Figure 4
Road is suitable for this embodiment.In the present embodiment, matrix A, B are 4 row 4 and arrange, and be multiplied the knot obtained
Really matrix is also that 4 row 4 arrange.The of the first row (a00, a01, a02, a03) of matrix A and matrix B
String (b00, b10, b20, b30) can pass through multiply-add first element that can obtain matrix of consequence C
(c00), concrete calculating process is: c00=a00*b00+a01*b10+a02*b20+a03*b30.
Perform by serial command, the most at least need 4 multiplication and 3 sub-addition computings, totally 7 computings.Additionally,
Owing to existing processor system typically requires the interim findings that memorizer is kept in calculating, and a multiplications/additions
Computing generally requires extra execution 2 secondary data reading and the storage of 1 secondary data (assumes do not have under best-case
Memory miss occurs, totally 4 cycles), needed for therefore can performing one matrix element multiplication
Time Estimate is 7*4=28 cycle.
In multilane system, it is clear that 4 multiplication and corresponding data access operation can be assigned to 4
Executed in parallel in track so that have only to 1*4=4 cycle just can complete multiplication operation, afterwards by 3
In sub-addition 2 times and corresponding data access operation are assigned to the (1*4=4 altogether of executed in parallel in 2 tracks
The individual cycle), then be assigned in 1 track perform (altogether by last 1 sub-addition and corresponding data access operation
1*4=4 cycle), 12 cycles just can complete a matrix element multiplication altogether, and performance is held with serial
28 cycles of row are compared and increase.
It is possible to further the result using preprocessor of the present invention to operate multiplication carries out polymerization behaviour
Make (i.e. add operation).Here, owing to preprocessor can directly receive the execution result in corresponding track, because of
This can save the data access that the storage of the data after multiplication has operated is corresponding with add operation, it is only necessary to will
Final polymerization result stores in memorizer.
Specifically, the data engine in each track first according to corresponding track number, data initial address with
And the address gaps in different tracks calculates data address.Wherein, the initial address of matrix A is exactly data
The address of a00, corresponding address gaps is ' 1 ', and the initial address of matrix B is exactly the address of data b00,
Corresponding address gaps is ' 4 '.So, in first track, the address of multiplicand and multiplier is respectively a00
Address with b00;In second track the address of the address of multiplicand and multiplier respectively a00 add ' 1 ',
The address of b00 adds ' 4 ', i.e. the address of a01 and b10;Similarly, in third and fourth track multiplicand and
The address of multiplier is respectively the address of a02 and b20, and the address of a03 and b30.
After data engine completes the calculating of data address, all of four tracks are with SIMD mode priority two
Secondary multiplicand and multiplier corresponding for each track is read in depositor parallel.The execution in the most each track
Module performs multiplying order simultaneously, obtains (a00*b00), (a01*b10), (a02*b20), (a03*b30)
Four multiplication results also carry out follow-up aminated polyepichlorohydrin.Complete above-mentioned multiplying and need 3 cycles (2 altogether
Cycle carries out digital independent, and 1 cycle carries out multiplication calculating).
Assume that described multilane system uses transmission bus preprocessor as described in Figure 9 to carry out converging operationJu Hecaozuo,
Then the described preprocessor in four tracks performs same addition instruction.That is, the output in the 4th track is in correspondence
Being added with ' 0 ' in preprocessor, its result is sent to the defeated of preprocessor corresponding to the 3rd track and the 3rd track
Going out to be added, its result is sent to preprocessor corresponding to the 2nd track by that analogy again.So, through 4 week
The rear operation of phase, exports final result from first track and stores memorizer, completing the calculating of c00
Rear storage is in memorizer.According to said method, complete the multiplying of a matrix element (such as c00), need altogether
Wanting 6 cycles (multiplication 3 cycles, cumulative 4 cycles, data 1 cycle of storage), performance compares serial
Perform (28 cycles) and tradition multilane system (12 cycles) all improves a lot.
Assume that again described multilane system uses tree-shaped bus preprocessor as described in Figure 8 to carry out converging operationJu Hecaozuo,
The most described preprocessor carries out convergence by pipeline system successively twice result to four track outputs and adds up.
I.e., first within a cycle simultaneously to the 1st, 2 tracks, and the output of the 3rd, 4 tracks is separately summed,
In next cycle, above-mentioned two addition results is added again, stores after obtaining final result in memorizer,
Store in memorizer after completing the calculating of c00.According to said method, a matrix element (such as c00) is completed
Multiplying, needs 4 cycles (multiplication 1 cycle, addition 2 cycles, data 1 weeks of storage altogether
Phase), performance has again one than above-mentioned three kinds of methods (respectively 28 cycles, 12 cycles and 6 cycles)
Fixed raising.
According to technical solution of the present invention, there is the multilane system of mapping/paradigmatic structure, moreover it is possible to complete described in employing
The parallelization becoming other operates, such as vector or the addition and subtraction of matrix, dot product etc., specific operation process and precedent
Similar, do not repeat them here.
Above-described embodiment gives and utilizes multilane system of the present invention to carry out a matrix of elements multiplication also
Row realizes.Afterwards, the instruction in this embodiment can be performed by art methods circulation, it is achieved complete
Matrix multiplication.According to technical solution of the present invention, the multilane system in precedent can be improved, increase
Odd number or a plurality of circulation (Loop) control module, and make loop control module control the merit of data engine
Can so that loop code is no longer necessary to data access instruction, thus improves cycle efficieny.
Refer to Figure 13 A, it is the embodiment of multilane system of raising cycle efficieny of the present invention.
For convenience of description, illustrate only a track 140.Scanning device 408, command memory in described system
If 406, data storage 146, track table 410, track allotter 188 and track group controller 189 thereof
Dry cycle controller 130, storage data engine 170 is that each track shares.Each track 140 there is it certainly
Some tracking devices 414 and performance element 147 (comprising register file 148), and several data engines 150.
Tracking device 414 reads the instruction in Instruction Register 406 for command decoder 149 in each track 140 simultaneously
Decoding, controls the operation in each track.Above-mentioned instruction is also directed to track allotter 188, decoder pair therein
Instruction decoding, does not distribute track resource to needing the program of resource by instruction request and Request Priority, and point
Join a track column of dispensers 189 and be managed collectively the track resource distributed to meet program requirement.Real at this
Executing in example, special track request instruction is to processor system request track resource;Special data engine is joined
Put instruction and in each track, ask a data engine 150 for each data access (Load or Store)
Deng and configure data access step-length;Special is circularly set instruction for each program cycle request one circulation control
Device 130 processed, arranges cycle-index in this cycle controller, and is circulated with this program by this cycle controller
In the corresponding data engine of all data accesses be associated.These requests are all by track allotter 188 basis
Resource in its available resources pond and the priority level Resources allocation of request at that time.Thereafter special in program follows
A cycle controller 130 is specified in fourth finger order, and according to the cycle-index determination procedure therein flow direction, (execution follows
Ring or exit circulation);Now such as perform circulation data engine 150 grade then with this circular correlation and press preset data
Access step-length stride accesses data register;Then cycle controller 130 and data engine 150 is circulated as exited
Deng recovery be commanded arrange time state, with treat next time circulation.Owing to each track is carried out same instruction,
Therefore it is identical at the circulating level residing for all tracks of synchronization;The data access that each track performs
Although instructing identical, but each track may be different from the same data address instructed in corresponding data engine.
So, a cycle controller controls the data engine being controlled in each track, can be to making at a plurality of car
The same data access instruction performed in road each accesses the different address of data storage in each track.
Here, the effect of data engine 150 is before track needs to use data, walk according to data address
Long (incremental time) calculates data address the most in advance and obtains data from data storage 146, with only needing
Execution data engine once arranges instruction and instead of the data access instruction being performed a plurality of times in loop code
(LD or ST), thus reduce the instruction number needing to perform, improve program operational efficiency.Cycle controller 130
Then provide cyclical information to data engine so that data engine 150 can use in difference circulates and follow with this
Ring corresponding data address step-length calculates the data address in subsequent cycle automatically, and accesses data the most in advance
Memorizer obtains data.Track allotter 188 and driveway controller 189 are then according to track needed for execution program
Number and currently available number of track-lines carry out track distribution;And in the case of available number of track-lines is less than required number of track-lines,
Divide program described in multiple runs, to realize the function of complete routine.
Refer to Figure 13 B, it is one embodiment of multilane system of raising cycle efficieny of the present invention.
For convenience of description, illustrate only a track.In the present embodiment command memory 406, track table 140,
Data storage 146, track allotter 188 and track group controller 189 thereof are identical with several structures
Cycle controller 130, and data engine 170 shared by each track.Tracking device 141, performance element 147 (bag
Containing register file 148) and data engine 150,160 belong to track, be used alone for described track.?
In following example, what track processor performed is SIMD operation, therefore control vehicle can be only used in practical operation
Tracking device 414 in road reads the content in track table 410, and moves towards according to cycle criterion determination procedure, with
The program controlling a plurality of track performs.For a track, the plurality of data engine that it comprises can be phase
Isostructural, or different structure.Such as, in Figure 13 B, show the data of three kinds of different structures
Engine, wherein data engine 150 and 160 is deposited for data for data read command, data engine 170
Storage instruction.The data engine of 150 or 160 structures can be used in reading data.
Can make by configuring (as switch 180,181,182 is configured respectively) in the present embodiment
Data are stored by the realization that can be associated with part data engine of cycle controller by program loop control
The access of device.Below in conjunction with the example of the matrix multiplication in Figure 14, and command adapted thereto sequence in Figure 15
Illustrate.Refer to the matrix that Figure 14 A, matrix M and N are four row four row, its multiplication result is square
Battle array P;Element in each matrix enters several 0-F with 16 expresses.Employ 4 tracks in this example, therefore may be used
With as described in embodiment before, mapped and converging operationJu Hecaozuo, be calculated in P by once parallel
Individual element.Therefore, altogether need 16 such operations (as Figure 14 B shows) that matrix multiplication (meter can be completed
Calculating result is P0~PF).According to matrix multiplication rule, these 16 times operations can circulate by two-layer.Its
In, interior loop complete often to go in four elements calculating (i.e. Figure 14 B rushes continuous four row, such as: P0~P3,
P4~P7, P8~PB, PC~PF), outer loop then completes the calculating of four row elements altogether (in Figure 14 B
Discontinuous four pieces).Therefore, for all participation tracks, two-layer circulation needs 2 cycle controllers altogether.
Additionally, 2 input data (i.e. each in matrix M and N are used in the operation that each track is carried out altogether
Element), and produce output data (i.e. an element in matrix P).Therefore, each track needs altogether
Want the corresponding M of two loading data engines 150 and N matrix.Post-polymerization treatment device needs storage data to draw
Hold up 170 corresponding P matrixes.
In order to realize described function, the multilane system that the present embodiment is corresponding is referred on the basis of precedent
Order extension.Extension instruction as shown in Figure 15: degree of parallelism arranges instruction (SETWDTH), normalizing track
Number arranging instruction (SETNLN), data engine arranges instruction (SETDE), is circularly set instruction
(SETLOOP), recursion instruction (LOOP) and space-time cycle criterion instruction instruction (LOOPTS).Need
Illustrating, Figure 15 has been merely given as the example of a kind of cyclic program operated in system of the present invention,
To those skilled in the art, change and the instruction of instruction format in this program are replaced, adjusted
With the protection domain that improvement all should belong to claims of the present invention.
In the present embodiment, perform the program of above-mentioned matrix multiplication as shown in figure 15, the form of each extended instruction
It is A form in Fig. 2, including operation code (OP) 21, source operand one (Source 1) 22, target
Operand (Dest) 23, source operand two (Source 2) 24, auxiliary territory (AUX) 25,26;Or
In Fig. 2, B form comprises operation code (OP) 21, source operand one (Source 1) 22, base address are deposited
Device address 23 and side-play amount 28.The operation code 21 of extended instruction is decoded and i.e. understands which kind of this instruction does
Operation.
In the present embodiment, perform in No. 16 tracks when program starts is the ordinary instruction being not required to executed in parallel,
Going to Article 1 in Figure 15 afterwards, its operation mnemonic code is the instruction of SETWDTH.This instruction is B
The degree of parallelism of form arranges instruction, is that disclosure Computer Software program is asked to multilane processor system
The communication way of hardware track resource, in order to arrange track allotter 188 and track group controller 189.Instruction
In side-play amount 28 store PROGRAMMED REQUESTS use number of lanes (such as ' 4 ', i.e. represent need to use 4
Individual track).Source operand 22 territory has request number (Request Number), base address register address
22 can.Another kind of degree of parallelism can also be defined instruction is set, with posting that 22 territories in B form are pointed to
Offset addition in base address and 28 territories in storage, itself and take required from caching or memorizer as address
Number of track-lines.
Return to Figure 13 B, in this example, when performing this instruction, track request number and track number of request 196 quilt
Send into track allotter 188.Track allotter 188 distribute at that time can number of track-lines with the track of satisfied needs
Number;Can also make track group controller more than 189 points in the case of available number of track-lines is less than required number of track-lines
Wheel performs programmed instruction to realize the function of complete routine.Track allotter 188 is by available track in this example
Number 17,18,19 available resources ponds from 188 are moved 188 Central Plains to and are come the track that track 16 takies
Name with above-mentioned request Q in group record and by this track group, distribute in multiple tracks group controller 189
Give this group for one, request Q is stored in this request depositor in 189, makes subsequent instructions to pass through
This request Q controls this track group controller 189, and by 189 control Q groups 16,17,18,
The operation in No. 19 tracks.188 and the depositor 191 that is stored in Q group 189 of available number of track-lines that Q is asked,
This 189 being associated with this group track newly assigned, in making Q group, each track is by the management of this Q group 189.
In this example, track group controller 189 by depositor 191,194, subtractor 192, selector 193
Constitute with logic 195.Wherein, depositor 191 store by the maximum number of track-lines being currently available for use,
Its output is sent to an input of subtractor 192.One input 196 of selector 193 derives from also
Row degree arranges the request number of track-lines comprised in the territory 28 of instruction, performs selector 193 when degree of parallelism arranges instruction
This request number of track-lines is selected to be sent to another input of subtractor 192.Subtractor 192 then will ask track
Number deducts the available number of track-lines in depositor 191, and the result obtained (i.e.: does not also obtain the request car of distribution
Number of channels) it is written in depositor 194.It is defeated that the value of depositor 194 is then sent to another of selector 193
Enter end and logic 195.The output of logic 195 receiving register 194 and depositor 191 produces 3 outputs
197,198, and 199.Because available number of track-lines is possibly less than the number of track-lines of request, need the recycling can
With track to complete the requirement of program, in the present embodiment, the NLN in benchmark track is not necessarily ' 0 ', and
It is to be set by the value of control vehicle Taoist monastic name 197.When circulating the last time, available number of track-lines is possibly more than request
Number of track-lines, this track group now controlled in track allotter 188 with this recycling number of track-lines 198 will
Unnecessary track number returns to resource pool, only uses the track also needed to complete instruction cycles.Cycle criterion 199
Then it is used for judging whether to perform circulation.
The meaning herein circulated is as follows.In order to improve programming efficiency and code density in prior art, will be repeatedly
The cyclic representation of one section of code backward branch performed, this is to code expansion in time, can not
Time needed for the execution of minimizing program;And be that instruction is set with special degree of parallelism disclosed in the present embodiment
SETWDTH asks track resource expressly to processor system, be by program on space (multilane)
Expansion, it is possible to save program perform needed for time, typically can replace the outermost layer in prior art program
Circulation.But number of track-lines available in processor system is possibly less than the number of track-lines of PROGRAMMED REQUESTS, now this reality
Execute example and process this problem with space-time two-dimension expansion, i.e. first with available track, instruction segment is done space development and perform,
The part (i.e. request number of track-lines is beyond the part of available number of track-lines) that space development is not enough is in time with circulation
Launch.This circulating in is not expressed in program, space request that to be processor system express according to program and
At that time can space resources and determine the circulation that exchanges space with circulation time for, hereon referred to as time idle loop with
The circulation difference expressed in money and program.When in the present embodiment, track group controller 189 controls, idle loop holds
OK.
When depositor 194 output is more than ' 0 ' (meaning is unsatisfactory for asking number of track-lines for available number of track-lines), 197
Value is the output of 194, and 198 values are the output of 191, and the value of 199 is ' 1 ', represents after needing to perform
Idle loop time continuous.When depositor 194 output is equal to ' 0 ', (meaning meets request just for available number of track-lines
Number of track-lines), 197 values are the output of 194, and 198 values are the output of 191, and the value of 199 is ' 0 ', table
Show idle loop when need not follow-up.When depositor 194 output is less than ' 0 ', (meaning is many for available number of track-lines
In request number of track-lines), 197 values are the output that value is depositor 191 and the depositor 194 of ' 0 ', 198
The sum that output is added, and the value of 199 is ' 0 ', idle loop when expression need not perform follow-up.As mentioned above
In the present embodiment, in SETWIDTH instruction, 28 territories are ' 4 ', and track allotter 188 is assigned with for Q request
Article 4, track, the difference in depositor 194 is ' 0 '.Therefore 197 values are ' 4 ' for ' 0 ', 198 values, and
The value of 199 is ' 0 '.
Track allotter 188 makes newly assigned 17,18,19 tracks in Q group track also accept 16 tracks to follow
The instruction that mark device reads from instruction buffer 406.Track allotter 188 also will be organized in interior No. 16 tracks as aforementioned
Reference register set track (Starting Lane) on the basis of value;By in remaining 17,18,19 track interior for group
Reference register set value as non-referenced track.
Return to Figure 15, in 16 tracks tracking device control 16,17,18,19 tracks start executed in parallel these
Subsequent instructions.Next is broadcast loading instruction BCLD, and this instruction controls benchmark track and posts with wherein R18
Original base location in storage takes from memorizer or in caching as data address plus the side-play amount in instruction
New base address is stored in group R28 depositor in all tracks.Three instructions below are common loading instructions
LD, makes N0 element in Figure 14 A, M0 unit with the base address in each track R28 plus suitable side-play amount
Element and the storage address of P0 element, i.e. multiplicand matrix N, multiplicand matrix M and the number of matrix of consequence P
According to initial address, be stored in depositor R1, R2 and R3 in each track respectively.Next mnemonic code is
The instruction of MOVLN is the move of aforementioned track, by the track number in each track from its track depositor
Move into R11 depositor in register file.Again next mnemonic code be the instruction of SUBSCH be in benchmark track
The middle track group controller that track number in R11 depositor in its register file is deducted the association of Q group track
The control vehicle Taoist monastic names 197 of 189 outputs are worth ' 0 ', are differed from and are stored in R12 (R12 in benchmark track in this example
Depositor intermediate value is identical with R11 intermediate value).Again next mnemonic code be the instruction of BCSUB be that aforementioned broadcast subtracts
Instruction, carries out track normalization operation as aforementioned, is deducted in the track number in R11 depositor in each track
Benchmark track is stored back to R11 after R12 depositor intermediate value.Next SETNLN instructs R11 in each track again
Depositor intermediate value is stored in the NLN depositor of all data engines in self track attached.After having operated respectively
In track, in each data engine, the normalization track number of storage starts to arrange by increasing from ' 0 ' number of benchmark track,
I.e. 16, the NLN in 17,18, No. 19 tracks is 0,1,2,3.
Data engine arranges in source operand one territory 22 of instruction (SETDE) and deposits containing data initial address
Device number (such as: R1, R2, R3);Target operand 23 stores number during data engine access data
According to source/destination register number, this depositor obtains, for the storage when digital independent, the data of coming, or in data
There is provided during storage and need to be sent to the data of memorizer;Source operand two territory 24 stores this instruction to arrange
Data engine numbering (such as: data engine DE0, DE1, DE31);First auxiliary territory 25 contains
Accessing the data address change step (stride) during data, the second auxiliary territory 26 is containing performing this instruction every time
Adjacent lane between address gaps (Lane Diff).In each track, command decoder is translating a data
The data engine that when engine arranges instruction, one this track of distribution is attached, compiles this engine in territory 24 in instruction
Number association (numbered register as this numbering is stored in this engine), make subsequent instructions can control by this numbering
Make the operation of this data engine.The information simultaneously also arranged in instruction in other territories by data engine is stored in this and draws
Each depositor in holding up.As a example by data engine 150, the register address in territory 23 in instruction is stored in 150
Middle depositor 159 is to indicate the destination register of data engine fetched data;Deposit with the step-length in territory 25
Enter depositor 155;It is stored in depositor 157 with address gaps between the track in territory 26 (Lane Diff);And
With difference value between the base address in the register address readout register in territory 22 and normalizing track number and track
Product addition, its result ((Base)+NLN*LaneDiff) is as this data engine initial in this track
Address is stored in depositor 153.Aforesaid operations can be with the dedicated computing Resource Calculation in each track, it is also possible to
The performance element controlled in track with instruction (such as a multiply-add instruction, a data move) completes,
It is not repeated herein.Data engine arranges instruction and controls the output of selector 152 mask register 153 further,
At following clock cycle, the output of selector 152 is stored in depositor 151.Loading data engine 150 is with 151
Output as addresses access data cache 146, the data of acquisition are stored in this track that depositor 159 points to
Depositor in middle register file.In loading data engine 150, depositor 151 is updated every time, all can touch
Send out this data engine to deposit with this updated value for the loaded bus of data 156 in address reading data buffer 146
Enter in track in register file by the depositor pointed by depositor 159 in 150.Simultaneously in depositor 151
Value be added with the step value in depositor 155 and produce next step data memory addresses.
Every data engine configuration-direct configures a data engine in each track.I.e. Article 1 data engine
Instruction configures DE0 data engine in each track and reads N matrix;Article 2 instruction configuration DE1 data engine
Read Metzler matrix;Article 3 data engine arranges instruction and arranges the storage depending on preprocessor PPU190
(Store) polymerization result is write back the P matrix in memorizer 146 by data engine 170.The address of 170 produces
Part 171,172,173,174,175,176 to the corresponding portion loaded in (Load) data engine 150
Dividing 151,152,153,154,155,156 is identical;The flow direction of data, loading data it is at Bu Tong
In engine, data flow is to flow to register file in track from data buffer storage, and stores the data in data engine
The flow direction is to flow to data buffer storage from register file.The most it is not both depositor in storage data engine 170
The reading mouth of what the register number of 179 storages were directed to is register file and depositor in loading data engine 150
159 point to register files write mouth.The loading data engine depending on preprocessor can be without depositor
179, because its unique data source is the output of preprocessor.
Data engine numbering implicitly (Implicitly) choosing that can arrange in instruction in territory 23 with data engine
Select loading or storage data engine (as numbering DE0-DE15 arranges loading data engine, numbering DE16-DE23
Arranging the storage data engine in each track, numbering DE24-DE31 arranges preprocessor storage data engine).
Instruction can also be set with dominant (Explicit) loading data engine, track storage engines arrange instruction and
Preprocessor storage engines arranges and has instructed same setting.The data engine using recessiveness in this example is arranged
Instruction.It is also provided with corresponding preprocessor annexation when track allotter 188 distributes track simultaneously.
When performing the 3rd data engine and arranging instruction, number according to DE31 in domain of instruction 23, track allotter
188 distribution arrange with this instruction that to be polymerized all four tracks in the configurable converging network in these four tracks of group defeated
The storage data engine 170 affiliated by preprocessor 190 of the intermediate object program gone out.
In this example, the data address of matrix M changes at outer loop, and its address step size (Stride) is
' 4 ' (i.e. address increases ' 4 ' every time), the address gaps (Lane Diff) of adjacent lane is ' 1 ';Matrix
The data address of N changes in interior loop, after its address step size is ' 1 ', and interior loop completes,
Its data address is reset as initial address, and the address gaps of adjacent lane is ' 4 '.The data ground of matrix P
Location step-length is always ' 1 '.Owing to the element in matrix P is all the result after polymerization, the most do not exist adjacent
Track address gaps (value is ' 0 ').Three data engines are carried out by three data engine configuration-directs respectively
Arrange, wherein Article 1 instruction to each lane configurations first row each element in N matrix in corresponding diagram 14A
Initial address;Article 2 instruction is the first row each element in Metzler matrix in corresponding diagram 14A to each lane configurations
Initial address;Article 3 instruction is then configured with in corresponding diagram 14A in P matrix the to last preprocessor
The initial address of a line first row element;Make it the state computation data address according to respective cycle controller.
So, before performing after two loop configuration instruction, the R5 in No. 0 (normalization track number) track
Depositor has the M0 element in Figure 14 A, R4 depositor has the N0 element in Figure 14 A;1
Number track there is M1 element in R5 depositor, R4 depositor has N4 element;In No. 2 tracks
R5 depositor has M2 element, R4 depositor has N8 element;In No. 3 tracks in R5 depositor
There is M3 element, R4 depositor has NC element.The most such, it is because being responsible for loading M
Its track of DE1 data engine of matrix is spaced apart ' 1 ', and is responsible for loading the DE0 data engine of N matrix
Its track is spaced apart ' 4 '.After having performed Article 3 data engine configuration-direct, data engine 170 is posted
The P0 element in data buffer 146 is pointed in the output of storage 171, prepares to produce preprocessor 190
Polymerization result through FIFO (FIFO) 176 write data storage 146.FIFO 176 is temporary poly-
Close result to avoid the read/write conflict of data buffer 146.
It is circularly set instruction (SETLOOP) and cycle controller 130 is set, for B instruction format in Fig. 2,
Its 28 territory stores the number of times (such as: 3) that circulation performs, target operand 23 stores this circulation
Corresponding cycle controller numbering (such as: cycle controller J, K);Territory 22 stores and this circular correlation
Data engine numbering.When decoder translate one be circularly set instruction time, track allotter 188 is its point
Join a cycle controller 130 to share for this group track, and the cycle controller numbering in instruction is followed with this
Ring controller association is (as being stored in this numbering this numbering in this controller, or record in 188 and being somebody's turn to do
Controller is correlated with).Carry out the cycle-index in this domain of instruction 28 is stored in distributed cycle controller simultaneously
Operation such as depositor 131 grade in 130.As a example by Figure 15, Article 1 is circularly set instruction, track in Figure 13 B
Allotter 188 distributes a cycle controller 130, and the named J of numbering being pressed domain of instruction 23 for it.
This is circularly set instruction and the control line 137 of cycle controller J in Figure 13 B is set to ' 0 ', passes through and door
136 make control line 138 for ' 0 ' so that selector 133 mask register 131 in J cycle controller 130
Output.This is circularly set instruction and makes in 130 at following clock cycle control register write signal 149
The output write depositor 134 of selector 133.Control line 137 is reset to ' 1 ' afterwards, simultaneously according to instruction
The switch 181 of the data engine 150 of numbered DE0 is set to Guan Bi by territory 22 makes selector 152 be followed by J
138 control lines of ring controller 130 control;Switch 183 is set to Guan Bi and makes DE0 data engine 150
The write control signal 158 of middle depositor 151 is by the register write signal 149 of J cycle controller 130
Control;The data engine making numbered DE0 follows cycle controller J action, relevant to J.Now J
The cycle-index ' 3 ' being stored in depositor 134 in number cycle controller 130, through or door 135 carry out ' or '
After operation so that control line 141 value is for ' 1 ', more through making control line 138 be worth also for ' 1 ' after door 136,
This value controls selector 133 in 130 and selects the output valve ' 2 ' of decrement device 132, also controls to select in 150
Device 152 selects the output of adder 154, and its value is that the data address of depositor 151 output is plus depositor
The step value of storage in 155, in DE0 data engine, this step value is ' 1 '.
It is circularly set, for one, all data engines numbering that in instruction, 24,25,26 territories occur and all presses above-mentioned
Process is relevant to the cycle controller of numbering distribution in 23 territories in this instruction, this cycle controller control each phase
Close the stepping of data engine.If (every these instruction is right for the loading contained in a circulation or storage instruction
Answer a data engine) number be circularly set in instruction more than one 22 territories can comprise data engine numbering
Number time, can increase domain of instruction 23 in a program, 28 identical are circularly set instruction, its 24,
The data engine numbering that Article 1 instruction fails to lay down is placed in 25,26 territories.Track allotter 188 translates Article 2
Instruction, finds to be assigned with a cycle controller for cycle controller numbering in 23 territories in this instruction, will
The data engine pointed by data engine numbering contained in instruction is the most relevant to this cycle controller;
And the cycle-index value in 22 territories in this instruction is write again depositor 131 (in this cycle controller 130
The value that secondary is write is identical with first time), or do not write this depositor.In Figure 15, Article 2 is circularly set instruction also
As above example arranges cycle controller K and by associated for data engine DE1.Depositor 155 in DE1 engine
The step value of storage is ' 4 ', data address during therefore the output valve of its adder 154 is its depositor 151
Add ' 4 '.
Followed by ordinary multiplications command M UL, by R4 content of registers in each track and R5 depositor
In numerical value be multiplied, it accumulates R6 depositor in its this track.Next is polymerization addition instruction again
RDUADD, sends the numerical value in R6 depositor in each track into post processing network and is added.In this instruction 25
Territory is also directed to data engine DE31.Refer to Figure 13 B, when this instruction reaches DE31 along post processing network
During data engine 170, i.e. with the P0 address having in depositor 171, by the output of preprocessor 190
Write data buffer storage 146.So complete the operation of the first row in Figure 14 B.I.e. No. 0 track completes M0*N0
Computing, 1,2, No. 3 track is respectively completed M*N4, the computing of M2*N8, M3*NC, preprocessor
By 4 respective product addition in track, the value obtained is stored back in data buffer storage be stored data engine by DE31
First element P0 of P matrix that in 170, depositor 171 points to.P0 storage complete after depositor in DE31
171 update, and now under not having data engine configuration-direct effect, selector 172 selects adder 174
Output, it is that the address of P0 element is plus the step-length ' 1 ' of storage, the i.e. ground of P1 element in depositor 175
Location is using the storage address as next circulation.Above-mentioned MUL Yu RDUADD instruction can also merge into one
Take advantage of-be polymerized and add instruction to save instruction execution cycle.
Return to Figure 15, instruct followed by LOOP.Recursion instruction LOOP be a kind of side-play amount be negative value
Special branch instructs, and form is Type B in Fig. 2, containing the cycle controller controlled by this instruction in its 22 territory
Numbering, is J in this instructs;Containing finger offsets amount in its 28 territory, in sensing program in this instance
MUL instructs (with T 1 labelling).Please see Figure 13B, be stored into finger in the instruction 128 from hierarchy storage
While making buffer 406, these instructions are also scanned by scanning device 408, analyze, calculate, the class of instruction
Type and branch target are extracted and are stored in track table 410 table corresponding with the instruction in Instruction Register 406
?.In this embodiment containing controlling selector 139 and the signal of selector 141 in instruction type.When
When the instruction being carrying out not is recursion instruction, the respective type signal read from track table 410 controls choosing
Selecting device 139 selects the TAKEN signal of branch's decision logic 149 in track to control to select in tracking device
Device 442.From the BRANCH signal in track normal time continuously effective, make each clock of depositor 443
Cycle all updates, and provides instruction address to instruction buffer 406 and track table so that 406 every circumference tracks provide
New instruction is for performing;This invalidating signal when only the streamline in track stops, making depositor 443
Stop updating, make 406 time-outs provide new instruction to track.
When in tracking device, depositor 443 exports the address of Article 1 LOOP instruction, with this address from instruction
Buffer 406 reads this recursion instruction and performs for track decoding, reads from track table 410 with this address simultaneously
Go out type signal with this recursion instruction to control selector 139 and select the output from selector 168 to control
Selector 442 in tracking device.Track table 410 also exports the branch target address T1 of this recursion instruction and send simultaneously
To an input of selector 442, another of selector 442 inputs the output from incrementer 441,
Its value is that the address (the most now output of depositor 443) of this recursion instruction increases one.Therefore according to selector
The output of 168 determines that following clock cycle performs next instruction (Article 2 LOOP of order of present instruction
Instruction) or the Branch Target Instruction T1 of present instruction (MUL instruction).This recursion instruction is decoded, with
In instruction, the control of the value in 22 territories selector 168 selects the output 141 of J cycle controller 130 to control
Selector 442.When cycle controller 130 is configured to ' J ', selector 168 select from this 130
The control line of output signal 141 is configured to ' J ' simultaneously, numbering ' J ' coupling in therefore later instructing
Select the output 141 of ' J ' cycle controller.Selector 168 is also such to other selections inputted, and is all base
In coupling.This duration of control line 141 is ' 1 ' as previously mentioned, and its meaning is carried out circulation, chosen device
Control, after 168 i.e. selection 139, the T1 address that in tracking device, selector 442 selects track table 410 to export, make
It becomes the instruction address of following clock cycle.Simultaneously according to the decoding of this recursion instruction, or directly according to rail
The recursion instruction type signal read in road table 410 enables signal in (Enable) J cycle controller 130
Signal 158. in 146 and DE0
So in the next clock cycle, in J cycle controller 130, depositor 134 updates and is stored in new following
Ring number of times ' 2 ';In each track, in DE0 data engine, depositor 151 updates, and is stored in relatively preceding value and increases ' 1 '
Also fetch data from data buffer storage 146 with this and be stored in the R4 depositor in respective track in the new address of (step value);
T1 address is stored into depositor 443 in tracking device, reads MUL with access instruction caching 406 and track table 410
Instruction and respective rail table list item thereof.Because DE1 data engine is uncorrelated with J cycle controller, it is not subject to
Affecting to this recursion instruction, the value in the R5 depositor write by DE1 data engine in the most each track does not has
Change.The most now still having the M0 element in Figure 14 A in the R5 depositor in No. 0 track, R4 posts
The N1 element having in storage;No. 1 track still there is M1 element in R5 depositor, in R4 depositor
There is N5 element;No. 2 tracks still there are M2 element in R5 depositor, R4 depositor has N9
Element;No. 3 tracks still there are M3 element in R5 depositor, R4 depositor has ND element.
The most again perform MUL Yu the RDUADD instruction in Figure 15 circulation, then as front obtained such as Figure 14 B
In result M0*NI+M1*N5+M2*N9+M3*ND shown in the second row be stored in Figure 13 B DE31 storage
P1 position in P matrix in data buffer storage 146 indicated by depositor 171 in data engine.
The most again perform Article 1 LOOP instruction in Figure 15, now because J cycle controller 130 is deposited
Device 134 intermediate value is ' 2 ', and control line 141 and 138 is ' 1 ', circulation the most performed as described above.Its result makes
In each track, in R4 depositor, data update;The execution of MUL and RDUADD instruction produces in Figure 14 B
The result write back data memorizer 146 of P2 row, also making depositor 134 intermediate value is ' 1 '.
The most again perform Article 1 LOOP instruction in Figure 15, be now ' 1 ' because of depositor 134 intermediate value,
Control line 141 and 138 is ' 1 ', circulation the most performed as described above.Its result makes in each track in R4 depositor
Data update;The execution of MUL and RDUADD instruction produces the result write back data of P3 row in Figure 14 B
Memorizer 146, also making depositor 134 intermediate value is ' 0 '.
The most again perform Article 1 LOOP instruction in Figure 15, now because of in J cycle controller 130
Depositor 134 intermediate value is ' 0 ', and control line 141 and 138 is ' 0 ', makes program perform to exit circulation (interior
Circulation).Its process is ' 0 ' to control to select in tracking device by selector 168 and 139 on control line 141
Select device 442 and select the output of incrementer 441, be stored in next instruction of order at next period register 443
The address of (i.e. Article 2 LOOP instruction in Figure 15).' 0 ' control J on control line 138 simultaneously
The cycle-index ' 3 ' stored in selector 133 mask register 131 in cycle controller is at next
It is stored in depositor 134 week;On control line 138 ' 0 ' also controls D E 0 data in each track
In engine 150, the base address in selector 152 mask register 153 is stored in depositor in next week
151.Cycle controller and data engine that will be relevant with this recursion instruction (interior circulation) all return to
Its original state, is ready to again perform whole interior circulation.
In each track of following clock cycle DE0 data engine with the data address in its depositor 151 from data
Caching 146 reads R 4 depositor that data are stored in each track.Press the address in depositor 443 simultaneously
From instruction buffer 406, read the Article 2 LOOP instruction in Figure 15, and from track table 410, read phase
The recursion instruction type answered and branch target T 1.The execution of Article 2 LOOP instruction and Article 1 LOOP
Instruction execution similar, difference be in this instruction perform be outer circulation, act on K cycle controller and
The data engine DE 1 of association, and do not affect J cycle controller and the data engine of association thereof of interior circulation
DE 0.The cycle-index that depositor 134 in K cycle controller now stores for ' 3 ' (by Fig. 5 the
Article two, SETLOOP instruction is by its 22 territory intermediate value write), the therefore control line 141 of K cycle controller
It is ' 1 ' with 138.Therefore outer circulation is carried out such as above interior circulation example, depositor 134 in K cycle controller
In cycle-index be kept to ' 2 ';In each track, in DE 1 data engine, depositor 151 is stored in adder
Data address in former depositor 151 is added the new number obtained by 151 with the step value ' 4 ' in depositor 155
According to address, and from data buffer storage 146, read data with this new data address;Depositor 443 is also deposited
Enter the branch target address (being T 1 equally) of the Article 2 LOOP instruction read from track table 410.
Following clock cycle reads the MUL in Figure 15 with address in depositor 443 from instruction buffer 406
Instruction performs for the decoding of each track.The R5 depositor in No. 0 track now has the M4 unit in Figure 14 A
Element, has N0 element in R4 depositor;Having M5 element in R5 depositor in No. 1 track, R4 deposits
Device has N4 element;No. 2 tracks there are M6 element in R5 depositor, R4 depositor has
N8 element;No. 3 tracks there are M7 element in R5 depositor, R4 depositor has NC element.
Therefore performed in Figure 15 after MUL and RDUADD instruction, i.e. such as P4 row in front generation such as Figure 14 B
Result be stored in data buffer storage 146.The most again perform Article 1 LOOP instruction in Figure 15, because of now J
Cycle-index in cycle controller is ' 3 ', then circulation in performing, and jumps back to MUL instruction.So as front
In performing, circulation is respectively completed P5 in Figure 14 B for 3 times, and the operation of P6, P7 row, now in J cycle controller
Cycle-index is kept to ' 0 ', and program exits interior circulation.
The most again perform the instruction of Article 2 LOOP, because in now K cycle controller, depositor 134 stores
Cycle-index be ' 2 ', therefore perform outer circulation return to MUL instruction.This takes turns in performing in outer circulation and follows
Ring 3 times, performs altogether the instruction segment that instructs from MUL to Article 1 LOOP 4 times, calculates and store figure
Four P matrix elements from P8 to PB in 14B.The most again perform the instruction of Article 2 LOOP, therefore
Time K cycle controller in depositor 134 storage cycle-index be ' 1 ', the most again execution outer circulation return
Instruct to MUL.This takes turns and performs interior circulation 3 times in outer circulation, performs from MUL to Article 1 LOOP altogether
The instruction segment of instruction 4 times, calculates and stores four P matrix elements from PC to PF in Figure 14 B.This
After again perform the instruction of Article 2 LOOP, the circulation stored because of depositor 134 in now K cycle controller
Number of times is ' 0 ', therefore exits outer circulation, next instruction of execution sequence.
Next instructs its mnemonic code is LOOPTS, and meaning is space-time cycle criterion.Fig. 2 is taked in this instruction
The form of middle B, its 21 territory is operation code;The track group that its 22 territory indicates this instruction to be acted on (is now
Q);Its 23 territory need not in this instruction;Its 28 territory is branch target side-play amount, herein means to T2 i.e. Figure 15
Middle MOVLN instructs.This instruction when being stored in Instruction Register 406 its instruction type and branch target by
Scanning device 408 extracts, calculates and is stored in list item corresponding with this instruction in 406 in track table 410.When translating
When code device translates this instruction, control the numerical value controlled in logical AND domain of instruction 22 of selector 168 in Figure 13 B
Q mates, and selects the space-time cycle criterion 199 of Q driveway controller 189 to export and (now i.e. exits for ' 0 '
Circulation).This instruction type simultaneously read from track table 410 controls selector 139 and selects selector 168
Output select the output of incrementer 441 to be stored in depositor 443 to control selector 442 in tracking device.Hereafter
I.e. next instruction of execution sequence SETWDTH, in this instruction, the request number of track-lines in 28 territories is ' 1 ', car
Road allotter 188 accordingly by this track in addition to track, benchmark track that is 16 other tracks 17,18,19 and
Each cycle controller 130 of association in this track group, storage data engine 170 disassociation such as grade regains resource
Storehouse is to treat that other request benchmark (16) tracks from other threads or this thread continue executing with following bicycle
Road serial command.
If track allotter is only assigned with two tracks 16 and 17, then Q track when performing Figure 15 Program
The difference that group controller 189 obtains after request number of track-lines ' 4 ' is deducted available number of track-lines ' 2 ' is stored in for ' 2 '
Depositor 194.Now the output reference track number 197 of logic 195 is ' 2 ', this recycling number of track-lines
198 is also ' 2 ', and space-time cycle criterion 199 is ' 1 '.Now track allotter 188 is according to this recycling
Number of track-lines 198 intermediate value ' 2 ' controls track 16 and 17 and is involved in program execution.Benchmark (16) track is being held
Value in row SUBSCH instruction late register R12 is less by ' 2 ' than R11 intermediate value.Therefore BCSUB is being performed
After instruction, its NLN of this 2 track is 2, and 3;Wherein NLN be 2 track on the basis of track.Because
In track, data engine 150 sets the initial data address relevant to NLN ((Base)+NLN*Diff) of value,
Therefore these two tracks are actually held when continuing executing with Figure 15 Program to space-time cycle criterion instruction LOOPTS
Having gone in Figure 14 B right half, incomplete result is stored in data storage 146 by the operation in No. 2 and No. 3 tracks
Middle P matrix.When going to space-time cycle criterion instruction, idle loop the most constantly judges 199 as ' 1 ',
Then the chosen device of this value 168,139 controls branch's mesh that selector 442 selects track table 410 now to export
Mark T2, starts to perform branching back to MOVLN instruction in Figure 15 next week.This circulation determines also to make Q
Depositor 194 in instruction group controller 189 updates, and is stored in former depositor 194 and is worth (residue request track
Number) subtract each other through subtractor 192 with depositor 191 intermediate value (available number of track-lines) after ' 2 ' chosen devices 193
After difference ' 0 '.The most now logic 195 output valve 197 is ' 0 ', 198 to be ' 2 ', and 199 is ' 0 '.
Now track allotter 188 controls track 3 and 4 according to this recycling number of track-lines 198 intermediate value ' 2 '
The program that is involved in performs.Benchmark track instructs in the value in late register R12 and R11 at execution SUBSCH
It is worth identical.Therefore after performing BCSUB instruction, this 2 track (16 and 17) its NLN is 0, and 1;
Wherein NLN be 0 track on the basis of track.Because data engine 150 sets the primary data ground of value in track
Location relevant to NLN ((Base)+NLN*Diff), therefore these two tracks continue executing with Figure 15 Program extremely
A left side half in Figure 14 B is actually performs during space-time cycle criterion instruction LOOPTS, No. 0 and No. 1 track
Operation.Store data engine 170 during execution during idle loop and be operated in read-modify-write pattern
(Read-Modify-Write) under.First will write before i.e. depositing result in data storage 146 list item being stored in
Hold (last time performs the imperfect result that program produces) to read, produce not with the current circulation storing 1
Complete result computing in preprocessor 190 is written back after becoming complete result and is stored in data storage 146.?
This storage preprocessor 190 affiliated by data engine 170 can be the processor of one three input, accepts
Output from two previous level preprocessors and the input from data storage 146;Or can be
Two input processors, perform twice operation and calculate complete result with 3 inputs.Hereafter Q track group is held again
Row space-time cycle criterion instruction LOOPTS instruction, because space-time cycle criterion value 199 is ' 0 ', exits circulation.
When and for example fruit performs Figure 15 Program, track allotter is assigned with three tracks 16,17 and 18, then Q
The difference that track group controller 189 will ask number of track-lines ' 4 ' to obtain after deducting available number of track-lines ' 3 ' is ' 1 '
It is stored in depositor 194.Now the output reference track number 197 of logic 195 is ' 1 ', this recycling car
Number of channels 198 is ' 3 ', and space-time cycle criterion 199 is ' 1 '.Now track allotter 188 is according to this circulation
Use number of track-lines 198 intermediate value ' 3 ' to control track 16,17 and 18 and be involved in program execution.Such as precedent now
Article three, the NLN in track 16,17 and 18 is 1 respectively, 2,3, and the program that therefore performs is to space-time cycle criterion
It is to complete the partial arithmetic result in 1,2, No. 3 tracks in Figure 14 B to be stored in number during instruction LOOPTS instruction
According to P matrix in memorizer 146.Because space-time cycle criterion 199 is ' 1 ', idle loop when therefore performing.
It is '-2 ' that the now request of the residue in depositor 194 number of track-lines ' 1 ' deducts the difference that available number of track-lines ' 3 ' obtains
It is stored in depositor 194.As it was previously stated, when depositor 194 output is less than ' 0 ', the output of logic 195
197 value for ' 0 ', 198 values be 191 export with 194 output be added and (3+ (-2)=1), and
The value of 199 is ' 0 ', accordingly track allotter according to 198 value by 17,18 tracks regain resource pools, only
Stay 16 tracks to continue executing with, and the value that 16 tracks are with 197 arranges track NLN for ' 0 '.Program is held
Row produce 0 track in Figure 14 B operation result and with the post-treated device of the partial results in P matrix 190
Computing produces complete result and is stored in P matrix in data storage 146.
In Figure 13 B, loading data engine 160 can exchange with 150 and use.An advanced person it is additionally arranged first in 160
Go out 166, for the temporary data read from data buffer storage 146.In 160, data address register 161 should
Controlled by the storage status signal of first in first out 166.When 166 less than time, depositor 161 can update,
And when 166 expire, then depositor 161 does not updates.And in 150 control depositor 161 update by
The signal that LOOP instruction produces changes into controlling the reading of first in first out 166 and reading data being stored in 160
By the register entry pointed by register address register 169, the corresponding component in miscellaneous part and 150
And function one_to_one corresponding, such as 161,162,163,164,165,167, and 169 difference correspondences
151,152,153,154,155,157 and 159, repeat no more.Loading data engine 160
I.e. start from data buffer storage 146 with this after being stored in effective data address in its data address register 161
Middle reading data, when data are successfully read and store in FIFO 166, depositor 161
I.e. update, store what former data address was added through adder 163 with the step-length of storage in step length register 165
Next data address, and read next data from data buffer 146 accordingly and be stored in first in first out 166.So
Operation is until first in first out 166 side's of having been filled with stopping.So when LOOP instruction instruction 160 is to register file
Middle number completion according to time, these data are from first in first out 166, and therefore mask data buffer 146 may visit
Ask delay.When due to data read-out so that 166 less than time, then loading data engine 160 recover from data delay
Storage 146 reads data to fill 166.When the circulation associated with 160 (indirect by cycle controller) refers to
When order execution result is judged as exiting circulation, in 160, first in first out 166 content is cleared.Additionally the present embodiment
Middle loading data engine 150,160, storage data engine 170 basic structure is identical, and simply data flow is not
With.Can be with data engine that a kind of data flow can be arranged to perform the function of above-mentioned three kinds of data engines.
Therefore in the present embodiment, program proposes space requirement, processor system by instruction to processor system
System provide at that time can space resources (track 140, cycle controller 130, track group controller etc.) and
The space requirement that will be unable to fully meet launches beyond the circulation that program is expressed the most in time.Data
Space (interval, track) and time (step-length) increment, each data of instruction offer are depended in the operation of engine
Engine is spaced in spatially launching by track when configuration, automatically walks by trigger condition temporally increment afterwards
Enter.The cycle-index provided with instruction is depended in the operation of cycle controller, makes related resource by the circulation set
Number of times performs circulation.And circulate and can be associated with the memory access of data engine, as to data engine
Trigger condition, make the access of memorizer is followed ring carry out.Data engine can be controlled according to track number
Initial data address expansion spatially.System can also be with arranging control vehicle Taoist monastic name and available number of track-lines
The idle loop when space requirement of resource is converted to by mode by program, exchanges space for the time.It addition, data
Engine configuration and cycle controller arrange and all carry out beyond the circulation of program, and system completes to calculate equally to be held
The instruction strip number relatively prior art of row is greatly reduced.
System and method in the present embodiment also apply be applicable to be not required to the program of converging operationJu Hecaozuo, i.e. in the journey of Figure 15
In sequence, the instruction of Article 3 SETDE changes the storage data engine 170 arranging in each track into, deposits therein
Device 179 is deposited the source register number of storage operation, and the RDUADD instruction in cancelbot.The most each
The result of track computing by the storage data engine 170 in respective track perform recursion instruction time from depositing
The data that the depositor that device 179 intermediate value is pointed to reads pointed by the data address stored in depositor 151 are deposited
List item in reservoir 146.When circulating, data engine 170 is also such as data engine 150 1 in the present embodiment every time
As stepping update the data address in depositor 151.
The CPU programming of prior art is programming model based on a uniprocessor, will by programmer
Need the instruction segment repeated to compile to launch in time for circulation.The GPU program programming of prior art is base
In the multiprocessor programming model of a fixing number of track-lines, the number of track-lines used by program will be with par-ticular processor
Number of track-lines is corresponding, and therefore program cannot be general, also cannot be compatible with the program of CPU.The invention discloses one
Plant elastic novel programmed model, it is characterized in that one can independently be carried out space and change to the time by processor
Universal programming model, can be applicable to containing space disclosed in this invention (single to the monokaryon of time switch technology
Track) or all arithmetic units of multinuclear (multilane), including CPU, DSP, MCU, GPU, GPGPU,
Arithmetical unit etc. in memorizer.This programming model also allow in program serial executable portion and executed in parallel part without
Seam connects, degree of parallelism the resource needed for instruction notification processor system is set and by processor system according to asking
The priority asked, and the automatic Resources allocation of available resources is to meet program requirements.Based on this programming model,
Programmer can also may be used with the program that this programming model is write with easier spatial parallelism mode coding
It is common to it and performs the various arithmetic units that resource is had nothing in common with each other.This programming model makes based on cpu instruction collection
Instruction extension can apply internuclear collaborative computing each with multilane processor or polycaryon processor.
The present embodiment is explanation as a example by the multilane processor in Fig. 4, actually disclosed in the present embodiment
Method and system can apply processor system in office to obtain effect same.As long as there being odd number core, so that it may
To use the cycle controller 130 in the present embodiment, data engine 150 grade performs being circularly set in this example,
Data engine configuration and recursion instruction.Scanning device 408 in the present embodiment, track table 410 and tracking device are also
Not necessarily, the PC ground that can export with the instruction addressing unit (PC unit) in prior art processor
Location addressing instruction buffer 406 provides instruction to track, and by the output signal of selector 139 in Figure 14 B
Judge that signal controls this instruction addressing unit as branch.Figure 14 B controls the signal of selector 139
Can be obtained by Instruction decoding, select the output of 168 selectores in execution recursion instruction, refer to performing remaining
The branch that when making, selection processor He Zhong branch decision logic 149 produces judges signal TAKEN.Above-mentioned
Monokaryon (track) system increases track group controller 189, it is possible to realize the space described in the present embodiment
/ time switching function, performs described degree of parallelism and arranges instruction.By depositor 191 in 189 in monokaryon system
Replace with a constant ' 1 ', then this system is as it was previously stated, space-time can be converted to by the space requirement of program
Circulation, correctly performs program.If multinuclear (multilane system), the most also need to set up track allotter 188
To require Resources allocation according to program.If requiring to perform instruction or post processing instruction between track, then by this
Bright the disclosed embodiments are set up between track or post-processing unit and bus.
The embodiment of a mapping/aggregating algorithm is given below based on technology of the present invention.With big data (big
Data), as a example by the counting in Processing Algorithm, substantial amounts of record, every record are comprised the most in the data file
In contain some numerical attributes of certain entity.The target of enumeration problem seeks to calculate certain of each entity
The function expression value of individual numerical attribute, such as summation, meansigma methods etc..Specifically, such as at CDR file
In contain the flow byte number that phone number and each network access, need the network calculating each mobile phone to visit
Ask flow summation.When using mapping/aggregating algorithm to solve this problem, in mapping process, can be by difference
The entity identification (phone number) that extracts in different CDR files of track and Target Attribute values (flow
Byte number);In the course of the polymerization process, the property value of all identical entity identifications is carried out according to function expression
Computing (such as additive operation).
In actual motion, the track of all these counting algorithms of participation is performed both by same program segment, and is polymerized
The adder that module then can be restrained by tree-shaped is constituted, and every layer of two results added to last layer are also sent to down
One layer.Such as pseudo-code below:
Wherein, code " outputValue=propertyValue1+propertyValue2 ' achieve similar figure
The polymerization addition function of RDUADD in 15 embodiments.So, it is assumed that have 4 tracks to participate in running, then
Said procedure is run in these 4 tracks simultaneously, 4 data files can be carried out numerical value extraction simultaneously, and lead to
Cross two layers of polymer and obtain final result.
In actual moving process, as a example by structure shown in Fig. 4 and Figure 13 B, command memory 406 is deposited
Store up the above-mentioned false code realizing mapping function.The tracking device in four tracks 401,403,405 and 407 is equal
Start to perform identical code from same instruction, and perform the loop code of same cycle-index.Obviously, this
Time 4 tracking devices work completely the same, accordingly it is also possible to only use tracking device to control all 4
Track.In this example, data storage stores the content of multiple CDR file, by data engine
Configuration so that each data engine starts to visit from the different data address of corresponding different CDR files
Ask data storage.So, the respective data engine in each track obtains different ticket from data storage
Data in file, extract the network access traffic information of each mobile phone and are sent to aggregation module.Polymerization mould
Block performs the above-mentioned false code realizing polymerizable functional, and the flow information sending each track here adds up,
To the final flow summation needed.In this example, the instruction performed in each track of synchronization is identical
, but the data processed are different, achieve described tally function with SIMD pattern.
In the above example, if varying in size of each CDR file (means the execution circulation generation of correspondence
The cycle-index of code need not identical), then in order to keep all tracks to be performed both by identical instruction, each car
The cycle-index that road performs must be identical.In implementing, each track is carried out same being circularly set
Instruction so that all write same maximum cycle in the depositor 131 of all cycle controllers used.
So, in addition to that track that cycle-index is most, remaining track all has to repeatedly perform useless behaviour
Make.Additionally, in this case, all CDR files account for the space size in data storage also must be with
That maximum CDR file is identical, and the read-write operation that its part additionally accounted for carries out more is to actual result not
Have an impact, only can cause the waste of data storage memory space.Therefore, it can of the present invention many
In the way of MIMD, above-mentioned false code is performed, to improve track execution efficiency and data storage on lane system
The storage efficiency of device.
As a example by structure shown in Fig. 4 and Figure 13 B, command memory 406 stores above-mentioned realization and maps merit
The false code of energy.The tracking device in four tracks 401,403,405 and 407 all starts to perform from same instruction
Identical code, but 4 tracking devices each work alone, and the respective data engine in each track is deposited from data
Reservoir obtains the data in different CDR file, extracts the network access traffic information of each mobile phone and send
Toward aggregation module.
In the present embodiment, although what each track performed is circularly set instruction is identical, but is set to
Cycle-index in cycle controller is but different.Such as, being circularly set in instruction in this example, permissible
No longer by immediate (such as ' 3 ' in the example of Figure 13,14,15) as cycle-index, but use depositor
It is worth and writes depositor 131 as cycle-index.Specifically, determine that each CDR file is corresponding when compiling
Cycle-index, and this cycle-index and CDR file are together stored as data in data storage,
Before cycle controller is set, described cycle-index is read in the same depositor in each track, and performing
When being circularly set instruction, the value of this depositor is write depositor 131.
Running afterwards is similar with precedent, and difference is that the track being finished can be suspended
Work has reduced power consumption, it is also possible to carry out other follow-up operations in advance.And aggregation module is receiving each car
During the operation result that road is sent here, it is necessary to until the operation result in all tracks all arrives, just carry out follow-up
Converging operationJu Hecaozuo.Obviously, the method that interlocking can be used, utilize synchronizing signal to ensure the correctness of converging operationJu Hecaozuo.
In this case, instruction that each track of synchronization performs also differs, and processed data are the most not
With, i.e. achieve described tally function with MIMD pattern.
Additionally, multilane system of the present invention can carry out mapping/converging operationJu Hecaozuo with stream data.Such as,
In one example, call bill data is sent to each track with data-stream form, when needs are to all tickets
When in data, the dial-out number of times of certain specific Outgoing Number is added up, can be by each track respectively to not
Same call bill data inlet flow (the most different data) carries out information retrieval.In this example, owing to having only to
The number of times extracting this information is simply added, therefore when each track is according to respective tracking device
During operation, after once completing the execution of demapping section code, it is not necessary to wait until that other tracks also produce synchronization letter
Number just can perform the polymerization add operation of corresponding band fusion function at any time.That is, by preprocessor from data
The most stored accumulated value before reading in memorizer, and with the extraction results added in described track after, then deposit
Store up back in data storage so that the extraction result in this track can be added up at any time, and real-time update data are deposited
Final result in reservoir.
In another example, need to add up again after the classifying content in above-mentioned CDR file.Such as,
Add up the duration of call for specific several telephone numbers respectively.At this time, it may be necessary to each track performs respectively
Different programs, extracts the call corresponding to different particular telephone numbers respectively for same CDR file
Duration, send aggregation module to add up.Now, still as a example by structure shown in Fig. 4 and Figure 13 B, instruction storage
Storing multistage in device 406 and realize the false code of mapping function, every section of described false code basic structure is similar to,
But the code of (getItemId function as escribed above) is different at coupling telephone number, it is thus possible to district
Separate the different telephone numbers in same CDR file (the most identical input data).Such as, therein one
The code of Duan Shixian mapping function is as follows:
The code that another section realizes mapping function is as follows:
Remaining each section code realizing mapping function all with above-mentioned two sections similar, do not repeat them here.
In this example, aggregation module is configured to many groups, when often group is for the call that a kind of telephone number is corresponding
Length is polymerized.Specifically, for often organizing aggregation module, it is necessary to wait until the corresponding operation result in all tracks
When all arriving, just carrying out follow-up converging operationJu Hecaozuo, respectively group aggregation module might not complete polymerization simultaneously
Operation.In this case, instruction that each track of synchronization performs also differs, but processed number
According to identical, i.e. achieve described tally function with MISD pattern.
In another example, need the duration of call of the same user in different CDR files is weighted
After add up again.Add owing to can not be multiplied by again after first the duration of call of this user in all CDR files being added up
Weight coefficient, the most described weighting multiplication can not be carried out in aggregation module.At this point it is possible to described weighting is taken advantage of
Method is placed in mapping code and is performed respectively by each track, and aggregation module is only to the duration of call after weighting
Add up, required function can be realized.
Such as, the code realizing mapping function that the present embodiment is corresponding is as follows:
Obviously, above-mentioned code can be applied to before three examples in, for realizing weight data
SIMD, MIMD, MISD mapping/converging operationJu Hecaozuo.Concrete methods of realizing is referred to previous embodiment,
This repeats no more.
Although the above embodiments are all with being read by tracking device addressing tracks table and instruction described in Fig. 4 embodiment
As a example by the processor system of buffering, but mouth instruction buffer is each read with its PC addressing of address in each track more one
Device, or address the own instruction in each track and read buffering and provide to each track instruction can also realize above-mentioned reality respectively
Execute the operation of MIMD or MISD in example.The instruction addressing unit in the most each track is sent to respectively with PC address
From instruction read buffering, if hit, read buffering directly reading instruction from instruction and perform in track;If
It is not hit by, PC address is sent to Instruction Register, read instruction block and fill instruction reading buffering, simultaneously will instruction
Switch to track perform.Instruction based on track number disclosed in this invention performs, and processes, locate afterwards between track
Reason, circulation, circulates method and the data such as the stepping data access of association, two-dimensional development, space/time conversion
Engine, cycle controller, driveway controller, the device such as track allotter can be used in any multilane system.
Process between post processing or track and can be also used for self-test on the sheet of multilane processor chips.Below with
Illustrating as a example by tree-shaped post processing bus in Fig. 8, same method can have in enforcement disclosed in this invention
Such as between the track of Fig. 3 transmission bus of bus or Fig. 9.80,81,82,83, track perform same
The test vector (program) of sample, this vector can be by chip tester or with multilane processor chips to be measured
The system of device is sent into from outside, or is deposited by the on-chip testing controller reading sheet of multilane processor chips
Storage in reservoir, or produce on sheet by certain algorithm.In this test vector containing post processing instruction make this 4
The result that bar track execution test vector produces is sent to preprocessor 84,85 and 86 and compares, and by post processing
The operation result of device or deliver to the test controller on sheet, or deliver to the tester of off-chip or system by it
Judge.Preprocessor in this example increases the function that test is special, can incite somebody to action according to the control of test controller
One input of preprocessor switches to its output, and ignores another one input.Or can also be such as Figure 10
Embodiment by between preprocessor connection bus selection turn off.The result in several tracks of runback exists in the future
Preprocessor compares operation (or checking after subtracting each other whether its difference is ' 0 ') and produces fiducial value afterwards, as compared
It is identical that the track that value is then compared for ' 1 ' performs result;As fiducial value then needs to test further for ' 0 '.
Each preprocessor, can by fiducial value and two inputs (such as under self-test default conditions
Left input) transmit toward next stage preprocessor.First assume that all 4 tracks the most normally work, then every car
The execution result that road performs same test vector is the most identical, and the fiducial value therefore read from preprocessor 86 is
' 1 ', show that 4 tracks the most normally work.Assume again 83 track cisco unity malfunctions, therefore from Hou Chu
The fiducial value that reason device 86 reads is ' 0 ', shows that at least one track work is abnormal.The survey of on and off the chip
Examination controller controls preprocessor 84 the most further and the result in track 80 is bypassed, and also controls preprocessor
The result in track 82 is bypassed by 85, and the fiducial value now read from preprocessor 86 is ' 1 ', shows track
80 and 82 is normal.Comparing the result in track 80 and 81 the most again, the fiducial value that preprocessor 86 reads is still
For ' 1 ', the most i.e. can determine whether that track 83 is abnormal.Above-described embodiment at least can detect in a plurality of track
Article one, the exception in track, but can analogize by this and with different tracks (processor core), same data be performed
The result of same program carries out comparing than taking turns more and locking a plurality of abnormal track (core).This method has one
Individual dead angle, i.e. cannot differentiate all tracks of processor has same fault.Can be with off-chip input or sheet for this
As a means of differentiation compared with the accordingly result that the minority expected results of upper storage is vectorial with the execution of at least one track.
The track number in abnormal track is stored in track allotter 188 makes abnormal track number not occur in resource pool,
Just can repair multilane (core) processor chips.For abnormal track, 188 can be by its track number and big
The track depositor that a track is stored in each track is all moved to right in the track number in this track number.So basis
Abnormal track can be walked around in track number in this depositor when producing NLN, and makes NLN still continuously to perform
Instruction based on NLN.If configuration preprocessor makes it ignore the output in abnormal track further, this has different
Often the multilane in track processes and can perform converging operationJu Hecaozuo disclosed in this invention.This method can also be used for into
Runway distributes, and some non-conterminous tracks are distributed to a program needing multilane space resources or line
Journey.If needing multilane processor to have certain number of track, then can increase redundancy car in the design
Road, resets track number after making self-test, also have the track of enough numbers to meet after getting rid of abnormal track
Requirement.Same method of testing can be used for testing and processes between track or preprocessor and connection, principle phase
With, repeat no more.
Above-mentioned self-test when chip production is tested, all tracks parallel running on chip once can be tested to
Amount, controlling to carry out operating or carry out post-processing operation between track with on and off the chip test controller thereafter will be each
The execution result in track is compared to each other and can position abnormal track, to reduce testing cost;And can be by abnormal car
To improve chip yield in Taoist monastic name record non-volatility memorizer on sheet.Also can be when system boot or fortune
The most automatically test is performed and by abnormal track number record memorizer on sheet during row, the most permissible
The fault produced during processor uses carries out selfreparing and increases reliability.On-chip testing vector generator
Odd number or a plurality of tandom number generator can be used, or incrementer is to each territory in whole piece instruction or instruction
Carry out exhaustive respectively and realize.
Although embodiments of the invention only architectural feature and/or procedure to the present invention is described,
But it is to be understood that, the claim of the present invention is not only limited to and described feature and process.On the contrary,
Described feature and process simply realize several examples of the claims in the present invention.
It should be appreciated that the multiple parts listed in above-described embodiment are only to facilitate describe, it is also possible to
Comprise miscellaneous part, or some parts can be combined or save.The plurality of parts can be distributed in multiple
In system, can be that be physically present or virtual, it is also possible to realize (such as integrated circuit) with hardware, use
Software realizes or is realized by combination thereof.
Obviously, according to the explanation to above-mentioned preferably embodiment, no matter how soon the technology development of this area has,
Which kind of may obtain the most in the future and be the most still difficult to the progress of prediction, the present invention all can be common by this area
Replacement that corresponding parameter, configuration are adapted according to the principle of the present invention by technical staff, adjust and change
Enter, all these replacements, adjust and improve the protection domain that all should belong to claims of the present invention.
Claims (53)
1. multilane/multiple nucleus system, it is characterised in that comprise multiple track/processor core, Mei Geche
Road/processor core has different track/processor cores number, and each track/processor core can perform identical or not
Same instruction also accesses memorizer;The execution result of described a plurality of track/processor cores is also entered by described system
Row post-processing operation, and access memorizer.
2. the system as claimed in claim 1, it is characterised in that have complete between the plurality of track/processor core
Office's bus, for transmitting the data in depositor, by carry out standdle carrier road/processor core register value move or in terms of
Calculate operation.
3. system as claimed in claim 2, it is characterised in that:
Track/processor core in multilane/multiple nucleus system is divided into a plurality of track/processor core group, Mei Geche
Bus switch conducting in global bus within road/processor core group, the overall situation between track/processor core group
Bus switch in bus disconnects so that each track/processor core group carries out standdle carrier road/place simultaneously therein
Reason device core register value moves or calculates operation;
Different track/processor cores perform the same degree of polymerization when arranging instruction simultaneously, to corresponding described bus
Switch configures, it is achieved the degree of polymerization between corresponding track/processor core.
4. system as claimed in claim 3, it is characterised in that;Different track/processor cores perform same simultaneously
Article one, between track/processor core during operational order, according to respective described track/processor core number determine track, source/
Processor core and target track/processor core, by bus between described track/processor core by track, source/processor
The register value of core delivers to target track/processor core, described target track/processor core carry out post processing behaviour
Make;
When different track/processor cores perform operational order between same track/processor core, according to each simultaneously
Described track/processor core number determines track, source/processor core and target track/processor core, by described track
Between/processor core, the register value of track, source/processor core is delivered to target track/processor core, by described target
Track/processor core carries out post-processing operation.
5. the system as claimed in claim 1, it is characterised in that by instruction by each track/processor core
Track/processor core number moves in the general register of this track/processor core.
6. system as claimed in claim 5, it is characterised in that different track/processor cores are according to different tracks
Number/processor core number is calculated different data addresses by same instruction.
7. the system as claimed in claim 1, it is characterised in that described system also comprises one or more rear place
Reason device, each preprocessor is connected with a plurality of track/processor cores, receives described a plurality of track/processor
The execution result of core, and described execution result is polymerized.
8. system as claimed in claim 7, it is characterised in that:
Polymerization result is directly stored in memorizer by described preprocessor;Or
Polymerization result is sent back in the depositor of at least one track/processor core by described preprocessor.
9. system as claimed in claim 7, it is characterised in that determined the polymerization of post-processing operation by instruction
Degree.
10. system as claimed in claim 9, it is characterised in that described preprocessor is by performing post processing
Instruction carries out converging operationJu Hecaozuo.
11. systems as claimed in claim 10, it is characterised in that:
Described preprocessor is connected by transmission bus;Each preprocessor instructs by performing described post processing,
The corresponding execution result of track/processor core is carried out converging operationJu Hecaozuo with the output of adjacent preprocessor;Or
Described preprocessor is connected by tree-shaped bus;Wherein, first order preprocessor by perform described after
Process instruction, the execution result of two corresponding track/processor cores is carried out converging operationJu Hecaozuo, and this polymerization is tied
Really, post processing instruction or its decoding result are passed to remaining preprocessor at different levels successively by level;After remaining is at different levels
Processor instructs by performing described post processing, to the polymerization result of two corresponding preprocessors of previous stage again
Polymerization.
12. systems as claimed in claim 11, it is characterised in that track/between processor core and preprocessor,
Path between preprocessor and preprocessor can configure;
Configure by described path is turned on or off, real after a plurality of preprocessors are grouped
Now by group converging operationJu Hecaozuo;Or
Configure by described path is turned on or off so that preprocessor realizes different polymerization degree
Converging operationJu Hecaozuo.
13. the system as claimed in claim 1, it is characterised in that by a kind of general judging module according to
The control signal of output produces next control signal when continuing executing with by current state, and receives not by current
Next control signal when state continues executing with;Described general judging module according to described system by described the most defeated
Go out the feedback that control signal controls to run, select a kind of output in next control signal of the two to control
Described system continues to run with;
Described general judging module at least includes: arithmetical unit, depositor and selector;Wherein:
Depositor is used for storing current control signal, and by current control signal output to control described system fortune
OK;
Arithmetical unit is for producing next when continuing executing with by current state according to described current control signal state
Control signal, and next control signal described is sent to described selector;
Selector runs feedback according to described system under described current control signal controls, and produces arithmetical unit
Next control signal described and receive next control signal when not continuing executing with by current state and select
Select, and selection result is updated in described depositor.
14. the system as claimed in claim 1, it is characterised in that also comprise:
Odd number or a plurality of cycle controller;Wherein, in each cycle controller correspondence job sequence
Individual loop body, for counting the execution number of times of described loop body, and determines whether circulation is finished;
Odd number or plurality of data engine, be divided into some groups, and often group is including at least a data engine also;
The often corresponding cycle controller of group data engine, for calculating the data used in described loop body
Address, and control memorizer and complete data access operation;
Cycle-index in cycle controller is set by instruction;
When going to corresponding recursion instruction, described cycle-index subtracts one every time;With
After corresponding whole of described recursion instruction have been circulated, described cycle-index is reset as former arranging value.
15. systems as claimed in claim 14, it is characterised in that whenever going to described recursion instruction,
Data engine updates data address, and is ready for track/process according to new data address acquisition corresponding data
Device core uses;
If the execution result of recursion instruction represents that circulation continues, then data engine is by adding Shangdi by data address
Location step-length obtains described new data address;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine
Put value as described new data address.
16. systems as claimed in claim 14, it is characterised in that described data engine also comprises one and first enters
First go out buffering;
After data engine is set, i.e. obtains corresponding data according to the data address arranged and be stored in described
FIFO buffers for track/processor core;
After each data acquisition completes, i.e. update data address, and obtain respective counts according to new data address
According to being stored in described FIFO buffering;Whenever going to described recursion instruction, described FIFO buffering is lost
Abandon the data being stored in the earliest the data that are stored in secondary morning as the new data being stored in the earliest;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine
Put value as described new data address, and empty described FIFO buffering.
17. systems as claimed in claim 14, it is characterised in that described data engine also comprises a fusion
Module;Described Fusion Module receives track/processor core and writes after the data and appropriate address of memorizer, first
From memorizer, read the data of former storage according to this address, and the data sent here with track/processor core are carried out
After calculating operation, further according in this write back address memorizer.
18. systems as claimed in claim 7, it is characterised in that:
Each track described/processor core performs same program, and by preprocessor to each track/processor
The execution result of core compares, it is judged that whether have the abnormal exception of work in described track/processor core
Track/processor core, thus realize the Autonomous test of described system;
When there is abnormal track/processor core, determine the track/processor core number of abnormal track/processor core.
19. systems as claimed in claim 18, it is characterised in that:
Track/the processor core number of abnormal track/processor core is stored in allotter;
Described abnormal track/processor core walked around by allotter when distributing track/processor core, thus realizes described
The selfreparing of system.
20. 1 kinds of multilane/multinuclears perform method, it is characterised in that each track/processor core has different
Track/processor core number, each track/processor core can perform identical or different instruction;Again to described multiple
The execution result of several track/processor cores carries out post-processing operation, and accesses memorizer.
21. methods as claimed in claim 20, it is characterised in that by between the plurality of track/processor core
Global bus's transmission depositor in data, move or calculate carrying out standdle carrier road/processor core register value
Operation.
22. methods as claimed in claim 21, it is characterised in that:
Track/processor core in multilane/multiple nucleus system is divided into a plurality of track/processor core group, Mei Geche
Bus switch conducting in global bus within road/processor core group, the overall situation between track/processor core group
Bus switch in bus disconnects so that each track/processor core group carries out standdle carrier road/place simultaneously therein
Reason device core register value moves or calculates operation;
Different track/processor cores perform the same degree of polymerization when arranging instruction simultaneously, to corresponding described bus
Switch configures, it is achieved the degree of polymerization between corresponding track/processor core.
23. methods as claimed in claim 22, it is characterised in that: different track/processor cores perform simultaneously
Between same track/processor core during operational order, determine track, source according to respective described track/processor core number
/ processor core and target track/processor core, by bus between described track/processor core by track, source/process
The register value of device core delivers to target track/processor core, described target track/processor core carry out post processing
Operation;
When different track/processor cores perform operational order between same track/processor core, according to each simultaneously
Described track/processor core number determines track, source/processor core and target track/processor core, by described track
Between/processor core, the register value of track, source/processor core is delivered to target track/processor core, by described target
Track/processor core carries out post-processing operation.
24. methods as claimed in claim 20, it is characterised in that by instruction by each track/processor core
Track/processor core number move in the general register of this track/processor core.
25. methods as claimed in claim 24, it is characterised in that different track/processor cores are according to different cars
Road/processor core number is calculated different data addresses by same instruction.
26. methods as claimed in claim 20, it is characterised in that by odd number or a plurality of preprocessor with
A plurality of tracks/processor core connects, and receives the execution result of described a plurality of track/processor core, and to institute
State execution result to be polymerized.
27. methods as claimed in claim 26, it is characterised in that:
By described preprocessor, polymerization result is directly stored in memorizer;Or
By described preprocessor polymerization result sent back in the depositor of at least one track/processor core.
28. methods as claimed in claim 26, it is characterised in that determine the poly-of post-processing operation by instruction
Right.
29. methods as claimed in claim 28, it is characterised in that by described preprocessor by perform after
Reason instruction carries out converging operationJu Hecaozuo.
30. methods as claimed in claim 29, it is characterised in that:
Described preprocessor is connected by transmission bus;Each preprocessor instructs by performing described post processing,
The corresponding execution result of track/processor core is carried out converging operationJu Hecaozuo with the output of adjacent preprocessor;Or
Described preprocessor is connected by tree-shaped bus;Wherein, first order preprocessor by perform described after
Process instruction, the execution result of two corresponding track/processor cores is carried out converging operationJu Hecaozuo, and this polymerization is tied
Really, post processing instruction or its decoding result are passed to remaining preprocessor at different levels successively by level;After remaining is at different levels
Processor instructs by performing described post processing, to the polymerization result of two corresponding preprocessors of previous stage again
Polymerization.
31. methods as claimed in claim 30, it is characterised in that track/between processor core and preprocessor,
Path between preprocessor and preprocessor can configure;
Configure by described path is turned on or off, real after a plurality of preprocessors are grouped
Now by group converging operationJu Hecaozuo;Or
Configure by described path is turned on or off so that preprocessor realizes different polymerization degree
Converging operationJu Hecaozuo.
A kind of 32. methods as claimed in claim 20, it is characterised in that also comprise control method, described control
Method processed produces next control signal when continuing executing with by current state according to the control signal exported, and
Receive next control signal when not continuing executing with by current state;Control has been exported by described further according to system
Signal controls the feedback run, and selects a kind of output in next control signal of the two to continue with control system
Reforwarding row;
Described control method at least includes:
Storage current control signal, and use current control signal control system to run;
Next control signal when continuing executing with is produced by current state according to described current control signal state;
Under described current control signal controls, run feedback according to described system, described current state is continued
Next control signal when next control signal during execution and reception are not continued executing with by current state is selected
Select, and selection result is updated to current control signal.
33. methods as claimed in claim 20, it is characterised in that also comprise:
Counted to determine that circulation is to the execution number of times of loop body by odd number or a plurality of cycle controller
No it is finished;A loop body in each cycle controller correspondence job sequence;
Described cycle controller is calculated corresponding by odd number or a plurality of data engine corresponding with cycle controller
Loop body in the address of data used, and control memorizer and complete data access operation;
Cycle-index in cycle controller is set by instruction;
When going to corresponding recursion instruction, described cycle-index subtracts one every time;With
After corresponding whole of described recursion instruction have been circulated, described cycle-index is reset as former arranging value.
34. methods as claimed in claim 33, it is characterised in that whenever going to described recursion instruction,
Updated data address by data engine, and be ready for track/place according to new data address acquisition corresponding data
Reason device core uses;
If the execution result of recursion instruction represents that circulation continues, then data engine is by adding Shangdi by data address
Location step-length obtains described new data address;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine
Put value as described new data address.
35. methods as claimed in claim 33, it is characterised in that use FIFO buffer data;
After data engine is set, i.e. obtains corresponding data according to the data address arranged and be stored in described
FIFO buffers for track/processor core;
After each data acquisition completes, i.e. update data address, and obtain respective counts according to new data address
According to being stored in described FIFO buffering;Whenever going to described recursion instruction, described FIFO buffering is lost
Abandon the data being stored in the earliest the data that are stored in secondary morning as the new data being stored in the earliest;With
If the execution result of recursion instruction represents loop ends, then data address is reset to former setting by data engine
Put value as described new data address, and empty described FIFO buffering.
36. methods as claimed in claim 33, it is characterised in that use Fusion Module to the number in memorizer
Merge according to the data sent here with track/processor core;Described Fusion Module receives track/processor core and writes
After the data and appropriate address of memorizer, from memorizer, first read the data of former storage according to this address,
And after the data sent here with track/processor core carry out calculating operation, further according in this write back address memorizer.
37. methods as claimed in claim 20, it is characterised in that:
Same program is performed by each track/processor core, and by preprocessor to each track/processor core
Execution result compare, it is judged that whether described track/processor core has the abnormal abnormal car of work
Road/processor core, thus realize the Autonomous test of described system;
When there is abnormal track/processor core, determine the track/processor core number of abnormal track/processor core.
38. methods as claimed in claim 37, it is characterised in that:
Store the track/processor core number of abnormal track/processor core, and when distributing track/processor core around
Cross described abnormal track/processor core, thus realize the selfreparing of described system.
39. 1 kinds utilize the side that normalization track/processor core number performs program on multilane/multiple nucleus system
Method, it is characterised in that the corresponding normalization track/processor core number of each track/processor core.
40. methods as claimed in claim 39, it is characterised in that: plural number bar track/processor core is performing
During cyclic program, by circulating the renewal of trigger data address every time, and carry out according to described new data address
Memorizer read or write, thus avoid to express in cyclic program data access instruction occurs.
41. methods as claimed in claim 40, it is characterised in that:
The same data engine of described a plurality of track/processor core executed in parallel arranges instruction;Described data are drawn
Hold up to calculate according to described configuration information and produce odd number or plurality of data address, and according to described data address
Carry out memorizer read or write.
42. methods as claimed in claim 41, it is characterised in that:
Described data engine is according at least to normalization track/processor core number corresponding to each track/processor core
And address gaps, calculate the data initial address that each track/processor core is corresponding.
43. methods as claimed in claim 41, it is characterised in that:
Described data engine, according at least to address step size corresponding to each track/processor core, calculates and follows every time
Data address corresponding during ring.
44. methods as claimed in claim 40, it is characterised in that when there being multilayer circulation nesting, described
What a plurality of tracks/processor core executed in parallel was same be circularly set, and cycle controller is configured by instruction;Institute
State and be circularly set the configuration information that comprises of instruction and at least include the cycle-index of loop body;
The recursion instruction that described a plurality of track/processor core also executed in parallel is same;Refer to performing described circulation
When making, described cycle controller carries out corresponding counts;By described counting:
If the cycle-index having occurred and that is less than the cycle-index of loop body, then controlled to read by cycle controller
Loop body initial order is for performing, thus repeats described loop body;
If the cycle-index having occurred and that is equal to the cycle-index of loop body, then controlled to read by cycle controller
After loop body, the next instruction of sequence address is for performing, thus terminates to perform described loop body.
45. methods as claimed in claim 44, it is characterised in that each cycle controller and odd number or
Plurality of data engine coordinates, circulation every time trigger described data engine and calculate new data address, and root
Carry out memorizer read or write according to described new data address, thus avoid and be explicitly shown in cyclic program
Existing data access instruction.
46. methods as claimed in claim 45, it is characterised in that to perform described recursion instruction as touching
Clockwork spring part.
47. methods as claimed in claim 39, it is characterised in that the program two dimension that needs are performed a plurality of times
By a plurality of tracks/processor core executed in parallel after expansion;The described execution number of times i.e. degree of parallelism being performed a plurality of times;
Described two-dimensional development includes that space development and time launch;Wherein:
Space development, the most a plurality of track/processor cores can be simultaneous for different data and perform identical finger
Order so that described program is launched on Spatial Dimension, program described in each track/processor core executed in parallel;
Time launches, i.e. when available track/processor check figure is less than described degree of parallelism, by described a plurality of cars
Road/processor core is performed a plurality of times described program so that described program is launched on time dimension, described a plurality of
The serial successively of track/processor core performs described program.
48. methods as claimed in claim 47, it is characterised in that:
Track allotter gradually deducts available resources space number according to the space resources number of PROGRAMMED REQUESTS, will be to sky
Between the requirement of resource be mapped as the use to time resource.
49. methods as claimed in claim 48, it is characterised in that:
By the normalization track number that allotter output reference track, track/processor core is corresponding, thus affect benchmark
The starting data address of data engine in track/processor core, is defined as the space starting point of space development;
The space scale of track/processor check figure definition current spatial expansion can be used with track allotter, and calculate
Go out time scale to control time/space conversion when track/processor core performs.
50. methods as claimed in claim 49, it is characterised in that: by adjusting benchmark track/processor core
Normalization track number, determine the epicycle time launch perform be which part in space development.
51. methods as claimed in claim 47, it is characterised in that:
Degree of parallelism demand is clearly given by the program being run or job sequence;
When running described program or job sequence, by described multilane/multiple nucleus system according to described degree of parallelism need
Ask and automatically distribute track/processor core;
When available track/processor core number cannot meet described degree of parallelism demand, by described multilane/multinuclear
System circulates the described program of execution several times, to meet described degree of parallelism demand.
52. methods as claimed in claim 51, it is characterised in that:
Determined maximum parallelism degree during executed in parallel loop body when compiling described program by compiler, produce bag
Degree of parallelism containing described maximum parallelism degree arranges instruction;
When described multilane/multiple nucleus system performs described cyclic program, track allotter perform degree of parallelism and set
Put instruction, and according to available track/processor check figure distribution track/processor core, determine and participate in executed in parallel
Track/processor check figure, and the number of times that the circulation of described program is performed by these track/processor cores.
53. methods as claimed in claim 52, it is characterised in that deducted by the degree of parallelism of program and opened up
The number opened and be executed in parallel obtains remaining degree of parallelism;When residue degree of parallelism is less than available track/processor core
During number, by program described in the track/processor core executed in parallel of track allotter distribution respective number;And this
After program finishes execution, described program is all finished.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410781446.2A CN105893319A (en) | 2014-12-12 | 2014-12-12 | Multi-lane/multi-core system and method |
PCT/CN2015/096769 WO2016091164A1 (en) | 2014-12-12 | 2015-12-09 | Multilane/multicore system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410781446.2A CN105893319A (en) | 2014-12-12 | 2014-12-12 | Multi-lane/multi-core system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105893319A true CN105893319A (en) | 2016-08-24 |
Family
ID=56106715
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410781446.2A Pending CN105893319A (en) | 2014-12-12 | 2014-12-12 | Multi-lane/multi-core system and method |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105893319A (en) |
WO (1) | WO2016091164A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107179895A (en) * | 2017-05-17 | 2017-09-19 | 北京中科睿芯科技有限公司 | A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture |
CN109189476A (en) * | 2018-09-19 | 2019-01-11 | 郑州云海信息技术有限公司 | The control stream of FPGA executes method, apparatus, equipment and medium |
CN109669682A (en) * | 2018-12-18 | 2019-04-23 | 上海交通大学 | Mapping method based on general reconfigurable processor DBSS and MBSS |
CN111158757A (en) * | 2019-12-31 | 2020-05-15 | 深圳芯英科技有限公司 | Parallel access device and method and chip |
CN111860804A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
CN114328592A (en) * | 2022-03-16 | 2022-04-12 | 北京奥星贝斯科技有限公司 | Aggregation calculation method and device |
CN115269455A (en) * | 2022-09-30 | 2022-11-01 | 湖南兴天电子科技股份有限公司 | Disk data read-write control method and device based on FPGA and storage terminal |
TWI805731B (en) * | 2019-04-09 | 2023-06-21 | 韓商愛思開海力士有限公司 | Multi-lane data processing circuit and system |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11397624B2 (en) * | 2019-01-22 | 2022-07-26 | Arm Limited | Execution of cross-lane operations in data processing systems |
TWI825315B (en) * | 2020-05-08 | 2023-12-11 | 安圖斯科技股份有限公司 | Assigning method and assigning system for graphic resource |
CN113722085B (en) * | 2020-05-26 | 2024-04-30 | 安图斯科技股份有限公司 | Distribution method and distribution system of graphic resources |
CN112307431B (en) * | 2020-11-09 | 2023-10-27 | 哲库科技(上海)有限公司 | VDSP, data processing method and communication equipment |
CN114816734B (en) * | 2022-03-28 | 2024-05-10 | 西安电子科技大学 | Cache bypass system based on memory access characteristics and data storage method thereof |
Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5642442A (en) * | 1995-04-10 | 1997-06-24 | United Parcel Services Of America, Inc. | Method for locating the position and orientation of a fiduciary mark |
CN101299199A (en) * | 2008-06-26 | 2008-11-05 | 上海交通大学 | Heterogeneous multi-core system based on configurable processor and instruction set extension |
US7535844B1 (en) * | 2004-01-28 | 2009-05-19 | Xilinx, Inc. | Method and apparatus for digital signal communication |
CN101477512A (en) * | 2009-01-16 | 2009-07-08 | 中国科学院计算技术研究所 | Processor system and its access method |
CN101561766A (en) * | 2009-05-26 | 2009-10-21 | 北京理工大学 | Low-expense block synchronous method supporting multi-core assisting thread |
CN101719105A (en) * | 2009-12-31 | 2010-06-02 | 中国科学院计算技术研究所 | Optimization method and optimization system for memory access in multi-core system |
CN102362256A (en) * | 2010-04-13 | 2012-02-22 | 华为技术有限公司 | Method and device for processing common data structure |
CN102576314A (en) * | 2009-07-27 | 2012-07-11 | 先进微装置公司 | Mapping processing logic having data parallel threads across processors |
TW201301032A (en) * | 2011-06-24 | 2013-01-01 | Kenneth Cheng-Hao Lin | High-performance cache system and method |
CN102880594A (en) * | 2012-10-17 | 2013-01-16 | 电子科技大学 | Parallel matrix full-selected primary element Gauss-Jordan inversion algorithm based on multi-core DSP (Digital Signal Processor) |
CN103365821A (en) * | 2013-06-06 | 2013-10-23 | 北京时代民芯科技有限公司 | Address generator of heterogeneous multi-core processor |
CN103383654A (en) * | 2012-05-03 | 2013-11-06 | 百度在线网络技术(北京)有限公司 | Method and device for adjusting mappers to execute on multi-core machine |
CN103731386A (en) * | 2014-01-02 | 2014-04-16 | 北京邮电大学 | High-speed modulation method based on GPP and SIMD technologies |
US20140122841A1 (en) * | 2012-10-31 | 2014-05-01 | International Business Machines Corporation | Efficient usage of a register file mapper and first-level data register file |
US8749561B1 (en) * | 2003-03-14 | 2014-06-10 | Nvidia Corporation | Method and system for coordinated data execution using a primary graphics processor and a secondary graphics processor |
CN104050092A (en) * | 2013-03-15 | 2014-09-17 | 上海芯豪微电子有限公司 | Data caching system and method |
-
2014
- 2014-12-12 CN CN201410781446.2A patent/CN105893319A/en active Pending
-
2015
- 2015-12-09 WO PCT/CN2015/096769 patent/WO2016091164A1/en active Application Filing
Patent Citations (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5642442A (en) * | 1995-04-10 | 1997-06-24 | United Parcel Services Of America, Inc. | Method for locating the position and orientation of a fiduciary mark |
US8749561B1 (en) * | 2003-03-14 | 2014-06-10 | Nvidia Corporation | Method and system for coordinated data execution using a primary graphics processor and a secondary graphics processor |
US7535844B1 (en) * | 2004-01-28 | 2009-05-19 | Xilinx, Inc. | Method and apparatus for digital signal communication |
CN101299199A (en) * | 2008-06-26 | 2008-11-05 | 上海交通大学 | Heterogeneous multi-core system based on configurable processor and instruction set extension |
CN101477512A (en) * | 2009-01-16 | 2009-07-08 | 中国科学院计算技术研究所 | Processor system and its access method |
CN101561766A (en) * | 2009-05-26 | 2009-10-21 | 北京理工大学 | Low-expense block synchronous method supporting multi-core assisting thread |
CN102576314A (en) * | 2009-07-27 | 2012-07-11 | 先进微装置公司 | Mapping processing logic having data parallel threads across processors |
CN101719105A (en) * | 2009-12-31 | 2010-06-02 | 中国科学院计算技术研究所 | Optimization method and optimization system for memory access in multi-core system |
CN102362256A (en) * | 2010-04-13 | 2012-02-22 | 华为技术有限公司 | Method and device for processing common data structure |
TW201301032A (en) * | 2011-06-24 | 2013-01-01 | Kenneth Cheng-Hao Lin | High-performance cache system and method |
CN103383654A (en) * | 2012-05-03 | 2013-11-06 | 百度在线网络技术(北京)有限公司 | Method and device for adjusting mappers to execute on multi-core machine |
CN102880594A (en) * | 2012-10-17 | 2013-01-16 | 电子科技大学 | Parallel matrix full-selected primary element Gauss-Jordan inversion algorithm based on multi-core DSP (Digital Signal Processor) |
US20140122841A1 (en) * | 2012-10-31 | 2014-05-01 | International Business Machines Corporation | Efficient usage of a register file mapper and first-level data register file |
CN104050092A (en) * | 2013-03-15 | 2014-09-17 | 上海芯豪微电子有限公司 | Data caching system and method |
CN103365821A (en) * | 2013-06-06 | 2013-10-23 | 北京时代民芯科技有限公司 | Address generator of heterogeneous multi-core processor |
CN103731386A (en) * | 2014-01-02 | 2014-04-16 | 北京邮电大学 | High-speed modulation method based on GPP and SIMD technologies |
Non-Patent Citations (1)
Title |
---|
刘华海: "节点内多CPU多GPU协同并行绘制关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107179895B (en) * | 2017-05-17 | 2020-08-28 | 北京中科睿芯科技有限公司 | Method for accelerating instruction execution speed in data stream structure by applying composite instruction |
CN107179895A (en) * | 2017-05-17 | 2017-09-19 | 北京中科睿芯科技有限公司 | A kind of method that application compound instruction accelerates instruction execution speed in data flow architecture |
CN109189476B (en) * | 2018-09-19 | 2021-10-29 | 郑州云海信息技术有限公司 | Control flow execution method, device, equipment and medium of FPGA |
CN109189476A (en) * | 2018-09-19 | 2019-01-11 | 郑州云海信息技术有限公司 | The control stream of FPGA executes method, apparatus, equipment and medium |
CN109669682A (en) * | 2018-12-18 | 2019-04-23 | 上海交通大学 | Mapping method based on general reconfigurable processor DBSS and MBSS |
TWI805731B (en) * | 2019-04-09 | 2023-06-21 | 韓商愛思開海力士有限公司 | Multi-lane data processing circuit and system |
CN111860804B (en) * | 2019-04-27 | 2022-12-27 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
CN111860804A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
US12026606B2 (en) | 2019-04-27 | 2024-07-02 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
US12093811B2 (en) | 2019-04-27 | 2024-09-17 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
CN111158757B (en) * | 2019-12-31 | 2021-11-30 | 中昊芯英(杭州)科技有限公司 | Parallel access device and method and chip |
CN111158757A (en) * | 2019-12-31 | 2020-05-15 | 深圳芯英科技有限公司 | Parallel access device and method and chip |
CN114328592A (en) * | 2022-03-16 | 2022-04-12 | 北京奥星贝斯科技有限公司 | Aggregation calculation method and device |
CN114328592B (en) * | 2022-03-16 | 2022-05-06 | 北京奥星贝斯科技有限公司 | Aggregation calculation method and device |
CN115269455A (en) * | 2022-09-30 | 2022-11-01 | 湖南兴天电子科技股份有限公司 | Disk data read-write control method and device based on FPGA and storage terminal |
CN115269455B (en) * | 2022-09-30 | 2022-12-23 | 湖南兴天电子科技股份有限公司 | Disk data read-write control method and device based on FPGA and storage terminal |
Also Published As
Publication number | Publication date |
---|---|
WO2016091164A1 (en) | 2016-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105893319A (en) | Multi-lane/multi-core system and method | |
CN103635875B (en) | For by using by can subregion engine instance the memory segment that is performed come support code block of virtual core | |
CN103547993B (en) | By using the virtual core by divisible engine instance come execute instruction sequence code block | |
CN109597646A (en) | Processor, method and system with configurable space accelerator | |
Teflioudi et al. | Distributed matrix completion | |
CN102902512B (en) | A kind of multi-threading parallel process method based on multi-thread programming and message queue | |
CN103218208B (en) | For implementing the system and method for the memory access operation being shaped | |
CN108804220A (en) | A method of the satellite task planning algorithm research based on parallel computation | |
CN110476174A (en) | Including neural network processor internuncial between device | |
CN104424158A (en) | General unit-based high-performance processor system and method | |
CN105190541A (en) | A method for executing blocks of instructions using a microprocessor architecture having a register view, source view, instruction view, and a plurality of register templates | |
Uchida et al. | An efficient GPU implementation of ant colony optimization for the traveling salesman problem | |
CN108268278A (en) | Processor, method and system with configurable space accelerator | |
CN108509270A (en) | The high performance parallel implementation method of K-means algorithms on a kind of domestic 26010 many-core processor of Shen prestige | |
CN104035751A (en) | Graphics processing unit based parallel data processing method and device | |
CN103562866A (en) | Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines | |
CN106227507A (en) | Calculating system and controller thereof | |
CN105468439B (en) | The self-adaptive parallel method of neighbours in radii fixus is traversed under CPU-GPU isomery frame | |
KR20130090147A (en) | Neural network computing apparatus and system, and method thereof | |
CN101855614A (en) | Have the hierarchy type microcode store more than core processor | |
CN102508820B (en) | Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit) | |
CN101717817A (en) | Method for accelerating RNA secondary structure prediction based on stochastic context-free grammar | |
CN108205704A (en) | A kind of neural network chip | |
Pan et al. | GPU-based parallel collision detection for real-time motion planning | |
CN105373367A (en) | Vector single instruction multiple data-stream (SIMD) operation structure supporting synergistic working of scalar and vector |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
DD01 | Delivery of document by public notice | ||
DD01 | Delivery of document by public notice |
Addressee: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd. Document name: the First Notification of an Office Action |
|
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 201203 501, No. 14, Lane 328, Yuqing Road, Pudong New Area, Shanghai Applicant after: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd. Address before: 200092, B, block 1398, Siping Road, Shanghai, Yangpu District 1202 Applicant before: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd. |
|
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160824 |