CN105468568B - Efficient coarseness restructurable computing system - Google Patents
Efficient coarseness restructurable computing system Download PDFInfo
- Publication number
- CN105468568B CN105468568B CN201510779977.2A CN201510779977A CN105468568B CN 105468568 B CN105468568 B CN 105468568B CN 201510779977 A CN201510779977 A CN 201510779977A CN 105468568 B CN105468568 B CN 105468568B
- Authority
- CN
- China
- Prior art keywords
- input
- execution unit
- multiplexer
- row
- configuration information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/16—Handling requests for interconnection or transfer for access to memory bus
- G06F13/1605—Handling requests for interconnection or transfer for access to memory bus based on arbitration
- G06F13/1652—Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
- G06F13/1663—Access to shared memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/78—Architectures of general purpose stored program computers comprising a single central processing unit
- G06F15/7867—Architectures of general purpose stored program computers comprising a single central processing unit with reconfigurable architecture
- G06F15/7871—Reconfiguration support, e.g. configuration loading, configuration switching, or hardware OS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2213/00—Indexing scheme relating to interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F2213/16—Memory access
- G06F2213/1602—Memory access type
Abstract
The invention discloses a kind of coarseness restructurable computing systems, and for performing the serial executable portion of the source code of application program and parallel executable portion, parallel executable portion therein is converted into configuration information.The present invention includes general-purpose processor core, coarse-grained reconfigurable array, main storage, shared memory and configuration information memory.Coarse-grained reconfigurable array performs the parallel executable portion, including multiple execution units into array arrangement;Each execution unit includes three multiplexers, arithmetic unit and register file, multiplexer receives input data, and arithmetic unit performs computing and operation result is output to outside array, is output in any one execution unit of next line and is output to register file.The application type that the coarseness restructurable computing system of the present invention is applicable in is wide, and hardware costs is low and can guarantee good performance, saves setup time, improves efficiency.
Description
Technical field
The present invention relates to processor structure design field more particularly to a kind of efficient coarseness Reconfigurable Computation systems
System.
Background technology
With the development of microelectronic process engineering, the limit of semiconductor device art has been touched, has dominated semiconductor product
The Moore's Law of industry for many years has failed, and the dominant frequency of microprocessor is difficult to further be promoted.However, the hair of microelectronic process engineering
Exhibition also brings the progress of another aspect, that is, the fast lifting of on piece integrated level.The thus development of current processor architecture
Turn to how more preferably to utilize the system on chip resource to become increasingly abundant from the promotion for pursuing dominant frequency.
Reconfigurable Computation structure is a kind of computing architecture different from traditional von Neumann structure, also known as reconstruction structure, can
Reconfiguration system, it changes circuit function by either statically or dynamically changing the method for circuit structure and connection relation, this is with leading to
It crosses and changes performed instruction stream to change the von Neumann framework of function formation significant difference.Static Reconfigurable Computation structure
At present mainly using FPGA as representative, develop more mature.Present document relates to be dynamic coarseness Reconfigurable Computation framework.Institute
Meaning dynamic, refers to that the structure can change circuit structure and function among calculating process, compared to static restructural FPGA
It is more flexible;So-called coarseness, it is granularity rather than picture at least with a byte (8bit) and above to refer to the change of function
FPGA is reconstructed with the fine granularity that position (bit) is unit.The restructural amount for having the advantage that configuration information of coarseness can drop significantly
Low, cost during so as to reduce reconstruct, this is also that coarseness reconstruction structure is more suitable for realizing dynamic reconfigurable than FPGA
Basic reason.
The early stage research of reconfigurable system is started in the sixties in last century, difficult since comparison of technology at that time falls behind
To integrate enough resources on piece, therefore the development of reconfigurable system is slower.With semiconductor process technique in recent years
Progress so that on piece can integrate extremely abundant resource, thus coarseness reconfigurable system receives attention again.Coarseness
Reconstruction structure provides performance more higher than general processor, and flexibility more better than application-specific integrated circuit becomes current
Research hotspot, researcher wish by the exploration in terms of reconstruction structure solve the utilization of resources that current computing architecture faces,
Many problems such as communication, power consumption.
Though the research and development of coarseness reconstruction structure are started in the sixties in last century, it is in this century really to become hot spot
Just, there is a collection of coarseness reconstruction structure, be continued for till now.Garp is that Univ. of California, Berkeley more early proposes
Coarseness reconstruction structure, system architecture block diagram is as shown in Figure 1, it adds one 32 restructural by a MIPS processor
Computing array form, be mainly directed towards computation-intensive application (referring to:Callahan T.J.,Hauser J.R.,
Wawrzynek J.The Garp architecture and C compiler[J].Computer,2000,33(4):62-
69).As seen from Figure 1, reconfigurable arrays are connected together by crossbar switch (CrossBar) and primary processor and memory,
Reconfigurable arrays is allowd quickly to read data from caching.However, it is not equipped with the memory used in reconfigurable arrays,
The performance of transmission data may be influenced, and then influences the performance of whole system.
Multiple execution units (PE) are organized as the structure of pipeline-type by the PipeRench frameworks of Carnegie Mellon University,
Be attached between different assembly lines step with interference networks (referring to:Goldstein S.C.,Schmit H.,Budiu M.,
Cadambi S.,Moe M.,Taylor R.R.PipeRench:a reconfigurable architecture and
compiler[J].Computer,2000,33(4):70-77).The advantages of PipeRench as shown in Figure 2,3 is different flowing water
Communication efficiency is very high between line step.However, from the point of view of the data provided with regard to document, general-purpose processor core is not had, is limited
The framework is beyond stream process in terms of more application.
The MorphoSys of University of California at Irvine is that another kind obtains the coarseness reconstruction structure of extensive concern,
System block diagram 4 (referring to:Singh H.,Lee M.-H.,Lu G.,Bagherzadeh N.,Kurdahi F.J.,Filho
E.M.C.MorphoSys:An Integrated Reconfigurable System for Data-Parallel and
Computation-Intensive Applications[J].IEEE Transaction on Computers,2000,49
(5):465-481).MorphoSys is by a simple R ISC general-purpose processor core and the reconfigurable arrays RC Array of a 8x8
It forms;Reconfigurable arrays are configured with local memories of the small Frame Buffer as array itself, Frame
The communication of Buffer and external main memory are completed by DMA;Configuration information is stored in Context Memory;Reconfigurable arrays with
Data path width of the data path width between 64bits, with Context Memory between Frame Buffer be
256bits partly overcomes the problem of internal data transfer bandwidth is insufficient.In addition, reconfigurable arrays are interconnected using part mesh
Structure, the efficiency of transmission Shortcomings in terms of interconnection.
As shown in figure 5, processing unit is organized as processing unit cluster PAC by the PACT XPP of PACT companies, four PAC pass through
Supervision configuration manager connects, and each PAC has to external data transmission path.The problem of structure, is PACT
XPP can only be used as accelerator module and handle compute-intensive applications, and system control code or other types of code still lack logical
With the support of processor core.
As shown in fig. 6, the ADRES frameworks of IMEC employ a kind of more exquisite organizational form, the processing unit of top layer
It is used as forming a vliw processor, remaining processing unit forms reconfigurable arrays.This is a kind of tightly coupled framework, tool
There is the advantages of simple in structure, data path is simple;Shortcoming is that tightly coupled VLIW frameworks cause mating compiler development difficult, together
When array local memory missing cause to handle certain form of application efficiency it is low.
In recent years the MORPHEUS of Europe exploitation is a extremely complex reconfigurable processor, but cannot be known as coarse grain
Reconfigurable processor is spent, because it is integrated with general-purpose processor core, coarseness unit XPP, middle granularity even fine granularity on piece
FPGA unit, excessively high framework complexity limit its application (referring to:Thoma F.,Kuhnle M.,Bonnot P.,
Panainte E.M.,Bertels K.,Goller S.,Schneider A.,Guyetant S.,Schuler E.,
Muller-Glaser K.D.,Becker J.MORPHEUS:Heterogeneous Reconfigurable
Computing.International Conference on Field Programmable Logic and
Applications(FPL'07).2007:409-414)。
EGRA be a kind of reconstruction structure of expression formula granularity (referring to:Ansaloni G.,Bonzini P.,Pozzi
L.EGRA:A Coarse Grained Reconfigurable Architectural Template[J].IEEE
Transactions on Very Large Scale Integration(VLSI)Systems,2011,19(6):1062-
1074), block architecture diagram is as shown in fig. 7, ALU clusters, memory access unit Mem, multiplier Mult etc. are organized together, and positioning is still
It is so an accelerator module, it is difficult to independent utility, and the layout of all kinds of units is fixed, and may influence the flexibility of configuration.
In addition to the framework that above-mentioned foreign study person proposes, domestic relevant unit also carries out coarseness reconstruction structure in recent years
More in-depth study, such as the REmus II reconstruction structures of the propositions such as Tsinghua University, using two 16x16 arrays and
The heterogeneous reconfigurable computing architecture of special accelerator module, achieves fairly good acceleration effect, but hardware complexity is excessively high.
As it can be seen that these above-mentioned prior arts have the defects of following:
1st, there are problems that storing wall:System architecture design is unbalanced, and access main memory bandwidth chahnel is inadequate, lacks restructural battle array
Row inside local memory, memory access underaction so that memory access becomes system bottleneck, though it is single to be configured with a large amount of calculating
Member, but the acceleration effect obtained is limited;
2nd, allocative efficiency is not high:Though most reconstruction structures support dynamic reconfigurable, once cycle needed for configuration is completed
Number is excessive, therefore cannot frequently carry out dynamic configuration;
3rd, interconnection structure underaction:Mostly using network and its mutation, paracentral list is leaned in reconfigurable arrays inside
First memory access is difficult, while the unit species in grid may be different, with application requirement mismatch, affect more applications
Mapping of the program on reconstruction structure.
Therefore, those skilled in the art is directed to developing a kind of coarseness restructurable computing system, and guarantee can wherein weigh
The high efficiency of the data access of structure array.
The content of the invention
To achieve the above object, the present invention provides a kind of coarseness restructurable computing system, for performing application program
Source code serial executable portion and parallel executable portion, the parallel executable portion be converted into configuration information, feature
It is, including general-purpose processor core, coarse-grained reconfigurable array, main storage, shared memory and configuration information memory, institute
State general-purpose processor core and the coarse-grained reconfigurable array, the main storage, the shared memory and described with confidence
It ceases memory to be all connected mutually to communicate, the shared memory and the configuration information memory all can be with the main storages
Exchange data;The general-purpose processor core is used to perform the serial executable portion and the instruction restructural battle array of coarseness
Row perform the parallel executable portion;The main storage is used to store the configuration information, performs the parallel executable portion
Output data after required input data and the execution parallel executable portion;The shared memory is used for from the master
Memory obtains the input data so that the coarse-grained reconfigurable array is read and supplies the coarse-grained reconfigurable array
Its operation result is write being stored the operation result to the main storage as the output data;The configuration information
Memory is used to obtain the configuration information from the main storage so that the coarse-grained reconfigurable array is read;
The coarse-grained reconfigurable array includes the m × n execution unit into m rows n row arrangements;
The execution unit include the first multiplexer, the second multiplexer, the 3rd multiplexer, arithmetic unit and
Register file;In any one of execution unit of the i-th row, 1≤i≤m,
The first input end of first multiplexer, second multiplexer and the 3rd multiplexer
It is all used to receive the input data;
Second input terminal of first multiplexer, second multiplexer and the 3rd multiplexer
Accordingly first, second, and third output terminal with the local register file is connected;
As 2≤i≤m, first multiplexer, second multiplexer and the 3rd multiplexer
3rd input terminal is connected to the output of the arithmetic unit in execution unit described in the (i-1)-th row separately by row crossbar switch
End;Work as i=1, the 3rd input of first multiplexer, second multiplexer and the 3rd multiplexer
End all skies connect;
The control terminal of first multiplexer, second multiplexer and the 3rd multiplexer is all used
Selection signal in the reception configuration information;
The output terminal of first multiplexer is connected to the first input end of the arithmetic unit, and second multichannel is answered
It is connected to the second input terminal of the arithmetic unit with the output terminal of device, the output terminal of the 3rd multiplexer is connected to described
3rd input terminal of arithmetic unit;
The control terminal of the arithmetic unit is used to receive the operational order in the configuration information, and the arithmetic unit is according to its institute
It states the input of first, second, third input terminal and the operational order carries out computing, and the operation result of acquisition is exported from it
End is output to outside the array, is output in any one of execution unit of i+1 row and is output to the deposit
Device heap.
Further, the m × n execution unit passes through the m+1 for being used for transmission data a the row crossbar switch, first
Row crossbar switch is connected with secondary series crossbar switch;
Execution unit described in per a line is all distributed between two row crossbar switches, the wherein n execution units
First multiplexer, second multiplexer and the 3rd multiplexer the first input end and
3rd input terminal is all connected with one in described two row crossbar switches, the fortune of the n execution units
The output terminal for calculating device is all connected with another in described two row crossbar switches;
The first row crossbar switch and first, second, third multiplexer of each execution unit
The first input end is all connected, and is connected with the output terminal of each execution unit;The secondary series crossbar switch
All be connected with the control terminal of first, second, third multiplexer of each execution unit, and with it is described each
The control terminal of the arithmetic unit of a execution unit is all connected;
The first row crossbar switch is connected with the shared memory, and the secondary series crossbar switch matches somebody with somebody confidence with described
Breath memory is connected.
Further, the row crossbar switch, the first row crossbar switch and the secondary series crossbar switch are by ground
Location line and data cable are formed.
Further, the general-purpose processor core passes through Wishbone buses and the coarse-grained reconfigurable array, described
Main storage, the shared memory are connected with the configuration information memory.
Further, the shared memory and the configuration information memory are all carried and the primary storage by DMA
The swapping data of device.
Further, the general-purpose processor core is the OR1200 processor cores increased income.
Further, the m is 8, and the n is 8.
Further, the configuration information of the array gives each one configuration words of the execution unit, the configuration words
For 40 bit bytes.
Further, in the configuration words of an execution unit
39th bit byte is reserved bit;
38th bit byte represents that the configuration words are effective configuration words when being 1;
37-32 bit bytes are used to represent the number of the execution unit;
31-26 bit bytes are used to represent the arithmetic logic of the operational order of the arithmetic unit of the execution unit
The type of operation;
25th bit byte is used to indicate the class of the input of the first input end of the arithmetic unit of the execution unit
Type, the type of the input of the first input end include:The input of the first input end is from the execution unit and described
Other execution units in coarse-grained reconfigurable array of the input of first input end from the appearance soft error;
24-21 bit bytes are used to represent the input of the 25th bit byte instruction, when the 25th bit byte is 1
When, 24-21 bit bytes are used to represent the number of the register file of the execution unit;
20-19 bit bytes are used to indicate the input of second input terminal of the arithmetic unit of the execution unit
Type, the type of the input of second input terminal include:The input of second input terminal is from the execution unit, described
It is other execution units in coarse-grained reconfigurable array of the input of second input terminal from the appearance soft error, described second defeated
The input for entering end is that the input of two input instruction immediates and second input terminal is three input instruction immediates;
18-10 bit bytes be used to represent the input of 20-19 bit bytes instruction from the deposit
The number of device heap or the three input instructions immediate;
9th bit byte is used to indicate when first, second, third input terminal of the arithmetic unit of the execution unit
When all having input, the type of the input of the 3rd input terminal;The type of the input of 3rd input terminal includes:Described 3rd
Coarseness of input of the input of input terminal from the execution unit and the 3rd input terminal from the appearance soft error can
Other execution units in restructuring array;
8-5 bit bytes be used to represent the input of the 9th bit byte instruction from the register file
Number;
4th bit byte is used to represent the output type of the operation result of the arithmetic unit of the execution unit, when
When 4th bit byte is 1, the operation result is output to the register file of the execution unit, is otherwise output to
Other execution units in the coarse-grained reconfigurable array for holding soft error;
3-0 bit bytes are for expression when the operation result is output to the register file of the execution unit
When, the number of the register file.
Further, when the 20-19 bit bytes indicate the input of second input terminal be two input instructions immediately
During number, second input terminal is represented together with the 18-10 bit bytes, the 9th bit byte and the 8-5 bit bytes
Input.
Further, the coarse-grained reconfigurable array for holding soft error is realized using SystemC language.
The present invention has the following advantages:
1st, general-purpose processor core had not only been included in structure but also including coarse-grained reconfigurable array, thus applicable application type is more
Extensively;
2nd, there is provided the shared memory for reconfigurable arrays direct read/write, the internal port width of shared memory is reachable
128bit, it is more wider than the 64bit width of the frameworks such as Morphsys, more data can be provided every time, and can extend, have more preferable
Performance;
3rd, configuration memory also employs asymmetric design as shared memory, both ensure that and has been connected with bus interface
Convenience, in turn ensure the high efficiency of reconfigurable arrays data access, reconfigurable arrays read configuration information interface width can
Up to 320bit, there is better performance and appropriate complexity;
4th, crossbar switch (Crossbar) connection of the widely used high speed of internal data interface, excessively complicated connection use
The Crossbar connections of classification ensure that good performance on the premise of increase hardware complexity not too much;
5th, by carefully setting configuration information, support the working method of differential configuration, can reconstruct and only match somebody with somebody again every time
Changed execution unit (PE) is put, remaining PE is motionless, so as to save setup time, improves efficiency.
The technique effect of the design of the present invention, concrete structure and generation is described further below with reference to attached drawing, with
It is fully understood from the purpose of the present invention, feature and effect.
Description of the drawings
Fig. 1 is the structure diagram of the coarseness reconstruction structure of a prior art.
Fig. 2 is the structure diagram of the coarse-grained reconfigurable array of second prior art.
Fig. 3 is the structure diagram of an execution unit in array shown in Fig. 3.
Fig. 4 is the structure diagram of the coarseness reconstruction structure of the 3rd prior art.
Fig. 5 shows the coarseness reconstruction structure of the 4th prior art and the structural frames of reconfigurable arrays therein
Figure.
Fig. 6 is the structure diagram of the coarseness reconstruction structure of the 5th prior art.
Fig. 7 is the reconstruction structure of the expression formula granularity of the 6th prior art.
Fig. 8 shows the structure diagram of the coarseness restructurable computing system of the present invention.
Fig. 9 shows an example of the coarse-grained reconfigurable array in coarseness restructurable computing system shown in Fig. 8
Structure diagram.
Figure 10 shows the structure of an execution unit in coarse-grained reconfigurable array shown in Fig. 9.
Figure 11 is that adjacent rows execution unit passes through row intersection in the coarse-grained reconfigurable array shown in Fig. 9 for holding soft error
Switch the schematic diagram of communication.
Figure 12 shows that one of the configuration information memory in coarseness restructurable computing system shown in Fig. 8 is exemplary
Structure.
Figure 13 is that the configuration information memory in coarseness restructurable computing system shown in Fig. 8 is opened by secondary series intersection
Close the schematic diagram communicated with each execution unit of coarse-grained reconfigurable array.
Figure 14 shows the configuration words of an execution unit.
Figure 15 shows an exemplary knot of the shared memory in coarseness restructurable computing system shown in Fig. 8
Structure.
Figure 16 be shared memory in coarseness restructurable computing system shown in Fig. 8 by first row crossbar switch with
The schematic diagram of each execution unit communication of coarse-grained reconfigurable array.
Specific embodiment
As shown in figure 8, in a preferred embodiment of the invention, coarseness restructurable computing system of the invention includes
One general-purpose processor core, 101, coarse-grained reconfigurable arrays (RCA) 104 and three memories.In the present embodiment, lead to
With processor core 101 using the OR1200 processor cores increased income;Coarse-grained reconfigurable array 104 is the execution unit of one 8 × 8
Array (its concrete structure will be described later);Three memories are main storage 102,103 and of shared memory respectively
Configuration information memory 105.General-purpose processor core 101 and coarse-grained reconfigurable array 104, main storage 102, shared memory
103 are all connected with configuration information memory 105 mutually to communicate, and shared memory 103 and configuration information memory 105 all can be with
Main storage 102 exchanges data.
The present invention coarseness restructurable computing system be used for perform application program source code serial executable portion and
Parallel executable portion, wherein general-purpose processor core 101 directly perform serial executable portion, and parallel executable portion is converted into configuration
Information simultaneously instructs coarse-grained reconfigurable array 104 to perform by processor core 101.Main storage 102 is performed for storing using journey
Instruction and data needed for sequence, specifically including above-mentioned configuration information, perform the input data needed for parallel executable portion with
And perform the output data after parallel executable portion.Shared memory 103 is used to directly read for coarse-grained reconfigurable array 104
It writes, is obtained from main storage and perform the input data needed for parallel executable portion so that coarse-grained reconfigurable array 104 is read,
And its operation result is write using after by the operation result as the parallel executable portion of execution for coarse-grained reconfigurable array 104
Output data is stored to main storage 102.Configuration information memory 105 is used to obtain from main storage 102 above-mentioned with confidence
Breath reads for coarse-grained reconfigurable array 104.
In the present embodiment, coarse-grained reconfigurable array 104 is connected to general-purpose processor core 101 by Wishbone buses;
Main storage 102 is a static RAM (SRAM), and general processor is connected to by Wishbone buses
Core 101;Shared memory 103 is the local memory of a coarse-grained reconfigurable array 104, and address space is divided into 8 pieces
(bank), general-purpose processor core 101 is connected to by Wishbone buses, by the crossbar switches of 128 (crossbar) even
The coarse-grained reconfigurable array 104 for holding soft error is connected to, each processing unit in array 104 can access it, in addition, it
Data are exchanged by DMA carryings between memory 102;Configuration information memory 105 is connected to logical by Wishbone buses
With processor core 101, the coarse-grained reconfigurable array for holding soft error is connected to by the crossbar switches of 320 (crossbar)
104, each processing unit in array 104 can access it, in addition, it is exchanged between memory 102 by DMA carryings
Data;The configuration information of array 104 gives the configuration words of one 40 bit byte of each execution unit in array 104, hereinafter can
It is specifically described.
The detailed process that the coarseness restructurable computing system of invention performs application program is as follows:
1) source code of application program is directed to, is broken down into serial executable portion and parallel executable portion, it is serial to perform
Part, which is placed on OR1200, to be performed, and parallel executable portion, which is placed on array 104, to be accelerated to perform;
2) by dedicated software tool or by hand will be in 1) and perform row and be partially converted to configuration information, it is stored in text
In part;
3) system power-up starts, and OR1200 reads in configuration information from the file of storage configuration information, is first loaded into primary storage
Then configuration information is write configuration information memory 105 by device 102 from main storage 102;
4) OR1200 brings into operation the code of serial executable portion, until program proceeds to parallel executable portion, then:i)
Required data are transmitted to shared memory 103 from main storage 102;Ii control command) is write to array 104, providing will hold
The number of capable configuration information;Iii) configuration information that array 104 is transmitted according to OR1200 is numbered from configuration information memory 105
Configuration information is loaded into, array 104 is configured according to the requirement of the configuration information retrieved;Iv array 104) is started, it is each to perform list
Member carries out computing, operation result deposit register file, next layer of execution unit, shared memory 103;
5) 104 end of run of array, OR1200 further send subsequent commands, including by result from shared memory 103
Write back main storage 102, with the configuration of new configuration information update array 104, new operation data are loaded into from main storage 102
Into shared memory 103;
6) OR1200 continues to be actuated for array 104 to handle or exit from array 104, continues to be performed by OR1200 surplus
Remaining serial code.
Fig. 9 shows the structure of the coarse-grained reconfigurable array 104 of the appearance soft error in the present embodiment, wherein 8 × 8 are held
Row unit (PE) is opened by 9 row crossbar switches (crossbar), first row crossbar switch (crossbar) and secondary series intersection
Connection connects (crossbar).In particular, often 8 execution units of row and 9 row crossbar switches are arranged alternately, each row execution unit
It is all distributed between two row crossbar switches and is connected with them, first row crossbar switch is arranged in the one of 9 row crossbar switches
It is connected at end with each execution unit, secondary series crossbar switch is arranged at the other end of 9 row crossbar switches and each execution
Unit is connected.Row crossbar switch, first row crossbar switch and secondary series crossbar switch are made of address wire and data cable, pass through it
And the control signal of enable signal etc. realize function, be used for transmission the input data including array 104, configuration information extremely
Each execution unit and the operation result of each execution unit and is being transferred to outside array 104 104 internal transmission of array.
In this way, for the execution unit of the 2 to 8th row, either of which can the row crossbar switch of a line, first from it
Row crossbar switch and secondary series crossbar switch receive data or configuration-direct, input data, configuration information including array 104 and
The operation result (the specifically operation result of arbitrary execution unit output in its lastrow) of other execution units output;And
Its operation result is exported to its next every trade crossbar switch and first row crossbar switch;Any one in the execution unit of 1st row
It is a then the row crossbar switch of a line, first row crossbar switch and secondary series crossbar switch to receive data from it, including battle array
The input data and configuration information of row 104 and export its computing to its next every trade crossbar switch and first row crossbar switch
As a result.It should be noted that first row crossbar switch and secondary series crossbar switch are all connected with each execution unit, but in order to
Illustrative clarity only symbolically illustrates the connection of they and immediate execution unit in fig.9.
It is all identical to hold the construction of 8 × 8 execution units of the coarse-grained reconfigurable array 104 of soft error, below with wherein
One execution unit exemplified by the execution unit (by the PE of overstriking in Fig. 9) of the 5th row the 8th row in Fig. 9, describes each execution unit
Structure.
As shown in Figure 10, execution unit includes three multiplexers, an arithmetic unit and register file, specifically, three
A multiplexer is the first multiplexer MUX A, the second multiplexer MUX B and the 3rd multiplexer MUX C, one
A arithmetic unit is arithmetic unit ALU.First multiplexer MUX A, the second multiplexer MUX B and the 3rd multiplexer
MUX C all have there are three input terminal, and first input end therein is all connected with first row crossbar switch with the defeated of receiving array 104
Enter data;Second input terminal is accordingly connected with three output terminals of the register file in the execution unit, is stored for receiving
The operation result of the upper once computing of the execution unit in register file;3rd input terminal is separately by its lastrow
Row crossbar switch is connected with the output terminal of the arithmetic unit in any one execution unit of its lastrow (i.e. the 4th row), for connecing
Receive the operation result of any one execution unit last time computing of its lastrow (i.e. the 4th row).It should be noted that it retouches here
State be the 5th row the 8th row execution unit three multiplexers the 3rd output terminal, the three of the execution unit of 2-8 rows
3rd output terminal connection of a multiplexer is same, and the of three multiplexers of the 1st row execution unit
Three output terminals can be received there is no the operation result from lastrow execution unit, it is possible to think the 1st row execution unit
The 3rd output terminal of three multiplexers connect for sky.
The control terminal of first multiplexer, the second multiplexer and the 3rd multiplexer is all intersected out with secondary series
It closes and is connected so that by the selection signal in the configuration information of secondary series crossbar switch receiving array 104, specifically, the first multichannel is answered
With the control terminal of device for receiving selection signal Sel_A, the control terminal of the second multiplexer is used to receive selection signal Sel_
B, the control terminal of the 3rd multiplexer are used to receive selection signal Sel_C.The output terminal of first multiplexer is connected to fortune
The first input end of device is calculated, the output terminal of the second multiplexer is connected to the second input terminal of arithmetic unit, the 3rd multiplexing
The output terminal of device is connected to the 3rd input terminal of arithmetic unit.In this way, the selection signal in the configuration information for passing through array 104, energy
It enough determines the output of three multiplexers, that is, determines three input Input A, Input B of arithmetic unit and Input C.Fortune
The control terminal of device is calculated for the operational order Op Code in the configuration information of receiving array 104, arithmetic unit is according to the first, the
2nd, the input Input A of the 3rd input terminal, Input B, Input C and operational order Op Code carry out computing, obtain computing
As a result.The output terminal of arithmetic unit is connected to intersect operation result by the row with next every trade crossbar switch of the execution unit
Switch is output to the execution unit of i+1 row, and the output terminal of arithmetic unit is also connected to transport with the register file of the execution unit
It calculates result and is output to register file, the output terminal of arithmetic unit is also connected operation result being output to battle array with first row crossbar switch
Outside row 104 (i.e. shared memory 103).
Above-mentioned arbitrary a line execution unit PI, 0、…、PI, 7Arithmetic unit in (i=1 ..., 8) discharges its operation result
To its next every trade crossbar switch, wherein when 2≤i≤8, execution unit PI, 0、…、PI, 7Operation result to be output to its next
Capable execution unit PI+1,0、…、PI+1,7, realization method is as shown in figure 11.In particular:Execution unit PI, 0、…、PI, 7(i=
1st ..., 8) in arithmetic unit by its operation result R0、…、R7It is discharged into the row crossbar switch, execution unit PI+1,0、…、PI+1,7
(i=2 ..., 8) obtains the operation result in the row crossbar switch of its lastrow, and in the row crossbar switch for passing through lastrow
The specified path in portion is transferred to three input Input A (A of its arithmetic unit0、…、A7)、Input B(B0、…、B7)、Input C
(C0、…、C7One or more of).
It can be seen that in execution unit, each operand Input A, Input B, the source of Input C can be divided into three
Kind:I) shared memory 103 is come from;Ii the register file of local register file, i.e. this execution unit) is come from;Iii) come
From the output of the execution unit in lastrow (execution unit of the first row is without this data source).Input A、Input B、
Which data source Input C, which select, is determined by configuration information.In addition, the result of calculation of each execution unit also there are three
Whereabouts:I) shared memory 103 is arrived in storage;Ii) local register file is arrived in storage;Iii) it is output to the input terminal of next line PE
(last column execution unit is without this data whereabouts).
In the present embodiment, the structure of configuration information memory 105 is as shown in figure 12, and inside unit is using 320bit as grain
Degree carries out tissue, such as unit 302.Which employs two sets of ports of asymmetric design, wherein, with general-purpose processor core 101
(OR1200) interaction between main storage 102 uses the Wishbone bus interface of 32, to maintain good compatibility,
The set port include port wb_addr_i, wb_data_i, wb_data_o ..., wb_we_i and wb_ack_o;It can with coarseness
Interaction between restructuring array 104 using a width up to port 301 --- the read port dev_ctx_data_o of 320,
The port 301 is connected with secondary series crossbar switch (referring to Figure 13, it is possible thereby to send configuration information to 64 execution units
Speed when PE0-PE63), to ensure dynamic restructuring.The address of the unit of each 320bit of configuration information memory 105
The referred to as id of configuration information, and the foundation retrieved using the id as configuration information, the 320bit configuration informations that can will be retrieved
It is exported by port 301 to coarse-grained reconfigurable array 104.In addition, port dev_ctx_id_i, dev_rd_en_i, dev_
Ack_o is the address of required access (320bit) configuration information respectively, reads configuration information mark, output acknowledgement indicator.
Correspondingly, the form of configuration information point need to be layered face and be described:
1) coarse-grained reconfigurable array (RCA) level:Configuration information in configuration information memory 105 is with 320bit mono-
A unit is unit storage, why is organized as the cell size of 320bit, is because the configuration that each execution unit occupies
Information is 40bit, and 320bit is just for 8 execution units, that is, the configuration information amount needed for a line execution unit.So
Needs are completely reconfigured every time, 8 configuration informations are read from configuration information memory, it is believed that read a line every time, altogether
Read 8 rows;
2) differential configuration:If each dynamic restructuring is required for all updating all 64 execution units its configuration deposit
Device then needs 8 cycle altogether, and such one is the space for wasting configuration memory, and two are reduction of system performance.Therefore this hair
Bright support differential configuration, that is, each dynamic restructuring only update the execution unit that is changed in 64 execution units
Configuration register (not shown, there are one the configuration registers of 40bit for each execution unit tool), therefore dynamic restructuring needs every time
It is performed for 0~64 and reads configuration information, the cycle of consumption is 0~8 cycle.The execution reconfigured if necessary is very
It is few, then it can greatly promote the speed of reconstruct;
3) word format is configured:Each execution unit corresponds to the configuration words of a 40bit, why configuration words for 40bit it
It is more, precisely in order to supporting to check the mark configuration, it is necessary to the number of corresponding execution unit be set on each configuration words head, to distinguish
Which execution unit the configuration information belongs to;
4) differentiation of configuration information between homogeneous does not reconstruct:Since the required configuration information of each dynamic restructuring corresponds to configuration
0~8 unit in memory, indefinite length, so must there is mechanism to be distinguish between the configuration information between the reconstruct of not homogeneous.
The present invention uses following scheme:Beginning is reconfigured every time, it is unit to start to read 320bit according to given configuration information id
Configuration information until some 40bit of afterbody are grouped into full 0 value in the 320bit configuration informations for meeting certain reading, represents this
Configuration information reading needed for secondary reconstruct leaves it at that, just as the design of the character string in C language with 0 ending.If one
The straight configuration information for not encountering afterbody full 0, then terminate after at most reading 8 320bit configuration informations, and expression has been read at this time
The configuration information that enough 64 PE are configured.
Accordingly, the configuration words of each execution unit are 40 bit bytes in the present embodiment, as shown in figure 14, specifically
For:
Byte section 401:39th bit byte resv is reserved bit;
Byte section 402:38th bit byte valid to indicate significance bit, represents that the configuration words are effectively to configure when being 1
Otherwise word is invalid configuration word;
Byte section 403:37-32 bit byte pe_id, for representing the number of this execution unit;
Byte section 404:31-26 bit byte op, for representing counting for the operational order of the arithmetic unit of the execution unit
The type of logical operation;
Byte section 405:25th bit byte A type, be used to indicate the arithmetic unit of the execution unit first input end it is defeated
Enter the type of Input A, the type of Input A includes:Other execution units from this execution unit, in array 104
(i.e. its lastrow execution unit);
Byte section 406:24-21 bit byte input A, for representing the input of the 25th bit byte A type instructions
Input A, when the 25th bit byte A type are 1,24-21 bit byte input A are used to represent the deposit of this execution unit
The number of device heap;
Byte section 407:20-19 bit byte B type are used to indicate the second input terminal of the arithmetic unit of this execution unit
The type of Input B is inputted, the type of Input A includes:Other from this execution unit, in array 104 perform list
First (i.e. its lastrow execution unit), be two input instruction immediates, be three input instruction immediates;
Byte section 408:18-10 bit byte input B, for representing the input of 20-19 bit byte B type instructions
Input B from register file number or three input instruction immediates;
Byte section 409:9th bit byte C type, be used to indicate when this execution unit arithmetic unit first, second,
When three input terminals all have input, the type of the input Input C of the 3rd input terminal;The type of Input C includes:From this execution
Unit, other execution units (i.e. its lastrow execution unit) in array 104;
Byte section 410:8-5 bit byte input C, for representing the input Input C of the 9th bit byte C type instructions
From register file number;
Byte section 411:4th bit byte R type, for representing the output of the operation result of the arithmetic unit of this execution unit
Type;When it is 1, which is output to the register file of this execution unit, is otherwise output to its in array 104
His execution unit (i.e. its next line execution unit);
Byte section 412:3-0 bit byte result, for representing to be output to posting for this execution unit when operation result
During storage heap, the number of the register file.
Wherein, when it is two input instruction immediates that 20-19 bit byte B type, which indicate input Input B, with 18-
5 bit bytes represent the input together.
As shown in figure 15, in order to improve access speed, shared memory 103 employs polylith (bank) in the present embodiment
Design corresponds to a bank per 16bit storage units, is divided into 8 bank, each bank is configured with individual access end
Mouthful, 8 access ports merge into the port of a 128bit, and so each sharing memory access can at most read in 8
The operand of 16bit, and shared memory 103 can be further expanded.
Similar with configuration information memory 105, shared memory 103 employs asymmetrical two sets of ports, wherein, it shares
It is Wishbone interfaces between memory 103 and general-purpose processor core 101 and main storage 102, width 32bit, the set end
Mouthfuls 501 include port wb_addr_i, wb_data_i, wb_data_o ..., wb_we_i and wb_ack_o;Shared memory 103
Interface between coarse-grained reconfigurable array 104 is used to by first row crossbar switch connect with coarse-grained reconfigurable array 104
Connect, the set port 502 include port bank0_addr_i, bank0_data_i, bank0_data_o ..., bank7_addr_i,
Bank7_data_i, bank7_data_o, i.e., corresponding to each banki (i in 8 blocks (bank) of shared memory 103
=0 ..., 7) there are at least three ports:Address input end mouth banki_addr_i, data-in port banki_data_i and
Data-out port banki_data_o.As a result, as shown in figure 16, each banki (i=0 ..., 7) is used to perform for a row
Unit PE0i-PE7iRead and write data.Due to often showing 8 rows totally 8 execution units, these execution units access shared memory 103
Priority it is higher according to the incremental decreasing order of line number, that is, the smaller priority of line number because the calculating in front is more
To be urgent, such as cannot memory access in time can influence calculating below.When the execution unit do not gone together in a certain row memory access simultaneously, award
It weighs and gives line number low execution unit.Therefore, the present invention employs the interconnection architecture of classification when accessing shared memory, in complexity
Property and aspect of performance achieve relatively good compromise.
The preferred embodiment of the present invention described in detail above.It should be appreciated that those of ordinary skill in the art without
Creative work is needed according to the present invention can to conceive and makes many modifications and variations.Therefore, the technology of all the art
Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea
Technical solution, all should be in the protection domain being defined in the patent claims.
Claims (10)
1. a kind of coarseness restructurable computing system, for performing the serial executable portion of the source code of application program and holding parallel
Row part, the parallel executable portion are converted into configuration information, which is characterized in that can including general-purpose processor core, coarseness
Restructuring array, main storage, shared memory and configuration information memory, the general-purpose processor core can be weighed with the coarseness
Structure array, the main storage, the shared memory are all connected with the configuration information memory mutually to communicate, described common
Enjoy memory and the configuration information memory all can exchange data with the main storage;The general-purpose processor core is used to hold
The row serial executable portion and the instruction coarse-grained reconfigurable array perform the parallel executable portion;The primary storage
Device is used to store the configuration information, perform the input data needed for the parallel executable portion and perform the parallel execution
Output data behind part;The shared memory is used to obtain the input data from the main storage for the coarse grain
Degree reconfigurable arrays read and for the coarse-grained reconfigurable array write its operation result using by the operation result as
The output data is stored to the main storage;The configuration information memory is used to match somebody with somebody from described in main storage acquisition
Confidence breath reads for the coarse-grained reconfigurable array;
The coarse-grained reconfigurable array includes the m × n execution unit into m rows n row arrangements;
The execution unit includes the first multiplexer, the second multiplexer, the 3rd multiplexer, arithmetic unit and deposit
Device heap;In any one of execution unit of the i-th row, 1≤i≤m,
The first input end of first multiplexer, second multiplexer and the 3rd multiplexer is all used
In the reception input data;
Second input terminal of first multiplexer, second multiplexer and the 3rd multiplexer corresponds to
Ground is connected with first, second, and third output terminal of local register file;
As 2≤i≤m, first multiplexer, the 3rd of second multiplexer and the 3rd multiplexer the
Input terminal is connected to the output terminal of the arithmetic unit in execution unit described in the (i-1)-th row separately by row crossbar switch;Work as i
=1, the 3rd input terminal of first multiplexer, second multiplexer and the 3rd multiplexer is all empty
It connects;
The control terminal of first multiplexer, second multiplexer and the 3rd multiplexer is all used to connect
Receive the selection signal in the configuration information;
The output terminal of first multiplexer is connected to the first input end of the arithmetic unit, second multiplexer
Output terminal be connected to the second input terminal of the arithmetic unit, the output terminal of the 3rd multiplexer is connected to the computing
3rd input terminal of device;
The control terminal of the arithmetic unit is used to receiving operational order in the configuration information, the arithmetic unit according to its described the
First, the input of second, third input terminal and the operational order carry out computing, and the operation result of acquisition is defeated from its output terminal
Go out to outside the array, be output in any one of execution unit of i+1 row and be output to the register file.
2. coarseness restructurable computing system as described in claim 1 the, wherein m × n execution unit is passed by being used for
M+1 the row crossbar switches, first row crossbar switches of transmission of data are connected with secondary series crossbar switch;
Execution unit described in per a line is all distributed between two row crossbar switches, the institute of the wherein n execution units
The 3rd input terminal of the first multiplexer, second multiplexer and the 3rd multiplexer is stated all with two
One in a row crossbar switch is connected, the output terminal of the arithmetic unit of the n execution units all with two
Another in the row crossbar switch is connected;
The first row crossbar switch is described with first, second, third multiplexer of each execution unit
First input end is all connected, and is connected with the output terminal of each execution unit;The secondary series crossbar switch and institute
It states the control terminal of first, second, third multiplexer of each execution unit to be all connected, and each is held with described
The control terminal of the arithmetic unit of row unit is all connected;
The first row crossbar switch is connected with the shared memory, and the secondary series crossbar switch is deposited with the configuration information
Reservoir is connected.
3. coarseness restructurable computing system as claimed in claim 2, wherein the row crossbar switch, the first row intersect
Switch and the secondary series crossbar switch are to be made of address wire and data cable.
4. coarseness restructurable computing system as claimed in claim 3, wherein the general-purpose processor core passes through Wishbone
Bus and the coarse-grained reconfigurable array, the main storage, the shared memory and the configuration information memory phase
Even.
5. coarseness restructurable computing system as claimed in claim 3, wherein the shared memory and the configuration information
Memory all carries the swapping data with the main storage by DMA.
6. coarseness restructurable computing system as claimed in claim 3, wherein the general-purpose processor core is increased income
OR1200 processor cores.
7. the coarseness restructurable computing system as described in any one in claim 3-6, wherein the m is 8, the n is
8。
8. coarseness restructurable computing system as claimed in claim 7, wherein the configuration information of the array gives each institute
One configuration words of execution unit are stated, the configuration words are 40 bit bytes.
9. coarseness restructurable computing system as claimed in claim 8, wherein in the configuration of an execution unit
In word
39th bit byte is reserved bit;
38th bit byte represents that the configuration words are effective configuration words when being 1;
37-32 bit bytes are used to represent the number of the execution unit;
31-26 bit bytes are used to represent the arithmetic logic operation of the operational order of the arithmetic unit of the execution unit
Type;
25th bit byte is used to indicate the type of the input of the first input end of the arithmetic unit of the execution unit, institute
Stating the type of the input of first input end includes:The input of the first input end is from the execution unit and described first defeated
Enter other execution units of the input at end in the coarse-grained reconfigurable array;
24-21 bit bytes are used to represent the input of the 25th bit byte instruction, when the 25th bit byte is 1,
24-21 bit bytes are used to represent the number of the register file of the execution unit;
20-19 bit bytes are used to indicate the class of the input of second input terminal of the arithmetic unit of the execution unit
Type, the type of the input of second input terminal include:The input of second input terminal is from the execution unit, described the
The input of other execution units, second input terminal of the input of two input terminals in the coarse-grained reconfigurable array is
The input of two input instruction immediates and second input terminal is three input instruction immediates;
18-10 bit bytes be used to represent the input of 20-19 bit bytes instruction from the register file
Number or the three input instructions immediate;
9th bit byte is used to indicate when first, second, third input terminal of the arithmetic unit of the execution unit all has
During input, the type of the input of the 3rd input terminal;The type of the input of 3rd input terminal includes:3rd input
Its in the coarse-grained reconfigurable array of input of the input at end from the execution unit and the 3rd input terminal
His execution unit;
8-5 bit bytes be used to represent the input of the 9th bit byte instruction from the register file number;
4th bit byte is used to represent the output type of the operation result of the arithmetic unit of the execution unit, when described
When 4th bit byte is 1, the operation result is output to the register file of the execution unit, is otherwise output to described
Other execution units in coarse-grained reconfigurable array;
3-0 bit bytes are for representing when the operation result is output to the register file of the execution unit, institute
State the number of register file.
10. coarseness restructurable computing system as claimed in claim 9, wherein when described in 20-19 bit bytes instruction
When the input of second input terminal is two input instruction immediate, with the 18-10 bit bytes, the 9th bit byte and described
8-5 bit bytes represent the input of second input terminal together.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510779977.2A CN105468568B (en) | 2015-11-13 | 2015-11-13 | Efficient coarseness restructurable computing system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510779977.2A CN105468568B (en) | 2015-11-13 | 2015-11-13 | Efficient coarseness restructurable computing system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105468568A CN105468568A (en) | 2016-04-06 |
CN105468568B true CN105468568B (en) | 2018-06-05 |
Family
ID=55606287
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510779977.2A Expired - Fee Related CN105468568B (en) | 2015-11-13 | 2015-11-13 | Efficient coarseness restructurable computing system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105468568B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059050B (en) * | 2019-04-28 | 2023-07-25 | 北京美联东清科技有限公司 | AI supercomputer based on high-performance reconfigurable elastic calculation |
CN110737628A (en) * | 2019-10-17 | 2020-01-31 | 辰芯科技有限公司 | reconfigurable processor and reconfigurable processor system |
CN111581148A (en) * | 2020-04-16 | 2020-08-25 | 清华大学 | Processor based on coarse-grained reconfigurable architecture |
CN111597138B (en) * | 2020-04-27 | 2024-02-13 | 科大讯飞股份有限公司 | Multi-concurrency RAM data transmission method and structure of X-type link structure |
CN112084139A (en) * | 2020-08-25 | 2020-12-15 | 上海交通大学 | Multi-emission mixed granularity reconfigurable array processor based on data flow driving |
CN112631610B (en) * | 2020-11-30 | 2022-04-26 | 上海交通大学 | Method for eliminating memory access conflict for data reuse of coarse-grained reconfigurable structure |
CN112463719A (en) * | 2020-12-04 | 2021-03-09 | 上海交通大学 | In-memory computing method realized based on coarse-grained reconfigurable array |
CN112433773B (en) * | 2020-12-14 | 2021-11-30 | 清华大学 | Configuration information recording method and device for reconfigurable processor |
CN112540793A (en) * | 2020-12-18 | 2021-03-23 | 清华大学 | Reconfigurable processing unit array supporting multiple access modes and control method and device |
CN113407483B (en) * | 2021-06-24 | 2023-12-12 | 重庆大学 | Dynamic reconfigurable processor for data intensive application |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761075A (en) * | 2014-02-10 | 2014-04-30 | 东南大学 | Coarse granularity dynamic reconfigurable data integration and control unit structure |
CN103761072A (en) * | 2014-02-10 | 2014-04-30 | 东南大学 | Coarse granularity reconfigurable hierarchical array register file structure |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB0304628D0 (en) * | 2003-02-28 | 2003-04-02 | Imec Inter Uni Micro Electr | Method for hardware-software multitasking on a reconfigurable computing platform |
-
2015
- 2015-11-13 CN CN201510779977.2A patent/CN105468568B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103761075A (en) * | 2014-02-10 | 2014-04-30 | 东南大学 | Coarse granularity dynamic reconfigurable data integration and control unit structure |
CN103761072A (en) * | 2014-02-10 | 2014-04-30 | 东南大学 | Coarse granularity reconfigurable hierarchical array register file structure |
Non-Patent Citations (1)
Title |
---|
《异构粗粒度可重构处理器的自动任务编译器框架设计》;楼杰超等;《微电子学与计算机》;20150830;第32卷(第8期);第110-114页 * |
Also Published As
Publication number | Publication date |
---|---|
CN105468568A (en) | 2016-04-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105468568B (en) | Efficient coarseness restructurable computing system | |
CN108537331A (en) | A kind of restructural convolutional neural networks accelerating circuit based on asynchronous logic | |
US7360068B2 (en) | Reconfigurable signal processing IC with an embedded flash memory device | |
WO2020103058A1 (en) | Programmable operation and control chip, a design method, and device comprising same | |
CN103020002B (en) | Reconfigurable multiprocessor system | |
US20220197710A1 (en) | Inter-processor execution of configuration files on reconfigurable processors using smart network interface controller (smartnic) buffers | |
CN103744644B (en) | The four core processor systems built using four nuclear structures and method for interchanging data | |
CN104699631A (en) | Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor) | |
JP2008537268A (en) | An array of data processing elements with variable precision interconnection | |
CN103761075B (en) | Coarse granularity dynamic reconfigurable data integration and control unit structure | |
JPH0212343A (en) | Simulation engine | |
CN102262611B (en) | 16-site RISC (Reduced Instruction-Set Computer) CUP (Central Processing Unit) system structure | |
WO2022133047A1 (en) | Dataflow function offload to reconfigurable processors | |
CN101441616B (en) | Rapid data exchange structure based on register document and management method thereof | |
CN115456155A (en) | Multi-core storage and calculation processor architecture | |
CN111079908B (en) | Network-on-chip data processing method, storage medium, computer device and apparatus | |
CN104035896B (en) | Off-chip accelerator applicable to fusion memory of 2.5D (2.5 dimensional) multi-core system | |
CN103365821B (en) | A kind of address generator of heterogeneous multi-nucleus processor | |
Sievers et al. | Comparison of shared and private l1 data memories for an embedded mpsoc in 28nm fd-soi | |
CN105893036A (en) | Compatible accelerator extension method for embedded system | |
TW200304749A (en) | Method and system for managing hardware resources to implement system acquisition using an adaptive computing architecture | |
CN114398308A (en) | Near memory computing system based on data-driven coarse-grained reconfigurable array | |
CN107665281B (en) | FPGA-based processor simulation method | |
CN111078624B (en) | Network-on-chip processing system and network-on-chip data processing method | |
CN111078625B (en) | Network-on-chip processing system and network-on-chip data processing method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180605 Termination date: 20201113 |
|
CF01 | Termination of patent right due to non-payment of annual fee |