CN105468568B

CN105468568B - Efficient coarseness restructurable computing system

Info

Publication number: CN105468568B
Application number: CN201510779977.2A
Authority: CN
Inventors: 绳伟光; 蒋剑飞; 毛志刚
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2015-11-13
Filing date: 2015-11-13
Publication date: 2018-06-05
Anticipated expiration: 2035-11-13
Also published as: CN105468568A

Abstract

The invention discloses a kind of coarseness restructurable computing systems, and for performing the serial executable portion of the source code of application program and parallel executable portion, parallel executable portion therein is converted into configuration information.The present invention includes general-purpose processor core, coarse-grained reconfigurable array, main storage, shared memory and configuration information memory.Coarse-grained reconfigurable array performs the parallel executable portion, including multiple execution units into array arrangement；Each execution unit includes three multiplexers, arithmetic unit and register file, multiplexer receives input data, and arithmetic unit performs computing and operation result is output to outside array, is output in any one execution unit of next line and is output to register file.The application type that the coarseness restructurable computing system of the present invention is applicable in is wide, and hardware costs is low and can guarantee good performance, saves setup time, improves efficiency.

Description

Efficient coarseness restructurable computing system

Technical field

The present invention relates to processor structure design field more particularly to a kind of efficient coarseness Reconfigurable Computation systems System.

Background technology

With the development of microelectronic process engineering, the limit of semiconductor device art has been touched, has dominated semiconductor product The Moore's Law of industry for many years has failed, and the dominant frequency of microprocessor is difficult to further be promoted.However, the hair of microelectronic process engineering Exhibition also brings the progress of another aspect, that is, the fast lifting of on piece integrated level.The thus development of current processor architecture Turn to how more preferably to utilize the system on chip resource to become increasingly abundant from the promotion for pursuing dominant frequency.

Reconfigurable Computation structure is a kind of computing architecture different from traditional von Neumann structure, also known as reconstruction structure, can Reconfiguration system, it changes circuit function by either statically or dynamically changing the method for circuit structure and connection relation, this is with leading to It crosses and changes performed instruction stream to change the von Neumann framework of function formation significant difference.Static Reconfigurable Computation structure At present mainly using FPGA as representative, develop more mature.Present document relates to be dynamic coarseness Reconfigurable Computation framework.Institute Meaning dynamic, refers to that the structure can change circuit structure and function among calculating process, compared to static restructural FPGA It is more flexible；So-called coarseness, it is granularity rather than picture at least with a byte (8bit) and above to refer to the change of function FPGA is reconstructed with the fine granularity that position (bit) is unit.The restructural amount for having the advantage that configuration information of coarseness can drop significantly Low, cost during so as to reduce reconstruct, this is also that coarseness reconstruction structure is more suitable for realizing dynamic reconfigurable than FPGA Basic reason.

The early stage research of reconfigurable system is started in the sixties in last century, difficult since comparison of technology at that time falls behind To integrate enough resources on piece, therefore the development of reconfigurable system is slower.With semiconductor process technique in recent years Progress so that on piece can integrate extremely abundant resource, thus coarseness reconfigurable system receives attention again.Coarseness Reconstruction structure provides performance more higher than general processor, and flexibility more better than application-specific integrated circuit becomes current Research hotspot, researcher wish by the exploration in terms of reconstruction structure solve the utilization of resources that current computing architecture faces, Many problems such as communication, power consumption.

Though the research and development of coarseness reconstruction structure are started in the sixties in last century, it is in this century really to become hot spot Just, there is a collection of coarseness reconstruction structure, be continued for till now.Garp is that Univ. of California, Berkeley more early proposes Coarseness reconstruction structure, system architecture block diagram is as shown in Figure 1, it adds one 32 restructural by a MIPS processor Computing array form, be mainly directed towards computation-intensive application (referring to：Callahan T.J.,Hauser J.R., Wawrzynek J.The Garp architecture and C compiler[J].Computer,2000,33(4):62- 69).As seen from Figure 1, reconfigurable arrays are connected together by crossbar switch (CrossBar) and primary processor and memory, Reconfigurable arrays is allowd quickly to read data from caching.However, it is not equipped with the memory used in reconfigurable arrays, The performance of transmission data may be influenced, and then influences the performance of whole system.

Multiple execution units (PE) are organized as the structure of pipeline-type by the PipeRench frameworks of Carnegie Mellon University, Be attached between different assembly lines step with interference networks (referring to：Goldstein S.C.,Schmit H.,Budiu M., Cadambi S.,Moe M.,Taylor R.R.PipeRench:a reconfigurable architecture and compiler[J].Computer,2000,33(4):70-77).The advantages of PipeRench as shown in Figure 2,3 is different flowing water Communication efficiency is very high between line step.However, from the point of view of the data provided with regard to document, general-purpose processor core is not had, is limited The framework is beyond stream process in terms of more application.

The MorphoSys of University of California at Irvine is that another kind obtains the coarseness reconstruction structure of extensive concern, System block diagram 4 (referring to：Singh H.,Lee M.-H.,Lu G.,Bagherzadeh N.,Kurdahi F.J.,Filho E.M.C.MorphoSys:An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications[J].IEEE Transaction on Computers,2000,49 (5):465-481).MorphoSys is by a simple R ISC general-purpose processor core and the reconfigurable arrays RC Array of a 8x8 It forms；Reconfigurable arrays are configured with local memories of the small Frame Buffer as array itself, Frame The communication of Buffer and external main memory are completed by DMA；Configuration information is stored in Context Memory；Reconfigurable arrays with Data path width of the data path width between 64bits, with Context Memory between Frame Buffer be 256bits partly overcomes the problem of internal data transfer bandwidth is insufficient.In addition, reconfigurable arrays are interconnected using part mesh Structure, the efficiency of transmission Shortcomings in terms of interconnection.

As shown in figure 5, processing unit is organized as processing unit cluster PAC by the PACT XPP of PACT companies, four PAC pass through Supervision configuration manager connects, and each PAC has to external data transmission path.The problem of structure, is PACT XPP can only be used as accelerator module and handle compute-intensive applications, and system control code or other types of code still lack logical With the support of processor core.

As shown in fig. 6, the ADRES frameworks of IMEC employ a kind of more exquisite organizational form, the processing unit of top layer It is used as forming a vliw processor, remaining processing unit forms reconfigurable arrays.This is a kind of tightly coupled framework, tool There is the advantages of simple in structure, data path is simple；Shortcoming is that tightly coupled VLIW frameworks cause mating compiler development difficult, together When array local memory missing cause to handle certain form of application efficiency it is low.

In recent years the MORPHEUS of Europe exploitation is a extremely complex reconfigurable processor, but cannot be known as coarse grain Reconfigurable processor is spent, because it is integrated with general-purpose processor core, coarseness unit XPP, middle granularity even fine granularity on piece FPGA unit, excessively high framework complexity limit its application (referring to：Thoma F.,Kuhnle M.,Bonnot P., Panainte E.M.,Bertels K.,Goller S.,Schneider A.,Guyetant S.,Schuler E., Muller-Glaser K.D.,Becker J.MORPHEUS:Heterogeneous Reconfigurable Computing.International Conference on Field Programmable Logic and Applications(FPL'07).2007:409-414)。

EGRA be a kind of reconstruction structure of expression formula granularity (referring to：Ansaloni G.,Bonzini P.,Pozzi L.EGRA:A Coarse Grained Reconfigurable Architectural Template[J].IEEE Transactions on Very Large Scale Integration(VLSI)Systems,2011,19(6):1062- 1074), block architecture diagram is as shown in fig. 7, ALU clusters, memory access unit Mem, multiplier Mult etc. are organized together, and positioning is still It is so an accelerator module, it is difficult to independent utility, and the layout of all kinds of units is fixed, and may influence the flexibility of configuration.

In addition to the framework that above-mentioned foreign study person proposes, domestic relevant unit also carries out coarseness reconstruction structure in recent years More in-depth study, such as the REmus II reconstruction structures of the propositions such as Tsinghua University, using two 16x16 arrays and The heterogeneous reconfigurable computing architecture of special accelerator module, achieves fairly good acceleration effect, but hardware complexity is excessively high.

As it can be seen that these above-mentioned prior arts have the defects of following：

1st, there are problems that storing wall：System architecture design is unbalanced, and access main memory bandwidth chahnel is inadequate, lacks restructural battle array Row inside local memory, memory access underaction so that memory access becomes system bottleneck, though it is single to be configured with a large amount of calculating Member, but the acceleration effect obtained is limited；

2nd, allocative efficiency is not high：Though most reconstruction structures support dynamic reconfigurable, once cycle needed for configuration is completed Number is excessive, therefore cannot frequently carry out dynamic configuration；

3rd, interconnection structure underaction：Mostly using network and its mutation, paracentral list is leaned in reconfigurable arrays inside First memory access is difficult, while the unit species in grid may be different, with application requirement mismatch, affect more applications Mapping of the program on reconstruction structure.

Therefore, those skilled in the art is directed to developing a kind of coarseness restructurable computing system, and guarantee can wherein weigh The high efficiency of the data access of structure array.

The content of the invention

To achieve the above object, the present invention provides a kind of coarseness restructurable computing system, for performing application program Source code serial executable portion and parallel executable portion, the parallel executable portion be converted into configuration information, feature It is, including general-purpose processor core, coarse-grained reconfigurable array, main storage, shared memory and configuration information memory, institute State general-purpose processor core and the coarse-grained reconfigurable array, the main storage, the shared memory and described with confidence It ceases memory to be all connected mutually to communicate, the shared memory and the configuration information memory all can be with the main storages Exchange data；The general-purpose processor core is used to perform the serial executable portion and the instruction restructural battle array of coarseness Row perform the parallel executable portion；The main storage is used to store the configuration information, performs the parallel executable portion Output data after required input data and the execution parallel executable portion；The shared memory is used for from the master Memory obtains the input data so that the coarse-grained reconfigurable array is read and supplies the coarse-grained reconfigurable array Its operation result is write being stored the operation result to the main storage as the output data；The configuration information Memory is used to obtain the configuration information from the main storage so that the coarse-grained reconfigurable array is read；

The coarse-grained reconfigurable array includes the m × n execution unit into m rows n row arrangements；

The execution unit include the first multiplexer, the second multiplexer, the 3rd multiplexer, arithmetic unit and Register file；In any one of execution unit of the i-th row, 1≤i≤m,

The first input end of first multiplexer, second multiplexer and the 3rd multiplexer It is all used to receive the input data；

Second input terminal of first multiplexer, second multiplexer and the 3rd multiplexer Accordingly first, second, and third output terminal with the local register file is connected；

As 2≤i≤m, first multiplexer, second multiplexer and the 3rd multiplexer 3rd input terminal is connected to the output of the arithmetic unit in execution unit described in the (i-1)-th row separately by row crossbar switch End；Work as i=1, the 3rd input of first multiplexer, second multiplexer and the 3rd multiplexer End all skies connect；

The control terminal of first multiplexer, second multiplexer and the 3rd multiplexer is all used Selection signal in the reception configuration information；

The output terminal of first multiplexer is connected to the first input end of the arithmetic unit, and second multichannel is answered It is connected to the second input terminal of the arithmetic unit with the output terminal of device, the output terminal of the 3rd multiplexer is connected to described 3rd input terminal of arithmetic unit；

The control terminal of the arithmetic unit is used to receive the operational order in the configuration information, and the arithmetic unit is according to its institute It states the input of first, second, third input terminal and the operational order carries out computing, and the operation result of acquisition is exported from it End is output to outside the array, is output in any one of execution unit of i+1 row and is output to the deposit Device heap.

Further, the m × n execution unit passes through the m+1 for being used for transmission data a the row crossbar switch, first Row crossbar switch is connected with secondary series crossbar switch；

Execution unit described in per a line is all distributed between two row crossbar switches, the wherein n execution units First multiplexer, second multiplexer and the 3rd multiplexer the first input end and 3rd input terminal is all connected with one in described two row crossbar switches, the fortune of the n execution units The output terminal for calculating device is all connected with another in described two row crossbar switches；

The first row crossbar switch and first, second, third multiplexer of each execution unit The first input end is all connected, and is connected with the output terminal of each execution unit；The secondary series crossbar switch All be connected with the control terminal of first, second, third multiplexer of each execution unit, and with it is described each The control terminal of the arithmetic unit of a execution unit is all connected；

The first row crossbar switch is connected with the shared memory, and the secondary series crossbar switch matches somebody with somebody confidence with described Breath memory is connected.

Further, the row crossbar switch, the first row crossbar switch and the secondary series crossbar switch are by ground Location line and data cable are formed.

Further, the general-purpose processor core passes through Wishbone buses and the coarse-grained reconfigurable array, described Main storage, the shared memory are connected with the configuration information memory.

Further, the shared memory and the configuration information memory are all carried and the primary storage by DMA The swapping data of device.

Further, the general-purpose processor core is the OR1200 processor cores increased income.

Further, the m is 8, and the n is 8.

Further, the configuration information of the array gives each one configuration words of the execution unit, the configuration words For 40 bit bytes.

Further, in the configuration words of an execution unit

39th bit byte is reserved bit；

38th bit byte represents that the configuration words are effective configuration words when being 1；

37-32 bit bytes are used to represent the number of the execution unit；

31-26 bit bytes are used to represent the arithmetic logic of the operational order of the arithmetic unit of the execution unit The type of operation；

25th bit byte is used to indicate the class of the input of the first input end of the arithmetic unit of the execution unit Type, the type of the input of the first input end include：The input of the first input end is from the execution unit and described Other execution units in coarse-grained reconfigurable array of the input of first input end from the appearance soft error；

24-21 bit bytes are used to represent the input of the 25th bit byte instruction, when the 25th bit byte is 1 When, 24-21 bit bytes are used to represent the number of the register file of the execution unit；

20-19 bit bytes are used to indicate the input of second input terminal of the arithmetic unit of the execution unit Type, the type of the input of second input terminal include：The input of second input terminal is from the execution unit, described It is other execution units in coarse-grained reconfigurable array of the input of second input terminal from the appearance soft error, described second defeated The input for entering end is that the input of two input instruction immediates and second input terminal is three input instruction immediates；

18-10 bit bytes be used to represent the input of 20-19 bit bytes instruction from the deposit The number of device heap or the three input instructions immediate；

9th bit byte is used to indicate when first, second, third input terminal of the arithmetic unit of the execution unit When all having input, the type of the input of the 3rd input terminal；The type of the input of 3rd input terminal includes：Described 3rd Coarseness of input of the input of input terminal from the execution unit and the 3rd input terminal from the appearance soft error can Other execution units in restructuring array；

8-5 bit bytes be used to represent the input of the 9th bit byte instruction from the register file Number；

4th bit byte is used to represent the output type of the operation result of the arithmetic unit of the execution unit, when When 4th bit byte is 1, the operation result is output to the register file of the execution unit, is otherwise output to Other execution units in the coarse-grained reconfigurable array for holding soft error；

3-0 bit bytes are for expression when the operation result is output to the register file of the execution unit When, the number of the register file.

Further, when the 20-19 bit bytes indicate the input of second input terminal be two input instructions immediately During number, second input terminal is represented together with the 18-10 bit bytes, the 9th bit byte and the 8-5 bit bytes Input.

Further, the coarse-grained reconfigurable array for holding soft error is realized using SystemC language.

The present invention has the following advantages：

1st, general-purpose processor core had not only been included in structure but also including coarse-grained reconfigurable array, thus applicable application type is more Extensively；

2nd, there is provided the shared memory for reconfigurable arrays direct read/write, the internal port width of shared memory is reachable 128bit, it is more wider than the 64bit width of the frameworks such as Morphsys, more data can be provided every time, and can extend, have more preferable Performance；

3rd, configuration memory also employs asymmetric design as shared memory, both ensure that and has been connected with bus interface Convenience, in turn ensure the high efficiency of reconfigurable arrays data access, reconfigurable arrays read configuration information interface width can Up to 320bit, there is better performance and appropriate complexity；

4th, crossbar switch (Crossbar) connection of the widely used high speed of internal data interface, excessively complicated connection use The Crossbar connections of classification ensure that good performance on the premise of increase hardware complexity not too much；

5th, by carefully setting configuration information, support the working method of differential configuration, can reconstruct and only match somebody with somebody again every time Changed execution unit (PE) is put, remaining PE is motionless, so as to save setup time, improves efficiency.

The technique effect of the design of the present invention, concrete structure and generation is described further below with reference to attached drawing, with It is fully understood from the purpose of the present invention, feature and effect.

Description of the drawings

Fig. 1 is the structure diagram of the coarseness reconstruction structure of a prior art.

Fig. 2 is the structure diagram of the coarse-grained reconfigurable array of second prior art.

Fig. 3 is the structure diagram of an execution unit in array shown in Fig. 3.

Fig. 4 is the structure diagram of the coarseness reconstruction structure of the 3rd prior art.

Fig. 5 shows the coarseness reconstruction structure of the 4th prior art and the structural frames of reconfigurable arrays therein Figure.

Fig. 6 is the structure diagram of the coarseness reconstruction structure of the 5th prior art.

Fig. 7 is the reconstruction structure of the expression formula granularity of the 6th prior art.

Fig. 8 shows the structure diagram of the coarseness restructurable computing system of the present invention.

Fig. 9 shows an example of the coarse-grained reconfigurable array in coarseness restructurable computing system shown in Fig. 8 Structure diagram.

Figure 10 shows the structure of an execution unit in coarse-grained reconfigurable array shown in Fig. 9.

Figure 11 is that adjacent rows execution unit passes through row intersection in the coarse-grained reconfigurable array shown in Fig. 9 for holding soft error Switch the schematic diagram of communication.

Figure 12 shows that one of the configuration information memory in coarseness restructurable computing system shown in Fig. 8 is exemplary Structure.

Figure 13 is that the configuration information memory in coarseness restructurable computing system shown in Fig. 8 is opened by secondary series intersection Close the schematic diagram communicated with each execution unit of coarse-grained reconfigurable array.

Figure 14 shows the configuration words of an execution unit.

Figure 15 shows an exemplary knot of the shared memory in coarseness restructurable computing system shown in Fig. 8 Structure.

Figure 16 be shared memory in coarseness restructurable computing system shown in Fig. 8 by first row crossbar switch with The schematic diagram of each execution unit communication of coarse-grained reconfigurable array.

Specific embodiment

As shown in figure 8, in a preferred embodiment of the invention, coarseness restructurable computing system of the invention includes One general-purpose processor core, 101, coarse-grained reconfigurable arrays (RCA) 104 and three memories.In the present embodiment, lead to With processor core 101 using the OR1200 processor cores increased income；Coarse-grained reconfigurable array 104 is the execution unit of one 8 × 8 Array (its concrete structure will be described later)；Three memories are main storage 102,103 and of shared memory respectively Configuration information memory 105.General-purpose processor core 101 and coarse-grained reconfigurable array 104, main storage 102, shared memory 103 are all connected with configuration information memory 105 mutually to communicate, and shared memory 103 and configuration information memory 105 all can be with Main storage 102 exchanges data.

The present invention coarseness restructurable computing system be used for perform application program source code serial executable portion and Parallel executable portion, wherein general-purpose processor core 101 directly perform serial executable portion, and parallel executable portion is converted into configuration Information simultaneously instructs coarse-grained reconfigurable array 104 to perform by processor core 101.Main storage 102 is performed for storing using journey Instruction and data needed for sequence, specifically including above-mentioned configuration information, perform the input data needed for parallel executable portion with And perform the output data after parallel executable portion.Shared memory 103 is used to directly read for coarse-grained reconfigurable array 104 It writes, is obtained from main storage and perform the input data needed for parallel executable portion so that coarse-grained reconfigurable array 104 is read, And its operation result is write using after by the operation result as the parallel executable portion of execution for coarse-grained reconfigurable array 104 Output data is stored to main storage 102.Configuration information memory 105 is used to obtain from main storage 102 above-mentioned with confidence Breath reads for coarse-grained reconfigurable array 104.

In the present embodiment, coarse-grained reconfigurable array 104 is connected to general-purpose processor core 101 by Wishbone buses； Main storage 102 is a static RAM (SRAM), and general processor is connected to by Wishbone buses Core 101；Shared memory 103 is the local memory of a coarse-grained reconfigurable array 104, and address space is divided into 8 pieces (bank), general-purpose processor core 101 is connected to by Wishbone buses, by the crossbar switches of 128 (crossbar) even The coarse-grained reconfigurable array 104 for holding soft error is connected to, each processing unit in array 104 can access it, in addition, it Data are exchanged by DMA carryings between memory 102；Configuration information memory 105 is connected to logical by Wishbone buses With processor core 101, the coarse-grained reconfigurable array for holding soft error is connected to by the crossbar switches of 320 (crossbar) 104, each processing unit in array 104 can access it, in addition, it is exchanged between memory 102 by DMA carryings Data；The configuration information of array 104 gives the configuration words of one 40 bit byte of each execution unit in array 104, hereinafter can It is specifically described.

The detailed process that the coarseness restructurable computing system of invention performs application program is as follows：

1) source code of application program is directed to, is broken down into serial executable portion and parallel executable portion, it is serial to perform Part, which is placed on OR1200, to be performed, and parallel executable portion, which is placed on array 104, to be accelerated to perform；

2) by dedicated software tool or by hand will be in 1) and perform row and be partially converted to configuration information, it is stored in text In part；

3) system power-up starts, and OR1200 reads in configuration information from the file of storage configuration information, is first loaded into primary storage Then configuration information is write configuration information memory 105 by device 102 from main storage 102；

4) OR1200 brings into operation the code of serial executable portion, until program proceeds to parallel executable portion, then：i) Required data are transmitted to shared memory 103 from main storage 102；Ii control command) is write to array 104, providing will hold The number of capable configuration information；Iii) configuration information that array 104 is transmitted according to OR1200 is numbered from configuration information memory 105 Configuration information is loaded into, array 104 is configured according to the requirement of the configuration information retrieved；Iv array 104) is started, it is each to perform list Member carries out computing, operation result deposit register file, next layer of execution unit, shared memory 103；

5) 104 end of run of array, OR1200 further send subsequent commands, including by result from shared memory 103 Write back main storage 102, with the configuration of new configuration information update array 104, new operation data are loaded into from main storage 102 Into shared memory 103；

6) OR1200 continues to be actuated for array 104 to handle or exit from array 104, continues to be performed by OR1200 surplus Remaining serial code.

Fig. 9 shows the structure of the coarse-grained reconfigurable array 104 of the appearance soft error in the present embodiment, wherein 8 × 8 are held Row unit (PE) is opened by 9 row crossbar switches (crossbar), first row crossbar switch (crossbar) and secondary series intersection Connection connects (crossbar).In particular, often 8 execution units of row and 9 row crossbar switches are arranged alternately, each row execution unit It is all distributed between two row crossbar switches and is connected with them, first row crossbar switch is arranged in the one of 9 row crossbar switches It is connected at end with each execution unit, secondary series crossbar switch is arranged at the other end of 9 row crossbar switches and each execution Unit is connected.Row crossbar switch, first row crossbar switch and secondary series crossbar switch are made of address wire and data cable, pass through it And the control signal of enable signal etc. realize function, be used for transmission the input data including array 104, configuration information extremely Each execution unit and the operation result of each execution unit and is being transferred to outside array 104 104 internal transmission of array. In this way, for the execution unit of the 2 to 8th row, either of which can the row crossbar switch of a line, first from it Row crossbar switch and secondary series crossbar switch receive data or configuration-direct, input data, configuration information including array 104 and The operation result (the specifically operation result of arbitrary execution unit output in its lastrow) of other execution units output；And Its operation result is exported to its next every trade crossbar switch and first row crossbar switch；Any one in the execution unit of 1st row It is a then the row crossbar switch of a line, first row crossbar switch and secondary series crossbar switch to receive data from it, including battle array The input data and configuration information of row 104 and export its computing to its next every trade crossbar switch and first row crossbar switch As a result.It should be noted that first row crossbar switch and secondary series crossbar switch are all connected with each execution unit, but in order to Illustrative clarity only symbolically illustrates the connection of they and immediate execution unit in fig.9.

It is all identical to hold the construction of 8 × 8 execution units of the coarse-grained reconfigurable array 104 of soft error, below with wherein One execution unit exemplified by the execution unit (by the PE of overstriking in Fig. 9) of the 5th row the 8th row in Fig. 9, describes each execution unit Structure.

As shown in Figure 10, execution unit includes three multiplexers, an arithmetic unit and register file, specifically, three A multiplexer is the first multiplexer MUX A, the second multiplexer MUX B and the 3rd multiplexer MUX C, one A arithmetic unit is arithmetic unit ALU.First multiplexer MUX A, the second multiplexer MUX B and the 3rd multiplexer MUX C all have there are three input terminal, and first input end therein is all connected with first row crossbar switch with the defeated of receiving array 104 Enter data；Second input terminal is accordingly connected with three output terminals of the register file in the execution unit, is stored for receiving The operation result of the upper once computing of the execution unit in register file；3rd input terminal is separately by its lastrow Row crossbar switch is connected with the output terminal of the arithmetic unit in any one execution unit of its lastrow (i.e. the 4th row), for connecing Receive the operation result of any one execution unit last time computing of its lastrow (i.e. the 4th row).It should be noted that it retouches here State be the 5th row the 8th row execution unit three multiplexers the 3rd output terminal, the three of the execution unit of 2-8 rows 3rd output terminal connection of a multiplexer is same, and the of three multiplexers of the 1st row execution unit Three output terminals can be received there is no the operation result from lastrow execution unit, it is possible to think the 1st row execution unit The 3rd output terminal of three multiplexers connect for sky.

The control terminal of first multiplexer, the second multiplexer and the 3rd multiplexer is all intersected out with secondary series It closes and is connected so that by the selection signal in the configuration information of secondary series crossbar switch receiving array 104, specifically, the first multichannel is answered With the control terminal of device for receiving selection signal Sel_A, the control terminal of the second multiplexer is used to receive selection signal Sel_ B, the control terminal of the 3rd multiplexer are used to receive selection signal Sel_C.The output terminal of first multiplexer is connected to fortune The first input end of device is calculated, the output terminal of the second multiplexer is connected to the second input terminal of arithmetic unit, the 3rd multiplexing The output terminal of device is connected to the 3rd input terminal of arithmetic unit.In this way, the selection signal in the configuration information for passing through array 104, energy It enough determines the output of three multiplexers, that is, determines three input Input A, Input B of arithmetic unit and Input C.Fortune The control terminal of device is calculated for the operational order Op Code in the configuration information of receiving array 104, arithmetic unit is according to the first, the 2nd, the input Input A of the 3rd input terminal, Input B, Input C and operational order Op Code carry out computing, obtain computing As a result.The output terminal of arithmetic unit is connected to intersect operation result by the row with next every trade crossbar switch of the execution unit Switch is output to the execution unit of i+1 row, and the output terminal of arithmetic unit is also connected to transport with the register file of the execution unit It calculates result and is output to register file, the output terminal of arithmetic unit is also connected operation result being output to battle array with first row crossbar switch Outside row 104 (i.e. shared memory 103).

Above-mentioned arbitrary a line execution unit P_{I, 0}、…、P_{I, 7}Arithmetic unit in (i=1 ..., 8) discharges its operation result To its next every trade crossbar switch, wherein when 2≤i≤8, execution unit P_{I, 0}、…、P_{I, 7}Operation result to be output to its next Capable execution unit P_I+1,0、…、P_I+1,7, realization method is as shown in figure 11.In particular：Execution unit P_{I, 0}、…、P_{I, 7}(i= 1st ..., 8) in arithmetic unit by its operation result R₀、…、R₇It is discharged into the row crossbar switch, execution unit P_I+1,0、…、P_I+1,7 (i=2 ..., 8) obtains the operation result in the row crossbar switch of its lastrow, and in the row crossbar switch for passing through lastrow The specified path in portion is transferred to three input Input A (A of its arithmetic unit₀、…、A₇)、Input B(B₀、…、B₇)、Input C (C₀、…、C₇One or more of).

It can be seen that in execution unit, each operand Input A, Input B, the source of Input C can be divided into three Kind：I) shared memory 103 is come from；Ii the register file of local register file, i.e. this execution unit) is come from；Iii) come From the output of the execution unit in lastrow (execution unit of the first row is without this data source).Input A、Input B、 Which data source Input C, which select, is determined by configuration information.In addition, the result of calculation of each execution unit also there are three Whereabouts：I) shared memory 103 is arrived in storage；Ii) local register file is arrived in storage；Iii) it is output to the input terminal of next line PE (last column execution unit is without this data whereabouts).

In the present embodiment, the structure of configuration information memory 105 is as shown in figure 12, and inside unit is using 320bit as grain Degree carries out tissue, such as unit 302.Which employs two sets of ports of asymmetric design, wherein, with general-purpose processor core 101 (OR1200) interaction between main storage 102 uses the Wishbone bus interface of 32, to maintain good compatibility, The set port include port wb_addr_i, wb_data_i, wb_data_o ..., wb_we_i and wb_ack_o；It can with coarseness Interaction between restructuring array 104 using a width up to port 301 --- the read port dev_ctx_data_o of 320, The port 301 is connected with secondary series crossbar switch (referring to Figure 13, it is possible thereby to send configuration information to 64 execution units Speed when PE0-PE63), to ensure dynamic restructuring.The address of the unit of each 320bit of configuration information memory 105 The referred to as id of configuration information, and the foundation retrieved using the id as configuration information, the 320bit configuration informations that can will be retrieved It is exported by port 301 to coarse-grained reconfigurable array 104.In addition, port dev_ctx_id_i, dev_rd_en_i, dev_ Ack_o is the address of required access (320bit) configuration information respectively, reads configuration information mark, output acknowledgement indicator.

Correspondingly, the form of configuration information point need to be layered face and be described：

1) coarse-grained reconfigurable array (RCA) level：Configuration information in configuration information memory 105 is with 320bit mono- A unit is unit storage, why is organized as the cell size of 320bit, is because the configuration that each execution unit occupies Information is 40bit, and 320bit is just for 8 execution units, that is, the configuration information amount needed for a line execution unit.So Needs are completely reconfigured every time, 8 configuration informations are read from configuration information memory, it is believed that read a line every time, altogether Read 8 rows；

2) differential configuration：If each dynamic restructuring is required for all updating all 64 execution units its configuration deposit Device then needs 8 cycle altogether, and such one is the space for wasting configuration memory, and two are reduction of system performance.Therefore this hair Bright support differential configuration, that is, each dynamic restructuring only update the execution unit that is changed in 64 execution units Configuration register (not shown, there are one the configuration registers of 40bit for each execution unit tool), therefore dynamic restructuring needs every time It is performed for 0~64 and reads configuration information, the cycle of consumption is 0~8 cycle.The execution reconfigured if necessary is very It is few, then it can greatly promote the speed of reconstruct；

3) word format is configured：Each execution unit corresponds to the configuration words of a 40bit, why configuration words for 40bit it It is more, precisely in order to supporting to check the mark configuration, it is necessary to the number of corresponding execution unit be set on each configuration words head, to distinguish Which execution unit the configuration information belongs to；

4) differentiation of configuration information between homogeneous does not reconstruct：Since the required configuration information of each dynamic restructuring corresponds to configuration 0~8 unit in memory, indefinite length, so must there is mechanism to be distinguish between the configuration information between the reconstruct of not homogeneous. The present invention uses following scheme：Beginning is reconfigured every time, it is unit to start to read 320bit according to given configuration information id Configuration information until some 40bit of afterbody are grouped into full 0 value in the 320bit configuration informations for meeting certain reading, represents this Configuration information reading needed for secondary reconstruct leaves it at that, just as the design of the character string in C language with 0 ending.If one The straight configuration information for not encountering afterbody full 0, then terminate after at most reading 8 320bit configuration informations, and expression has been read at this time The configuration information that enough 64 PE are configured.

Accordingly, the configuration words of each execution unit are 40 bit bytes in the present embodiment, as shown in figure 14, specifically For：

Byte section 401：39th bit byte resv is reserved bit；

Byte section 402：38th bit byte valid to indicate significance bit, represents that the configuration words are effectively to configure when being 1 Otherwise word is invalid configuration word；

Byte section 403：37-32 bit byte pe_id, for representing the number of this execution unit；

Byte section 404：31-26 bit byte op, for representing counting for the operational order of the arithmetic unit of the execution unit The type of logical operation；

Byte section 405：25th bit byte A type, be used to indicate the arithmetic unit of the execution unit first input end it is defeated Enter the type of Input A, the type of Input A includes：Other execution units from this execution unit, in array 104 (i.e. its lastrow execution unit)；

Byte section 406：24-21 bit byte input A, for representing the input of the 25th bit byte A type instructions Input A, when the 25th bit byte A type are 1,24-21 bit byte input A are used to represent the deposit of this execution unit The number of device heap；

Byte section 407：20-19 bit byte B type are used to indicate the second input terminal of the arithmetic unit of this execution unit The type of Input B is inputted, the type of Input A includes：Other from this execution unit, in array 104 perform list First (i.e. its lastrow execution unit), be two input instruction immediates, be three input instruction immediates；

Byte section 408：18-10 bit byte input B, for representing the input of 20-19 bit byte B type instructions Input B from register file number or three input instruction immediates；

Byte section 409：9th bit byte C type, be used to indicate when this execution unit arithmetic unit first, second, When three input terminals all have input, the type of the input Input C of the 3rd input terminal；The type of Input C includes：From this execution Unit, other execution units (i.e. its lastrow execution unit) in array 104；

Byte section 410：8-5 bit byte input C, for representing the input Input C of the 9th bit byte C type instructions From register file number；

Byte section 411：4th bit byte R type, for representing the output of the operation result of the arithmetic unit of this execution unit Type；When it is 1, which is output to the register file of this execution unit, is otherwise output to its in array 104 His execution unit (i.e. its next line execution unit)；

Byte section 412：3-0 bit byte result, for representing to be output to posting for this execution unit when operation result During storage heap, the number of the register file.

Wherein, when it is two input instruction immediates that 20-19 bit byte B type, which indicate input Input B, with 18- 5 bit bytes represent the input together.

As shown in figure 15, in order to improve access speed, shared memory 103 employs polylith (bank) in the present embodiment Design corresponds to a bank per 16bit storage units, is divided into 8 bank, each bank is configured with individual access end Mouthful, 8 access ports merge into the port of a 128bit, and so each sharing memory access can at most read in 8 The operand of 16bit, and shared memory 103 can be further expanded.

Similar with configuration information memory 105, shared memory 103 employs asymmetrical two sets of ports, wherein, it shares It is Wishbone interfaces between memory 103 and general-purpose processor core 101 and main storage 102, width 32bit, the set end Mouthfuls 501 include port wb_addr_i, wb_data_i, wb_data_o ..., wb_we_i and wb_ack_o；Shared memory 103 Interface between coarse-grained reconfigurable array 104 is used to by first row crossbar switch connect with coarse-grained reconfigurable array 104 Connect, the set port 502 include port bank0_addr_i, bank0_data_i, bank0_data_o ..., bank7_addr_i, Bank7_data_i, bank7_data_o, i.e., corresponding to each banki (i in 8 blocks (bank) of shared memory 103 =0 ..., 7) there are at least three ports：Address input end mouth banki_addr_i, data-in port banki_data_i and Data-out port banki_data_o.As a result, as shown in figure 16, each banki (i=0 ..., 7) is used to perform for a row Unit PE_0i-PE_7iRead and write data.Due to often showing 8 rows totally 8 execution units, these execution units access shared memory 103 Priority it is higher according to the incremental decreasing order of line number, that is, the smaller priority of line number because the calculating in front is more To be urgent, such as cannot memory access in time can influence calculating below.When the execution unit do not gone together in a certain row memory access simultaneously, award It weighs and gives line number low execution unit.Therefore, the present invention employs the interconnection architecture of classification when accessing shared memory, in complexity Property and aspect of performance achieve relatively good compromise.

The preferred embodiment of the present invention described in detail above.It should be appreciated that those of ordinary skill in the art without Creative work is needed according to the present invention can to conceive and makes many modifications and variations.Therefore, the technology of all the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical solution, all should be in the protection domain being defined in the patent claims.

Claims

1. a kind of coarseness restructurable computing system, for performing the serial executable portion of the source code of application program and holding parallel Row part, the parallel executable portion are converted into configuration information, which is characterized in that can including general-purpose processor core, coarseness Restructuring array, main storage, shared memory and configuration information memory, the general-purpose processor core can be weighed with the coarseness Structure array, the main storage, the shared memory are all connected with the configuration information memory mutually to communicate, described common Enjoy memory and the configuration information memory all can exchange data with the main storage；The general-purpose processor core is used to hold The row serial executable portion and the instruction coarse-grained reconfigurable array perform the parallel executable portion；The primary storage Device is used to store the configuration information, perform the input data needed for the parallel executable portion and perform the parallel execution Output data behind part；The shared memory is used to obtain the input data from the main storage for the coarse grain Degree reconfigurable arrays read and for the coarse-grained reconfigurable array write its operation result using by the operation result as The output data is stored to the main storage；The configuration information memory is used to match somebody with somebody from described in main storage acquisition Confidence breath reads for the coarse-grained reconfigurable array；

The execution unit includes the first multiplexer, the second multiplexer, the 3rd multiplexer, arithmetic unit and deposit Device heap；In any one of execution unit of the i-th row, 1≤i≤m,

The first input end of first multiplexer, second multiplexer and the 3rd multiplexer is all used In the reception input data；

Second input terminal of first multiplexer, second multiplexer and the 3rd multiplexer corresponds to Ground is connected with first, second, and third output terminal of local register file；

As 2≤i≤m, first multiplexer, the 3rd of second multiplexer and the 3rd multiplexer the Input terminal is connected to the output terminal of the arithmetic unit in execution unit described in the (i-1)-th row separately by row crossbar switch；Work as i =1, the 3rd input terminal of first multiplexer, second multiplexer and the 3rd multiplexer is all empty It connects；

The control terminal of first multiplexer, second multiplexer and the 3rd multiplexer is all used to connect Receive the selection signal in the configuration information；

The output terminal of first multiplexer is connected to the first input end of the arithmetic unit, second multiplexer Output terminal be connected to the second input terminal of the arithmetic unit, the output terminal of the 3rd multiplexer is connected to the computing 3rd input terminal of device；

The control terminal of the arithmetic unit is used to receiving operational order in the configuration information, the arithmetic unit according to its described the First, the input of second, third input terminal and the operational order carry out computing, and the operation result of acquisition is defeated from its output terminal Go out to outside the array, be output in any one of execution unit of i+1 row and be output to the register file.

2. coarseness restructurable computing system as described in claim 1 the, wherein m × n execution unit is passed by being used for M+1 the row crossbar switches, first row crossbar switches of transmission of data are connected with secondary series crossbar switch；

Execution unit described in per a line is all distributed between two row crossbar switches, the institute of the wherein n execution units The 3rd input terminal of the first multiplexer, second multiplexer and the 3rd multiplexer is stated all with two One in a row crossbar switch is connected, the output terminal of the arithmetic unit of the n execution units all with two Another in the row crossbar switch is connected；

The first row crossbar switch is described with first, second, third multiplexer of each execution unit First input end is all connected, and is connected with the output terminal of each execution unit；The secondary series crossbar switch and institute It states the control terminal of first, second, third multiplexer of each execution unit to be all connected, and each is held with described The control terminal of the arithmetic unit of row unit is all connected；

The first row crossbar switch is connected with the shared memory, and the secondary series crossbar switch is deposited with the configuration information Reservoir is connected.

3. coarseness restructurable computing system as claimed in claim 2, wherein the row crossbar switch, the first row intersect Switch and the secondary series crossbar switch are to be made of address wire and data cable.

4. coarseness restructurable computing system as claimed in claim 3, wherein the general-purpose processor core passes through Wishbone Bus and the coarse-grained reconfigurable array, the main storage, the shared memory and the configuration information memory phase Even.

5. coarseness restructurable computing system as claimed in claim 3, wherein the shared memory and the configuration information Memory all carries the swapping data with the main storage by DMA.

6. coarseness restructurable computing system as claimed in claim 3, wherein the general-purpose processor core is increased income OR1200 processor cores.

7. the coarseness restructurable computing system as described in any one in claim 3-6, wherein the m is 8, the n is 8。

8. coarseness restructurable computing system as claimed in claim 7, wherein the configuration information of the array gives each institute One configuration words of execution unit are stated, the configuration words are 40 bit bytes.

9. coarseness restructurable computing system as claimed in claim 8, wherein in the configuration of an execution unit In word

39th bit byte is reserved bit；

37-32 bit bytes are used to represent the number of the execution unit；

31-26 bit bytes are used to represent the arithmetic logic operation of the operational order of the arithmetic unit of the execution unit Type；

25th bit byte is used to indicate the type of the input of the first input end of the arithmetic unit of the execution unit, institute Stating the type of the input of first input end includes：The input of the first input end is from the execution unit and described first defeated Enter other execution units of the input at end in the coarse-grained reconfigurable array；

24-21 bit bytes are used to represent the input of the 25th bit byte instruction, when the 25th bit byte is 1, 24-21 bit bytes are used to represent the number of the register file of the execution unit；

20-19 bit bytes are used to indicate the class of the input of second input terminal of the arithmetic unit of the execution unit Type, the type of the input of second input terminal include：The input of second input terminal is from the execution unit, described the The input of other execution units, second input terminal of the input of two input terminals in the coarse-grained reconfigurable array is The input of two input instruction immediates and second input terminal is three input instruction immediates；

18-10 bit bytes be used to represent the input of 20-19 bit bytes instruction from the register file Number or the three input instructions immediate；

9th bit byte is used to indicate when first, second, third input terminal of the arithmetic unit of the execution unit all has During input, the type of the input of the 3rd input terminal；The type of the input of 3rd input terminal includes：3rd input Its in the coarse-grained reconfigurable array of input of the input at end from the execution unit and the 3rd input terminal His execution unit；

4th bit byte is used to represent the output type of the operation result of the arithmetic unit of the execution unit, when described When 4th bit byte is 1, the operation result is output to the register file of the execution unit, is otherwise output to described Other execution units in coarse-grained reconfigurable array；

3-0 bit bytes are for representing when the operation result is output to the register file of the execution unit, institute State the number of register file.

10. coarseness restructurable computing system as claimed in claim 9, wherein when described in 20-19 bit bytes instruction When the input of second input terminal is two input instruction immediate, with the 18-10 bit bytes, the 9th bit byte and described 8-5 bit bytes represent the input of second input terminal together.