CN101799750B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN101799750B
CN101799750B CN200910208432.0A CN200910208432A CN101799750B CN 101799750 B CN101799750 B CN 101799750B CN 200910208432 A CN200910208432 A CN 200910208432A CN 101799750 B CN101799750 B CN 101799750B
Authority
CN
China
Prior art keywords
processor core
data
configurable
code
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910208432.0A
Other languages
Chinese (zh)
Other versions
CN101799750A (en
Inventor
林正浩
任浩琪
王静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Original Assignee
Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xinhao Bravechips Micro Electronics Co Ltd filed Critical Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Priority to CN200910208432.0A priority Critical patent/CN101799750B/en
Priority to EP09828544A priority patent/EP2372530A4/en
Priority to PCT/CN2009/001346 priority patent/WO2010060283A1/en
Priority to KR1020117014902A priority patent/KR101275698B1/en
Publication of CN101799750A publication Critical patent/CN101799750A/en
Priority to US13/118,360 priority patent/US20110231616A1/en
Application granted granted Critical
Publication of CN101799750B publication Critical patent/CN101799750B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Logic Circuits (AREA)
  • Stored Programmes (AREA)

Abstract

The invention relates to a data processing method and a device. Program codes running on a serially connected multiprocessor nuclear structure are partitioned according to specific rules, so that the serially connected multiprocessor nuclear structure forms a serial multi-emitting and production line layered structure, and the time required for running corresponding code fragments obtained by partitioning on each nucleus is equal as much as possible, thereby realizing the load balance of internuclear workload.

Description

A kind of method and apparatus of data processing
Technical field
The present invention relates to integrated circuit (IC) design field.
Background technology
According to Moore's Law, the characteristic dimension of transistor is just along 65nm, 45nm, 32nm ... route reduce gradually, number of transistors integrated in monolithic chip is more than tens00000000.But since the eighties release in last century is comprehensive and placement-and-routing's instrument, after having liberated rear end design productivity, eda tool does not have the breakthrough of matter over more than 20 year, make Front-end Design, and especially checking becomes and is more and more difficult to tackle the monolithic chip scale day by day increased.Therefore, multinuclear is invested sight by design corporation, and namely integrated multiple comparatively simple core in chip piece, reduces design, validation difficulty while raising chip functions.
Tradition polycaryon processor is integrated with the processor core of multiple executed in parallel program to improve chip performance.For traditional polycaryon processor, the thought needing multiple programming just likely makes full use of resource.But operating system does not have the change of essence to the distribution of resource and management, is mostly to be distributed equally in a symmetrical manner.Although can carry out concurrent operation between multiple processor core, for single program threads, the design feature that its serial performs causes cannot realizing real stream line operation in traditional polycaryon processor structure.In addition still there is the program that necessary serial in a large number performs in current software, cannot well be split.Therefore, after processor core reaches some, performance just cannot improve along with the increase of nuclear volume again.In addition, along with the continuous lifting of semiconductor fabrication process, the frequency of operation of polycaryon processor inside has been much higher than the frequency of operation of its external memory storage, multiple processor core carries out the large bottleneck that memory access has also become system for restricting performance simultaneously, cannot reach the performance boost effect of expection by the program of parallel multi-core structure operation serial.
Summary of the invention
The present invention is directed to the deficiencies in the prior art, propose a kind of method and apparatus of the data processing for high-speed cruising serial program, improve throughput.
The method and apparatus of data processing of the present invention, comprise: split running on the structural program code of multi-processor core connected in series according to ad hoc rules, the time that each core in described multiprocessor nuclear structure connected in series is run needed for corresponding segmentation post code fragment is as far as possible equal, to realize the load balance (load balancing) of internuclear workload.A plurality of processor core is comprised in described multiprocessor nuclear structure connected in series, described processor refers to the hardware carrying out computing by performing instruction and read and write data, and includes but not limited to central processing unit (CPU) and data signal processor (DSP).
Multi-processor core Structure composing serial multi-emitting (in serialmulti-issue) connected in series of the present invention, odd number or a plurality of transmitting can be carried out in any core time per unit in multiprocessor nuclear structure connected in series, a plurality of core connected in series forms more massive multi-emitting simultaneously, i.e. serial multi-emitting.
Multi-processor core Structure composing line level aggregated(particle) structure (pipelinehierarchy) connected in series of the present invention, in multiprocessor nuclear structure connected in series, the inside streamline of any core is first level, the macroscopic flow waterline that in multiprocessor nuclear structure connected in series, each core is formed as a macroscopical pipelining segment is second level, the rest may be inferred can also obtain more how higher level, as the 3rd level formed as a higher level pipelining segment using multiprocessor nuclear structure connected in series.
Code snippet on multiprocessor nuclear structure center connected in series of the present invention is individual or a plurality of code snippet via the part or all of step generation odd number in front compiling (pre-compile), compiling (compile) and rear compiling (post-compile) three step, and described program code includes but not limited to higher-level language code and assembler language code.
Compiling on described compiling and existing ordinary meaning from program source code to object code;
Before described, compiling is the precompile to program source code before described compiling is carried out, include but not limited to before carrying out program compilation, " calling " (call) in program be launched, substitute call statement with the actual code called, form the program code do not called; Described calling includes but not limited to function call;
Described rear compiling be assigned to that described compiling obtains object code by the action of each core in described multiprocessor nuclear structure connected in series and load on request be divided into odd number or a plurality of code snippet, step includes but not limited to:
A () executable program code is resolved, generate front end code stream;
B () is run on particular model, scanning front end code stream, and analyze required performance period, the whether information such as redirect and jump address as requested, statistics scanning result, determines carve information indirectly; Or do not scan front end code stream, directly determine carve information according to presupposed information; Described particular model includes but not limited to the behavior model of described multiprocessor nuclear structure center connected in series;
C () to executable program instruction code and carry out code division, generates the corresponding code snippet of each processor core in described multiprocessor nuclear structure connected in series according to carve information.
Front Compilation Method of the present invention is implemented before program source code compiling, also can implement in program source code compilation process as the ingredient of compiler, can also as the operating system component of described multiprocessor nuclear structure connected in series or as driving or as application program, implementing in real time when described multiprocessor nuclear structure connected in series runs.
Rear Compilation Method of the present invention can be implemented after program source code has compiled, also can implement in program source code compilation process as the ingredient of compiler, as including but not limited to the operating system component of described multiprocessor nuclear structure connected in series, driving, application program, can also implement in real time when described multiprocessor nuclear structure connected in series runs.When described rear Compilation Method is implemented in real time, artificially can determine the corresponding configuration information in described code snippet, also dynamically automatically can produce the corresponding configuration information in described code snippet according to the service condition of described multiprocessor nuclear structure connected in series, only can also produce fixing configuration information.
By described segmentation, existing application program can be carried out program segmentation, segmentation performs simultaneously, not only increase the travelling speed of existing program on multi-core/many-core device, and given full play to the efficiency of multi-core/many-core device, also ensure that the compatibility of multi-core/many-core device to existing application simultaneously.Efficiently solve the predicament that existing application cannot give full play to multi-core/many-core processor advantage.
In rear Compilation Method of the present invention, indirectly determine carve information according to the number including but not limited to the periodicity that instruction performs or time, instruction, the instruction execution cycle number that namely can obtain according to scanning front end code stream or time, whole executable program code is divided into the code snippet of identical or close working time, the instruction strip number that also can obtain according to scanning front end code stream, is divided into the code snippet of identical or close instruction strip number by whole executable program code; Described directly determine carve information according to including but not limited to the number of instruction, namely according to the number of instruction, directly whole executable program code can be divided into the code snippet of identical or close instruction strip number.
In rear Compilation Method of the present invention, avoid as far as possible splitting loop code according to ad hoc rules during described executable program code segmentation.When avoiding splitting loop code, according to ad hoc rules, by described loop code, by odd number, secondary or plural time segmentation forms a plurality of more small-scale loop code.Described a plurality of more small-scale loop code can be the ingredient of identical or different code snippet respectively.Described more small-scale loop code includes but not limited to comprise the loop code of less number of codes and the less loop code of code execution cycle number.
In rear Compilation Method of the present invention, described code snippet includes but not limited to the executable object code and/or the corresponding configuration information that are applicable to the segmentation that multiprocessor nuclear structure connected in series runs described in stationary processors check figure object, be applicable to the unsegmented executable object code of described multiprocessor nuclear structure operation connected in series and comprise the corresponding configuration information being applicable to not fix the multiple segment information of check figure object, wherein segment information includes but not limited to comprise the numeral representing every section of number of instructions, represent the special sign of section boundaries, the indicating gauge of each code snippet start information.
For example, have in the described device of 1000 processor cores at one, the table that one has 1000 items can be generated by maximum processor number 1000, each stores the positional information of command adapted thereto in described unsegmented executable object code, and the packing of orders between two i.e. the corresponding code snippet that can run on corresponding single core.If operationally used whole 1000 processor cores, then each processor core has run in described table pointed by corresponding two between unsegmented executable object code position code, and namely each processor core runs one section of code corresponding in described table.If operationally only used N number of processor core (N < 1000), then each processor core has run 1000/N section code corresponding in described table, and specific code can be determined according to relevant position information in table.
The instruction that each processor core runs, except the respective code fragment after described segmentation, can also comprise extra instruction.Described extra instruction includes but not limited to that code snippet header extension, code snippet afterbody are expanded, for realizing seamlessly transitting of the internuclear instruction execution of different processor.For example, can add that code snippet afterbody is expanded at the end of each code snippet, all values in register file is stored into the ad-hoc location in data-carrier store, code snippet header extension is added in the beginning of each code snippet, read in register file from the value in the ad-hoc location data-carrier store, realize the internuclear register value transmission of different processor with this, ensure the true(-)running of program; When performing the end of code snippet, next instruction is from the Article 1 instruction of described code snippet.
The method and apparatus of data processing of the present invention, a kind of configurable multi-core/many-core device based on serial multi-emitting and line level aggregated(particle) structure can be constructed, comprise a plurality of processor core (ProcessorCore), a plurality of configurable local storage (confi gurable local memory), configurable interconnect architecture (configurable interconnect structure).Wherein:
Processor core, for performing instruction, carrying out computing and obtaining accordingly result;
Configurable local storage, for storing data transmission between instruction and described processor core and data are preserved;
Configurable interconnect architecture, for each intermodule and the connection with outside in described configurable multi-core/many-core device.
Described configurable multi-core/many-core device can also comprise expansion module, to adapt to demand widely; Described expansion module includes but not limited to odd number or a plurality of part or all of with lower module:
Shared storage (shared memory), for preserving data when described configurable data storer overflows, transmits shared data between a plurality of processor core;
Direct memory access (DMA) controller, for the direct access of other modules except processor core to described configurable local storage;
Abnormality processing (exception handling) module, the exception (exception) for the treatment of processor core, local storage occur:
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, processor core comprises arithmetic element and programmable counter, can also comprise expansion module to adapt to demand widely, described expansion module includes but not limited to register file.The instruction that described processor core performs includes but not limited to arithmetic operation instruction, logic instruction, condition judgment and jump instruction, is extremely absorbed in and link order; Described arithmetic operation instruction, logic instruction include but not limited to multiplication, plus/minus method, take advantage of plus/minus, cumulative, displacement, extract, swap operation, and comprise fixed-point arithmetic and the floating-point operation that any bit wide is less than or equal to described processor core data bit width; Each described processor core completes odd number bar or instruction described in a plurality of.The number of described processor core can be expanded according to practical application request.
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, each described processor core has corresponding configurable local storage, comprises for depositing the command memory (instruction memory) of segmentation post code fragment and the configurable data storer (configurable data memory) for store data.
In same configurable local storage, the border between described command memory from configurable data storer can change according to different configuration information.After according to the size of configuration information determination configurable data storer and border, described configurable data storer comprises plurality of data quantum memory.
In same configurable data storer, the border between described plurality of data quantum memory can change according to different configuration information.Described data quantum memory is mapped to whole address spaces of described multi-core/many-core device by address transformation energy.Described mapping includes but not limited to carry out address conversion by tabling look-up and carry out address conversion by content adressable memory (CAM) coupling.
In described data quantum memory, every (entry) comprises data and flag information, and described flag information includes but not limited to significance bit (valid bit), data address.Whether described significance bit is used to indicate the data stored in corresponding entry effective.Described data address is used to indicate the position that the data that store in corresponding entry should be at whole address spaces of described multi-core/many-core device.
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, described configurable interconnect architecture is by being configured for each intermodule and the connection with outside in described configurable multi-core/many-core device, include but not limited to the connection of processor core and adjacent configurable local storage, the connection of processor core and shared storage, the connection of processor core and direct memory access controller, the connection of configurable local storage and shared storage, the connection of configurable local storage and direct memory access controller, the connection of the connection of configurable local storage and described device outside and shared storage and described device outside.
According to configuration, two processor cores and respective local memories thereof can be made to form front stage annexation, include but not limited to that previous stage processor core transfers data to rear stage processor core by its corresponding configurable data storer.
According to application program requirement, by configuration, part or all of processor core and respective local memories thereof can be formed the individual or a plurality of structure connected in series of odd number by configurable interconnect architecture.A plurality of described structure connected in series can be independent separately, also can partly or entirely connect each other, serial, parallel or string mixedly perform instruction.Described serial, parallel or string mixedly perform that instruction includes but not limited to require different structure connected in series to run different program segment executed in parallel different instructions under the control of synchronization mechanism according to application program, multi-threaded parallel runs, and requires that different structure connected in series is run identical program segment, carried out the intensive operations of same instructions, different pieces of information in single-instruction multiple-data stream (SIMD) (SIMD) mode under the control of synchronization mechanism according to application program.
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, in described structure connected in series, processor core has specific data and reads rule (read policy), writes rule (write policy).
Described data read rule, and namely in described structure connected in series, the input Data Source of first processor core includes but not limited to that corresponding configurable data storer itself, shared storage, described configurable multi-core/many-core device are outside.The input Data Source of other random processor cores includes but not limited to itself corresponding configurable data storer, the corresponding configurable data storer of previous stage processor core.Correspondingly, the whereabouts of the output data of any described processor core includes but not limited to itself corresponding configurable data storer, shared storage, when extended memory exists, the whereabouts of the output data of any described processor core can also be extended memory.
Described data write rule, and namely in described structure connected in series, the input Data Source of first processor core corresponding configurable data storer includes but not limited to processor core itself, shared storage, described configurable multi-core/many-core device outside.Other random processor nuclear phases answer the input Data Source of configurable data storer to include but not limited to the corresponding configurable data storer of processor core itself, previous stage processor core, shared storage.The input data of described processor core and corresponding configurable data storer separate sources thereof carry out multi-path choice to determine final input data by ad hoc rules.
Same described configurable data storer can be accessed by two processor cores of its front stage simultaneously, and different processor cores accesses the different pieces of information quantum memory in described configurable data storer separately.Described processor core can be accessed different pieces of information quantum memory in same configurable data storer respectively according to ad hoc rules, described ad hoc rules includes but not limited to different pieces of information quantum memory ping-pong buffers (ping-pong buffer) each other in same configurable data storer, accessed respectively by two processor cores, after described front and back stages processor core all completes the access to ping-pong buffers, carry out ping-pong buffers exchange, make originally by the data quantum memory of previous stage processor core read/write as the data quantum memory read by rear stage processor core, in the data quantum memory originally read by rear stage processor core all significance bits be all set to invalid, and as by the data quantum memory of previous stage processor core read/write.
When in described multi-core/many-core system, processor core comprises register file, also need that there is specific register value transmission rule, described register value transmission rule, the odd number namely in described structure connected in series in any prime processor core or a plurality of register value can be transferred in the corresponding registers of any rear class processor core.Described register value includes but not limited to the value of register in register file in described processor core.The delivering path of described register value includes but not limited to be transmitted by configurable interconnect architecture, directly transmitted by shared storage, directly by the corresponding configurable data memory transfer of described processor core, according to specific instruction by shared storage transmission, according to specific instruction by the corresponding configurable data memory transfer of described processor core.
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, the second level in described line level aggregated(particle) structure and macroscopical pipelining segment can pass through back pressure (backpressure) by the information transmission of this macroscopical pipelining segment to previous stage macroscopic view pipelining segment, whether the macroscopic flow waterline after described previous stage macroscopic view pipelining segment is known according to the back pressure information received blocks (stall), in conjunction with the situation of this macroscopical pipelining segment, determine whether this macroscopical pipelining segment blocks, and by new back pressure information transmission to more previous stage macroscopic view pipelining segment, the control of macroscopic flow waterline is realized with this.
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, the shared storage of expansion can be had, for storing data when processor core corresponding configurable data storer overflows, transmit shared data between a plurality of processor core; Abnormality processing (exception handling) module of expansion can also be had, for the treatment of the exception (exception) that processor core, local storage occur.
There is shared storage when described multi-core/many-core device and overflow to during configurable data memory stores data, then produce exception, and stored data is stored in shared storage, now, the flag information that in described data quantum memory, every (entry) comprises includes but not limited to significance bit, data address and data label (tag).Whether described significance bit is used to indicate the data stored in corresponding entry effective.Described data address and data label (tag) are used to indicate the position that the data that store in corresponding entry should be at whole address spaces of described multi-core/many-core device jointly.
The abnormal information that all described processor cores produce all is transferred to abnormality processing module, carries out respective handling by abnormality processing module.Described abnormality processing module can be made up of the processor core in described multi-core/many-core device, also can be extra module.Described abnormal information includes but not limited to abnormal processor numbering, Exception Type occur.The described respective handling to there is abnormal processor core and/or local storage includes but not limited to each processor core information whether streamline blocks be delivered to by the transmission of back pressure signal in structure connected in series.
Of the present invention based in the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure, can require be configured processor core, configurable local storage and configurable interconnect architecture according to application program.Described configuration includes but not limited to open or turn off processor core, configures the size/border of command memory and data quantum memory in local storage and content wherein, configuration interconnect architecture and annexation.
Source for the configuration information of described configuration includes but not limited to that described configurable multi-core/many-core device is inside and outside.Described configuration can adjust in the requirement in run duration any time according to application program.The collocation method of described configuration includes but not limited to directly to be configured by processor core or central processing unit core, to be configured by direct memory access controller by processor core or central processing unit core and external request is configured by direct memory access controller.
Configurable multi-core/many-core device based on serial multi-emitting and line level aggregated(particle) structure of the present invention has the Low-power Technology of three levels: configuration level, instruction level and application level.
Described configuration level, according to configuration information, the processor core be not used to can enter low power consumpting state; Described low power consumpting state includes but not limited to reduce processor clock frequency or supply of cutting off the electricity supply.
Described instruction level, when processor core performs the instruction of reading data, if these data are also not ready for, then described processor core enters low power consumpting state, until described DSR, described processor core returns to normal operating conditions from low power consumpting state again.Described data are not ready for, and include but not limited to the data write corresponding data quantum memory that processor core at the corresponding levels does not also need by previous stage processor core.Described low power consumpting state includes but not limited to reduce processor clock frequency or supply of cutting off the electricity supply.
Described application level, employing devices at full hardware realizes, idle (idle) task feature of coupling, determine the utilization rate (utilization) of current processor core, determine whether enter low power consumpting state or whether return from low power consumpting state according to current processor utilization rate and benchmark utilization rate.Described benchmark utilization rate can immobilize, and also reconfigurable or self study is determined, can be solidificated in chip internal, also can be write by described device when described device starts, also can be write by software.Can when chip production for the reference content of mating, be cured to chip internal, also can be write by described device or software when described device starts, can also self study write, its storage medium includes but not limited to volatile storer, nonvolatile storer; Its writing mode includes but not limited to write-once, can repeatedly write.Described low power consumpting state includes but not limited to reduce processor clock frequency or supply of cutting off the electricity supply.
Configurable multi-core/many-core device based on serial multi-emitting and line level aggregated(particle) structure of the present invention can possess self-testing capability, can not rely on the self-test that external unit carries out chip when powering up work.
When described multi-core/many-core device possesses self-testing capability, can by odd number specific in described multi-core/many-core device or a plurality of primary element, arithmetic element or processor core are with becoming comparer, to other primary elements of plural groups corresponding in described multi-core/many-core device, arithmetic element or processor core and primary element, the combination of arithmetic element or processor core has the excitation of particular kind of relationship, and with other primary elements of the more described plural groups of described comparer, arithmetic element or processor core and primary element, arithmetic element or processor core the output of combination whether meet corresponding particular kind of relationship.Described excitation can from the particular module in described multi-core/many-core device, also can be outside from described multi-core/many-core device.Described particular kind of relationship includes but not limited to equal, contrary, reciprocal, complementary.It is outside that described test result can be sent to described multi-core/many-core device, also can be kept in the storer in described multi-core/many-core device.
Described self-test can be On-Wafer Measurement, tests, also can artificially set self-test condition and cycle, regularly carry out self-test during operation when integrated circuit testing or chip use after encapsulation when described device starts.The storer that described self-test is used includes but not limited to volatile storer, nonvolatile storer.
When described multi-core/many-core device possesses self-testing capability, self-reparing capability can be possessed.When in the storer that described test result is kept in described multi-core/many-core device, can mark to failing processor core, when being configured described multi-core/many-core device, failing processor core can be walked around according to respective markers, described multi-core/many-core device still normally can be worked, realize selfreparing.Described selfreparing can be carry out after On-Wafer Measurement, carry out after integrated circuit testing after encapsulation or chip uses time test when described device starts after carry out, also can artificially set self-test selfreparing condition and cycle, carry out after regularly carrying out self-test during operation.
A plurality of processor cores in configurable multi-core/many-core device of the present invention can be isomorphisms, also can be isomeries.
In configurable multi-core/many-core device of the present invention, in local command memory, the length of instruction word can be unfixed.
In configurable multi-core/many-core device of the present invention, local command memory and local data memory can have odd number group or plural groups read port separately.
In configurable multi-core/many-core device of the present invention, all right corresponding a plurality of local command memory of each processor core, described a plurality of local command memory can be formed objects, also can be different size; Can be mutually isostructural, also can be different structure.When one or more in described a plurality of local command memory are for responding respective processor core fetch operation, other the local command memories in described a plurality of local command memory can carry out instruction renewal rewards theory.The approach of update instruction includes but not limited to by direct memory access controller update instruction.
A plurality of processor cores in configurable multi-core/many-core device of the present invention can be operated in identical clock frequency, also can be operated in different clock frequencies.
Configurable multi-core/many-core device of the present invention can have the characteristic (LIS, loadinduced store) reading and cause writing.When processor core read for certain address date first time, the local data memory corresponding from adjacent foregoing stage processor core reads data, the data read are write local data memory corresponding to processor core at the corresponding levels simultaneously, afterwards corresponding local data memory at the corresponding levels is all accessed to the read-write of this address date, thus realize the transmission of identical address data in adjacent front stage local data memory when not increasing overhead.
Configurable multi-core/many-core device of the present invention can have the characteristic that data are transmitted in advance; Processor core can in the past be checked in the local data memory of answering and reads present treatment device core and do not need to read and write but the data of subsequent processor core needs reading by coagulation device, and write local data memory corresponding to processor core at the corresponding levels, thus realize the transmission step by step of identical address data in front stage local data memory.
Local data memory of the present invention can also comprise odd number or a plurality of effective marker and odd number or a plurality of ownership indicates.Whether described effective marker is effective for representing corresponding data.Which processor core described ownership mark is used by for representing that corresponding data are current.Adopt described effective marker and ownership mark can avoid using ping-pong buffers, improve the service efficiency of storer, and multiple processor core can access same data-carrier store, is convenient to exchanges data simultaneously.
Of the present invention by configurable interconnect architecture transmission register value, include but not limited to adopt a large amount of hardwired directly the value of register in described processor core to be once all transferred in the register of rear class processor core, adopt the method for shift register by the value of register in described processor core successively shift transport in the register of rear class processor core.
The delivering path of described register value can also be the register determining to need transmission according to register read-write record sheet.Register read-write record sheet of the present invention is for recording the read-write situation of the corresponding local data memory of register pair.If the value of register has been written into local data memory corresponding to processor core at the corresponding levels and the value of this register does not change afterwards, then only can read data by rear class processor core appropriate address from local data memory corresponding to processor core at the corresponding levels, thus complete the transmission of described register, do not need this register value of individual transmission to rear level processor.
For example, when the value of register writes corresponding local data memory, in described register read-write record sheet, corresponding item is by clear " 0 ", and when data write register, in described register read-write record sheet, corresponding item is by set.When carrying out register value transmission, a transmission register read-write record sheet middle term is the value of the corresponding registers of " 1 ".Described data write register in described register file, include but not limited to read data to the register described register file from corresponding local data memory, the result that instruction performs are write back the register in register file.
When of the present invention based on the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure in processor core number determine time, can also be optimized code snippet header extension and the expansion of code snippet afterbody according to the code snippet of the determination obtained after segmentation, reduce the quantity needing the register transmitted.
For example, under normal conditions, code snippet afterbody expanding packet contains the instruction whole register value being stored into specific local data memory address, and code snippet header extension contains the instruction of the value in appropriate address being read in register, and both cooperations realize register value and smoothly transmit.When code snippet is determined, according to the instruction in code snippet, the number of storage and/or reading command in code snippet header extension and the expansion of code snippet afterbody can be reduced.
If in the code snippet that processor core at the corresponding levels is corresponding, before a certain register of write, do not use value in this register, then can save in the instruction that stores this register value in code snippet afterbody expansion corresponding to prime processor core and code snippet header extension corresponding to processor core at the corresponding levels from local data memory, read the instruction of data to this register.
If in the code snippet that prime processor core is corresponding, the value of a certain register is just not modified after being stored into local data memory, then can save the instruction storing this register value in code snippet afterbody expansion corresponding to prime processor core, and dependent instruction is added in the code snippet header extension that processor core at the corresponding levels is corresponding, allow to appropriate address from local data memory and read data to this register.
In the method and apparatus of data processing of the present invention, one section of code is performed when same address all can be transferred in the code snippet implementation that a plurality of processor core is corresponding, and this section of code be finished be transferred back to each self-corresponding code snippet time, can by the code repeated storage of described same address in the local command memory that described a plurality of processor core is corresponding; The code of described same address includes but not limited to function call, circulation.
In the method and apparatus of data processing of the present invention, described processor core can access the local command memory of the processor core except described processor core; When a plurality of processor core performs identical code, and when described code length exceedes local instruction memory size corresponding to single processor core, described code can be stored in successively in local command memory corresponding to a plurality of processor core; During operation, arbitrary processor core in described a plurality of processor core first performs from the local command memory reading command storing first paragraph code described identical code, perform from the local command memory reading command storing second segment code described code again after first paragraph code is finished, the rest may be inferred, until whole described identical code is finished.
In the method and apparatus of data processing of the present invention, described a plurality of processor core synchronously can perform each section of code in described identical code, also can each section of code in identical code described in asynchronous execution; Described a plurality of processor core can each section of code in identical code described in executed in parallel, also serial can perform each section of code in described identical code; Can also go here and there and mixedly perform each section of code in described identical code.
In the method and apparatus of data processing of the present invention, all right corresponding a plurality of local command memory of described processor core, described a plurality of local command memory can be formed objects, also can be different size; Can be mutually isostructural, also can be different structure; When one or more in described a plurality of local command memory are for responding respective processor core fetch operation, other the local command memories in described a plurality of local command memory can carry out instruction renewal rewards theory; The approach of update instruction can be by direct memory access controller update instruction.
In tradition SOC (system on a chip) (SoC, System on Chip) except processor, other functional modules are all the special IC modules realized with firmware hardwired logic.The performance requirement of these functional modules is very high, adopts traditional processor to be difficult to reach performance requirement, therefore cannot substitute these special IC modules with conventional processors.
In the method and apparatus of data processing of the present invention, odd number or a plurality of processor core and respective local memories thereof can be formed high performance multinuclear syndeton, multinuclear syndeton is configured, in corresponding local command memory, puts into corresponding code snippet, make described multinuclear syndeton realize specific function, the special IC module in SOC (system on a chip) can be substituted.Described multinuclear syndeton is equivalent to the functional module in SOC (system on a chip), as image decompressor module or encryption/decryption module.These functional modules are connected by system bus again, to realize SOC (system on a chip).
Processor core of the present invention and respective local memories thereof are connected (local interconnection) with the data transmission channel between adjacent processor core and respective local memories thereof for local, an odd number described processor core and respective local memories thereof or by local a plurality of processor core of connecting together and the multinuclear syndeton of respective local memories formation thereof and the functional module of corresponding SOC (system on a chip).
The data transmission channel corresponded between the multinuclear syndeton of functional module in SOC (system on a chip) and the multinuclear syndeton corresponding to functional module in SOC (system on a chip) described in other of the present invention is system bus (system bus).By described system bus, a plurality of multinuclear syndeton corresponding to functional module in SOC (system on a chip) is coupled together, just can realize the SOC (system on a chip) on ordinary meaning.
Based on the SOC (system on a chip) that technical solution of the present invention realizes, there is the configurability that traditional SOC (system on a chip) does not possess.By to carrying out difference configuration based on data processing equipment of the present invention, different SOC (system on a chip) can be obtained.Described configuration can be carried out in real time in operational process, thus in operational process, can change SOC (system on a chip) function in real time.Dynamically can reconfigure processor core and the respective local memories also dynamic code snippet changed in corresponding local command memory thereof, thus change the function of described SOC (system on a chip).
According to technical solution of the present invention, the described multinuclear syndeton internal processor core corresponding to functional module in SOC (system on a chip) and respective local memories thereof are connected with this locality that the path transmitted for data between other processor cores and respective local memories thereof belongs to functional module inside.Connect transmission data by this locality of described functional module inside, usually need the operation taking the processor core proposing transmission request.System bus of the present invention, can be that described this locality connects, also can be the data transmission channel not needing the operation taking processor core can complete data transmission between different processor core and respective local memories thereof.Described different processor core and respective local memories thereof can be adjacent, also can be non-conterminous.
In the method and apparatus of data processing of the present invention, a method of construction system bus adopts the fixing coupling arrangement in a plurality of position to set up data transmission channel.The input and output of any described multinuclear syndeton are all connected by odd number root or complex root hardwired with close coupling arrangement.Also be connected by odd number root or complex root hardwired between all described coupling arrangements.Described coupling arrangement, the line between described multinuclear syndeton and described coupling arrangement and the line between described coupling arrangement form described system bus jointly.
In the method and apparatus of data processing of the present invention, another method of construction system bus sets up data transmission channel, makes random processor core and corresponding local data memory thereof can carry out data transmission to other random processor cores and corresponding local data memory thereof.The approach of described data transmission includes but not limited to by shared storage transmission, by direct memory access controller transmission, by private bus or network delivery.
For example, a kind of method is, can processor core between two in advance in some processor cores and corresponding local data memory thereof and arrange odd number root or complex root hardwired between corresponding local data memory, and described hardwired can be configurable; When any two processor cores in these processor cores and corresponding local data memory thereof and corresponding local data memory thereof be in different multinuclear syndetons, be namely in different functional modules time, namely the hardwired between described two processor cores and corresponding local data memory thereof can be used as the system bus between described two multinuclear syndetons.
Second method is, all or part of described processor core and corresponding local data memory thereof can be made to have access to other processor core and corresponding local data memory thereof by direct memory access controller.When any two processor cores in these processor cores and corresponding local data memory thereof and corresponding local data memory thereof be in different multinuclear syndetons, be namely in different functional modules time, just can in real time execution process, carry out described processor core and corresponding local data memory thereof and processor core described in another and the data transmission between corresponding local data memory thereof as required, realize the system bus between two multinuclear syndetons.
The third method is, network-on-chip (Network on Chip) function can be realized on all or part of described processor core and corresponding local data memory thereof, namely when the data of described processor core and corresponding local data memory thereof are transferred to other processor cores and corresponding local data memory thereof, by the whereabouts of configurable internet determination data, thus form a data path, realize data transmission.When any two processor cores in these processor cores and corresponding local data memory thereof and corresponding local data memory thereof be in different multinuclear syndetons, be namely in different functional modules time, just can in real time execution process, carry out described processor core and corresponding local data memory thereof and processor core described in another and the data transmission between corresponding local data memory thereof as required, realize the system bus between two multinuclear syndetons.
Above-mentioned three kinds of methods, the system bus that first method adopts hardwired structure to realize, its connection is static, and the second adopts direct memory access, the third method to adopt network-on-chip method, and its connection is dynamic.
In the method and apparatus of data processing of the present invention, described processor core can have fasting conditions judgment mechanism, in order to determine whether branch transition performs; Described fasting conditions judgment mechanism can be the counter for judging cycling condition, also can be the hardware finite state machines for judging branch transition and cycling condition.
Configuration level of the present invention low-power consumption, can also, according to configuration information, make specific processor core enter low power consumpting state; Described specific processor core includes but not limited to the processor core be not used to, the processor core that operating load is relatively low; Described low power consumpting state includes but not limited to reduce processor clock frequency or supply of cutting off the electricity supply.
In the method and apparatus of data processing of the present invention, odd number or a plurality of dedicated processes module can also be comprised.Described dedicated processes module can be called for described processor core and respective local memories thereof as macroblock, also the output of described processor core and respective local memories thereof be can receive as independently processing module, and described processor core and respective local memories thereof or other processor cores and respective local memories thereof result are sent to.The processor core that the processor core exported to described dedicated processes module and respective local memories thereof and reception described dedicated processes module export and respective local memories thereof can be same processor core and respective local memories thereof, also can be different processor core and respective local memories thereof.Described dedicated processes module includes but not limited to fast fourier transform (FFT) module, entropy code module, entropy decoder module, matrix multiplication module, convolutional encoder module, viterbi codes (Viterbi Code) decoder module, turbine code (Turbo Code) decoder module.
For matrix multiplication module, if use single described processor core to carry out large-scale matrix multiplication, need a large amount of clock period, limit the raising of data throughput; If use multiple described processor core to realize extensive matrix multiplication, although can execution cycle number be reduced, add the data transmission capacity between processor core, and take a large amount of processor resource.Adopt special matrix multiplication module, extensive matrix multiplication can be completed within a minority cycle.When dividing program, operation before this extensive matrix multiplication can be assigned to several processor cores, namely before in group processor core, operation after this extensive matrix multiplication is assigned to several other processor cores, namely after in group processor core, the data participating in this extensive matrix multiplication are needed to be sent to special matrix multiplication module in the output of front group of processor core, result is sent to rear group of processor core more after treatment, do not need the data participating in this extensive matrix multiplication to be then directly conveyed in the output of front group of processor core and organize processor core backward.
Beneficial effect:
First, the method and apparatus of data processing of the present invention, the program code of serial can be divided into and be adapted to the code snippet that in multiprocessor nuclear structure connected in series, each processor core runs, processor core for different number is divided into the code snippet of different size and number according to different segmentations rule, is applicable to the multi-core/many-core device/system application of easily extensible (Scalable).
Secondly, according to the method and apparatus of data processing of the present invention, code snippet is distributed to each processor core in multiprocessor nuclear structure connected in series to run, each processor core performs specific instruction, whole processor core complete function realizing program connected in series, the data used between the code snippet split from complete program code, by the transmission of special delivering path, almost do not have data dependence problem, achieve real multi-emitting.In described multiprocessor nuclear structure connected in series, namely the transmitting quantity of its multi-emitting equal the quantity of processor core, substantially increases the utilization factor of arithmetic element, thus realize the high-throughput of multiprocessor nuclear structure connected in series and even device/system.
Again, the buffer memory (cache) usually had in processor is instead of with local storage.All instruction and datas that this processor core will be used are saved in the corresponding local storage of each processor core, accomplish the access hit rate (hit rate) of 100%, solve the speed bottle-neck problem of the outside low-speed memory of access that cache miss (cache miss) causes, further increase the overall performance of device/system.
Again, multi-core/many-core device of the present invention has the Low-power Technology of three levels, not only can adopt as cut off the power managed not realized coarseness by methods such as the power supplys of processor core that uses, can also according to data-driven, carry out the fine granularity power managed for instruction level, more can implement automatically adjustment processor core clock frequency in real time by the mode of hardware, ensureing under the prerequisite that processor core normally works, effectively reduce the operating dynamic power consumption of processor core, realize processor core and adjust clock frequency by demand, and reduce artificial intervention enforcement as far as possible.Simultaneously owing to adopting the mode of hardware to realize, speed is fast, more effectively can realize the real-time adjustment of processor clock frequency.
Finally, adopt technical solution of the present invention, only needing programming and configuration just can realize SOC (system on a chip), the R&D cycle from being designed between launch being shortened.And, only need reprogramming and reshuffle, same hardware product just can be made to realize different functions.
Accompanying drawing explanation
Although this invention can be expanded in amendment in a variety of forms and replacing, also list some concrete enforcement legends in instructions and be described in detail.Should be understood that, the starting point of inventor is not that this invention is limited to set forth specific embodiment, antithesis, the starting point of inventor be to protect carry out in all spirit or scope based on being defined by this rights statement improvement, equivalency transform and amendment.
Fig. 1 is with the segmentation of high-level language programs and assembly language program(me) and is assigned as the example flow embodiment that the present invention will be described.
Fig. 2 is the embodiment of handling procedure circulation in rear Compilation Method of the present invention.
Fig. 3 is the configurable multi-core/many-core device schematic diagram based on serial multi-emitting and line level aggregated(particle) structure of the present invention.
Fig. 4 is the embodiment of address maps mode.
Fig. 5 is the embodiments of data in internuclear transmission.
Fig. 6 is back pressure, abnormality processing and the embodiment that is connected between data-carrier store with shared storage.
Fig. 7 is self-test and self-repair method of the present invention and constructive embodiment.
Fig. 8 (a) is a kind of embodiment of adjacent processor core register value transmission.
Fig. 8 (b) is the second embodiment of adjacent processor core register value transmission.
Fig. 9 is the third embodiment of adjacent processor core register value transmission.
Figure 10 (a) is a kind of embodiment based on processor core of the present invention and corresponding local storage composition.
Figure 10 (b) is the another kind of embodiment based on processor core of the present invention and corresponding local storage composition.
Figure 10 (c) is the embodiment based on effective marker position in processor core of the present invention and corresponding local storage and ownership zone bit.
Figure 11 (a) is the typical structure of current existing SOC (system on a chip).
Figure 11 (b) is a kind of embodiment realizing SOC (system on a chip) based on technical solution of the present invention.
Figure 11 (c) is the another kind of embodiment realizing SOC (system on a chip) based on technical solution of the present invention.
Figure 12 (a) is the embodiment of front compiling in technical solution of the present invention.
Figure 12 (b) is the embodiment of rear compiling in technical solution of the present invention.
Figure 13 (a) is configurable another schematic diagram of multi-core/many-core device based on serial multi-emitting and line level aggregated(particle) structure of the present invention.
Figure 13 (b) is that the configurable multi-core/many-core device based on serial multi-emitting and line level aggregated(particle) structure of the present invention passes through to configure the multinuclear serial structure schematic diagram formed.
Figure 13 (c) is that the configurable multi-core/many-core device based on serial multi-emitting and line level aggregated(particle) structure of the present invention passes through to configure the multinuclear serial parallel mixed structure schematic diagram formed.
Figure 13 (d) is the schematic diagram of the multiple coenocytisms formed by configuration based on the configurable multi-core/many-core device of serial multi-emitting and line level aggregated(particle) structure of the present invention.
Embodiment
Fig. 1 is with the segmentation of high-level language programs and assembly language program(me) and is assigned as the example flow embodiment that the present invention will be described.Calling in high-level language programs (101) and/or assembly language program(me) (102) launches to obtain the higher-level language code after calling expansion and/or assembler language code by first premenstrual compiling (103) step.Then to obtain calling the higher-level language code after expansion and/or assembler language code being in order by compiler compiling (104) assembly code of execution sequence, then carry out rear compiling (107); If only have assembler language code in program, and execution sequence of being in order, then can save compiling (104), directly carry out rear compiling (107).When carrying out rear compiling (107), in the present embodiment, with the structural information of multi-core device (106) for foundation, above run assembly code at the behavior model (108) of processor core and split, obtain configuration information (110), produce corresponding configuration boot (109) simultaneously.Finally, directly or by dma controller (112) corresponding a plurality of processor core (113) is configured by the processor core of in described device (111).
In fig. 2, first instruction dispenser reads in top end stops stream fragments in step one (201), then reads in front end code stream relevant information in step 2 (202).Then enter step 3 (203) and judge whether this code stream segment circulates, if do not circulated, then enter step 9 (209) conveniently transaction code stream fragments process, if circulation, then enter step 4 (204) and first read in circulating cycle issue M, then enter step 5 (205) and read in the periodicity N that this program segment can hold.Judge whether circulating cycle issue M is greater than the periodicity N that can hold in step 6 (206), if circulating cycle issue M is greater than the periodicity N that can hold, then entering step 7 (207) is partial circulating and a partial circulating in a M-N week performing N week by Loop partitioning, and in step 8 (208) by M-N again assignment to M, enter the circulation of next program segment, until meet circulating cycle issue to be less than the periodicity that can hold simultaneously.By the method, the situation that circulating cycle issue is greater than the periodicity that program segment can hold effectively can be solved.
Fig. 3 is the configurable multi-core/many-core device schematic diagram based on serial multi-emitting and line level aggregated(particle) structure of the present invention.In the present embodiment, this device is made up of some processor cores (301), configurable local storage (302) and configurable interconnect architecture (303).In the present embodiment, the corresponding configurable local storage (302) below it of each processor core (301), both form the one-level of described macroscopic flow waterline together.By configuring configurable interconnect architecture (303), multiple processor core (301) and corresponding configurable local storage (302) thereof can be connected into structure connected in series.Multiple structure connected in series can be independent separately, also can partly or entirely connect each other, serial, parallel or string mixedly working procedure.
Fig. 4 is the embodiment of address maps mode.Fig. 4 (a) adopts the method for look-up table to realize address search.For 16 bit address, 64K address space is divided into the small memory (403) of the single 1K address space of polylith, adopts the mode be sequentially written in, after a block storage writes, then writes other blocks.Often write once, in block, address pointer (404) points to the available list item that next significance bit is 0 automatically, by the active position 1 of list item during write.Each list item write data are simultaneously by its address write look-up table (402).For the value of writing address BFC0, now address pointer (404) points to No. 2 list items of storer (403), when corresponding data being write No. 2 list items, in look-up table (402) corresponding address BFC0, write 2, thus set up address mapping relation.When reading data, finding corresponding list item by address according to look-up table (402), reading stored data.Fig. 4 (b) adopts the method for cam array to realize address search.For 16 bit address, 64K address space is divided into the small memory (403) of the single 1K address space of polylith, adopts the mode be sequentially written in, after a block storage writes, then writes other blocks.Often write once, in block, address pointer (406) points to the available list item that next significance bit is 0 automatically, by the active position 1 of list item during write.Each list item write data are simultaneously by the next list item of its instruction address write cam array (402).For the value of writing address BFC0, now address pointer (406) points to No. 2 list items of storer (403), when corresponding data being write No. 2 list items, at the next list item write instruction address BFC0 of cam array (405), thus set up address mapping relation.When reading data, finding corresponding list item compared with all instruction addresses that input instruction address and cam array are deposited, reading stored data.
Fig. 5 is the embodiments of data in internuclear transmission.All data-carrier stores all between processor core, and are divided into the two parts up and down on logical meaning.Wherein upper part is used for the read-write of the processor core above data-carrier store, and lower part is only for reading data for the processor core below data-carrier store.While processor core working procedure, data are delivered in relays downwards from data-carrier store above.One-out-three selector switch (502,509) can select the data (506) transmitted to send into data-carrier store (503,504) at a distance.When processor core (510,511) does not do Store instruction, the lower part of data-carrier store (501,503), respectively by the upper part of the next data-carrier store (503,504) of one-out-three selector switch (502,509) write correspondence, indicates that the significance bit V of writing line is 1 simultaneously.When doing Store instruction, register file only writes value to data-carrier store below.When Load instruction needs the data of getting appropriate address, alternative selector switch (505,507) respectively by data-carrier store (503,504) significance bit V determine be from the data-carrier store (501,503) above correspondence or below data-carrier store (503,504) peek.If the significance bit V of certain list item is 1 in data-carrier store (503,504), namely flag data upgrades from data-carrier store (501,503) write above, then when not selecting data (506) transmitted at a distance, the register file of one-out-three selector switch (502,509) respectively selection processor core (510,511) exports as input, thus ensure stored data be through processor core (510,511) process after last look.When the upper part of data-carrier store (503) is write by new data, the lower part of data-carrier store (503) transmits data to the upper part of data-carrier store (504).Use pointer mark transmitting the list item of data during data transmission, when last list item of pointed, flag transmission is near completion.When one section of program is run complete, data should complete the transmission to next storer.When next section of program is run, the upper part of data-carrier store (501) transmits data to the lower part of data-carrier store (503), the upper part of data-carrier store (503) transmits data to the lower part of data-carrier store (504), the upper part of data-carrier store (504) transmits data downwards, thus forms table tennis transmission structure.All data-carrier stores all mark off the storage of a part for instruction by required instruction space size, and namely data-carrier store and command memory are indiscrete physically.
Fig. 6 is back pressure, abnormality processing and the embodiment that is connected between data-carrier store with shared storage.Respective code fragment (615) is write by dma controller (616) to command memory (601,609,610,611) in the present embodiment.Processor core (602,604,606,608) runs the code in command adapted thereto storer (601,609,610,611), and reads and writes corresponding data-carrier store (603,605,607,612).For processor core (604), data-carrier store (605) and rear stage processor core (606), front and back stages processor core (604,606) all has access to data-carrier store (605), only complete at prime processor core (604) and write data-carrier store (605) and after rear class processor core (606) completes read data storer (605), the data quantum memory in data-carrier store (605) just can make table tennis and exchange.Whether back pressure signal (614) is for completing read operation by rear class processor core (606) notification data storer (605).Back pressure signal (613) for notifying whether prime processor core (604) has spilling by data-carrier store (605), and transmits the back pressure signal come by rear class processor core (606) transmission.Prime processor core (604) transmits according to ruuning situation own with by data-carrier store (605) the back pressure signal of coming, judge whether macroscopic flow waterline blocks, determine that whether making table tennis to the data quantum memory in data-carrier store (605) exchanges, and produce back pressure signal continuation to previous stage transmission.Arrived the reverse back pressure signal transmission of processor core by processor core like this to data-carrier store again, the operation of macroscopic flow waterline can be controlled.All data-carrier stores (603,605,607,612) are all connected with shared storage (618) by connecting (619).When the address of write or reading needed for certain data-carrier store is outside himself, there is address abnormal, enter in shared storage (618) and search address, after finding, data are write this address maybe by the data reading of this address.When processor core (608) needs to use the data in data-carrier store (605), also exception occurs, data-carrier store (605) is transferred data in processor core (608) by shared storage (618).The abnormal information that processor core and data-carrier store produce all is transferred to abnormality processing module (617) by designated lane (620).In the present embodiment, overflow for the operation result in processor core, the operation result that abnormality processing module (617) control processor checks spilling does limit spoke (saturation) operation; For data-carrier store spilling, data are stored in shared storage by abnormality processing module (617) control data memory access shared storage; In the process, abnormality processing module (617) transmits a signal to described processor core or data-carrier store, make it to block, resume operation after completing abnormality processing operation, by the signal that back pressure is transmitted, other processor cores and data-carrier store determine whether self blocks separately again.
Refer to Fig. 7, this figure is described self-test and self-repair method and constructive embodiment.In this self-test selfreparing structure (701), the test vector that vector generator (702) produces synchronously delivers to each processor core, test vector dispensing controller (703) controls the annexation of each processor core and vector generator (702), operation result distribution controller (709) controls the annexation of each processor core and comparer, processor core carries out the comparison of operation result by comparer and other processor cores, in the present embodiment, each processor core can compare with other adjacent processor cores, as processor core (704) can pass through Compare Logic (708) and processor core (705, 706, 707) compare.In this embodiment, each Compare Logic can comprise one or more comparer, if a Compare Logic has a comparer, then each processor core compares with other adjacent multiple processor cores successively, if a Compare Logic has multiple comparer, then each processor core simultaneously and other adjacent multiple processor cores compare, test result directly writes table with test results (710) from each Compare Logic.
Refer to Fig. 8, Fig. 8 gives three kinds of embodiments of adjacent processor core register value transmission.
In the embodiment that Fig. 8 (a) is corresponding, processor core has the register file (801) comprising 31 32 general-purpose registers, when in transmission prime processor core (802), all general register values are to processor core at the corresponding levels (803), directly each the input end of the output terminal of each of prime processor core (802) all general-purpose registers with processor core at the corresponding levels (803) all general-purpose registers can be communicated with by MUX one_to_one corresponding with 992 hardwireds.When transmitting register value, in one-period, the value of 31 32 general-purpose registers in prime processor core (802) all can be delivered to processor core at the corresponding levels (803).Specifically show the hardwired method of attachment of (804) in a general-purpose register in Fig. 8 (a), all the other hardwired methods of attachment of 991 are identical with this position (804).In prime processor core (802), the output terminal (806) of corresponding positions (805) is connected by MUX (808) with the input end of this position (804) in processor core at the corresponding levels (803) by hardwired (807).When processor core performs the computing such as arithmetic, logic, MUX (808) selects the data (809) deriving from processor core at the corresponding levels; When processor core performs peek operation, if these data exist in the local storage that processor core at the corresponding levels is corresponding, then select the data (809) deriving from processor core at the corresponding levels, otherwise the data (810) selected to derive from the transmission of prime processor core and come; When transmitting register value, the data (810) that MUX (808) is selected to derive from the transmission of prime processor core and come.Whole 992 transmit simultaneously, can complete the transmission of whole register file value in one-period.
In the embodiment that Fig. 8 (b) is corresponding, adjacent processor core (820,822) has the register file (821,823) comprising a plurality of 32 general-purpose registers separately.When the past level processor core (820) transmits register value to processor core at the corresponding levels (822), with 32 hardwireds, the input of the data output end (829) of register file (821) in prime processor core (820) with the MUX (827) be connected in processor core at the corresponding levels (822) on register file (823) data input pin (830) can be connected, the input of MUX (827) is respectively the data (824) that processor core at the corresponding levels comes and the data (825) come by the past level processor core that hardwired (826) sends, when processor core performs arithmetic, during the computings such as logic, MUX (827) selects the data (824) deriving from processor core at the corresponding levels, when processor core performs peek operation, if these data exist in the local storage that processor core at the corresponding levels is corresponding, then select the data (824) deriving from processor core at the corresponding levels, otherwise the data (825) selected to derive from the transmission of prime processor core and come are when transmitting register value, the data (825) that MUX (827) is selected to derive from the transmission of prime processor core and come.The register address generation module (828,832) corresponding by register file (821,823) itself produces the address input end (831,833) needing the register address transmitting register value to deliver to register file (821,823), several times by register file (823) that the value of described register is transmitted from register file (821) by hardwired (826) and MUX (827).Like this, only increasing in a small amount of hard-wired situation, the transmission completing all or part of register value in register file in multiple cycle can be utilized.
In the embodiment that Fig. 9 is corresponding, adjacent processor core (940,942) has the register file (941,943) comprising a plurality of 32 general-purpose registers separately.When the past level processor core (940) transmits register value to processor core at the corresponding levels (942), can first utilize data to store (store) instruction by prime processor core (940) register value in register file (941) is write in local data memory (954) corresponding to prime processor core (940), then utilize data loading (load) instruction from local data memory (954), read corresponding data and write in the corresponding register of register file (943) by processor core at the corresponding levels (942).In the present embodiment, the data output end (949) of the register file (941) in prime processor core (940) is connected with the data input pin (948) of local data memory (954) by 32 lines (946), and the data input pin (950) of the register file (943) in processor core at the corresponding levels (942) is connected with the data output end (952) of local data memory (954) by MUX (947) and 32 lines (953).The input of MUX (947) is respectively the data (944) that processor core at the corresponding levels comes and the data (945) come by the past level processor core that 32 lines (953) send, when processor core performs the computing such as arithmetic, logic, MUX (947) selects the data (944) deriving from processor core at the corresponding levels; When processor core performs peek operation, if these data exist in the local storage that processor core at the corresponding levels is corresponding, then select the data (944) deriving from processor core at the corresponding levels, otherwise the data (945) selected to derive from the transmission of prime processor core and come are when transmitting register value, the data (945) that MUX (947) is selected to derive from the transmission of prime processor core and come.In the embodiment that Fig. 8 (c) is corresponding, can first successively the value of register whole in register file (941) all be write in local data memory (954), afterwards successively by these values write register file (943); Also can, first successively by the value of component register in register file (941) write local data memory (954), successively these values be write in register file (943) afterwards; The value of a register in register file (941) can also be write after in local data memory (954), at once by this value write register file (943), repeat this process successively, until need the register value transmitted all to transmit complete.
Refer to Figure 10, Figure 10 gives two kinds of embodiments of the syndeton formed based on processor core of the present invention and corresponding local storage.To those skilled in the art; can carry out various possible replacement, adjustment and improvement according to technical scheme of the present invention and design to ingredient each in these embodiments, and all these are replaced, adjust and improve the protection domain that all should belong to claims of the present invention.
The embodiment that Figure 10 (a) is corresponding contains the processor core (1001) of local command memory and local data memory and local data memory (1002) corresponding to previous stage processor core thereof.Processor core (1001) is made up of local command memory (1003), local data memory (1004), performance element (1005), register file (1006), data address generation module (1007), programmable counter (1008), Write post (1009) and output buffering (1010).
Local command memory (1003) stores the instruction needed for processor core (1001) execution.Operand in processor core (1001) needed for performance element (1005) from register file (1006), or from the immediate in instruction; Execution result writes back register file (1006).
In the present embodiment, local data memory has two quantum memories.For local data memory (1004), the data read from two quantum memories are selected by MUX (1018,1019), produce the final data (1020) exported.
Data (1011) in the shared storage of the data in local data memory (1002,1004), the data in Write post (1009) or outside can be read in register file (1006) by data loading (load) instruction.In the present embodiment, data (1011) in the shared storage of the data in local data memory (1002,1004), the data in Write post (1009) and outside are input in register file (1006) after being selected by MUX (1016,1017).
Store (store) instruction by data the data in register file (1006) to be stored in local data memory (1004) by Write post (1009) time delay, or the data in register file (1006) are stored in outside shared storage by exporting buffering (1010) time delay.Can these data be stored in local data memory (1004) by Write post (1009) time delay while reading data to register file (1006) from local data memory (1002), to complete LIS function of the present invention, realize free data transmission.
In the embodiment that Figure 10 (a) is corresponding, the data that Write post (1009) receives have three sources: the data come from data, in the past level processor core local data memory (1002) that register file (1006) is next and the data (1011) come from the shared storage of outside.Data that described data, in the past level processor core local data memory (1002) come from register file (1006) comes and be input to Write post (1009) after the data (1011) that the shared storage of outside comes are selected by MUX (1012).
In the embodiment that Figure 10 (a) is corresponding, local data memory receives only the data input that Write post from same processor core comes.As in processor core (1001), local data memory (1004) receives only the data input come from Write post (1009).
In the embodiment that Figure 10 (a) is corresponding, local command memory (1003) is all that the quantum memory identical by two is formed with local data memory (1002,1004) separately, can read and write operation to quantum memories different in local storage simultaneously.Adopt such structure just can realize the local data memory of the employing ping-pong buffers exchange described in technical solution of the present invention.The address that local command memory (1003) receives is produced by programmable counter (1008).There are three sources the address that local data memory (1004) receives: from processor core Write post (1009) at the corresponding levels address storage part assign to for store data address, from processor core data address generation module (1007) at the corresponding levels come for read data address, from rear class processor core data address generation module come the address (1013) for reading data.Described from processor core Write post (1009) at the corresponding levels address storage part assign to for store data address, from processor core data address generation module (1007) at the corresponding levels come for read data address, from rear class processor core data address generation module come selected by MUX (1014,1015) for the address (1013) of reading data after, be input to the address accept module of different quantum memory in local data memory (1004) respectively.
Correspondingly, also there are three sources the address that local data memory (1002) receives: from processor core Write post at the corresponding levels address storage part assign to for store data address, from processor core data address generation module at the corresponding levels come for read data address, from rear class processor core data address generation module (1007) come the address for reading data.After address above mentioned is selected by MUX, be input to the address accept module of different quantum memory in local data memory (1002) respectively.
Figure 10 (b) is the another kind of syndeton based on processor core of the present invention and corresponding local storage composition, wherein contains the processor core (1021) of local command memory and local data memory and local data memory (1022) composition corresponding to previous stage processor core thereof.Processor core (1021) is made up of local command memory (1003), local data memory (1024), performance element (1005), register file (1006), data address generation module (1007), programmable counter (1008), Write post (1009) and output buffering (1010).
The syndeton that the corresponding embodiment of Figure 10 (b) proposes is roughly the same with the structure that the corresponding embodiment of Figure 10 (a) proposes, and unique difference is that the local data memory (1022,1024) in the present embodiment is respectively be made up of dual-port (dual-port) storer.Dual-ported memory can support that the reading and writing of two different addresses operate simultaneously.
There are three sources the address that local data memory (1024) receives: from processor core Write post (1009) at the corresponding levels address storage part assign to for store data address, from processor core data address generation module (1007) at the corresponding levels come for read data address, from rear class processor core data address generation module come the address (1025) for reading data.Described from processor core Write post (1009) at the corresponding levels address storage part assign to for store data address, from processor core data address generation module (1007) at the corresponding levels come for read data address, from rear class processor core data address generation module come selected by MUX (1026) for the address (1025) of reading data after, be input to the address accept module of local data memory (1024).
Correspondingly, also there are three sources the address that local data memory (1022) receives: from processor core Write post at the corresponding levels address storage part assign to for store data address, from processor core data address generation module at the corresponding levels come for read data address, from rear class processor core data address generation module (1007) come the address for reading data.After address above mentioned is selected by MUX, be input to the address accept module of local data memory (1022).
Store instruction owing to needing the data loading instruction and data of accessing storer in usual program and be generally no more than 40%, therefore the dual-ported memory in the corresponding embodiment of Figure 10 (b) can be replaced with single-ended (single-port) storer, the order of the instruction in static adjust program when program compilation, or when program performs dynamic conditioning instruction execution sequence, perform the instruction to memory access when performing the instruction not needing to access storer simultaneously, and then make the composition of syndeton more succinct, efficient.
In the corresponding embodiment of Figure 10 (b), each local data memory is actually a dual-ported memory, two can be supported to read simultaneously, two write or one read a write operation.For ensureing that data are not in commission mistakenly rewritten, can adopt the method as shown in Figure 10 (c), each address in local data memory (1031) is all corresponding increases an effective marker position (1032) and ownership zone bit (1033).
In Figure 10 (c), effective marker position (1032) represents the validity of data (1034) corresponding to this address in local data memory (1031), for example, can be effective by the data (1034) that " 1 " represents this address in local data memory (1031) corresponding, be invalid by the data (1034) that " 0 " represents this address in local data memory (1031) corresponding.It is return which processor core to use that ownership zone bit (1033) represents data (1034) corresponding to this address in local data memory (1031), for example, can represent data (1034) corresponding to this address in local data memory (1031) with " 0 " returns the processor core (1035) that described local data memory (1031) is corresponding to use, represent data (1034) corresponding to this address in local data memory (1031) with " 1 " and return the processor core (1035) and rear class processor core (1036) use thereof that described local data memory (1031) is corresponding.
In a particular embodiment, can describe by the above-mentioned definition to effective marker position (1032) and ownership zone bit (1033) attribute being stored in each data in local data memory, and ensure correct read-write.
In the corresponding embodiment of Figure 10 (c), if the effective marker position (1032) that in local data memory (1031), certain address is corresponding is " 0 ", then represent that data (1034) corresponding to this address are invalid, namely, if needed, directly data storage operations can be carried out to this address.If effective marker position (1032) are " 1 " and ownership zone bit (1033) is " 0 ", then represent that data (1034) corresponding to this address are effective, and be use to the processor core (1035) that described local data memory (1031) is corresponding, therefore processor core at the corresponding levels (1035) is if needed, and directly can carry out data storage operations to this address.If effective marker position (1032) are " 1 " and ownership zone bit (1033) is " 1 ", then represent that data (1034) corresponding to this address are effective, and be will use to the processor core (1035) of described local data memory (1031) correspondence and rear class processor core (1036) thereof, if processor core at the corresponding levels (1035) needs to carry out data storage operations to this address, then must wait until that described ownership zone bit (1033) just can carry out data storage operations for after " 0 ", namely first data (1034) corresponding for this address are transferred to the relevant position in local data memory (1037) corresponding to rear class processor core (1036), ownership zone bit (1033) corresponding for this address in local data memory (1031) corresponding for processor core at the corresponding levels (1035) is set to " 0 " simultaneously, like this, processor core (1035) at the corresponding levels just can carry out data storage operations to this address.
In the corresponding embodiment of Figure 10 (c), if the local data memory (1031) of this level processor (1035) to its correspondence carries out data storage operations, then can by effective marker position (1032) set of correspondence, and whether can be used decision ownership zone bit by rear level processor (1036) according to these data (1034), if can be used by rear level processor (1036), belong to zone bit (1033) set, otherwise reset; Also can by effective marker position (1032) set of correspondence, simultaneously by ownership zone bit (1032) the also set of correspondence, although need the capacity increasing local data memory (1031) like this, can be simplified it and concrete realize structure.
Refer to Figure 11 (a), Figure 11 (a) gives the typical structure of current existing SOC (system on a chip).Wherein processor core (1101), Digital Signal Processor Core (1102), functional unit (1103,1104,1105), input/output interface control module (1106) and storage control module (1108) are all connected on system bus (1110).This SOC (system on a chip) can be passed through input/output interface control module (1106) and transmit data with peripherals (1107), can also pass through storage control module (1108) and transmit data with external memory storage (1109).
Refer to Figure 11 (b), Figure 11 (b) gives a kind of embodiment realizing SOC (system on a chip) based on technical solution of the present invention.In the present embodiment, processor core and respective local memories (1121) form functional module (1124) jointly with other six processor cores and respective local memories, processor core and respective local memories (1122) form functional module (1125) jointly with other four processor cores and respective local memories, and processor core and respective local memories (1123) form functional module (1126) jointly with other two processor cores and respective local memories.Described functional module (1124,1125,1126) separately can processor core (1101) in corresponding Figure 11 (a) embodiment or Digital Signal Processor Core (1102) or functional unit (1103 or 1104 or 1105) or input/output interface control module (1106) or storage control module (1108).
For functional module (1126), processor core and respective local memories (1123,1127,1128,1129) form coenocytism connected in series, the function that described four processor cores and the common practical function module (1126) of respective local memories (1123,1127,1128,1129) possess.
Processor core and respective local memories (1123) and the data between processor core and respective local memories (1127) are conveyed through and are innerly connected (1130) and realize.Similarly, processor core and respective local memories (1127) and the data between processor core and respective local memories (1128) are conveyed through and are innerly connected (1131) and realize, and processor core and respective local memories (1128) and the data between processor core and respective local memories (1129) are conveyed through inside and are connected (1132) and realize.
Functional module (1126) is connected with bus link block (1138) by hardwired (1133,1134), and making can mutual data transmission between functional module (1126) and bus link block (1138).Similarly, energy mutual data transmission between functional module (1125) and bus link block (1139), can mutual data transmission between functional module (1124) and bus link block (1140,1141).Bus link block (1138) can mutual data transmission by hardwired (1135) with bus link block (1139).Bus link block (1139) can mutual data transmission by hardwired (1136) with bus link block (1140).Bus link block (1140) can mutual data transmission by hardwired (1137) with bus link block (1141).By this method, can practical function module (1125), functional module (1126), data between functional module (1127) mutually transmit, bus link block (1138,1139,1140,1141) and hardwired (1135,1136,1137) achieve the function of system bus (1110) in Figure 11 (a), and together with functional module (1125,1126,1127), constitute typical system on chip structure.
Because processor core in the configurable multi-core/many-core device that the present invention proposes and respective local memories are easy to expansion on number, therefore adopt the method for the present embodiment can realize various types of SOC (system on a chip) easily.In addition, when the configurable multi-core/many-core device real time execution proposed based on the present invention, also can, by the method for real-time dynamic-configuration, the structure of SOC (system on a chip) can be changed flexibly.
Refer to Figure 11 (c), Figure 11 (c) give based on technical solution of the present invention realize SOC (system on a chip)/another kind of embodiment.In the present embodiment, processor core and respective local memories (1151) form functional module (1163) jointly with other six processor cores and respective local memories, processor core and respective local memories (1152) form functional module (1164) jointly with other four processor cores and respective local memories, and processor core and respective local memories (1153) form functional module (1165) jointly with other two processor cores and respective local memories.Described functional module (1163,1164,1165) separately can processor core (1101) in corresponding Figure 11 (a) embodiment or Digital Signal Processor Core (1102) or functional unit (1103 or 1104 or 1105) or input/output interface control module (1106) or storage control module (1108).
For functional module (1165), processor core and respective local memories (1153,1154,1155,1156) form coenocytism connected in series, the function that described four processor cores and the common practical function module (1165) of respective local memories (1153,1154,1155,1156) possess.
Processor core and respective local memories (1153) and the data between processor core and respective local memories (1154) are conveyed through and are innerly connected (1160) and realize.Similarly, processor core and respective local memories (1154) and the data between processor core and respective local memories (1155) are conveyed through and are innerly connected (1161) and realize, and processor core and respective local memories (1155) and the data between processor core and respective local memories (1156) are conveyed through inside and are connected (1162) and realize.
In the present embodiment, example transmits data transfer demands between practical function module (1165) and functional module (1164) by the data between processor core and respective local memories (1156) and processor core and respective local memories (1166).According to technical solution of the present invention, in operational process, once processor core and respective local memories (1156) thereof need and processor core and respective local memories (1166) mutual data transmission thereof, the demand that configurable internet transmits according to described data configures automatically, set up the bi-directional data path (1158) of processor core and respective local memories (1156) and processor core and respective local memories (1166) thereof.Similarly, once processor core and respective local memories (1166) thereof need to processor core and respective local memories (1156) unidirectional data transmission thereof, or processor core and respective local memories (1156) thereof need to processor core and respective local memories (1166) unidirectional data transmission thereof, also can set up unidirectional data path by same procedure.
In the present embodiment, also establish the bi-directional data path (1157) between processor core and respective local memories (1151) thereof and processor core and respective local memories (1152) thereof, and the bi-directional data path (1159) between processor core and respective local memories (1165) thereof and processor core and respective local memories (1155) thereof.By this method, can practical function module (1163), functional module (1164), data between functional module (1165) mutually transmit, bi-directional data path (1157,1158,1159) achieves the function of system bus (1110) in Figure 11 (a), and together with functional module (1163,1164,1165), constitute typical system on chip structure.
According to the difference of SOC (system on a chip) application demand, between any two functional modules, not necessarily only has one group of data path.Because processor core in the configurable multi-core/many-core device that the present invention proposes is easy to expansion on number, therefore adopt the method for the present embodiment can realize various types of SOC (system on a chip) easily.In addition, when the configurable multi-core/many-core device real time execution proposed based on the present invention, also can, by the method for real-time dynamic-configuration, the structure of SOC (system on a chip) can be changed flexibly.
The front compiling of Figure 12 and rear compiling embodiment, wherein Figure 12 (a) is front compiling embodiment, and Figure 12 (b) is rear compiling embodiment.
As shown in Figure 12 (a), the left side is original program code (1201,1203,1204), has twice function call in code, is respectively A function call and B function call.Wherein 1203,1204 be respectively A function and B function code itself.After before carrying out, compiling launches, A function call and B function call are replaced by corresponding function code respectively, do not have function call, in the code after expansion as shown in 1202.
Figure 12 (b) is rear compiling embodiment, as shown in the figure, original object code (1205) is the object code after general compiled, this object code is the object code performed based on order, after compiling segmentation later, form code block (1206,1207,1208,1209,1210,1211) as shown in the figure, each code block is distributed to a corresponding processor core and is performed.Corresponding A loop body is split into an independent code block (1207), and B loop body is due to own relatively large, is divided into two code blocks, i.e. B loop body 1 (1209) and B loop body 2 (1210).Two code blocks perform on two processor cores, jointly complete B loop body.
Refer to Figure 13, Figure 13 (a) for of the present invention based on the configurable multi-core/many-core device schematic diagram of serial multi-emitting and line level aggregated(particle) structure, Figure 13 (b) is the multinuclear serial structure schematic diagram formed by configuration, Figure 13 (c) is the multinuclear serial parallel mixed structure schematic diagram formed by configuration, and Figure 13 (d) is the schematic diagram of the multiple coenocytisms by configuration formation.
As shown in Figure 13 (a), this device is made up of multiple processor core and configurable local storage (1301,1303,1305,1307,1309,1311,1313,1315,1317) and configurable interconnect architecture (1302,1304,1306,1308,1310,1312,1314,1316,1318).In the present embodiment, each processor core and configurable local storage form the one-level of described macroscopic flow waterline.By configuring configurable interconnect architecture (as 1302), multiple processor core and configurable local storage (1301,1303,1305,1307,1309,1311,1313,1315,1317) can be connected into structure connected in series.Multiple structure connected in series can be independent separately, also can partly or entirely connect each other, serial, parallel or string mixedly working procedure.
As shown in Figure 13 (b), by configuring corresponding configurable interconnect architecture, form the multinuclear serial structure in figure, wherein processor core and configurable local storage (1301) be this multinuclear serial structure the first order, the afterbody that processor core and configurable local storage (1317) are this multinuclear serial structure.
As shown in Figure 13 (c), by configuring corresponding configurable interconnection structure, processor core and configurable local storage (1301,1303,1305,1313,1315,1317) form serial structure, and processor core and configurable local storage (1307,1309,1311) form parallel organization, the final polycaryon processor forming a serial parallel mixed structure.
As shown in Figure 13 (d), by configuring corresponding configurable interconnection structure, processor core and configurable local storage (1301,1307,1313,1315) form serial structure, and processor core and configurable local storage (1303,1309,1305,1311,1317) form an other serial structure, thus form two completely independently serial structures.

Claims (15)

1., for a configurable coenocytism for executive routine, comprising:
A plurality of processor core;
The a plurality of configurable local storage of auxiliary described a plurality of processor core;
A plurality of configurable interconnect architecture for described a plurality of processor core connected in series;
It is characterized in that:
According to the configuration information of coenocytism, program is divided into a plurality of code snippets of corresponding a plurality of processor core, makes the execution cycle number of each code snippet equal or close;
When comprising a circulation in a code snippet and the cycle index of this circulation is greater than the cycle index allowed in this code snippet, described circulation is divided into two or more groups subcycle further, makes only to comprise one group of subcycle in a code snippet;
It is in described configurable coenocytism: each processor core is by a part of code snippet in serial order program code execution; In described coenocytism connected in series, all processor cores realize the complete function of program code in a pipeline fashion jointly;
The code snippet being supplied to respective processor core is stored in configurable local storage, and the source of data needed for running as processor and place to go; With
Each configurable local storage comprises a command memory and a configurable data storer; Described configurable data storer is divided into the two parts on logical meaning, and wherein upper part reads and writes data for previous stage processor core, and lower part is only for the reading data of rear stage processor core; With
Described configurable local storage comprises a plurality of data submodule can accessed respectively at one time by previous stage processor core and rear stage processor core;
When current coagulation device core and rear stage processor core all comprise register file, the value of the register in the register file in previous stage processor core is transferred in the register of the correspondence in the register file in rear stage processor core in operational process.
2. coenocytism according to claim 1, wherein:
Single processor core runs an inner streamline with the form of single transmit or multi-emitting; With
A plurality of processor core runs a macroscopic flow waterline, and wherein each processor core is the one-level in described macroscopic flow waterline, thus realizes the transmitting of quantity larger than single processor core transmitting quantity.
3. coenocytism according to claim 1, wherein:
According to the configuration information of coenocytism, program is divided into a plurality of code snippets of corresponding a plurality of processor core, makes the execution cycle number of each code snippet equal or close; And comprise for generation of the cutting procedure of code snippet:
A front compilation process, for replacing with the real code in function call by the function call in program;
A compilation process, for converting program source code to object code; With
Compilation process after one, for being divided into code snippet by object code and being added in code snippet by guidance code.
4. coenocytism according to claim 1, comprises further:
One or more expansion module; With
Described module comprises one for storing the shared storage of the data of overflowing from configurable local storage and the data transmitted between processor core, one for accessing direct memory access (DMA) controller of configurable local storage, or an abnormality processing module for the treatment of described processor core and configurable local storage exception
Wherein each processor core comprises a performance element, a programmable counter.
5. coenocytism according to claim 1, wherein:
Each configurable local storage comprises a command memory and a configurable data storer, and the border between described command memory and configurable data storer is configurable.
6. coenocytism according to claim 5, wherein:
Configurable data storer comprises a plurality of quantum memory, and the border between quantum memory is configurable.
7. coenocytism according to claim 4, wherein:
Configurable interconnect architecture comprises the connection between processor core and configurable local storage, connection between processor core and shared storage, connection between processor core and direct memory access controller, connection between configurable local storage and shared storage, connection between configurable local storage and direct memory access controller, connection between configurable local storage and external system, and the connection between shared storage and external system.
8. coenocytism according to claim 2, wherein:
Whether macroscopic flow waterline controls by the back pressure signal transmitted between adjacent two-stage macroscopic flow waterline, suspend with macroscopic flow waterline at the corresponding levels to determine whether prime macroscopic flow waterline suspends.
9. coenocytism according to claim 1, wherein processor core is configured to have a plurality of power managed pattern, comprising:
A configuration stage power managed pattern, will not be set to low power consumpting state at the processor core of work;
An instruction-level power managed pattern, will wait for that the processor core that data access completes is set to low power consumpting state; With
An application layer power managed pattern, is set to low power consumpting state by current utilization lower than the processor core of a threshold value.
10. coenocytism according to claim 1, comprises further:
A self-test device, for generation of test vector and stores test results, makes a processor core can compare the result using same test vector to run with adjacent processor core, thus judges whether this processor core normally runs,
Wherein the processor core of any abnormal running is marked as invalid, makes to be labeled as invalid processor core and is not configured in macroscopic flow waterline, thus realize self-repair function.
11. coenocytisms according to claim 1, can form a kind of SOC (system on a chip) comprising coenocytism described at least one, described SOC (system on a chip) comprises further:
The processor core of a plurality of parallel join, the processor core of wherein said a plurality of processor core connected in series and described a plurality of parallel join is interconnected to constitute goes here and there and the multinuclear SOC (system on a chip) of mixed connection.
12. coenocytisms according to claim 1, can form a kind of SOC (system on a chip) comprising coenocytism described at least one, described SOC (system on a chip) comprises further:
Second group of coenocytism connected in series, the processor core connected in series in the operation of processor core wherein connected in series and first group of coenocytism has nothing to do.
13. coenocytisms according to claim 1, can form a kind of SOC (system on a chip) comprising a plurality of functional module based on described coenocytism, described SOC (system on a chip) comprises further:
The bus link block for exchanging data of the described a plurality of functional module of a plurality of connection;
By many data paths between bus link block and a plurality of bus link block and connect and compose a system bus between bus link block and functional module,
The connection pre-set between two processor cores that wherein system bus comprises in difference in functionality module; With
Described functional module comprises one by static configuration for realizing the accommodation function module of a special data processing function and dynamically being called by other functional modules by configuration.
14. 1, for the configurable coenocytism of executive routine, comprising:
A first processor core, as in described coenocytism macroscopic flow waterline run the first order and the first code fragment of executive routine;
A first configurable local storage, for assisting described first processor core and storing described first code fragment;
Second processor core, as the second level of macroscopic flow waterline and the second code fragment of executive routine, wherein second code fragment has the performance period equal or close with first code fragment;
A second configurable local storage, for assisting described second processor core and storing described second code fragment; With
A plurality of configurable interconnect architecture, for first processor core connected in series and the second processor core;
It is characterized in that:
Each configurable local storage comprises a command memory and a configurable data storer; Described configurable data storer is divided into the two parts on logical meaning, and wherein upper part reads and writes data for previous stage processor core, and lower part is only for the reading data of rear stage processor core;
Item total in first configurable local storage and the second configurable local storage comprises a data division, one for representing this data division whether effective effective marker, and one for representing the entitlement mark that these data are used by which in first processor core and the second processor core; With
When the second processor core reads data to an address first time, second processor core reads and the data of reading is stored in the second configurable local storage from the first configurable local storage, all access after making can be carried out from the second configurable local storage, thus realize reading to cause the function writing (load-induced-store, LIS); With
First configurable local storage comprises a plurality of data submodule can accessed respectively at one time by first processor core and the second processor core;
When first processor core and the second processor core all comprise register file, the value of the register in the register file in first processor core is transferred in the register of the correspondence in the register file in the second processor core in operational process.
15. coenocytisms according to claim 14, wherein:
First processor core is configured to first and reads strategy, makes the first source for entering data into first processor core comprise the first configurable local storage, shared storage and external unit;
Second processor core is configured to second and reads strategy, makes the second source for entering data into the second processor core comprise the second configurable local storage, the first configurable local storage, shared storage and external unit;
First processor core is configured to first and writes strategy, and the first whereabouts that the data come from first processor core are exported comprises the first configurable local storage, shared storage and external unit; With
Second processor core is configured to second and writes strategy, and the second whereabouts that the data come from first processor core are exported comprises the second configurable local storage, shared storage and external unit.
CN200910208432.0A 2008-11-28 2009-09-29 Data processing method and device Active CN101799750B (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CN200910208432.0A CN101799750B (en) 2009-02-11 2009-09-29 Data processing method and device
EP09828544A EP2372530A4 (en) 2008-11-28 2009-11-30 Data processing method and device
PCT/CN2009/001346 WO2010060283A1 (en) 2008-11-28 2009-11-30 Data processing method and device
KR1020117014902A KR101275698B1 (en) 2008-11-28 2009-11-30 Data processing method and device
US13/118,360 US20110231616A1 (en) 2008-11-28 2011-05-27 Data processing method and system

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN200910046117 2009-02-11
CN2009100461172 2009-02-11
CN200910046117.2 2009-02-11
CN200910208432.0A CN101799750B (en) 2009-02-11 2009-09-29 Data processing method and device

Publications (2)

Publication Number Publication Date
CN101799750A CN101799750A (en) 2010-08-11
CN101799750B true CN101799750B (en) 2015-05-06

Family

ID=42595439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910208432.0A Active CN101799750B (en) 2008-11-28 2009-09-29 Data processing method and device

Country Status (1)

Country Link
CN (1) CN101799750B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937370B (en) * 2010-08-16 2013-02-13 中国科学技术大学 Method and device supporting system-level resource distribution and task scheduling on FCMP (Flexible-core Chip Microprocessor)
WO2012049728A1 (en) * 2010-10-12 2012-04-19 富士通株式会社 Simulation device, method, and program
US9552206B2 (en) * 2010-11-18 2017-01-24 Texas Instruments Incorporated Integrated circuit with control node circuitry and processing circuitry
CN102023846B (en) * 2011-01-06 2014-06-04 中国人民解放军国防科学技术大学 Shared front-end assembly line structure based on monolithic multiprocessor system
CN102521201A (en) * 2011-11-16 2012-06-27 刘大可 Multi-core DSP (digital signal processor) system-on-chip and data transmission method
GB2516995B (en) * 2013-12-18 2015-08-19 Imagination Tech Ltd Task execution in a SIMD processing unit
CN105988774A (en) * 2015-02-20 2016-10-05 上海芯豪微电子有限公司 Multi-issue processor system and method
CN104978235A (en) * 2015-06-30 2015-10-14 柏斯红 Operating frequency prediction based load balancing method
CN107038028B (en) * 2017-03-24 2020-09-04 中国南方电网有限责任公司电网技术研究中心 Multithreading real-time simulation method of RTDS custom element
CN108549583B (en) * 2018-04-17 2021-05-07 致云科技有限公司 Big data processing method and device, server and readable storage medium
CN111142936B (en) * 2018-11-02 2021-12-31 深圳云天励飞技术股份有限公司 Data stream operation method, processor and computer storage medium
CN109901884B (en) * 2019-01-17 2022-05-17 京微齐力(北京)科技有限公司 Method and device for high-level synthesis and code stream generation of FPGA
CN110362530B (en) * 2019-07-17 2023-02-03 电子科技大学 Data chain blind signal processing method based on parallel pipeline architecture
CN110569066B (en) * 2019-07-26 2023-08-01 深圳震有科技股份有限公司 Control method for common code segment of multi-core system, intelligent terminal and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1484169A (en) * 2002-06-19 2004-03-24 阿尔卡塔尔加拿大公司 Multiprocessor computing device having shared program memory
CN1608249A (en) * 2001-10-22 2005-04-20 太阳微系统有限公司 Multi-core multi-thread processor
EP1675015A1 (en) * 2004-12-22 2006-06-28 Galileo Avionica S.p.A. Reconfigurable multiprocessor system particularly for digital processing of radar images

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5732209A (en) * 1995-11-29 1998-03-24 Exponential Technology, Inc. Self-testing multi-processor die with internal compare points
US7793278B2 (en) * 2005-09-30 2010-09-07 Intel Corporation Systems and methods for affine-partitioning programs onto multiple processing units
US8104030B2 (en) * 2005-12-21 2012-01-24 International Business Machines Corporation Mechanism to restrict parallelization of loops

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1608249A (en) * 2001-10-22 2005-04-20 太阳微系统有限公司 Multi-core multi-thread processor
CN1484169A (en) * 2002-06-19 2004-03-24 阿尔卡塔尔加拿大公司 Multiprocessor computing device having shared program memory
EP1675015A1 (en) * 2004-12-22 2006-06-28 Galileo Avionica S.p.A. Reconfigurable multiprocessor system particularly for digital processing of radar images

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
AsAP:An Asynchronous Array of Simple Processors;Zhiyi Yu等;《IEEE JOURNAL OF SOLID-STATE CIRCUITS》;20080301;第43卷(第3期);第695页至第705页 *
Cell multiprocessor communication network:built for speed;Michael Kistler等;《IEEE MICRO,IEEE SERVICE CENTER,LOSALAMITOS,CA,US》;20060501;第10页至第23页 *

Also Published As

Publication number Publication date
CN101799750A (en) 2010-08-11

Similar Documents

Publication Publication Date Title
CN101799750B (en) Data processing method and device
KR101275698B1 (en) Data processing method and device
Cong et al. A fully pipelined and dynamically composable architecture of CGRA
CN103150264B (en) Extension Cache Coherence protocol-based multi-level consistency simulation domain verification and test method
CN103714039B (en) universal computing digital signal processor
CN105512088B (en) A kind of restructural processor architecture and its reconstructing method
US11061742B2 (en) System, apparatus and method for barrier synchronization in a multi-threaded processor
US8176478B2 (en) Process for running programs on processors and corresponding processor system
CN102713846B (en) The generation method of the executable code of processor and storage area management method
CN105408859A (en) Method and system for instruction scheduling
CN104699631A (en) Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN103562866A (en) Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
CN103635875A (en) Memory fragments for supporting code block execution by using virtual cores instantiated by partitionable engines
CN103547993A (en) Executing instruction sequence code blocks by using virtual cores instantiated by partitionable engines
US20120110303A1 (en) Method for Process Synchronization of Embedded Applications in Multi-Core Systems
CN103744644A (en) Quad-core processor system built in quad-core structure and data switching method thereof
BR102020019649A2 (en) apparatus and method for adaptively scheduling work on heterogeneous processing resources
US20120096292A1 (en) Method, system and apparatus for multi-level processing
CN108604107A (en) Processor, the method and system of maximum clock frequency are adjusted for type based on instruction
CN101281513A (en) Stream processor IP core based on Avalon
CN101727435B (en) Very-long instruction word processor
US20160092182A1 (en) Methods and systems for optimizing execution of a program in a parallel processing environment
CN101727513A (en) Method for designing and optimizing very-long instruction word processor
Leidel et al. CHOMP: a framework and instruction set for latency tolerant, massively multithreaded processors
CN101236576B (en) Interconnecting model suitable for heterogeneous reconfigurable processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP02 Change in the address of a patent holder

Address after: 201203 501, No. 14, Lane 328, Yuqing Road, Pudong New Area, Shanghai

Patentee after: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd.

Address before: 200092, B, block 1398, Siping Road, Shanghai, Yangpu District 1202

Patentee before: SHANGHAI XINHAO MICROELECTRONICS Co.,Ltd.

CP02 Change in the address of a patent holder