CN101799750A

CN101799750A - Data processing method and device

Info

Publication number: CN101799750A
Application number: CN200910208432A
Authority: CN
Inventors: 林正浩; 任浩琪; 王静
Original assignee: Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Current assignee: Shanghai Xinhao Bravechips Micro Electronics Co Ltd
Priority date: 2009-02-11
Filing date: 2009-09-29
Publication date: 2010-08-11
Anticipated expiration: 2029-09-29
Also published as: CN101799750B

Abstract

The invention relates to a data processing method and a device. Program codes running on a serially connected multiprocessor nuclear structure are partitioned according to specific rules, so that the serially connected multiprocessor nuclear structure forms a serial multi-emitting and production line layered structure, and the time required for running corresponding code fragments obtained by partitioning on each nucleus is equal as much as possible, thereby realizing the load balance of internuclear workload.

Description

A kind of method and apparatus of data processing

Technical field

The present invention relates to the integrated circuit (IC) design field.

Background technology

According to Moore's Law, transistorized characteristic dimension is just along 65nm, 45nm, 32nm ... route dwindle gradually, on the monolithic chip integrated number of transistors above tens00000000.But since the comprehensive and placement-and-routing's instrument of the eighties release in last century, after having liberated the back end design yield-power, eda tool is the not breakthrough of matter over more than 20 year, makes Front-end Design, especially checking become and more and more is difficult to tackle the monolithic chip scale that increases day by day.Therefore, design corporation invests multinuclear to sight, and promptly integrated a plurality of comparatively simply nuclears in the chip piece reduce design, validation difficulty when improving chip functions.

The processor core of a plurality of executed in parallel programs that the tradition polycaryon processor is integrated is to improve chip performance.For traditional polycaryon processor, the thought that needs multiple programming just might make full use of resource.Yet mostly operating system is to average distribution with symmetrical manner to the distribution and the not change of essence of management of resource.Although can carry out concurrent operation between a plurality of processor cores, for single program threads, the design feature that its serial is carried out causes can't realizing real stream line operation in traditional polycaryon processor structure.The program that still exists in a large number necessary serial to carry out in the present in addition software can't well be cut apart.Therefore, after processor core reached some, performance just can't promote along with the increase of nuclear volume again.In addition, continuous lifting along with semiconductor fabrication process, the frequency of operation of polycaryon processor inside has been much higher than the frequency of operation of its external memory storage, a plurality of processor cores carry out the big bottleneck that memory access has also become the system for restricting performance simultaneously, can't reach the performance boost effect of expection with the program of parallel multi-core structure operation serial.

Summary of the invention

The present invention is directed to the deficiencies in the prior art, propose a kind of method and apparatus that is used for the data processing of high-speed cruising serial program, improve throughput.

The method and apparatus of data processing of the present invention, comprise: the program code that runs on the multiprocessor nuclear structure connected in series is cut apart according to ad hoc rules, make each nuclear operation in the described multiprocessor nuclear structure connected in series cut apart the required time of post code fragment accordingly and equate as far as possible, with the load balance (load balancing) of realizing internuclear workload.Comprise a plurality of processor cores in the described multiprocessor nuclear structure connected in series, described processor refers to the hardware that carries out computing and read and write data by execution command include but not limited to central processing unit (CPU) and data signal processor (DSP).

Multiprocessor nuclear structure connected in series of the present invention constitutes the serial pilosity and penetrates (in serialmulti-issue), can carry out the individual or a plurality of emissions of odd number in the nuclear time per unit arbitrarily in the multiprocessor nuclear structure connected in series, a plurality of nuclears connected in series form more massive pilosity simultaneously and penetrate, and promptly the serial pilosity is penetrated.

Multiprocessor nuclear structure connected in series of the present invention constitutes line level aggregated(particle) structure (pipelinehierarchy), the inside streamline of examining arbitrarily in the multiprocessor nuclear structure connected in series is first level, the macroscopic flow waterline that each nuclear constitutes as a macroscopical pipelining segment in the multiprocessor nuclear structure connected in series is second level, the rest may be inferred can also obtain more how higher level, as the 3rd level that constitutes as a higher level pipelining segment with multiprocessor nuclear structure connected in series.

Code snippet on the multiprocessor nuclear structure center connected in series of the present invention produces the individual or a plurality of code snippets of odd number via the part or all of step in preceding compiling (pre-compile), compiling (compile) and back compiling (post-compile) three steps, and described program code includes but not limited to higher-level language code and assembly language code.

Compiling on the promptly existing ordinary meaning of described compiling from the program source code to the object code;

Compiling is to the precompile of program source code before described compiling is carried out before described, include but not limited to before carrying out program compilation, " calling " in the program (call) be launched, substitute call statement with the actual code that calls, form the program code that does not call; Described calling includes but not limited to function call;

Described back compiling is to be assigned to the action of each nuclear in the described multiprocessor nuclear structure connected in series and load on request to be divided into odd number or a plurality of code snippets with what described compiling obtained object code, and step includes but not limited to:

(a) executable program code is resolved, and generates the front end code stream;

(b) operation on particular model, scanning front end code stream are analyzed required performance period, information such as redirect and jump address whether as requested, and the statistics scanning result is determined carve information indirectly; Or do not scan the front end code stream, directly determine carve information according to presupposed information; Described particular model includes but not limited to the behavior model of described multiprocessor nuclear structure center connected in series;

(c) according to carve information to executable programmed instruction code and carry out code division, generate the corresponding code snippet of each processor core in the described multiprocessor nuclear structure connected in series.

Compilation Method is implemented before the program source code compiling before of the present invention, the ingredient that also can be used as compiler is implemented in the program source code compilation process, can also be as the operating system component of described multiprocessor nuclear structure connected in series or as driving or, when described multiprocessor nuclear structure connected in series moves, implementing in real time as application program.

Back of the present invention Compilation Method can be implemented after the program source code compiling is finished, the ingredient that also can be used as compiler is implemented in the program source code compilation process, can also when moving, implement in real time described multiprocessor nuclear structure connected in series as the operating system component, driving, the application program that include but not limited to described multiprocessor nuclear structure connected in series.When described back Compilation Method is implemented in real time, can artificially determine the corresponding configuration information in the described code snippet, also can dynamically produce the corresponding configuration information in the described code snippet according to the operating position of described multiprocessor nuclear structure connected in series automatically, can also only produce fixing configuration information.

Cut apart by described, the existing application program can be carried out program cuts apart, segmentation is carried out simultaneously, not only improved the travelling speed of existing program on the multi-core/many-core device, and given full play to the efficient of multi-core/many-core device, also guaranteed the compatibility of multi-core/many-core device simultaneously to existing application.Solve existing application effectively and can't give full play to the predicament of multi-core/many-core processor advantage.

In the Compilation Method of back of the present invention, the foundation of indirectly determining carve information includes but not limited to instruct the bar number of the periodicity carried out or time, instruction, promptly can be according to the instruction execution cycle number or the time of scanning front end code stream acquisition, whole executable program code is divided into the code snippet of identical or close working time, also can whole executable program code be divided into the code snippet of identical or close instruction strip number according to the instruction strip number of scanning front end code stream acquisition; The described foundation of directly determining carve information includes but not limited to the bar number that instructs promptly can directly whole executable program code be divided into the code snippet of identical or close instruction strip number according to the bar number of instruction.

In the Compilation Method of back of the present invention, when cutting apart, described executable program code avoids as far as possible loop code is cut apart according to ad hoc rules.In the time can't avoiding loop code cut apart, according to ad hoc rules described loop code is cut apart by odd number time or plural number and to be formed a plurality of more small-scale loop codes.Described a plurality of more small-scale loop code can be respectively the ingredient of identical or different code snippet.Described more small-scale loop code includes but not limited to comprise the loop code and the code execution cycle number loop code still less of number of codes still less.

In the Compilation Method of back of the present invention, described code snippet includes but not limited to be applicable to the executable object code and/or the corresponding configuration information of the segmentation of the described multiprocessor nuclear structure operation connected in series of stationary processors check figure purpose, be applicable to the unsegmented executable object code of described multiprocessor nuclear structure operation connected in series and comprise the corresponding configuration information of not fixing the multiple segment information of check figure purpose that is applicable to, wherein segment information includes but not limited to comprise the numeral of representing every section number of instructions, represent the special sign of section boundaries, the indicating gauge of each code snippet start information.

For instance, in a described device that 1000 processor cores are arranged, can generate the table that 1000 items are arranged by maximum processor number 1000, the positional information of each storage command adapted thereto in described unsegmented executable object code, the promptly corresponding code snippet that can on corresponding single nuclear, move of the packing of orders between two.If used whole 1000 processor cores when operation, then each processor core moves the code between corresponding two unsegmented executable object code positions pointed in the described table, and promptly each processor core moves one section corresponding in described table code.If only used N processor core (N＜1000) when operation, then each processor core moves 1000/N section code corresponding in the described table, and specific code can be determined according to relevant position information in the table.

The respective code fragment of the instruction that on each processor core, moves after described cutting apart, can also comprise extra instruction.Described extra instruction includes but not limited to code snippet header extension, the expansion of code snippet afterbody, is used to realize seamlessly transitting of the internuclear instruction execution of different processor.For instance, can add the expansion of code snippet afterbody at the end of each code snippet, store all values in the register file in the data-carrier store ad-hoc location, beginning at each code snippet adds the code snippet header extension, value in the ad-hoc location from data-carrier store reads in the register file, realize the register value transmission that different processor is internuclear with this, guarantee the true(-)running of program; When carrying out the end of code snippet, next bar instruction is from article one instruction of described code snippet.

The method and apparatus of data processing of the present invention, can construct and a kind ofly penetrate configurable multi-core/many-core device with the line level aggregated(particle) structure, comprise a plurality of processor cores (ProcessorCore), a plurality of configurable local storage (configurable local memory), configurable interconnect architecture (configurable interconnect structure) based on the serial pilosity.Wherein:

Processor core is used for execution command, carries out computing and obtains accordingly result;

Configurable local storage, the data transfer and the data that are used between storage instruction and described processor core are preserved;

Configurable interconnect architecture is used for interior each intermodule of described configurable multi-core/many-core device and and outside being connected.

Described configurable multi-core/many-core device can also comprise expansion module, to adapt to demand widely; Described expansion module include but not limited to odd number or a plurality of with lower module partly or entirely:

Shared storage (shared memory) is used for preserving data under the situation that described configurable data storer overflows, transmitting the shared data between a plurality of processor cores;

Direct memory visit (DMA) controller is used for except that processor core other modules to the direct visit of described configurable local storage;

Abnormality processing (exception handling) module is used to handle unusual (exception) that processor core, local storage take place:

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, processor core comprises arithmetic element and programmable counter, can also comprise expansion module to adapt to demand widely, described expansion module includes but not limited to register file.The instruction that described processor core is carried out includes but not limited to arithmetic operation instruction, logic instruction, condition judgment and jump instruction, is absorbed in and link order unusually; Described arithmetic operation instruction, logic instruction include but not limited to multiplication, add/subtraction, take advantage of add/subtract, add up, be shifted, extraction, swap operation, and comprise fixed-point arithmetic and the floating-point operation of any bit wide smaller or equal to described processor core data bit width; Each described processor core is finished odd number bar or a plurality of described instructions.The number of described processor core can be expanded according to practical application request.

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, each described processor core all has corresponding configurable local storage, comprises the configurable data storer (configurable data memory) that is used to deposit the command memory (instruction memory) of cutting apart the post code fragment and is used for store data.

In same configurable local storage, described command memory can be according to different configuration informations changes with the border between the configurable data storer.After the size and border of determining the configurable data storer according to configuration information, described configurable data storer comprises the plurality of data quantum memory.

In same configurable data storer, the border between the described plurality of data quantum memory can change according to different configuration informations.Described data quantum memory can be mapped to whole address spaces of described multi-core/many-core device by address translation.Described mapping includes but not limited to be undertaken by tabling look-up address translation and carries out address translation by content adressable memory (CAM) coupling.

Every (entry) comprises data and flag information in the described data quantum memory, and described flag information includes but not limited to significance bit (valid bit), data address.Whether the data that described significance bit is used for indicating corresponding entry to store are effective.Described data address is used for indicating data that corresponding entry stores in position that whole address spaces of described multi-core/many-core device should be in.

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, described configurable interconnect architecture by configuration be used for each intermodule in the described configurable multi-core/many-core device and with being connected of outside, include but not limited to being connected of processor core and adjacent configurable local storage, processor core is connected with shared storage, processor core is connected with the direct memory access controller, configurable local storage is connected with shared storage, configurable local storage is connected with the direct memory access controller, configurable local storage and outside being connected and shared storage and outside being connected of described device of described device.

According to configuration, can make two processor cores and respective local memories thereof constitute the front and back level and connect relation, include but not limited to that the previous stage processor core arrives back one-level processor core by its corresponding configurable data storer with data transmission.

According to the application program requirement, can part or all of processor core and respective local memories thereof be constituted the individual or a plurality of structures connected in series of odd number by configurable interconnect architecture by configuration.A plurality of described structures connected in series can be independent separately, also can partly or entirely connect serial, parallel or string and mixing ground execution command each other.Described serial, parallel or string and mix the ground execution command include but not limited to according to application program require different structures connected in series under the control of synchronization mechanism, move different program segment executed in parallel different instructions, multi-threaded parallel moves, and requires different structures connected in series to move identical program segment under the control of synchronization mechanism, carry out the intensive computing of same instructions, different pieces of information in single-instruction multiple-data stream (SIMD) (SIMD) mode according to application program.

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, processor core has specific data and reads rule (read policy), writes rule (write policy) in the described structure connected in series.

Described data are read rule, and the input Data Source of first processor core includes but not limited to itself configurable data storer, shared storage, described configurable multi-core/many-core device outside accordingly in the promptly described structure connected in series.The input Data Source of other random processor nuclears includes but not limited to itself configurable data storer, the corresponding configurable data storer of previous stage processor core accordingly.Correspondingly, the whereabouts of the output data of any described processor core includes but not limited to itself configurable data storer, shared storage accordingly, when extended memory existed, the whereabouts of the output data of any described processor core can also be an extended memory.

Described data are write rule, and the input Data Source of the corresponding configurable data storer of first processor core includes but not limited to processor core itself, shared storage, described configurable multi-core/many-core device outside in the promptly described structure connected in series.Other random processor nuclear phases answer the input Data Source of configurable data storer to include but not limited to processor core itself, the corresponding configurable data storer of previous stage processor core, shared storage.The input data of described processor core and corresponding configurable data storer separate sources thereof are carried out multichannel by ad hoc rules and are selected to determine final input data.

Same described configurable data storer can be simultaneously by two processor core visits of level before and after it, and different processor cores is visited the different pieces of information quantum memory in the described configurable data storer separately.Described processor core can be visited respectively different pieces of information quantum memory in the same configurable data storer according to ad hoc rules, described ad hoc rules includes but not limited in the same configurable data storer different pieces of information quantum memory ping-pong buffers (ping-pong buffer) each other, visit respectively by two processor cores, after described front and back stages processor core is all finished visit to ping-pong buffers, carry out the ping-pong buffers exchange, make the data quantum memory of originally being read as quilt back one-level processor core by the data quantum memory of previous stage processor core read/write, originally in the data quantum memory of being read by back one-level processor core all significance bits all be changed to invalid, and as by the data quantum memory of previous stage processor core read/write.

When processor core comprises register file in the described multi-core/many-core system, also need to have specific register value transmission rule, described register value transmission rule, arbitrarily the odd number in the prime processor core or a plurality of register values can be transferred in the corresponding registers of any back level processor nuclear in the promptly described structure connected in series.Described register value includes but not limited in the described processor core value of register in the register file.The delivering path of described register value includes but not limited to by configurable interconnect architecture transmission, directly transmit by shared storage, directly by the corresponding configurable data memory transfer of described processor core, according to specific instruction by the shared storage transmission, according to specific instruction by the corresponding configurable data memory transfer of described processor core.

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, second level in the described line level aggregated(particle) structure be macroscopical pipelining segment can pass through back pressure (backpressure) with the information transmission of this macroscopical pipelining segment to previous stage macroscopic view pipelining segment, whether described previous stage macroscopic view pipelining segment blocks (stall) according to the macroscopic flow waterline of the back pressure information of receiving after as can be known, situation in conjunction with this macroscopical pipelining segment, determine whether this macroscopical pipelining segment blocks, and with new back pressure information transmission to the macroscopic view of previous stage more pipelining segment, realize the control of macroscopic flow waterline with this.

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, the shared storage that expansion can be arranged is used for storing data under the situation that the corresponding configurable data storer of processor core overflows, transmitting the shared data between a plurality of processor cores; Abnormality processing (exception handling) module that expansion can also be arranged is used to handle unusual (exception) that processor core, local storage take place.

When described multi-core/many-core device has shared storage and overflows during to the configurable data memory stores data, then produce unusual, and stored data stored in the shared storage, at this moment, the flag information that every (entry) comprises in the described data quantum memory includes but not limited to significance bit, data address and data label (tag).Whether the data that described significance bit is used for indicating corresponding entry to store are effective.Described data address and data label (tag) are used for indicating data that corresponding entry stores in position that whole address spaces of described multi-core/many-core device should be in jointly.

The abnormal information that all described processor cores produce all is transferred to the abnormality processing module, carries out respective handling by the abnormality processing module.Described abnormality processing module can be made of the processor core in the described multi-core/many-core device, also can be extra module.Described abnormal information includes but not limited to take place abnormity processing device numbering, Exception Type.Described the respective handling that abnormity processing device nuclear and/or local storage take place is included but not limited to be delivered to each processor core in the structure connected in series by the information whether transmission of back pressure signal blocks streamline.

Of the present inventionly penetrate in the configurable multi-core/many-core device with the line level aggregated(particle) structure, can require be configured according to application program to processor core, configurable local storage and configurable interconnect architecture based on the serial pilosity.Described configuration includes but not limited to open or turn-off size/border and content wherein, configuration interconnect architecture and the annexation of command memory and data quantum memory in processor core, the configuration local storage.

The source that is used for the configuration information of described configuration includes but not limited to that described configurable multi-core/many-core device is inside and outside.Described configuration can be adjusted according to the requirement of application program in run duration any time.The collocation method of described configuration includes but not limited to directly be disposed, disposed by the direct memory access controller by configuration of direct memory access controller and external request by processor core or central processing unit nuclear by processor core or central processing unit nuclear.

The configurable multi-core/many-core device of penetrating with the line level aggregated(particle) structure based on the serial pilosity of the present invention has the Low-power Technology of three levels: configuration level, instruction level and application level.

Described configuration level, according to configuration information, the processor core that is not used to can enter low power consumpting state; Described low power consumpting state includes but not limited to reduce processor clock frequency or the supply of cutting off the electricity supply.

Described instruction level, when processor core was carried out the instruction of reading of data, if these data also are not ready for, then described processor core entered low power consumpting state, up to described DSR, described processor core returns to normal operating conditions from low power consumpting state again.Described data are not ready for, and include but not limited to that the data that the previous stage processor core does not also need processor core at the corresponding levels write the corresponding data quantum memory.Described low power consumpting state includes but not limited to reduce processor clock frequency or the supply of cutting off the electricity supply.

Described application level, the employing devices at full hardware realizes, idle (idle) task feature of coupling is determined the utilization rate (utilization) of current processor nuclear, determines whether to enter low power consumpting state or does not return from low power consumpting state according to current processor utilization rate and benchmark utilization rate.Described benchmark utilization rate can immobilize, and also reconfigurable or self study is determined, can be solidificated in chip internal, also can be write by described device when described device starts, and also can be write by software.The reference content that is used to mate can be when chip production, be cured to chip internal, also can be write by described device or software when described device starts, can also self study write, its storage medium includes but not limited to volatile storer, nonvolatile storer; Its writing mode includes but not limited to write-once, can repeatedly write.Described low power consumpting state includes but not limited to reduce processor clock frequency or the supply of cutting off the electricity supply.

The configurable multi-core/many-core device of penetrating with the line level aggregated(particle) structure based on the serial pilosity of the present invention can possess self-testing capability, can not rely on the self-test that external unit carries out chip under the situation of work powering up.

When described multi-core/many-core device possesses self-testing capability, can be with odd number specific in the described multi-core/many-core device or a plurality of primary elements, arithmetic element or processor core are with becoming comparer, to corresponding other primary elements of plural groups in the described multi-core/many-core device, arithmetic element or processor core and primary element, the combination of arithmetic element or processor core has the excitation of particular kind of relationship, and with other primary elements of the more described plural groups of described comparer, arithmetic element or processor core and primary element, arithmetic element or processor core the output of combination whether meet corresponding particular kind of relationship.Described excitation can be from the particular module in the described multi-core/many-core device, also can be from described multi-core/many-core device outside.Described particular kind of relationship includes but not limited to equate, opposite, reciprocal, complementary.Described test result can be sent to described multi-core/many-core device outside, also can be kept in the storer in the described multi-core/many-core device.

Described self-test can be in wafer sort, tests when described device starts when encapsulation back integrated circuit testing or chip use, and also can artificially set self-test condition and cycle, regularly carries out self-test during operation.The storer that described self-test is used includes but not limited to volatile storer, nonvolatile storer.

When described multi-core/many-core device possesses self-testing capability, can possess self-reparing capability.When described test result is kept in the storer in the described multi-core/many-core device, can mark to crash handling device nuclear, when described multi-core/many-core device is configured, can walk around crash handling device nuclear according to respective markers, make the described multi-core/many-core device still can operate as normal, realize selfreparing.Described selfreparing can be to carry out after wafer sort, carry out behind the integrated circuit testing of encapsulation back or chip carries out after testing when described device starts when using, also can artificially set self-test selfreparing condition and cycle, regularly carry out during operation carrying out after the self-test.

A plurality of processor cores in the configurable multi-core/many-core device of the present invention can be isomorphisms, also can be isomeries.

In the configurable multi-core/many-core device of the present invention in the local command memory length of instruction word can be unfixed.

Local command memory and local data memory can have odd number group or plural groups read port separately in the configurable multi-core/many-core device of the present invention.

In the configurable multi-core/many-core device of the present invention, all right corresponding a plurality of local command memories of each processor core, described a plurality of local command memories can be identical sizes, also can be different sizes; Can be same structure, also can be different structure.Get when referring to operation when one or more in described a plurality of local command memories are used to respond respective processor nuclear, other the local command memories in described a plurality of local command memories can instruct and upgrade operation.The approach of update instruction includes but not limited to by direct memory access controller update instruction.

A plurality of processor cores in the configurable multi-core/many-core device of the present invention can be operated in identical clock frequency, also can be operated in different clock frequencies.

Configurable multi-core/many-core device of the present invention can have to read and causes the characteristic (LIS, loadinduced store) write.When processor core reads for the first time for certain address date, local data memory reading of data from adjacent previous stage processor core correspondence, simultaneously the data that read are write the local data memory of processor core correspondence at the corresponding levels, afterwards corresponding local data memory at the corresponding levels is all visited in the read-write of this address date, thereby under the situation that does not increase overhead, realize the transmission of identical address data in the level local data memory of adjacent front and back.

Configurable multi-core/many-core device of the present invention can have the pre-characteristic of transmitting of data; Processor core can read from the local data memory of previous stage processor core correspondence that this processor core does not need to read and write but data that subsequent treatment device nuclear need read, and write the local data memory of processor core correspondence at the corresponding levels, thereby the transmission step by step of identical address data in the level local data memory before and after realizing.

Local data memory of the present invention can also comprise odd number or a plurality of effective markers and odd number or a plurality of ownership signs.Described effective marker is used to represent whether corresponding data are effective.Described ownership sign is used to represent that corresponding data are current by which processor core is used.Adopt described effective marker and ownership sign can avoid using ping-pong buffers, improve the service efficiency of storer, and a plurality of processor core can visit same data-carrier store simultaneously, be convenient to exchanges data.

Of the present invention by configurable interconnect architecture transmission register value, include but not limited to adopt a large amount of hardwireds directly the value of register in the described processor core once all to be transferred in the register of back level processor nuclear, the method that adopts shift register with the value of register in described processor core shift transport successively in the register of back level processor nuclear.

The delivering path of described register value can also be the register that the needs transmission is decided in voting according to the register read write record.Register read write record table of the present invention is used to write down the read-write situation of the corresponding local data memory of register pair.If the value of register be written into the local data memory of processor core correspondence at the corresponding levels and afterwards the value of this register do not change, then can be only by back level processor nuclear appropriate address reading of data from the local data memory of processor core correspondence at the corresponding levels, thereby finish the transmission of described register, do not need to transmit separately this register value to the back level processor.

For example, when the value of register write corresponding local data memory, item was by clear " 0 " accordingly in the described register read write record table, and when data write register, corresponding item was by set in the described register read write record table.When carrying out the register value transmission, a transmission register read-write record sheet discipline is the value of the corresponding registers of " 1 ".Described data write register in the described register file, the register including but not limited to from corresponding local data memory reading of data to described register file, and the result that instruction is carried out writes back the register in the register file.

When the processor core number is determined in the configurable multi-core/many-core device of penetrating based on the serial pilosity with the line level aggregated(particle) structure of the present invention, can also be optimized code snippet header extension and the expansion of code snippet afterbody according to the code snippet of determining that obtains after cutting apart, reduce the quantity of the register that need transmit.

For example, under normal conditions, the expansion of code snippet afterbody has comprised the instruction of whole register values being stored into specific local data memory address, and the code snippet header extension has comprised the instruction of the value in the appropriate address being read in register, and both cooperate the realization register value smoothly to transmit.When code snippet is determined, can be according to the instruction in the code snippet, the bar number of storage and/or reading command in minimizing code snippet header extension and the expansion of code snippet afterbody.

If in the code snippet of processor core correspondence at the corresponding levels, before writing a certain register, do not use the value in this register, then can save in the code snippet header extension of the instruction of this register value of storage in the code snippet afterbody expansion of prime processor core correspondence and processor core correspondence at the corresponding levels from local data memory reading of data to the instruction of this register.

If in the code snippet of prime processor core correspondence, the value of a certain register did not just change after storing local data memory into, then can save the instruction of storing this register value in the code snippet afterbody expansion of prime processor core correspondence, and in the code snippet header extension of processor core correspondence at the corresponding levels, add dependent instruction, allow to from local data memory the appropriate address reading of data to this register.

In the method and apparatus of data processing of the present invention, in the code snippet implementation of a plurality of processor core correspondences, all can transfer to same address and carry out one section code, and be finished when shifting back each self-corresponding code snippet at this section code, can be in the local command memory of described a plurality of processor core correspondences with the code repeated storage of described same address; The code of described same address includes but not limited to function call, circulation.

In the method and apparatus of data processing of the present invention, described processor core can be visited the local command memory of the processor core except that described processor core; When a plurality of processor cores are carried out identical code, and described code length can be stored in described code in the local command memory of a plurality of processor core correspondences when surpassing the local instruction memory size of single processor core correspondence successively; During operation, the local command memory reading command of first section code and the execution from store described identical code earlier of arbitrary processor core in described a plurality of processor core, the local command memory reading command of second section code and execution from store described code again after first section code is finished, the rest may be inferred, is finished up to whole described identical codes.

In the method and apparatus of data processing of the present invention, described a plurality of processor cores can be carried out each section code in the described identical code synchronously, each section code in also can the described identical code of asynchronous execution; Each section code of described a plurality of processor core in can the described identical code of executed in parallel also can serial be carried out each section code in the described identical code; Can also go here and there and carry out with mixing each section code in the described identical code.

In the method and apparatus of data processing of the present invention, all right corresponding a plurality of local command memories of described processor core, described a plurality of local command memories can be identical sizes, also can be different sizes; Can be same structure, also can be different structure; Get when referring to operation when one or more in described a plurality of local command memories are used to respond respective processor nuclear, other the local command memories in described a plurality of local command memories can instruct and upgrade operation; The approach of update instruction can be by direct memory access controller update instruction.

Except that processor, other functional modules all are the special IC modules that realizes with firmware hardwired logic in the tradition SOC (system on a chip) (SoC, System on Chip).The performance requirement of these functional modules is very high, adopts traditional processor to be difficult to reach performance requirement, therefore can't substitute these special IC modules with conventional processors.

In the method and apparatus of data processing of the present invention, odd number or a plurality of processor cores and respective local memories thereof can be constituted high performance multinuclear syndeton, the multinuclear syndeton is configured, puts into corresponding code snippet in corresponding local command memory, make described multinuclear syndeton realize specific function, can substitute the special IC module in the SOC (system on a chip).Described multinuclear syndeton is equivalent to the functional module in the SOC (system on a chip), as image decompressor module or encryption and decryption module.These functional modules are connected by system bus again, to realize SOC (system on a chip).

Data transmission channel between processor core of the present invention and respective local memories thereof and adjacent processor core and the respective local memories thereof is local is connected (local interconnection), and an odd number described processor core and respective local memories thereof or a plurality of processor cores that connect together by this locality and the multinuclear syndeton of respective local memories formation thereof are the functional module of corresponding SOC (system on a chip).

Of the present invention corresponding to functional module in the SOC (system on a chip) the multinuclear syndeton and other described be system bus (system bus) corresponding to the data transmission channel between the multinuclear syndeton of functional module in the SOC (system on a chip).By described system bus a plurality of multinuclear syndetons corresponding to functional module in the SOC (system on a chip) are coupled together, just can realize the SOC (system on a chip) on the ordinary meaning.

SOC (system on a chip) based on technical solution of the present invention realizes has the configurability that traditional SOC (system on a chip) does not possess.By to carry out the difference configuration based on data processing equipment of the present invention, can obtain different SOC (system on a chip).Described configuration can be carried out in operational process in real time, thus can be in operational process real time altering SOC (system on a chip) function.Can dynamically reconfigure processor core and respective local memories thereof and the dynamic code snippet that changes in the corresponding local command memory, thereby change the function of described SOC (system on a chip).

According to technical solution of the present invention, described corresponding to functional module in the SOC (system on a chip) multinuclear syndeton internal processor nuclear and respective local memories and other processor cores and respective local memories thereof between be used for data transmission path this locality of belonging to functional module inside be connected.This locality by described functional module inside connects the transmission data, need take the operation of the processor core that proposes transmission requests usually.System bus of the present invention can be described local the connection, also can be the data transmission channel that the operation that do not need to take processor core can be finished data transmission between different processor nuclear and respective local memories thereof.Described different processor nuclear and respective local memories thereof can be adjacent, also can be non-conterminous.

In the method and apparatus of data processing of the present invention, a method of construction system bus is to adopt the coupling arrangement of a plurality of stationkeeping to set up data transmission channel.The input and output of any described multinuclear syndeton all link to each other by odd number root or complex root hardwired with close coupling arrangement.Also link to each other between all described coupling arrangements by odd number root or complex root hardwired.Line between described coupling arrangement, described multinuclear syndeton and described coupling arrangement, the line that reaches between described coupling arrangement constitute described system bus jointly.

In the method and apparatus of data processing of the present invention, another method of construction system bus is to set up data transmission channel, makes random processor nuclear and corresponding local data memory thereof carry out data transfer with other random processor nuclears and corresponding local data memory thereof.The approach of described data transfer includes but not limited to by the shared storage transmission, transmits, passes through private bus or network delivery by the direct memory access controller.

For example, a kind of method is, arranges odd number root or complex root hardwired between processor core in twos in some processor cores and corresponding local data memory thereof and the corresponding local data memory thereof in advance, and described hardwired can be configurable; When any two processor cores in these processor cores and the corresponding local data memory thereof and corresponding local data memory is in the different multinuclear syndetons, when promptly being in the different functional modules, the hardwired between described two processor cores and the corresponding local data memory thereof promptly can be used as the system bus between described two multinuclear syndetons.

Second method is to make all or part of described processor core and corresponding local data memory thereof can have access to other processor core and corresponding local data memory thereof by the direct memory access controller.When any two processor cores in these processor cores and the corresponding local data memory thereof and corresponding local data memory is in the different multinuclear syndetons, when promptly being in the different functional modules, just can be in the real time execution process, carry out the data transfer between described processor core and corresponding local data memory thereof and another described processor core and corresponding local data memory thereof as required, realize the system bus between two multinuclear syndetons.

The third method is, can on all or part of described processor core and corresponding local data memory thereof, realize network-on-chip (Network on Chip) function, promptly when the data transmission of described processor core and corresponding local data memory thereof arrives other processor cores and corresponding local data memory thereof, whereabouts by configurable internet determination data, thereby constitute a data path, realize data transmission.When any two processor cores in these processor cores and the corresponding local data memory thereof and corresponding local data memory is in the different multinuclear syndetons, when promptly being in the different functional modules, just can be in the real time execution process, carry out the data transfer between described processor core and corresponding local data memory thereof and another described processor core and corresponding local data memory thereof as required, realize the system bus between two multinuclear syndetons.

Above-mentioned three kinds of methods, the system bus that first method adopts the hardwired structure to realize, its connection is static, and second kind is adopted direct memory visit, the third method to adopt the network-on-chip method, and its connection is dynamic.

Whether in the method and apparatus of data processing of the present invention, described processor core can have quick condition judgment mechanism, carry out in order to determine branch transition; Described quick condition judgment mechanism can be the counter that is used to judge cycling condition, also can be the hardware finite state machine that is used to judge branch transition and cycling condition.

Configuration level of the present invention low-power consumption can also make specific processor core enter low power consumpting state according to configuration information; Described specific processor core includes but not limited to the processor core that is not used to, the processor core that operating load is relatively low; Described low power consumpting state includes but not limited to reduce processor clock frequency or the supply of cutting off the electricity supply.

In the method and apparatus of data processing of the present invention, can also comprise the individual or a plurality of dedicated processes modules of odd number.Described dedicated processes module can be called for described processor core and respective local memories thereof as macroblock, also can be used as processing module independently and receive the output of described processor core and respective local memories thereof, and result is sent to described processor core and respective local memories or other processor cores and respective local memories thereof.To the processor core of described dedicated processes module output and the processor core and the respective local memories thereof of respective local memories and the described dedicated processes module output of reception thereof can be same processor core and respective local memories thereof, also can be different processor nuclear and respective local memories thereof.Described dedicated processes module includes but not limited to fast Fourier transform (FFT) module, entropy coding module, entropy decoder module, matrix multiplication module, convolutional encoding module, viterbi codes (Viterbi Code) decoder module, turbine code (Turbo Code) decoder module.

With the matrix multiplication module is example, if use single described processor core to carry out large-scale matrix multiplication, needs a large amount of clock period, has limited the raising of data throughput; If use a plurality of described processor cores to realize extensive matrix multiplication, though can reduce execution cycle number, increased the data transfer amount between processor core, and taken a large amount of processor resources.Adopt special-purpose matrix multiplication module, can finish extensive matrix multiplication in the cycle in minority.When program is divided, can be with the operated allocated before this extensive matrix multiplication to several processor cores, organize in the processor core promptly, operated allocated behind this extensive matrix multiplication is arrived several other processor cores, promptly in the back group processor core, the data that need in the output of preceding group processor core to participate in this extensive matrix multiplication are sent to special-purpose matrix multiplication module, the result is sent to back group processor core after treatment, the data that do not need in the output of preceding group of processor core to participate in this extensive matrix multiplication then directly are sent to back group processor core again.

Beneficial effect:

At first, the method and apparatus of data processing of the present invention, the program code of serial can be divided into the code snippet that is adapted to each processor core operation in the multiprocessor nuclear structure connected in series, according to the different code snippets that rule is divided into different sizes and number of cutting apart, be fit to expand the multi-core/many-core device application of (Scalable) at the processor core of different numbers.

Secondly, method and apparatus according to data processing of the present invention, code snippet is distributed to each processor core operation in the multiprocessor nuclear structure connected in series, each processor core is carried out specific instruction, the complete function of whole processor core realization programs connected in series, the data of using between the code snippet that splits from the complete routine code almost do not have the data dependence problem by the transmission of special delivering path, have realized that real pilosity penetrates.In described multiprocessor nuclear structure connected in series, the emission quantity that its pilosity is penetrated promptly equals the quantity of processor core, has improved the utilization factor of arithmetic element greatly, thereby realizes the high-throughput of multiprocessor nuclear structure connected in series and even device.

Once more, substituted the buffer memory (cache) that has usually in the processor with local storage.All instruction and datas that this processor core will be used have been preserved in the corresponding local storage of each processor core, accomplished 100% visit hit rate (hit rate), solve the speed bottle-neck problem of the outside low-speed memory of visit that cache miss (cache miss) causes, further improved the overall performance of device.

Once more, multi-core/many-core device of the present invention has the Low-power Technology of three levels, not only can adopt the power managed of realizing coarseness as the methods such as power supply of cutting off the processor core that is not used, can also be according to data-driven, carry out fine granularity power managed at the instruction level, more can implement to adjust in real time automatically the processor core clock frequency with hardware mode, under the prerequisite that guarantees the processor core operate as normal, effectively reduce the operating dynamic power consumption of processor core, realize processor core by demand adjustment clock frequency, and reduce artificial intervention as far as possible and implement.Simultaneously owing to adopt hardware mode to realize that speed is fast, real-time adjustment that can more effective realization processor clock frequency.

At last, adopt technical solution of the present invention, only need programming and configuration just can realize SOC (system on a chip), can shorten from being designed into the R﹠D cycle between the launch.And, only need reprogramming and reshuffle, just can make same hardware product realize different functions.

Description of drawings

Though the modification that this invention can be in a variety of forms and replace and expand has also been listed some concrete enforcement legends and has been described in detail in the instructions.Should be understood that inventor's starting point is not that this invention is limited to the specific embodiment of being set forth, antithesis, inventor's starting point is to protect all based on the improvement of carrying out in the spirit or scope by the definition of this rights statement, equivalence conversion and modification.

Fig. 1 is cutting apart and be assigned as example the flow implementation example that the present invention will be described with high-level language programs and assembly language program(me).

Fig. 2 is handling procedure round-robin embodiment in the Compilation Method of back of the present invention.

Fig. 3 of the present inventionly penetrates configurable multi-core/many-core device synoptic diagram with the line level aggregated(particle) structure based on the serial pilosity.

Fig. 4 is the embodiment of map addresses mode.

Fig. 5 is the embodiments of data in internuclear transmission.

Fig. 6 be back pressure, abnormality processing and data-carrier store with shared storage between the embodiment that is connected.

Fig. 7 is self-test and self-repair method of the present invention and structure embodiment.

Fig. 8 (a) is a kind of embodiment of adjacent processor core register value transmission.

Fig. 8 (b) is second kind of embodiment of adjacent processor core register value transmission.

Fig. 9 is the third embodiment of adjacent processor core register value transmission.

Figure 10 (a) is based on a kind of embodiment of processor core of the present invention and corresponding local storage composition.

Figure 10 (b) is based on the another kind of embodiment of processor core of the present invention and corresponding local storage composition.

Figure 10 (c) is based on effective marker position and the embodiment that belongs to zone bit in processor core of the present invention and the corresponding local storage.

Figure 11 (a) is the typical structure of present existing SOC (system on a chip).

Figure 11 (b) is based on a kind of embodiment that technical solution of the present invention realizes SOC (system on a chip).

Figure 11 (c) is based on the another kind of embodiment that technical solution of the present invention realizes SOC (system on a chip).

Figure 12 (a) is the embodiment of preceding compiling in the technical solution of the present invention.

Figure 12 (b) is the embodiment of back compiling in the technical solution of the present invention.

Figure 13 (a) of the present inventionly penetrates configurable another synoptic diagram of multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity.

Figure 13 (b) is the multinuclear serial structure synoptic diagram that passes through configuration formation with the configurable multi-core/many-core device of line level aggregated(particle) structure of penetrating based on the serial pilosity of the present invention.

Figure 13 (c) is the multinuclear serial parallel mixed structure synoptic diagram that passes through configuration formation with the configurable multi-core/many-core device of line level aggregated(particle) structure of penetrating based on the serial pilosity of the present invention.

Figure 13 (d) is the synoptic diagram that passes through a plurality of coenocytisms of configuration formation with the configurable multi-core/many-core device of line level aggregated(particle) structure of penetrating based on the serial pilosity of the present invention.

Embodiment

Fig. 1 is cutting apart and be assigned as example the flow implementation example that the present invention will be described with high-level language programs and assembly language program(me).Higher-level language code and/or the assembly language code after preceding compiling (103) step is launched calling in high-level language programs (101) and/or the assembly language program(me) (102) to obtain calling expansion at first.To call higher-level language code after the expansion and/or assembly language code then by compiler compiling (104) obtain the being in order assembly code of execution sequence, and carry out back again and compile (107); If have only the assembly language code in the program, and the execution sequence of being in order, then can save compiling (104), directly carry out back compiling (107).When carrying out back compiling (107), in the present embodiment, be foundation with the structural information (106) of multinuclear device, go up the operation assembly code and cut apart at the behavior model (108) of processor core, obtain configuration information (110), produce corresponding configuration boot (109) simultaneously.At last, directly or by dma controller (112) corresponding a plurality of processor cores (113) are configured by the processor core (111) in the described device.

In Fig. 2, the instruction dispenser at first reads in the top end stops stream fragments in step 1 (201), reads in front end code stream relevant information in step 2 (202) again.Enter step 3 (203) then and judge whether this code stream segment circulates, if do not circulate, then entering step 9 (209) handles according to conventional processing code stream segment, if circulation, then enter step 4 (204) and at first read in circulating cycle issue M, enter step 5 (205) again and read in the periodicity N that this program segment can hold.Judge that in step 6 (206) whether circulating cycle issue M is greater than the periodicity N that can hold, if circulating cycle issue M is greater than the periodicity N that can hold, then enter step 7 (207) circulation is divided into the partial circulating and a partial circulating in a M-N week of carrying out N week, and give M with M-N assignment again in step 8 (208), enter next program segment circulation simultaneously, up to satisfying the circulating cycle issue less than the periodicity that can hold.By this method, can effectively solve the situation of the periodicity that the circulating cycle issue can hold greater than program segment.

Fig. 3 of the present inventionly penetrates configurable multi-core/many-core device synoptic diagram with the line level aggregated(particle) structure based on the serial pilosity.In the present embodiment, this device is made of some processor cores (301), configurable local storage (302) and configurable interconnect architecture (303).In the present embodiment, the configurable local storage (302) of corresponding its below of each processor core (301), both constitute the one-level of described macroscopic flow waterline together.By disposing configurable interconnect architecture (303), a plurality of processor cores (301) and corresponding configurable local storage (302) thereof can be connected into structure connected in series.A plurality of structures connected in series can be independent separately, also can partly or entirely connect serial, parallel or string and mixing ground working procedure each other.

Fig. 4 is the embodiment of map addresses mode.Fig. 4 (a) adopts the method for look-up table to realize address search.With 16 bit address is example, and the 64K address space is divided into the small memory (403) of the single 1K address space of polylith, and the mode that employing writes in proper order after one block storage has been write, writes other pieces again.Whenever write once, to point to next significance bit automatically be 0 available list item to address pointer (404) in the piece, writes fashionable active position 1 with list item.Each list item writes data and simultaneously its address is write look-up table (402).With the value that writes address BFCO is example, and address pointer this moment (404) points to No. 2 list items of storer (403), when corresponding data is write No. 2 list items, writes 2 in look-up table (402) corresponding address BFCO, thereby sets up address mapping relation.When reading of data, find corresponding list item by the address according to look-up table (402), read institute's deposit data.Fig. 4 (b) adopts the method for cam array to realize address search.With 16 bit address is example, and the 64K address space is divided into the small memory (403) of the single 1K address space of polylith, and the mode that employing writes in proper order after one block storage has been write, writes other pieces again.Whenever write once, to point to next significance bit automatically be 0 available list item to address pointer (406) in the piece, writes fashionable active position 1 with list item.Each list item writes the next list item that data write its instruction address cam array (402) simultaneously.With the value that writes address BFCO is example, and address pointer this moment (406) points to No. 2 list items of storer (403), when corresponding data is write No. 2 list items, writes instruction address BFCO at the next list item of cam array (405), thereby sets up address mapping relation.When reading of data, comparing with all instruction addresses that cam array is deposited and find corresponding list item in the input instruction address, reads institute's deposit data.

Fig. 5 is the embodiments of data in internuclear transmission.All data-carrier stores and are divided into two parts up and down on the logical meaning all between processor core.Wherein top is used for the read-write of the processor core above the data-carrier store, and the lower part only is used for reading of data and uses for the processor core below the data-carrier store.In the time of the processor core working procedure, data are delivered in relays downwards from top data-carrier store.Three data (506) of selecting a selector switch (502,509) can select to transmit are at a distance sent into data-carrier store (503,504).When processor core (510,511) is not done the Store instruction, the lower part of data-carrier store (501,503) selects a selector switch (502,509) to write the top of corresponding next data-carrier store (503,504) by three respectively, and the significance bit V that indicates writing line simultaneously is 1.When doing the Store instruction, register file is only to following data-carrier store value of writing.When Load instruction need be got the data of appropriate address, alternative selector switch (505,507) respectively by the significance bit V decision of data-carrier store (503,504) be above corresponding data-carrier store (501,503) or below data-carrier store (503,504) peek.If the significance bit V of certain list item is 1 in the data-carrier store (503,504), be that flag data writes renewal from top data-carrier store (501,503), then under the situation of the data of not selecting to transmit at a distance (506), three select a selector switch (502,509) the register file output conduct input of selection processor nuclear (510,511) respectively, thereby guarantee that institute's deposit data is the last look after handling through processor core (510,511).Write fashionablely by new data on the top of data-carrier store (503), the lower part of data-carrier store (503) is to the top of data-carrier store (504) transmission data.Use the pointer sign transmitting the list item of data during data transmission, when last list item of pointed, the sign transmission is near completion.When one section program run finished, data should be finished the transmission to next storer.When next section program run, the top of data-carrier store (501) is to the lower part of data-carrier store (503) transmission data, the top of data-carrier store (503) is to the lower part of data-carrier store (504) transmission data, data are transmitted on the top of data-carrier store (504) downwards, thereby constitute the table tennis transmission structure.All data-carrier stores all mark off the storage that a part is used to instruct by required instruction space size, and promptly data-carrier store and command memory are indiscrete physically.

Fig. 6 be back pressure, abnormality processing and data-carrier store with shared storage between the embodiment that is connected.Write respective code fragment (615) by dma controller (616) to command memory (601,609,610,611) in the present embodiment.Code in processor core (602,604,606,608) the operation command adapted thereto storer (601,609,610,611), and read and write corresponding data-carrier store (603,605,607,612).With processor core (604), data-carrier store (605) and back one-level processor core (606) is example, front and back stages processor core (604,606) all has visit to data storer (605), only after prime processor core (604) was finished write data storer (605) and back level processor nuclear (606) and finished read data storer (605), the data quantum memory in the data-carrier store (605) just can be done the table tennis exchange.Back pressure signal (614) is used for whether having finished read operation by back level processor nuclear (606) notification data storer (605).Back pressure signal (613) is used for notifying prime processor core (604) whether to have by data-carrier store (605) and overflows, and transmits by the next back pressure signal of back level processor nuclear (606) transmission.Prime processor core (604) is according to ruuning situation own with by the next back pressure signal of data-carrier store (605) transmission, judge whether the macroscopic flow waterline blocks, determine whether the data quantum memory in the data storer (605) is done the table tennis exchange, and produce the back pressure signal continuation to the previous stage transmission.By the reverse back pressure signal transmission that processor core like this arrives processor core again to data-carrier store, the i.e. operation of may command macroscopic flow waterline.All data-carrier stores (603,605,607,612) all are connected with shared storage (618) by connecting (619).When the required address that writes or read of certain data-carrier store was outside himself, the generation address was unusual, enters in the shared storage (618) and searches the address, after finding data is write this address and maybe the data of this address is read.When processor core (608) need be used data in the data-carrier store (605), also take place unusual, data-carrier store (605) by shared storage (618) with data transmission in processor core (608).The abnormal information that processor core and data-carrier store produce all is transferred to abnormality processing module (617) by designated lane (620).In the present embodiment, overflowing with the operation result in the processor core is example, and abnormality processing module (617) processor controls is checked the operation result that overflows and done limit spoke (saturation) operation; Overflowing with data-carrier store is example, abnormality processing module (617) control data memory access shared storage, with data storage in shared storage; In this process, abnormality processing module (617) transmits a signal to described processor core or data-carrier store, make it to block, wait to finish and resume operation after abnormality processing is operated again, the signal that other processor cores and data-carrier store come by the back pressure transmission determines self whether to block separately.

See also Fig. 7, this figure is described self-test and self-repair method and structure embodiment.In this self-test selfreparing structure (701), the test vector that vector generator (702) produces is delivered to each processor core synchronously, test vector dispensing controller (703) is controlled the annexation of each processor core and vector generator (702), operation result distribution controller (709) is controlled the annexation of each processor core and comparer, processor core carries out the comparison of operation result by comparer and other processor cores, in the present embodiment, each processor core can compare with other adjacent processor cores, can pass through Compare Logic (708) and processor core (705 as processor core (704), 706,707) compare.In this embodiment, each Compare Logic can comprise one or more comparer, if a Compare Logic has a comparer, then each processor core compares with other adjacent a plurality of processor cores successively, if a Compare Logic has a plurality of comparers, then each processor core while and other adjacent a plurality of processor cores compare, and test result directly writes table with test results (710) from each Compare Logic.

See also Fig. 8, Fig. 8 has provided three kinds of embodiment of adjacent processor core register value transmission.

In the corresponding embodiment of Fig. 8 (a), processor core has the register file (801) that comprises 31 32 general-purpose registers, when all general register values are to processor core at the corresponding levels (803) in transmitting prime processor core (802), can directly each each input end of output terminal and all general-purpose registers of processor core at the corresponding levels (803) of all general-purpose registers of prime processor core (802) be passed through the corresponding one by one connection of MUX with 992 hardwireds.When transmitting register value, in one-period, the value of 31 32 general-purpose registers in the prime processor core (802) all can be delivered to processor core at the corresponding levels (803).Specifically shown the hardwired method of attachment of (804) in the general-purpose register among Fig. 8 (a), all the other hardwired methods of attachment of 991 are identical with this position (804).The output terminal (806) of corresponding positions (805) is connected by MUX (808) by the input end of this position (804) in hardwired (807) and the processor core at the corresponding levels (803) in the prime processor core (802).When computings such as processor core execution arithmetic, logic, MUX (808) selects to derive from the data (809) of processor core at the corresponding levels; When processor core is carried out the peek operation, if these data exist, then select to derive from the data (809) of processor core at the corresponding levels in the local storage of processor core correspondence at the corresponding levels, the data (810) of coming otherwise selection derives from the transmission of prime processor core; When transmitting register value, MUX (808) selects to derive from the transmission of prime processor core and next data (810).Whole 992 transmission simultaneously can be finished the transmission of whole register file value in one-period.

In the corresponding embodiment of Fig. 8 (b), adjacent processor core (820,822) has the register file (821,823) that comprises a plurality of 32 general-purpose registers separately.Examine (820) when processor core at the corresponding levels (822) transmits register value at the past level processor, can the data output end (829) of register file (821) in the prime processor core (820) be connected with the input of MUX (827) on being connected middle register file (823) data input pin (830) of processor core at the corresponding levels (822) with 32 hardwireds, the data (825) that the input of MUX (827) is respectively the next data (824) of processor core at the corresponding levels and comes by the past level processor nuclear that hardwired (826) sends, when processor core is carried out arithmetic, during computings such as logic, MUX (827) selects to derive from the data (824) of processor core at the corresponding levels; When processor core is carried out the peek operation, if these data exist in the local storage of processor core correspondence at the corresponding levels, then select to derive from the data (824) of processor core at the corresponding levels, otherwise the data (825) of selecting to derive from the transmission of prime processor core and coming are when transmitting register value, and MUX (827) is selected to derive from the transmission of prime processor core and the data (825) of coming.Produce the address input end (831,833) that the register address that needs the transmission register value is delivered to register file (821,823) by the corresponding register address generation module (828,832) of register file (821,823) itself, several times the register file (823) that the value of described register is transmitted from register file (821) by hardwired (826) and MUX (827).Like this, can only increase under a small amount of hard-wired situation, utilize the transmission of finishing all or part of register value in the register file in a plurality of cycles.

In the embodiment of Fig. 9 correspondence, adjacent processor core (940,942) has the register file (941,943) that comprises a plurality of 32 general-purpose registers separately.Examine (940) when processor core at the corresponding levels (942) transmits register value at the past level processor, can be earlier utilize data storage (store) instruction that register value in the register file (941) is write in the corresponding local data memory (954) of prime processor core (940), utilize Data Loading (load) instruction from local data memory (954), to read corresponding data and write in the corresponding register of register file (943) by processor core at the corresponding levels (942) again by prime processor core (940).In the present embodiment, the data output end (949) of the register file (941) in the prime processor core (940) links to each other with the data input pin (948) of local data memory (954) by 32 lines (946), and the data input pin (950) of the register file (943) in the processor core at the corresponding levels (942) links to each other with the data output end (952) of local data memory (954) by MUX (947) and 32 lines (953).The data (945) that the input of MUX (947) is respectively the next data (944) of processor core at the corresponding levels and comes by the past level processor nuclear that 32 lines (953) send, when computings such as processor core execution arithmetic, logic, MUX (947) selects to derive from the data (944) of processor core at the corresponding levels; When processor core is carried out the peek operation, if these data exist in the local storage of processor core correspondence at the corresponding levels, then select to derive from the data (944) of processor core at the corresponding levels, otherwise the data (945) of selecting to derive from the transmission of prime processor core and coming are when transmitting register value, and MUX (947) is selected to derive from the transmission of prime processor core and the data (945) of coming.In the corresponding embodiment of Fig. 8 (c), can be earlier successively with in the register file (941) all the value of registers all write in the local data memory (954), these values are write in the register file (943) successively afterwards; Also can be earlier successively the value of component register in the register file (941) be write in the local data memory (954), these values are write in the register file (943) successively afterwards; After the value of a register in the register file (941) can also being write in the local data memory (954), this value is write in the register file (943) at once, repeat this process successively, all transmit up to the register value of needs transmission and finish.

See also Figure 10, Figure 10 has provided two kinds of embodiment based on the syndeton of processor core of the present invention and corresponding local storage composition.To those skilled in the art; can carry out various possible replacements, adjustment and improvement to each ingredient among these embodiment according to technical scheme of the present invention and design, and all these replace, adjust and improve the protection domain that all should belong to claims of the present invention.

The embodiment that Figure 10 (a) is corresponding has comprised the processor core (1001) of local command memory and local data memory and the local data memory (1002) of previous stage processor core correspondence thereof.Processor core (1001) by local command memory (1003), local data memory (1004), performance element (1005), register file (1006), data address generation module (1007), programmable counter (1008), write buffering (1009) and output buffering (1010) is formed.

Local command memory (1003) stores processor core (1001) and carries out required instruction.The required operand of performance element (1005) is from register file (1006) in the processor core (1001), or from counting immediately in the instruction; Execution result writes back register file (1006).

In the present embodiment, local data memory has two quantum memories.With local data memory (1004) is example, selects by MUX (1018,1019) from the data that two sub-storeies are read, and produces the data (1020) of final output.

Data in the local data memory (1002,1004), the data (1011) write in the shared storage of data in the buffering (1009) or outside can be read in the register file (1006) by Data Loading (load) instruction.In the present embodiment, data in the local data memory (1002,1004), write data in the buffering (1009) and the data (1011) in the outside shared storage and select by MUX (1016,1017) after, be input in the register file (1006).

Data in the register file (1006) can be stored in the local data memory (1004) by writing buffering (1009) time-delay by data storage (store) instruction, or the data in the register file (1006) are stored in the outside shared storage by output buffering (1010) time-delay.These data can stored into the local data memory (1004) by writing buffering (1009) time-delay in register file (1006) from local data memory (1002) reading of data, to finish LIS function of the present invention, realize free data transfer.

In the corresponding embodiment of Figure 10 (a), writing the data that buffering (1009) receives has three sources: the data of coming from register file (1006), level processor nuclear local data memory (1002) data of coming and from the next data (1011) of the shared storage of outside in the past.Described data of coming from register file (1006), level processor nuclear local data memory (1002) data of coming and the data (1011) of coming from the shared storage of outside are input to after by MUX (1012) selection and write buffering (1009) in the past.

In the corresponding embodiment of Figure 10 (a), local data memory receives only and write the data input that buffering is come from same processor core.As in processor core (1001), local data memory (1004) receives only from writing the data input that buffering (1009) is come.

In the corresponding embodiment of Figure 10 (a), local command memory (1003) all is to be made of two identical quantum memories with local data memory (1002,1004) separately, can carry out the reading and writing operation to quantum memories different in the local storage simultaneously.Adopt such structure just can realize the local data memory of the described employing ping-pong buffers exchange of technical solution of the present invention.The address that local command memory (1003) receives is produced by programmable counter (1008).There are three sources the address that local data memory (1004) receives: from processor core at the corresponding levels write address storage part is assigned to buffering (1009) the address that is used to store data, from the address that is used for reading of data that processor core data address generation module at the corresponding levels (1007) comes, from back level processor check figure according to the next address that is used for reading of data (1013) of address generating module.Described after processor core at the corresponding levels is write address storage part is assigned to the buffering (1009) the address that is used to store data, selected by MUX (1014,1015) from the address that is used for reading of data that processor core data address generation module at the corresponding levels (1007) comes, from the address that is used for reading of data (1013) that back level processor check figure comes according to address generating module, be input to the address receiver module of different quantum memories in the local data memory (1004) respectively.

Correspondingly, also there are three sources the address that receives of local data memory (1002): from processor core at the corresponding levels write address storage part is assigned to the buffering the address that is used to store data, from the address that is used for reading of data that processor core data address generation module at the corresponding levels comes, from the address that is used for reading of data that back level processor check figure comes according to address generating module (1007).After above-mentioned address is selected by MUX, be input to the address receiver module of different quantum memories in the local data memory (1002) respectively.

Figure 10 (b) is another kind of syndeton based on processor core of the present invention and corresponding local storage composition, has wherein comprised the processor core (1021) of local command memory and local data memory and the local data memory (1022) of previous stage processor core correspondence thereof and has formed.Processor core (1021) by local command memory (1003), local data memory (1024), performance element (1005), register file (1006), data address generation module (1007), programmable counter (1008), write buffering (1009) and output buffering (1010) is formed.

The structure that the corresponding embodiment with Figure 10 (a) of syndeton that the corresponding embodiment of Figure 10 (b) proposes proposes is roughly the same, and unique difference is that the local data memory (1022,1024) in the present embodiment respectively is to be made of a dual-port (dual-port) storer.Dual-ported memory can be supported the reading and writing operation of two different addresses simultaneously.

There are three sources the address that local data memory (1024) receives: from processor core at the corresponding levels write address storage part is assigned to buffering (1009) the address that is used to store data, from the address that is used for reading of data that processor core data address generation module at the corresponding levels (1007) comes, from back level processor check figure according to the next address that is used for reading of data (1025) of address generating module.Described after processor core at the corresponding levels is write address storage part is assigned to the buffering (1009) the address that is used to store data, selected by MUX (1026) from the address that is used for reading of data that processor core data address generation module at the corresponding levels (1007) comes, from the address that is used for reading of data (1025) that back level processor check figure comes according to address generating module, be input to the address receiver module of local data memory (1024).

Correspondingly, also there are three sources the address that receives of local data memory (1022): from processor core at the corresponding levels write address storage part is assigned to the buffering the address that is used to store data, from the address that is used for reading of data that processor core data address generation module at the corresponding levels comes, from the address that is used for reading of data that back level processor check figure comes according to address generating module (1007).After above-mentioned address is selected by MUX, be input to the address receiver module of local data memory (1022).

Owing to need the Data Loading instruction and data storage instruction of reference-to storage generally to be no more than 40% in the program usually, therefore can be with the dual-ported memory among the corresponding embodiment of single-ended (single-port) storer replacement Figure 10 (b), the order of when program compilation, instructing in the static adjustment program, or when carrying out, program dynamically adjusts instruction execution sequence, when carrying out the instruction do not need reference-to storage, carry out instruction simultaneously, and then make the composition of syndeton more succinct, efficiently memory access.

Each local data memory is actually a dual-ported memory among the corresponding embodiment of Figure 10 (b), can support simultaneously two read, two write or one read a write operation.For guaranteeing that data in commission are not mistakenly rewritten, can adopt the method shown in Figure 10 (c), an all corresponding effective marker position (1032) and the ownership zone bit (1033) of increasing in each address in local data memory (1031).

Among Figure 10 (c), the validity of the data (1034) of this address correspondence in the local data memory (1031) has been represented in effective marker position (1032), for example, can represent the data (1034) of this address correspondence in the local data memory (1031) with " 1 " is effectively, and it is invalid representing the data (1034) of this address correspondence in the local data memory (1031) with " 0 ".Ownership zone bit (1033) has represented the data (1034) of this address correspondence in the local data memory (1031) are to return which processor core to use, for example, can represent the data (1034) of this address correspondence in the local data memory (1031) to return corresponding processor core (1035) use of described local data memory (1031) with " 0 ", represent the data (1034) of this address correspondence in the local data memory (1031) to return corresponding processor core (1035) of described local data memory (1031) and level processor nuclear (1036) use thereafter with " 1 ".

In specific embodiment, can describe by above-mentioned definition and be stored in the attribute of each data in the local data memory, and guarantee correct read-write effective marker position (1032) and ownership zone bit (1033).

In the corresponding embodiment of Figure 10 (c), if the effective marker position (1032) of certain address correspondence is " 0 " in the local data memory (1031), represent that then the data (1034) of this address correspondence are invalid, promptly, if desired, can directly carry out data storage operations to this address.If being " 1 " and ownership zone bit (1033), effective marker position (1032) are " 0 ", the data (1034) of then representing this address correspondence are effective, and be to use to the corresponding processor core (1035) of described local data memory (1031), therefore processor core at the corresponding levels (1035) can directly carry out data storage operations to this address if desired.If being " 1 " and ownership zone bit (1033), effective marker position (1032) are " 1 ", the data (1034) of then representing this address correspondence are effective, and be to give corresponding processor core (1035) of described local data memory (1031) and level processor nuclear (1036) use thereafter, if processor core at the corresponding levels (1035) need carry out data storage operations to this address, then must wait until after described ownership zone bit (1033) is for " 0 " and just can carry out data storage operations, promptly earlier the data (1034) of this address correspondence are transferred to the relevant position in the corresponding local data memory (1037) of back level processor nuclear (1036), the ownership zone bit (1033) of this address correspondence is changed to " 0 " in simultaneously that processor core at the corresponding levels (1035) is the corresponding local data memory (1031), like this, processor core at the corresponding levels (1035) just can carry out data storage operations to this address.

In the corresponding embodiment of Figure 10 (c), if this level processor (1035) carries out data storage operations to its corresponding local data memory (1031), then can be with effective marker position (1032) set of correspondence, and whether can be by back level processor (1036) use decision ownership zone bit according to these data (1034), if meeting is used by back level processor (1036) then belonged to zone bit (1033) set, otherwise reset; Also effective marker position (1032) set of correspondence with the also set of ownership zone bit (1032) of correspondence, though need to increase the capacity of local data memory (1031) like this, can be able to be simplified its concrete implementation structure simultaneously.

See also Figure 11 (a), Figure 11 (a) has provided the typical structure of present existing SOC (system on a chip).Wherein processor core (1101), Digital Signal Processor Core (1102), functional unit (1103,1104,1105), input/output interface control module (1106) and storage control module (1108) all are connected on the system bus (1110).This SOC (system on a chip) can be passed through input/output interface control module (1106) and peripherals (1107) transmission data, can also pass through storage control module (1108) and external memory storage (1109) transmission data.

See also Figure 11 (b), Figure 11 (b) has provided a kind of embodiment that realizes SOC (system on a chip) based on technical solution of the present invention.In the present embodiment, processor core and respective local memories (1121) constitute functional module (1124) jointly with other six processor cores and respective local memories, processor core and respective local memories (1122) constitute functional module (1125) jointly with other four processor cores and respective local memories, and processor core and respective local memories (1123) constitute functional module (1126) jointly with other two processor cores and respective local memories.Processor core (1101) or Digital Signal Processor Core (1102) or functional unit (1103 or 1104 or 1105) or input/output interface control module (1106) or the storage control module (1108) of described functional module (1124,1125,1126) in separately can corresponding Figure 11 (a) embodiment.

With functional module (1126) is example, processor core and respective local memories (1123,1127,1128,1129) constitute coenocytism connected in series, and described four processor cores and respective local memories (1123,1127,1128,1129) realize the function that functional module (1126) possesses jointly.

Data transmission between processor core and respective local memories (1123) and processor core and the respective local memories (1127) is connected (1130) realization by inner.Similarly, data transmission between processor core and respective local memories (1127) and processor core and the respective local memories (1128) realizes by inner is connected (1131), and processor core and respective local memories (1128) are connected (1132) realization with data transmission between processor core and the respective local memories (1129) by inside.

Functional module (1126) is connected with bus link block (1138) by hardwired (1133,1134), and making can mutual data transmission between functional module (1126) and the bus link block (1138).Similarly, energy mutual data transmission between functional module (1125) and the bus link block (1139) can mutual data transmission between functional module (1124) and the bus link block (1140,1141).Bus link block (1138) can mutual data transmission by hardwired (1135) with bus link block (1139).Bus link block (1139) can mutual data transmission by hardwired (1136) with bus link block (1140).Bus link block (1140) can mutual data transmission by hardwired (1137) with bus link block (1141).By this method, can realize that the data between functional module (1125), functional module (1126), the functional module (1127) transmit mutually, bus link block (1138,1139,1140,1141) and hardwired (1135,1136,1137) have been realized the function of system bus (1110) among Figure 11 (a), and, constituted typical system on chip structure with functional module (1125,1126,1127).

Because processor core and respective local memories are easy to expand on number in the configurable multi-core/many-core device that the present invention proposes, therefore adopt the method for present embodiment can realize various types of SOC (system on a chip) easily.In addition, when the configurable multi-core/many-core device real time execution that proposes based on the present invention, also can the structure of SOC (system on a chip) can be changed flexibly by the method for Real-time and Dynamic configuration.

See also Figure 11 (c), Figure 11 (c) provided based on technical solution of the present invention realize SOC (system on a chip)/another kind of embodiment.In the present embodiment, processor core and respective local memories (1151) constitute functional module (1163) jointly with other six processor cores and respective local memories, processor core and respective local memories (1152) constitute functional module (1164) jointly with other four processor cores and respective local memories, and processor core and respective local memories (1153) constitute functional module (1165) jointly with other two processor cores and respective local memories.Processor core (1101) or Digital Signal Processor Core (1102) or functional unit (1103 or 1104 or 1105) or input/output interface control module (1106) or the storage control module (1108) of described functional module (1163,1164,1165) in separately can corresponding Figure 11 (a) embodiment.

With functional module (1165) is example, processor core and respective local memories (1153,1154,1155,1156) constitute coenocytism connected in series, and described four processor cores and respective local memories (1153,1154,1155,1156) realize the function that functional module (1165) possesses jointly.

Data transmission between processor core and respective local memories (1153) and processor core and the respective local memories (1154) is connected (1160) realization by inner.Similarly, data transmission between processor core and respective local memories (1154) and processor core and the respective local memories (1155) realizes by inner is connected (1161), and processor core and respective local memories (1155) are connected (1162) realization with data transmission between processor core and the respective local memories (1156) by inside.

In the present embodiment, an example is by the data transmission demand between realization functional module (1165) of the data transmission between processor core and respective local memories (1156) and processor core and respective local memories (1166) and functional module (1164).According to technical solution of the present invention, in the operational process, in case processor core and respective local memories thereof (1156) need with processor core and respective local memories (1166) mutual data transmission thereof, configurable internet disposes, sets up the bi-directional data path (1158) of processor core and respective local memories (1156) thereof and processor core and respective local memories (1166) thereof automatically according to the demand of described data transmission.Similarly, in case processor core and respective local memories thereof (1166) need be to processor core and respective local memories (1156) one-way transmission data thereof, or processor core and respective local memories (1156) thereof need also can be set up unidirectional data path by same procedure to processor core and respective local memories (1166) one-way transmission data thereof.

In the present embodiment, also set up the bi-directional data path (1157) between processor core and respective local memories thereof (1151) and processor core and the respective local memories (1152) thereof, and the bi-directional data path (1159) between processor core and respective local memories (1165) and processor core and the respective local memories (1155) thereof.By this method, can realize that the data between functional module (1163), functional module (1164), the functional module (1165) transmit mutually, bi-directional data path (1157,1158,1159) has been realized the function of system bus (1110) among Figure 11 (a), and, constituted typical system on chip structure with functional module (1163,1164,1165).

According to the difference of SOC (system on a chip) application demand, not necessarily has only one group of data path between any two functional modules.Because processor core is easy to expand on number in the configurable multi-core/many-core device that the present invention proposes, therefore adopt the method for present embodiment can realize various types of SOC (system on a chip) easily.In addition, when the configurable multi-core/many-core device real time execution that proposes based on the present invention, also can the structure of SOC (system on a chip) can be changed flexibly by the method for Real-time and Dynamic configuration.

Compiling and back compiling embodiment before Figure 12, wherein Figure 12 (a) is preceding compiling embodiment, Figure 12 (b) is back compiling embodiment.

Shown in Figure 12 (a), the left side is original program code (1201,1203,1204), and function call is arranged in code twice, is respectively A function call and B function call.Wherein 1203,1204 be respectively A function and B function code itself.After compiling launched before carrying out, A function call and B function call were replaced by corresponding function code respectively, do not have function call in the code after the expansion, shown in 1202.

Figure 12 (b) is back compiling embodiment, as shown in the figure, original object code (1205) is through the object code after the general compiled, this object code is based on the object code that order is carried out, after compiling is cut apart later, form code block (1206,1207,1208,1209,1210,1211) as shown in the figure, each code block is distributed to a corresponding processor core and is carried out.Corresponding A loop body is split into an independent code block (1207), and the B loop body is divided into two code blocks, i.e. B loop body 1 (1209) and B loop body 2 (1210) because itself is relatively large.Two code blocks are carried out on two processor cores, finish the B loop body jointly.

See also Figure 13, Figure 13 (a) penetrates configurable multi-core/many-core device synoptic diagram with the line level aggregated(particle) structure for of the present invention based on the serial pilosity, the multinuclear serial structure synoptic diagram of Figure 13 (b) for forming by configuration, Figure 13 (c) is for passing through the multinuclear serial parallel mixed structure synoptic diagram that configuration forms, and Figure 13 (d) is by disposing the synoptic diagram of a plurality of coenocytisms that form.

Shown in Figure 13 (a), this device is by a plurality of processor cores and configurable local storage (1301,1303,1305,1307,1309,1311,1313,1315,1317) and configurable interconnect architecture (1302,1304,1306,1308,1310,1312,1314,1316,1318) formation.In the present embodiment, each processor core and configurable local storage constitute the one-level of described macroscopic flow waterline.By disposing configurable interconnect architecture (as 1302), a plurality of processor cores and configurable local storage (1301,1303,1305,1307,1309,1311,1313,1315,1317) can be connected into structure connected in series.A plurality of structures connected in series can be independent separately, also can partly or entirely connect serial, parallel or string and mixing ground working procedure each other.

Shown in Figure 13 (b), by disposing corresponding configurable interconnect architecture, form the multinuclear serial structure among the figure, wherein processor core and configurable local storage (1301) are the first order of this multinuclear serial structure, and processor core and configurable local storage (1317) are the afterbody of this multinuclear serial structure.

Shown in Figure 13 (c), by disposing corresponding configurable interconnection structure, processor core and configurable local storage (1301,1303,1305,1313,1315,1317) constitute serial structure, and processor core and configurable local storage (1307,1309,1311) constitute parallel organization, finally form the polycaryon processor of a serial parallel mixed structure.

Shown in Figure 13 (d), by disposing corresponding configurable interconnection structure, processor core and configurable local storage (1301,1307,1313,1315) constitute serial structure, and processor core and configurable local storage (1303,1309,1305,1311,1317) constitute an other serial structure, thereby constitute two fully independently serial structures.

Claims

1. the method for a data processing is used on the multiprocessor nuclear structure executive routine and obtains the result; Described processor refers to the hardware that carries out computing and read and write data by execution command; The method that it is characterized in that described data processing comprises:

(1) program code is cut apart, made each nuclear operation in the described multiprocessor nuclear structure cut apart the required time of post code fragment accordingly and equate as far as possible;

(2) in described multiprocessor nuclear structure on the coenocytism connected in series during working procedure, the execution result of previous processor core is given a back processor core as input in the coenocytism connected in series; Can carry out the individual or a plurality of emissions of odd number in the nuclear time per unit arbitrarily in the multiprocessor nuclear structure connected in series, a plurality of nuclears connected in series form more massive pilosity simultaneously and penetrate, and promptly the serial pilosity is penetrated;

(3) in described multiprocessor nuclear structure on the coenocytism connected in series during working procedure, the inside streamline of examining arbitrarily in the coenocytism connected in series is first level, each macroscopic flow waterline of constituting as a macroscopical pipelining segment of nuclear is second level in the multiprocessor nuclear structure connected in series, the rest may be inferred can also obtain more how higher level.

2. the method for data processing according to claim 1, it is characterized in that producing and run on the code snippet on the processor core in the described multiprocessor nuclear structure, except that having now on the ordinary meaning the compiling from the program source code to the object code (compile) to program code, can also carry out preceding compiling (pre-compile), promptly before described compiling is carried out to the precompile of program source code; Can also carry out back compiling (post-compile), promptly be assigned to the action of each nuclear in the described multiprocessor nuclear structure connected in series and load on request program code is divided into odd number or a plurality of code snippets.

3. the method for data processing according to claim 2 is characterized in that described back compile step comprises:

(a) program code is resolved, generate the front end code stream;

(b) scanning, analyze the front end code stream, according to carrying out front end required performance period of code stream, whether redirect and jump address information, the statistics scanning result is determined carve information indirectly; Or do not scan the front end code stream, directly determine carve information according to presupposed information;

4. the method for data processing according to claim 3 the respective code fragment of instruction after described cutting apart that it is characterized in that moving, can also comprise the extra instruction that does not influence the original program function on each described processor core.

5. penetrate configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity for one kind, comprise a plurality of processor cores, a plurality of configurable local storage (configurable localmemory), configurable interconnect architecture (configurable interconnect structure), wherein:

Configurable interconnect architecture is used for interior each intermodule of described configurable multi-core/many-core device and reaches and outside being connected;

Can require processor core, configurable local storage and configurable interconnect architecture are configured according to application program, constitute the individual or a plurality of coenocytisms connected in series of odd number; A part of code snippet in the described coenocytism connected in series in each program code execution of each processor core; All processor cores are realized the complete function of program code jointly in the described coenocytism connected in series.

6. configurable multi-core/many-core device according to claim 5 is characterized in that describedly penetrating configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity and can also comprising odd number or a plurality of expansion modules; Described expansion module can be:

Shared storage (shared memory) is used for preserving data under the situation that described configurable data storer overflows, transmitting the shared data between a plurality of processor cores; Or

Direct memory visit (DMA) controller is used for except that processor core other modules to the direct visit of described configurable local storage; Or

Abnormality processing (exception handling) module is used to handle unusual (exception) that processor core, local storage take place.

7. configurable multi-core/many-core device according to claim 5 is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, and processor core comprises arithmetic element and programmable counter.

8. configurable multi-core/many-core device according to claim 7 is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, and processor core can also comprise expansion module; Described expansion module can be a register file.

9. configurable multi-core/many-core device according to claim 5, it is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, each described processor core all has corresponding configurable local storage, comprises being used to deposit the command memory (instruction memory) of cutting apart the post code fragment and the configurable data storer (configurabledata memory) that is used for store data.

10. configurable multi-core/many-core device according to claim 9 is characterized in that in the same configurable local storage, and described command memory can be according to different configuration informations changes with the border between the configurable data storer.

11. configurable multi-core/many-core device according to claim 9, it is characterized in that in same configurable data storer, can comprising the plurality of data quantum memory that the border between the described plurality of data quantum memory can change according to different configuration informations.

12. configurable multi-core/many-core device according to claim 5 is characterized in that described configurable interconnect architecture comprises being connected of being connected of being connected of being connected of being connected of being connected of processor core and adjacent configurable local storage, processor core and shared storage, processor core and direct memory access controller, configurable local storage and shared storage, configurable local storage and direct memory access controller, configurable local storage and system outside and being connected of shared storage and system outside.

13. configurable multi-core/many-core device according to claim 12, it is characterized in that according to configuration, can make two described processor cores and respective local memories thereof constitute the front and back level and connect relation, also part or all of processor core and respective local memories thereof can be constituted the individual or a plurality of structures connected in series of odd number by configurable interconnect architecture; A plurality of described structures connected in series can be independent separately, also can partly or entirely connect serial, parallel or string and mixing ground execution command each other.

14. configurable multi-core/many-core device according to claim 13, it is characterized in that described serial, parallel or string and mix the ground execution command can be according to application program require different structures connected in series under the control of synchronization mechanism, move different program segment executed in parallel different instructions, multi-threaded parallel moves, also can be to require different structures connected in series to move identical program segment under the control of synchronization mechanism according to application program, can also be to carry out the computing of same instructions, different pieces of information in single-instruction multiple-data stream (SIMD) (SIMD) mode.

15. configurable multi-core/many-core device according to claim 5, it is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, processor core has specific data and reads rule (read policy) in the described structure connected in series, the input Data Source that is first processor core in the described structure connected in series can be corresponding configurable data storer itself, can be shared storage, also can be described configurable multi-core/many-core device outside; The input Data Source of other random processor nuclears can be corresponding configurable data storer itself, also can be the corresponding configurable data storer of previous stage processor core.

16. configurable multi-core/many-core device according to claim 5, it is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, processor core has specific data and writes rule (write policy) in the described structure connected in series, the input Data Source that is the corresponding configurable data storer of first processor core in the described structure connected in series can be a processor core itself, can be shared storage, also can be described configurable multi-core/many-core device outside; It can be processor core itself that other random processor nuclear phases are answered the input Data Source of configurable data storer, can be the corresponding configurable data storer of previous stage processor core, also can be shared storage; The whereabouts of the output data of any described processor core can be corresponding configurable data storer itself, also can be shared storage; When extended memory existed, the whereabouts of the output data of any described processor core can also be an extended memory.

17., it is characterized in that the input data of described processor core and corresponding configurable data storer separate sources thereof are carried out the multichannel selection to determine final input data by ad hoc rules according to claim 15,16 described configurable multi-core/many-core devices.

18. according to claim 15,16 described configurable multi-core/many-core devices, it is characterized in that same described configurable data storer can be simultaneously by two processor core visits of level before and after it, different processor cores is visited the different pieces of information quantum memory in the described configurable data storer separately.

19. configurable multi-core/many-core device according to claim 5, it is characterized in that when processor core comprises register file in the described multi-core/many-core system, also need to have the function of transmission register value, arbitrarily the odd number in the prime processor core or a plurality of register values can be transferred in the corresponding registers of any back level processor nuclear in the promptly described structure connected in series.

20. configurable multi-core/many-core device according to claim 5, it is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, second level in the described line level aggregated(particle) structure be macroscopical pipelining segment can pass through back pressure (back pressure) information with the information transmission of this macroscopical pipelining segment to previous stage macroscopic view pipelining segment, whether described previous stage macroscopic view pipelining segment blocks (stall) according to the macroscopic flow waterline of the back pressure information of receiving after as can be known, situation in conjunction with this macroscopical pipelining segment, determine whether this macroscopical pipelining segment blocks, and with new back pressure information transmission to the macroscopic view of previous stage more pipelining segment, realize the control of macroscopic flow waterline with this.

21. configurable multi-core/many-core device according to claim 6 is characterized in that described abnormality processing module can be made of the processor core in the described multi-core/many-core device, also can be extra module.

22. configurable multi-core/many-core device according to claim 5, it is characterized in that describedly penetrating in the configurable multi-core/many-core device with the line level aggregated(particle) structure based on the serial pilosity, the configuration information that is used for described configuration comprises opens or turn-offs the size/border of processor core, configuration local storage command memory and data quantum memory and content wherein, configuration interconnect architecture and annexation.

23. configurable multi-core/many-core device according to claim 5 is characterized in that the described configurable multi-core/many-core device of penetrating with the line level aggregated(particle) structure based on the serial pilosity has the Low-power Technology of three levels:

(a) configuration level, according to configuration information, the processor core that is not used to can enter low power consumpting state;

(b) instruction level, when processor core was carried out the instruction of reading of data, if these data also are not ready for, then described processor core entered low power consumpting state, up to described DSR, described processor core returns to normal operating conditions from low power consumpting state again; Described data are not ready for, and can be because the data that the previous stage processor core does not also need processor core at the corresponding levels write the corresponding data quantum memory;

(c) application level, the employing devices at full hardware realizes, idle (idle) task feature of coupling is determined the utilization rate (utilization) of current processor nuclear, determines whether to enter low power consumpting state or does not return from low power consumpting state according to current processor utilization rate and benchmark utilization rate; Described benchmark utilization rate can be changeless, also can reconfigure or self study is determined; Can be solidificated in chip internal, also can when system start-up, write, can also write by software by system; The reference content that is used to mate can be cured to chip internal when chip production, also can be write by system or software when system start-up, can also self study write; Its writing mode can be that write-once also can be repeatedly to write;

Described low power consumpting state can be to reduce processor clock frequency, also can be the supply of cutting off the electricity supply.

24. configurable multi-core/many-core device according to claim 5, it is characterized in that the described configurable multi-core/many-core device of penetrating with the line level aggregated(particle) structure based on the serial pilosity can possess self-testing capability, can not rely on the self-test that external unit carries out chip under the situation of work powering up; When described multi-core/many-core system possesses self-testing capability, can be with odd number specific in the described multi-core/many-core system or a plurality of primary elements, arithmetic element or processor core are with becoming comparer, to corresponding other primary elements of plural groups in the described multi-core/many-core system, arithmetic element or processor core and primary element, the combination of arithmetic element or processor core has the excitation of particular kind of relationship, and with other primary elements of the more described plural groups of described comparer, arithmetic element or processor core and primary element, arithmetic element or processor core the output of combination whether meet corresponding particular kind of relationship.

25. configurable multi-core/many-core device according to claim 24 is characterized in that can possessing self-reparing capability when described multi-core/many-core system possesses self-testing capability; Promptly when described test result is kept in the storer in the described multi-core/many-core system, can mark to inefficacy elementary cell or lost efficacy row or the battle array that lost efficacy, when described multi-core/many-core system is configured, can walk around the inefficacy elementary cell or lose efficacy the row or the battle array that lost efficacy according to respective markers, make the described multi-core/many-core system still can operate as normal, realize selfreparing.

26. configurable multi-core/many-core device according to claim 5 is characterized in that described a plurality of processor core can be an isomorphism, also can be isomery.

27. configurable multi-core/many-core device according to claim 5 is characterized in that having reading and causes the characteristic (LIS, load induced store) write; Be that processor core is when reading for the first time for certain address date, local data memory reading of data from adjacent previous stage processor core correspondence, simultaneously the data that read are write the local data memory of processor core correspondence at the corresponding levels, afterwards corresponding local data memory at the corresponding levels is all visited in the read-write of this address date.

28. configurable multi-core/many-core device according to claim 5 is characterized in that having the pre-characteristic of transmitting of data; Be that processor core can read from the local data memory of previous stage processor core correspondence that this processor core does not need to read and write but data that subsequent treatment device nuclear need read, and write the local data memory of processor core correspondence at the corresponding levels.

29. configurable multi-core/many-core device according to claim 5 is characterized in that described local data memory can comprise odd number or a plurality of effective markers and odd number or a plurality of ownership signs; Described effective marker is used to represent whether corresponding data are effective; Described ownership sign is used to represent that corresponding data are current by which processor core is used.

30. configurable multi-core/many-core device according to claim 5 is characterized in that processor core can visit the local command memory of the processor core except that described processor core; Each section code in can the executed in parallel described identical code of described a plurality of processor core, each section code in also can the executed in parallel described different code; Described a plurality of processor core can serial be carried out each section code in the described identical code, also can serial carries out each section code in the described different code; Each section code in the described same code or each section code in the described different code can also be gone here and there and be carried out with mixing to described a plurality of processor core.

31. configurable multi-core/many-core device according to claim 5, it is characterized in that and odd number or a plurality of processor cores and respective local memories thereof can also be constituted high performance multinuclear syndeton, described multinuclear syndeton is configured, puts into corresponding code snippet in corresponding local command memory, make described multinuclear syndeton realize specific function; Described multinuclear syndeton is equivalent to the functional module in the SOC (system on a chip) (SoC, System on Chip) on function; A plurality of described functional modules are connected by the data transmission channel between described functional module again, realize SOC (system on a chip); Described data transmission channel is the system bus in corresponding traditional system on chip structure.

32. configurable multi-core/many-core device according to claim 31, it is characterized in that to pass through described configurable interconnect architecture, a plurality of connections are set up or are dynamically set up in input, output to a plurality of processor cores that data transfer relation is arranged in advance, constitute the data transmission channel that is equivalent to system bus structure in traditional SOC (system on a chip) between the processor core.

33. configurable multi-core/many-core device according to claim 31, it is characterized in that processor core and respective local memories thereof can dynamically be reconfigured, code snippet in the corresponding local command memory can be dynamically altered, thereby changes the function of described SOC (system on a chip).

34. whether configurable multi-core/many-core device according to claim 7 is characterized in that described processor core can have quick condition judgment mechanism, carry out in order to determine branch transition; Described quick condition judgment mechanism can be the counter that is used to judge cycling condition, also can be the hardware finite state machine that is used to judge branch transition and cycling condition.

35. configurable multi-core/many-core device according to claim 5 is characterized in that all right corresponding a plurality of local command memories of described processor core; Get when referring to operation when one or more in described a plurality of local command memories are used to respond respective processor nuclear, other the local command memories in described a plurality of local command memories can instruct and upgrade operation.

36. configurable multi-core/many-core device according to claim 5 is characterized in that a plurality of processor cores in the described configurable multi-core/many-core device can be operated in identical clock frequency, also can be operated in different clock frequencies.

37. configurable multi-core/many-core device according to claim 5 is characterized in that described penetrating based on the serial pilosity can also comprise odd number or a plurality of dedicated processes modules in the configurable multi-core/many-core device with the line level aggregated(particle) structure; Described dedicated processes module can be called for described processor core as macroblock, also can be used as the output that processing module independently receives described processor core, and result is sent to described processor core; Processor core to described dedicated processes module output can be same processor core with the processor core that receives the output of described application specific processor nuclear, also can be different processor cores.