WO2010060283A1 - 一种数据处理的方法与装置 - Google Patents

一种数据处理的方法与装置 Download PDF

Info

Publication number
WO2010060283A1
WO2010060283A1 PCT/CN2009/001346 CN2009001346W WO2010060283A1 WO 2010060283 A1 WO2010060283 A1 WO 2010060283A1 CN 2009001346 W CN2009001346 W CN 2009001346W WO 2010060283 A1 WO2010060283 A1 WO 2010060283A1
Authority
WO
WIPO (PCT)
Prior art keywords
core
data
processor
configurable
memory
Prior art date
Application number
PCT/CN2009/001346
Other languages
English (en)
French (fr)
Inventor
林正浩
Original Assignee
上海芯豪微电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN200810203777A external-priority patent/CN101751280A/zh
Priority claimed from CN200810203778A external-priority patent/CN101751373A/zh
Priority claimed from CN200910208432.0A external-priority patent/CN101799750B/zh
Application filed by 上海芯豪微电子有限公司 filed Critical 上海芯豪微电子有限公司
Priority to EP09828544A priority Critical patent/EP2372530A4/en
Priority to KR1020117014902A priority patent/KR101275698B1/ko
Publication of WO2010060283A1 publication Critical patent/WO2010060283A1/zh
Priority to US13/118,360 priority patent/US20110231616A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • G06F9/30134Register stacks; shift registers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Definitions

  • the present invention relates to the field of integrated circuit design. Background technique
  • the characteristic size of transistors is gradually narrowing along the route of 65nin, 45nm, 32nm..., and the number of transistors integrated on a single chip has exceeded one billion.
  • EDA tools have not made a qualitative breakthrough for more than 20 years, making front-end design, especially verification, more and more difficult to cope with increasing The scale of the single chip. Therefore, the design company has turned its attention to multi-core, that is, integrates multiple simple cores in one chip, which reduces the design and verification difficulty while improving the function of the chip.
  • the internal working frequency of the multi-core processor has been much higher than the operating frequency of its external memory. Simultaneous memory access by multiple processor cores has become a major bottleneck restricting system performance. Parallel multi-core architectures running serial programs do not achieve the expected performance improvements.
  • the present invention is directed to the deficiencies of the prior art, and proposes a method and apparatus for data processing for running a serial program at a high speed to improve throughput.
  • the method and apparatus for data processing of the present invention includes: segmenting program code running on a serially connected multiprocessor core structure according to a specific rule, causing each of the serially connected multiprocessor core structures The time required for the core to run the corresponding split code segment is as equal as possible to achieve load balancing of the inter-core workload.
  • the serial-connected multi-processor core structure includes a plurality of processor cores.
  • the hardware refers to the hardware that performs operations and reads and writes data by executing instructions, including but not limited to a central processing unit (CPU) and a data processing unit (DSP).
  • serial connection multi-processor core structure of the present invention constitutes serial multi-i ssue, and any of the cores of the serial-to-row multi-processor core structure can be singular or plural per unit time. Transmit, multiple serially connected cores simultaneously form a larger scale of multiple emissions, ie serial multiple emissions.
  • the serial connection multi-processor core structure of the present invention constitutes a pipeline ine hierarchy, and the internal pipeline of any core in the serial connection multi-processor core structure is the first layer, and the serial connection multi-processor core is connected.
  • the macro pipeline formed by each core in the structure as a macro pipeline segment is the second level, and so on can get more higher levels, such as serial connection multi-processor core structure as a higher-level pipeline segment. And the third level of composition.
  • the code segment on the core in the serial connection multiprocessor core structure of the present invention is generated by some or all of the three steps of pre-compilation, compile and post-compi A singular or plural code segment, the program code including but not limited to a high level language code and an assembly language code.
  • the compilation is currently compiled from the program source code to the target code in the usual sense
  • the pre-compilation is a pre-compilation of the program source code before the compilation is performed, including but not limited to, expanding a "call" in the program before compiling the program, and replacing the call statement with the actually called code.
  • the post-compilation is to divide the work content and load of each core in the serial connection multi-processor core structure as required, and divide the compiled object code into a single number or a plurality of code segments, and the steps include but not Limited to:
  • the specific model includes, but is not limited to, a behavior model of the core in the serially connected multiprocessor core structure;
  • the pre-compilation method described in the present invention is implemented before the program source code is compiled, and can also be implemented as a component of the compiler in the program source code compilation process, and can also be used as the operating system of the serial connection multi-processor core structure.
  • the component, or as a driver, or as an application, is implemented in real time while the serial connection multiprocessor core structure is running.
  • the post-compilation method described in the present invention can be implemented after the compilation of the program source code is completed, or can be used as a group of compilers. Partially implemented during program source code compilation, and may also be part of a operating system including, but not limited to, the serial-connected multi-processor core structure, drivers, applications, and serial-connected multi-processor cores
  • the structure is implemented in real time while it is running.
  • the post-compilation method is implemented in real time, the corresponding configuration information in the code segment may be manually determined, and the corresponding content in the code segment may be dynamically generated automatically according to the usage condition of the serial connection multi-processor core structure. Configuration information can also generate only fixed configuration information.
  • the existing application program can be divided and segmented simultaneously, which not only improves the running speed of the existing program on the multi-core/nuclear-core device, but also fully utilizes the efficiency of the multi-core/nuclear-core device.
  • Multi-core/nuclear-core devices are also guaranteed to be compatible with existing applications. It effectively solves the dilemma that existing applications cannot take full advantage of the multicore/many core processor.
  • the basis for indirectly determining the split information includes, but is not limited to, the number of cycles or time of execution of the instruction, and the number of instructions, that is, the number of execution cycles or time according to the instruction obtained by scanning the front-end code stream,
  • the entire executable program code is divided into code segments of the same or similar running time, and the entire executable program code may be divided into code segments of the same or similar instruction number according to the number of instructions obtained by scanning the front-end code stream;
  • the basis for determining the split information includes, but is not limited to, the number of instructions, that is, the entire executable program code can be directly divided into code segments of the same or similar instruction number according to the number of instructions.
  • the executable program code is divided as much as possible to avoid segmentation of the cyclic code according to a specific rule.
  • the loop code is divided into a plurality of smaller-scale loop codes by singular or plural times according to a specific rule.
  • the plurality of smaller scale loop codes may be components of the same or different code snippets, respectively.
  • the smaller scale loop code includes, but is not limited to, loop code containing a smaller number of codes and loop code having a smaller number of code execution cycles.
  • the code segment includes, but is not limited to, a segmented executable object code and/or corresponding to the serial connection multi-processor core structure operation for a fixed number of processor cores.
  • Configuration information applicable to the unsegmented executable object code of the serial connection multiprocessor core structure operation and corresponding configuration information including a plurality of segmentation information applicable to the number of unfixed cores, wherein the segmentation information includes It is not limited to a number including a number representing the number of instructions per segment, a specific flag representing a segment boundary, and an indication table of the start information of each code segment.
  • a table of 1000 items can be generated by a maximum number of processors 1000, each item storing a corresponding instruction in the unsegmented executable
  • the location information in the object code, the combination of instructions between the two corresponds to the code segment that can run on the corresponding single core.
  • each processor core running code between unsegmented executable object code locations pointed to by corresponding two entries in the table ie each processor core runs a corresponding segment of the table Code.
  • N ⁇ 1000 each processor core runs the corresponding 1000/N segment code in the table, and the specific code can be determined according to the corresponding location information in the table.
  • Instructions running on each processor core may include additional instructions in addition to the segmented corresponding code segments.
  • the additional instructions include, but are not limited to, a code fragment header extension, a code fragment tail extension, and a smooth transition for implementing instruction execution between different processors. For example, you can add a code fragment tail extension to the end of each code fragment, store all the values in the register file to a specific location in the data memory, and add a code fragment header extension at the beginning of each code fragment. Values in specific locations in the data memory are read into the register file to enable register value transfer between different processor cores to ensure proper operation of the program; when executed to the end of the code segment, the next instruction is from the The first instruction of the code snippet begins.
  • the data processing method and apparatus of the present invention can construct a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy, including a plurality of processor cores and a plurality of configurable Configurable local memory > configurable interconnect structure (the configurable interconnect structure where - the processor core, used to execute instructions, perform operations and get corresponding results;
  • a local memory is configurable for storing instructions and data transfer and data storage between the processor cores; a configurable interconnect structure for inter-module and external connections within the configurable multi-core/many-core device.
  • the configurable multi-core/many-core device may also include an expansion module to accommodate a wider range of needs;
  • the expansion module includes, but is not limited to, a single or a plurality of the following modules:
  • a shared memory for storing data in the case of overflow of the configurable data memory and transferring shared data between the plurality of processor cores
  • DMA direct memory access
  • An exception handling module that handles exceptions that occur in the processor core and local memory
  • the processor core of the configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention may also include an expansion module to accommodate a wider range of needs, including but not limited to a register file.
  • the instructions executed by the processor core include, but are not limited to, an arithmetic operation instruction, a logic operation instruction, a condition determination and a jump instruction, an abnormal trap and a return instruction;
  • the arithmetic operation instruction and the logic operation instruction include but are not limited to multiplication, plus/ Subtraction, multiply-add/subtract, accumulate, shift, extract, swap operations, and include fixed-point operations and floating-point operations with any bit width less than or equal to the processor core data bit width; each of the processor cores completes a singular number Or a plurality of instructions as described.
  • the number of processor cores can be extended according to actual application requirements.
  • each of the processor cores has a corresponding configurable local memory, including instructions for storing the segmented code segment.
  • Configurable data memory for storing data and data in the same configurable local memory, the boundary between the instruction memory and the configurable data memory can be changed according to different configuration information of.
  • the configurable data store includes a plurality of data sub-memory when the size and boundary of the configurable data store are determined based on the configuration information.
  • the boundaries between the plurality of data sub-memory can be changed according to different configuration information.
  • the data sub-memory can be mapped to the entire address space of the multi-core/nuclear device by address translation.
  • the mapping includes, but is not limited to, address translation by look-up table and address translation by content addressed memory (CAM) matching.
  • Each entry in the data sub-memory contains data and flag information, including but not limited to a valid bit (val id bit), a data address.
  • the valid bit is used to indicate whether the data stored in the corresponding item is valid.
  • the data address is used to indicate where the data stored in the corresponding item should be in the entire address space of the multi-core/nuclear device.
  • the configurable interconnect structure is configured for inter-module and external between the configurable multi-core/many-core devices Connections, including but not limited to connection of processor cores to adjacent configurable local memories, connections of processor cores to shared memory, connections of processor cores to direct memory access controllers, configurable connections of local and shared memories
  • Connections including but not limited to connection of processor cores to adjacent configurable local memories, connections of processor cores to shared memory, connections of processor cores to direct memory access controllers, configurable connections of local and shared memories
  • the connection of the local memory to the direct memory access controller, the connection of the local memory to the outside of the device, and the connection of the shared memory to the exterior of the device can be configured.
  • the two processor cores and their respective local memories can be configured to form a front-to-back connection, including but not limited to, the previous stage processor core transmits data to the next stage processor core through its corresponding configurable data memory.
  • some or all of the processor cores and their corresponding local memories can be configured to form a single or multiple serial connection structures through a configurable interconnect structure.
  • the plurality of serial connection structures may be independent of each other, or may be partially or completely interconnected, and instructions may be executed serially, in parallel, or in series and in series.
  • serial, parallel or serial and mixed execution instructions include, but are not limited to, different serial connection structures according to application requirements, running different program segments under the control of the synchronization mechanism, executing different instructions in parallel, and running multiple threads in parallel, according to the application program. It is required that different serial connection structures run the same program segment under the control of the synchronization mechanism, and perform the same instruction and intensive operation of different data in a single instruction multiple data stream (SMD) manner.
  • SMD single instruction multiple data stream
  • the processor core in the serial connection structure has a specific data read rule (read pol icy), write rule (write Pol icy).
  • the data read rule i.e., the input data source of the first processor core in the serial connection structure, includes but is not limited to its own corresponding configurable data memory, shared memory, and external to the configurable multi-core/nuclear device.
  • Sources of input data for any other processor core include, but are not limited to, their respective configurable data stores, corresponding configurable data stores of the previous stage processor core.
  • the destination of the output data of any of the processor cores includes, but is not limited to, a corresponding configurable data memory and a shared memory.
  • the output of any of the processor cores may be extended. Memory.
  • the data write rule that is, the input data source of the corresponding configurable data memory of the first processor core in the serial connection structure includes but is not limited to the processor core itself, the shared memory, and the configurable multi-core/multi-core device external.
  • Other input data sources for the corresponding configurable data memory of any processor core include, but are not limited to, the processor core itself, the corresponding configurable data memory of the previous stage processor core, and the shared memory.
  • Input data from different sources of the processor core and its corresponding configurable data store are multiplexed according to specific rules to determine the final input data.
  • the same configurable data store can be simultaneously accessed by two processor cores of its preceding and succeeding stages, each of which accesses a different data sub-memory in the configurable data store.
  • the processor core may separately access different data sub-memory in the same configurable data memory according to a specific rule, including but not limited to different data sub-memory in the same configurable data memory, which is a ping-pong buffer (Ping- Pong buffer ), accessed by two processor cores respectively, after the two-stage processor cores complete the access to the ping-pong buffer, perform ping-pong buffer exchange, so that the data originally read/written by the previous-stage processor core
  • the sub-memory is the data sub-memory read by the latter-stage processor, and all the valid bits in the data sub-memory that were originally read by the latter-stage processor are invalidated, and are read by the previous-stage processor.
  • the processor core in the multi-core/many-core system includes a register file, it is also necessary to have a specific register value transmission rule, that is, in any pre-stage processor core in the serial connection structure.
  • a single or multiple register values can be transferred to the corresponding registers of any subsequent processor core.
  • the register values include, but are not limited to, values of registers in the register file in the processor core.
  • the transmission path of the register value includes, but is not limited to, transmission through a configurable interconnect structure, directly transmitted through a shared memory, directly transmitted through a corresponding configurable data memory of the processor core, and transmitted through a shared memory according to a specific instruction, according to a specific instruction. Transmission by the corresponding configurable data memory of the processor core.
  • the second level in the pipeline hierarchy that is, the macro-pipeline segment
  • the second level in the pipeline hierarchy can be used to back the macro pipeline by back pressure.
  • the information of the segment is transmitted to the macro-stage pipeline segment of the previous stage.
  • the received back pressure information it can be known whether the subsequent macro-pipeline is blocked (stall), and the macro-pipeline is determined according to the situation of the macro-pipeline segment. Whether the segment is blocked or not, and the new back pressure information is transmitted to the upper stage of the macro-pipeline segment to achieve macro-pipeline control.
  • the configurable multi-core/many-core device based on the serial multi-emission and pipeline hierarchy of the present invention, there may be an extended shared memory for storing data and transmitting in case the corresponding configurable data memory of the processor core overflows.
  • shared data between multiple processor cores there may also be an extended exception handling module for handling exceptions (ex3 ⁇ 4eption) that occur in the processor core and local memory.
  • each item in the data sub-memory Contains flag information including but not limited to valid bits, data addresses, and data tags.
  • the valid bit is used to indicate whether the data stored in the corresponding item is valid.
  • the data address and data tag are used together to indicate where the data stored in the corresponding item should be in the entire address space of the multi-core/nuclear device.
  • the exception handling module may be composed of a processor core in the multi-core/many-core device, or may be an additional module.
  • the abnormality information includes, but is not limited to, a processor number in which an abnormality occurs, and an abnormal type.
  • the respective processing of the processor core and/or local memory in which the exception occurred includes, but is not limited to, transmitting information of whether the pipeline is blocked by the transfer of the backpressure signal to the respective processor cores in the serial connection structure.
  • the processor core, the configurable local memory, and the configurable interconnect structure can be configured according to an application program.
  • the configuration includes but It is not limited to turning the processor core on or off, configuring the size/boundary of the instruction memory and data sub-memory in the local memory and the contents thereof, the configuration interconnection structure, and the connection relationship.
  • Sources of configuration information for the configuration include, but are not limited to, internal and external to the configurable multi-core/nuclear device.
  • the configuration can be adjusted at any time during the run according to the requirements of the application.
  • the configuration methods of the configuration include, but are not limited to, direct configuration by a processor core or a central processor core, by a processor core or a central processor core through direct memory access controller configuration and external request via direct memory access controller configuration.
  • the configurable multi-core/many-core device based on the serial multi-emission and pipeline hierarchy described in the present invention has three levels of low-power technologies: configuration hierarchy, instruction hierarchy, and application hierarchy.
  • the configuration hierarchy may enter a low power state; the low power state includes, but is not limited to, reducing the processor clock frequency or cutting off the power supply.
  • the instruction hierarchy when the processor core executes an instruction to read data, if the data is not yet ready, the processor core enters a low power state until the data is ready, the processor core Then return to the normal working state from the low power state.
  • the data is not ready, including but not limited to, the previous stage processor core has not yet written the data required by the processor core of the stage to the corresponding data sub-memory.
  • the low power state includes, but is not limited to, reducing the processor clock frequency or turning off the power supply.
  • the application hierarchy adopts full hardware implementation, matches idle task characteristics, determines current processor core usage rate (uti lization), determines whether to enter a low power consumption state according to current processor usage rate and reference usage rate, or Return from a low power state.
  • the reference usage rate may be fixed, may be reconfigured or self-learned, may be solidified inside the chip, may be written by the device when the device is started, or may be written by software.
  • the reference content for matching may be solidified into the chip during chip production, or may be written by the device or software when the device is started, or may be self-learned, and the storage medium includes but is not limited to volatile Memory, non-volatile memory; its writing mode includes but is not limited to one-time write, multiple-write.
  • the low power state includes, but is not limited to, reducing the processor clock frequency or turning off the power supply.
  • the configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention can have self-test capability, and can perform self-test of the chip without relying on external devices in the case of power-on operation.
  • an arithmetic unit or a processor core of the multi-core/many-core device may be used as a comparator for the multi-core/ The corresponding complex array of other complex elements in the many-core device, the arithmetic unit or the combination of the processor core and the basic element, the arithmetic unit or the processor core are given a specific The excitation of the relationship, and comparing, by the comparator, whether the output of the other basic elements of the complex array, the arithmetic unit or the combination of the processor core and the basic element, the arithmetic unit or the processor core conforms to the corresponding specific relationship.
  • the excitation may be from a particular module in the multi-core/nuclear-core device or may be external to the multi-core/nuclear device.
  • the specific relationships include, but are not limited to, equal, opposite, reciprocal, and complementary.
  • the test results may be sent to the outside of the multi-core/nuclear-core device or may be stored in a memory in the multi-core/nuclear-core device.
  • the self-test may be performed during the wafer test, the package integrated circuit test or the chip is used when the device is started, or the self-test condition and cycle may be manually set, and the self-test is periodically performed during the work.
  • the memory used for the self test includes, but is not limited to, a volatile memory, a non-volatile memory.
  • the multi-core/nuclear device When the multi-core/nuclear device has self-test capability, it can have self-repair capability.
  • the test result When the test result is saved in the memory in the multi-core/many-core device, the failed processor can be marked, and when the multi-core/many-core device is configured, the failure process can be bypassed according to the corresponding flag.
  • the core so that the multi-core/nuclear device can still work normally and realize self-repair.
  • the self-repair may be performed after the wafer is tested, after the integrated circuit test after the package is tested, or after the chip is used, the test is performed after the device is started, and the self-test self-repair condition and cycle may be manually set at work. Perform regular self-tests during the period.
  • the plurality of processor cores in the configurable multi-core/many-core device of the present invention may be isomorphic or heterogeneous.
  • the length of the instruction word in the local instruction memory in the configurable multi-core/many-core device of the present invention may not be fixed.
  • the local instruction memory and the local data memory in the configurable multi-core/many-core device of the present invention may each have a singular or complex array read port.
  • each processor core may further correspond to a plurality of local instruction memories, and the plurality of local instruction memories may be the same size or different sizes; Structurally, it can also be of different structure.
  • the other local instruction memories of the plurality of local instruction memories may perform an instruction update operation. Ways to update instructions include, but are not limited to, accessing controller update instructions through direct memory.
  • the plurality of processor cores in the configurable multi-core/many-core device of the present invention can operate at the same clock frequency or at different clock frequencies.
  • the configurable multi-core/many-core device of the present invention may have the characteristics of reading-induced writes (LIS, load induced store processor core corresponding to the previous pre-stage processor core for the first reading of a certain address data) Local data
  • the memory reads the data, and writes the read data to the local data memory corresponding to the processor core of the current level, and then reads and writes the address data to access the local data memory corresponding to the level, so that no additional overhead is added. In this case, the transfer of the same address data in the adjacent local data memory is implemented.
  • the configurable multi-core/many-core device of the present invention may have the characteristics of data pre-transfer; the processor core may read from the local data memory corresponding to the processor core of the previous stage, and the processor core does not need to read or write, but the subsequent processor The core needs to read the data and write it to the local data memory corresponding to the processor core of the current level, thereby implementing the stepwise transfer of the same address data in the local data memory of the front and rear stages.
  • the local data store of the present invention may also include a singular or plural number of valid flags and a singular or plural number of attribution flags.
  • the valid flag is used to indicate whether the corresponding data is valid.
  • the attribution flag is used to indicate which processor core the corresponding data is currently used by. The use of the valid flag and the attribution flag can avoid the use of ping-pong buffering, improve memory usage efficiency, and multiple processor cores can access the same data memory at the same time, facilitating data exchange.
  • the transfer register value through the configurable interconnect structure of the present invention includes, but is not limited to, directly transferring the value of the register in the processor core to the register of the subsequent processor core at a time by using a large number of hardwires.
  • the method of the bit register sequentially shifts the values of the registers in the processor core to the registers of the subsequent processor core.
  • the transmission path of the register value may also be a register that needs to be transmitted according to a register read/write record table.
  • the register read/write record table of the present invention is used to record the read and write of the register to the corresponding local data memory. If the value of the register has been written to the local data memory corresponding to the processor core of the current level and then the value of the register has not changed, then only the subsequent processor core can correspond to the local data memory corresponding to the core of the current processor core. The address reads the data, thereby completing the transfer of the register, and does not need to separately transfer the register value to the subsequent processor.
  • the corresponding entry in the register read/write record table is cleared to "0".
  • the register reads and writes the corresponding record in the record table.
  • the item is set to "1".
  • register value transfer is performed, only the value of the corresponding register whose entry is "1" in the register read/write log table is transferred.
  • the data is written to registers in the register file, including but not limited to, reading data from the corresponding local data memory to a register in the register file, and writing the result of the instruction execution back to a register in the register file.
  • the code segment header extension and code may also be obtained according to the determined code segment obtained after the segmentation.
  • the fragment tail extension is optimized to reduce the number of registers that need to be passed.
  • the code fragment tail extension contains storing all register values to a specific local number.
  • the code fragment header extension contains an instruction to read the value in the corresponding address into the register, and the two cooperate to realize the smooth transfer of the register value.
  • the code segment tail extension corresponding to the pre-processor core can be omitted.
  • the instruction in the code fragment header extension corresponding to the processor core of the current stage reads the data from the local data memory to the register.
  • the code value of the code segment tail extension corresponding to the pre-processor core may be omitted.
  • the code segment when a code segment corresponding to a plurality of processor cores is executed, the code segment is transferred to the same address to execute a piece of code, and after the execution of the piece of code, the corresponding code segment is transferred back to the corresponding code segment.
  • the code of the same address may be repeatedly stored in a local instruction memory corresponding to the plurality of processor cores; the code of the same address includes but is not limited to a function call, a loop.
  • the processor core can access a local instruction memory of a processor core other than the processor core; when a plurality of processor cores execute exactly the same code, and When the code length exceeds the local instruction memory size corresponding to a single processor core, the code may be sequentially stored in a local instruction memory corresponding to the plurality of processor cores; at runtime, any one of the plurality of processor cores
  • the processor core first reads and executes the instruction from the local instruction memory storing the first piece of code in the identical code, and after reading the first piece of code, reads from the local instruction memory storing the second piece of code in the code.
  • the instruction is fetched and executed, and so on, until all of the identical code is executed.
  • the plurality of processor cores may synchronously execute each piece of code in the identical code, or may asynchronously execute each piece of code in the identical code.
  • the plurality of processor cores may execute each piece of code in the identical code in parallel, or may execute each piece of code in the identical code serially; or perform the complete sequence in a mixed manner The pieces of code in the same code.
  • the processor core may further correspond to a plurality of local instruction memories, and the plurality of local instruction memories may be the same size or different sizes; Structurally, may also be of different structure; when one or more of the plurality of local instruction memories are used to respond to corresponding processing
  • the other local instruction memories in the plurality of local instruction memories may perform an instruction update operation; the way to update the instructions may be to update the instructions by direct memory access controller.
  • SoC System on Chip
  • other functional modules are ASIC modules implemented in hardwired logic.
  • the performance requirements of these functional modules are very high, and it is difficult to meet the performance requirements with conventional processors, so these ASIC modules cannot be replaced by conventional processors.
  • a single or multiple processor cores and their corresponding local memories can be configured to form a high-performance multi-core connection structure, and the multi-core connection structure is configured and placed in the corresponding local instruction memory.
  • the corresponding code segment is inserted to enable the multi-core connection structure to implement a specific function, which can replace the ASIC module in the system on chip.
  • the multi-core connection structure is equivalent to a functional module in a system on a chip, such as an image decompression module or an encryption and decryption module. These functional modules are then connected by the system bus to implement a system on a chip.
  • the data transmission channel between the processor core and its corresponding local memory and the adjacent processor core and its corresponding local memory is a local interconnection, and a plurality of the processor cores and their corresponding local memories
  • a multi-core connection structure composed of a plurality of processor cores connected together by a local connection and their corresponding local memories corresponds to a functional module of the system on chip.
  • the data transmission channel between the multi-core connection structure corresponding to the function module in the system on chip and the multi-core connection structure corresponding to the function module in the system on chip is described as a system bus.
  • a system bus connecting a plurality of multi-core connection structures corresponding to the functional modules in the system on chip a system-on-a-chip in the usual sense can be realized.
  • a system on chip implemented based on the technical solution of the present invention has configurability that is not provided by a conventional system on chip.
  • Different on-chip systems can be obtained by differently configuring the data processing apparatus according to the present invention.
  • the configuration can be performed in real time during operation so that the on-chip system functions can be changed in real time during operation.
  • the functionality of the on-chip system can be changed by dynamically reconfiguring the processor core and its corresponding local memory and dynamically changing the code segments in the corresponding local instruction memory.
  • the multi-core connection structure internal processor core corresponding to the function module in the system on chip and the corresponding local memory and other processor cores and their corresponding local memories are used for data transmission within the function module. local connection.
  • the transmission of data through the local connection within the functional module typically requires the operation of the processor core that is proposing the transfer request.
  • the system bus of the present invention may be the local connection, or a data transmission channel capable of completing data transmission between different processor cores and their corresponding local memories without occupying the operation of the processor core. Place The different processor cores and their corresponding local memories may be contiguous or non-adjacent.
  • one method of constructing a system bus is to establish a data transmission channel by using a plurality of fixed connection devices.
  • the input and output of any of the multi-core connection structures are connected to a similar connection device by a single or multiple hard-wired connection. All of the connecting devices are also connected by a single or multiple hard wires.
  • the connection device, the connection between the multi-core connection structure and the connection device, and the connection between the connection devices together constitute the system bus.
  • another method of constructing the system bus is to establish a data transmission channel, so that any processor core and its corresponding local data memory can be combined with any other processor core and its corresponding local data memory.
  • the means of data transfer include, but are not limited to, delivery via shared memory, transfer via direct memory access controller, delivery over a dedicated bus or network.
  • one method is to arrange a single root or a plurality of hardwires between two processor cores and their respective local data memories in some processor cores and their corresponding local data memories.
  • Hardwired may be configurable; when any two of the processor cores and their respective local data memories and their respective local data stores are in different multicore connection structures, ie, in different functional modules
  • the hardwire between the two processor cores and their respective local data memories can serve as the system bus between the two multi-core connection structures.
  • the second method is to enable all or part of the processor core and its corresponding local data memory to access other processor cores and their corresponding local data memory through the direct memory access controller.
  • these two processor cores and their corresponding local data memories and their corresponding local data memories are in different multi-core connection structures, that is, in different functional modules, they can be in real-time operation.
  • the data transfer between the processor core and its corresponding local data memory and another of the processor cores and their respective local data memories is performed as needed to implement a system bus between two multi-core connection structures.
  • a third method is to implement a network on chip function on all or part of the processor core and its corresponding local data memory, that is, when the processor core and its corresponding local data memory are transferred to the data
  • the configurable internetwork determines the destination of the data, thereby forming a data path for data transmission.
  • these two processor cores and their corresponding local data memories and their corresponding local data memories are in different multi-core connection structures, that is, in different functional modules, they can be in real-time operation.
  • a system bus between two multi-core connection structures is implemented as needed to perform data transfer between the processor core and its corresponding local data memory and another of the processor cores and their respective local data memories.
  • the processor core may have a fast condition determination mechanism for determining whether a branch transfer is performed; the fast condition determination mechanism may be a counter for determining a loop condition, It can be a hardware finite state machine for determining branch transitions and loop conditions.
  • the configuration level of the present invention is low power consumption, and the specific processor core can also enter a low power consumption state according to the configuration information;
  • the specific processor core includes but is not limited to a processor core that is not used, and the workload A relatively low processor core;
  • the low power state includes, but is not limited to, reducing the processor clock frequency or shutting down the power supply.
  • the method and apparatus for data processing according to the present invention may further include a single number or a plurality of dedicated processing modules.
  • the dedicated processing module can be used as a macro module for the processor core and its corresponding local memory to be called, or can be used as an independent processing module to receive the output of the processor core and its corresponding local memory, and send the processing result to the location.
  • the processor core outputted to the dedicated processing module and its corresponding local memory and the processor core receiving the output of the dedicated processing module and its corresponding local memory may be the same processor core and its corresponding local memory, or may be different processing Kernel and its corresponding local memory.
  • the dedicated processing module includes but is not limited to a fast Fourier transform (FFT) module, an entropy encoding module, an entropy decoding module, a matrix multiplication module, a convolutional coding module, a Viterbi Code decoding module, and a Turbo Code. Decoding module.
  • FFT fast Fourier transform
  • a matrix multiplication module if a single large-scale matrix multiplication is performed using a single processor core, a large number of clock cycles are required, which limits the increase in data throughput; if a plurality of the processor cores are used to implement large-scale matrix multiplication, Although it can reduce the number of execution cycles, it increases the amount of data transfer between processor cores and consumes a lot of processor resources. With a dedicated matrix multiplication module, large-scale matrix multiplication can be done in a few cycles. When the program is divided, the operation before the large-scale matrix multiplication can be allocated to a plurality of processor cores, that is, the former group processor core, and the operation after the large-scale matrix multiplication is allocated to another plurality of processors.
  • the data in the output of the former group processor core that needs to participate in the large-scale matrix multiplication is sent to a dedicated matrix multiplication module, and after processing, the result is sent to the latter group processor core.
  • the data in the output of the former group processor core that does not need to participate in the large-scale matrix multiplication is sent directly to the post-group processor core.
  • the data processing method and apparatus of the present invention are capable of dividing serial program code into code segments adapted to run by respective processor cores in a serially connected multiprocessor core structure, for different numbers of processor cores. Divided into different sizes and numbers of code segments according to different segmentation rules, suitable for Scalable multi-core/nuclear device/system applications.
  • code segments are allocated to each processor core in a serial connection multi-processor core structure, each processor core executes a specific instruction, and all processor core strings are executed.
  • the line connection realizes the complete function of the program.
  • the data used between the code segments separated from the complete program code is transmitted through a special transmission path, and there is almost no data correlation problem, realizing true multi-emission.
  • the serial connection multi-processor core structure the number of multi-transmitted transmissions is equal to the number of processor cores, which greatly improves the utilization of the arithmetic unit, thereby realizing serial connection of multi-processor core structures, and even devices/ The high throughput of the system.
  • the local memory replaces the cache that is usually present in the processor.
  • Each processor core saves all the instructions and data used by the processor core in the corresponding local memory, achieving a 100% hit rate, and solving the cache miss caused by accessing the external
  • the speed bottleneck problem of low-speed memory further improves the overall performance of the device/system.
  • the multi-core/many-core device of the present invention has three levels of low-power technology, and can implement coarse-grained power management by using methods such as cutting off the power of an unused processor core.
  • Drive perform fine-grained power management for instruction level, and implement automatic real-time adjustment of processor core clock frequency in hardware mode.
  • processor core operation Under the premise of ensuring normal operation of processor core, it can effectively reduce the dynamics of processor core operation. Power consumption, enabling the processor core to adjust the clock frequency as needed, and minimize the implementation of human intervention.
  • the speed is fast, and the real-time adjustment of the processor clock frequency can be realized more effectively.
  • the on-chip system can be realized only by programming and configuration, and the development cycle from design to product launch can be shortened. Moreover, only the need to reprogram and reconfigure enables the same hardware product to perform different functions.
  • Fig. 1 is a flow chart showing an embodiment of the present invention by taking the division and assignment of a high-level language program and an assembly language program as an example.
  • FIG. 2 is an embodiment of a processing routine loop in the post-compilation method of the present invention.
  • FIG 3 is a schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
  • Figure 4 is an embodiment of an address mapping method.
  • Figure 5 is an embodiment of data transfer between cores.
  • Figure 6 is an embodiment of back pressure, exception handling, and the connection between data memory and shared memory.
  • Figure 8 (a) is an embodiment of adjacent processor core register value transfers.
  • Figure 8(b) is a second embodiment of adjacent processor core register value transfers.
  • Figure 9 is a third embodiment of adjacent processor core register value transfers.
  • Figure 10 (a) is an embodiment of a processor core and corresponding local memory composition based on the present invention.
  • Figure 10 (b) is another embodiment of a processor core and corresponding local memory composition based on the present invention.
  • Figure 10 (c) is an embodiment of a valid flag and a home flag bit in a processor core and corresponding local memory in accordance with the present invention.
  • Figure 11 (a) shows the typical structure of the current system-on-chip.
  • Figure 11 (b) is an embodiment of implementing a system on a chip based on the technical solution of the present invention.
  • Figure 11 (c) is another embodiment of implementing a system on a chip based on the technical solution of the present invention.
  • Figure 12 (a) is an embodiment of pre-compilation in the technical solution of the present invention.
  • Figure 12 (b) is an embodiment of post-compilation in the technical solution of the present invention.
  • Figure 13 (a) is another schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy of the present invention.
  • Figure 13 (b) is a schematic diagram of a multi-core serial structure formed by configuration of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
  • Figure 13 (c) is a configurable multi-core / many-core device based on serial multi-emission and pipeline hierarchy according to the present invention
  • Figure 13 (d) is a schematic diagram of a plurality of multi-core structures formed by configuration of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
  • Fig. 1 is a flow chart showing an embodiment of the present invention by taking the division and assignment of a high-level language program and an assembly language program as an example.
  • the pre-compilation (103) step is used to expand the calls in the high-level language program (101) and/or the assembly language program (102) to obtain the expanded high-level language code and/or assembly language code.
  • the compiled high-level language code and/or assembly language code is then compiled by the compiler (104) to obtain assembly code that conforms to the program execution order, and then compiled (107); if there is only assembly language code in the program, and has been met In the program execution sequence, the compilation (104) can be omitted, and the post-compilation (107) can be performed directly.
  • post-compilation in the present embodiment, based on the structure information (106) of the multi-core device, the assembly code is run on the behavior model (108) of the processor core and divided to obtain configuration information (110). A corresponding configuration bootloader (109) is also generated. Finally, a corresponding plurality of processor cores (113) are configured by one of the processor cores (111) either directly or through a DMA controller (112).
  • the instruction splitter first reads the front-end stream segment in step one (201), and reads the front-end stream related information in step two (202). Then, proceed to step 3 (203) to determine whether the stream segment is looped. If it is not looping, proceed to step 9 (209) to process the stream segment according to the conventional processing. If looping, proceed to step 4 (204) to first read the loop. Count M, and then go to step 5 (205) to read the number of cycles N that can be accommodated in this block. In step 6 (206), it is judged whether the number of cycles M is greater than the number N of cycles that can be accommodated.
  • step 7 If the number of cycles M is greater than the number N of cycles that can be accommodated, then the process proceeds to step 7 (207) to divide the cycle into a small N-period. Loop and a small loop of MN weeks, and re-assign MN to M in step eight (208), while entering the next block cycle until the number of cycles is satisfied is less than the number of cycles that can be accommodated.
  • FIG. 3 is a schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
  • the apparatus is comprised of a number of processor cores (301), configurable local memory (302), and a configurable interconnect structure (303).
  • each processor core (301) corresponds to a configurable local memory (302) below it, which together form a level of the macro pipeline.
  • the configurable interconnect structure (303) multiple processor cores (301) and their respective configurable local memories (302) can be connected into a serial connection structure.
  • Multiple serial connection structures can They are independent, and they can be partially or completely interconnected, running programs serially, in parallel or in series.
  • Figure 4 is an embodiment of an address mapping method.
  • Figure 4 (a) uses the lookup table method to achieve address lookup. Taking a 16-bit address as an example, the 64K address space is divided into a plurality of small memories (403) of a single 1K address space, which are sequentially written, and then written to other blocks after one block of memory is written. After each write, the intra-block address pointer (404) automatically points to the next available entry with a valid bit of 0, which is set to a valid position of the entry. Each entry writes data and writes its address to the lookup table (402). Taking the value of the write address BFC0 as an example, the address pointer (404) points to the No. 2 entry of the memory (403), and when the corresponding data is written to the No.
  • Figure 4 (b) uses the CAM array method to implement address lookup. Taking a 16-bit address as an example, the 64K address space is divided into a plurality of small memories (403) of a single 1K address space, which are sequentially written, and after one block of memory is written, another block is written. Each time the block is written, the intra-block address pointer (406) automatically points to the next available entry with a valid bit of 0, which is set to a valid bit of the entry.
  • Each entry writes data and writes its instruction address to the next entry in the CAM array (402).
  • the address pointer (406) points to the No. 2 entry of the memory (403), and the next entry in the CAM array (405) when the corresponding data is written to the No. 2 entry.
  • the instruction address BFC0 is written to establish an address mapping relationship.
  • the input command address is compared with all the instruction addresses stored in the CAM array to find the corresponding entry, and the stored data is read.
  • Figure 5 is an embodiment of data transfer between cores. All data memory is located between the processor cores and is divided into two parts in the logical sense. The upper part is used for reading and writing of the processor core above the data memory, and the lower part is only used for reading data for use by the processor core below the data memory. While the processor core is running the program, data is transferred from the data memory above.
  • the three-choice selector (502, 509) can select the data (506) from the far-end to be sent to the data store (503, 504). 'When the processor core (510, 511) does not execute the Store instruction, the lower portions of the data memory (501, 503) are written to the corresponding next data memory (503, 504) through the three-select selectors (502, 509), respectively.
  • the upper part of the flag marks the valid bit V of the write line as 1.
  • the register file only writes values to the underlying data memory.
  • the two selectors (505, 507) are respectively determined by the valid bit V of the data memory (503, 504) from the corresponding data memory (501, 503) or below.
  • the data memory (503, 504) takes the number. If the valid bit V of an entry in the data memory (503, 504) is 1, that is, the flag data has been written and updated from the above data memory (501, 503), the data transmitted from the far side is not selected (506).
  • selectors (502, 509) select the register file output of the processor core (510, 511) as Input to ensure that the stored data is the latest value processed by the processor core (510, 511).
  • the lower portion of the data memory (503) transfers data to the upper portion of the data memory (504).
  • the pointer is used to transfer the data entry.
  • the flag transfer is about to be completed.
  • the data should have completed the transfer to the next memory.
  • the upper portion of the data memory (501) transfers data to the lower portion of the data memory (503), and the upper portion of the data memory (503) transfers data to the lower portion of the data memory (504), the data memory (504).
  • the upper part of the data is transferred downwards to form a ping-pong transmission structure. All data memories are divided into a portion of the storage for instructions in terms of the required instruction space size, ie the data memory and the instruction memory are not physically separated.
  • Figure 6 is an embodiment of back pressure, exception handling, and the connection between data memory and shared memory.
  • a corresponding code segment (615) is written to the instruction memory (601, 609, 610, 611) by the DMA controller (616).
  • the processor cores (602, 604, 606, 608) run the code in the respective instruction memory (601, 609, 610, 611) and read and write the corresponding data memory (603, 605, 607, 612).
  • the processor core (604), the data memory (605), and the subsequent processor core (606) as an example, both the front and rear processor cores (604, 606) have access to the data memory (605), only in front.
  • the data sub-memory in the data memory (605) can be ping-pong swapped.
  • the back pressure signal (614) is used by the downstream processor core (606) to inform the data memory (605) if the read operation has been completed.
  • the back pressure signal (613) is used by the data memory (605) to notify the pre-processor core (604) whether there is an overflow and to pass the back-pressure signal transmitted by the post-processor core (606).
  • the pre-processor core (604) determines whether the macro pipeline is blocked according to its own operation and the back pressure signal transmitted by the data memory (605), and determines whether to ping-pong the data sub-memory in the data memory (605), and The back pressure signal is generated and continues to pass to the next stage. Through such a reverse backpressure signal transfer from the processor core to the data memory to the processor core, the operation of the macro pipeline can be controlled. All data stores (603, 605, 607, 612) are connected to the shared memory (618) via a connection (619). When an address written or read by a data memory is outside its own address, an address exception occurs, and the address is searched in the shared memory (618). After the data is found, the data is written to the address or the data of the address is read.
  • An exception also occurs when the processor core (608) needs to use data in the data store (605), and the data store (605) transfers the data to the processor core (608) via the shared memory (618).
  • the exception information generated by the processor core and data memory is transferred to the exception handling module (617) via the dedicated channel (620).
  • the exception processing module (617) controls the processor to perform a saturation operation on the overflow operation result; taking the data memory overflow as an example, the exception processing mode Block (617) controls the data memory to access the shared memory and stores the data in the shared memory; in the process, the exception handling module (617) sends a signal to the processor core or data memory, blocks it, and then completes the exception handling. After the operation resumes operation, the signals transmitted by the other processor cores and the data memory through the back pressure respectively determine whether they are blocked.
  • FIG. 7 is a self-test self-repair method and structural embodiment.
  • the test vectors generated by the vector generator (702) are synchronously sent to the processor cores, and the test vector distribution controller (703) controls the processor cores and vector generators (702).
  • the connection relationship, the operation result distribution controller (709) controls the connection relationship between the processor cores and the comparators, and the processor core compares the operation results with the other processor cores.
  • each processing The cores can be compared to other adjacent processor cores, as the processor core (704) can be compared by comparison logic (708) with processor cores (705, 706, 707).
  • each comparison logic may include one or more comparators.
  • each processor core is sequentially compared with other adjacent processor cores, if one The comparison logic has multiple comparators, and each processor core is simultaneously compared with other adjacent processor cores, and the test results are directly written into the test result table from each comparison logic (710).
  • the processor core has a register file (801) including 31 32-bit general-purpose registers, and transfers all the general-purpose register values in the pre-stage processor core (802) to the local processor.
  • each of the general-purpose registers of the pre-processor core (802) can be directly connected to each of the general-purpose registers of the processor core (803) with 992 hard-wired connections.
  • the input ends are connected one by one through a multiplexer.
  • the register value is passed, the values of the 31 32-bit general-purpose registers in the pre-processor core (802) can be passed to the processor core (803) in one cycle.
  • Figure 8 (a) specifically shows a hard-wired connection method for one bit (804) in a general-purpose register.
  • the remaining 991-bit hard-wired connection method is the same as this bit (804).
  • the output (806) of the corresponding bit (805) in the pre-processor core (802) is hardwired (807) to the input of the bit (804) in the processor core (803) of the present stage through the multiplexer (808) Connection.
  • the processor core performs arithmetic, logic, etc.
  • the multiplexer (808) selects data from the processor core of the current stage (809); when the processor core performs the fetch operation, if the data is processed at the present stage If the local memory corresponding to the core exists, the data from the processor core of the current level is selected (809), otherwise the data transmitted from the core of the previous processor is selected (810); when the register value is transferred,
  • the way selector (808) selects data (810) derived from the pre-processor core. All 992 bits are transmitted simultaneously, and the entire register file value can be transferred in one cycle.
  • adjacent processor cores (820, 822) each have a register file (821, 823) containing a plurality of 32-bit general purpose registers. When transferring register values from the pre-processor core (820) to the processor core (822), the data output of the register file (821) in the pre-processor core (820) can be hardwired with 32 hardwires.
  • the multiplexer (827) is connected to the input of a multiplexer (827) connected to the data input (830) of the register file (823) in the processor core (822) of the present stage, and the inputs of the multiplexer (827) are respectively The data from the processor core of this level (824) and the data from the pre-processor core (825) transmitted through the hardwire (826).
  • the multiplexer (827) selecting data from the processor core of the current level (824); when the processor core performs the fetch operation, if the data already exists in the local memory corresponding to the processor core of the current level, the selection is derived from the level The data of the processor core (824), otherwise the data from the pre-processor core is selected (825).
  • the multiplexer (827) is selected from the pre-processor core to transmit.
  • the register address generation module (828, 832) corresponding to the register file (821, 823) itself generates a register address to which the register value needs to be transferred to the address input terminal (831, 833) of the register file (821, 823), and is divided into multiple times. Pass the value of the register through hardwire
  • the register file (823) passed from the register file (821). In this way, the transfer of all or part of the register values in the register file can be completed in multiple cycles with only a small number of hardwires added.
  • adjacent processor cores (940, 942) each have a register file (941, 943) containing a plurality of 32-bit general purpose registers.
  • the pre-processor core (940) can first use a data store (store) instruction to register a register in the register file (941). The value is written into the local data memory (954) corresponding to the pre-processor core (940), and the corresponding processor core (942) uses the data load command to read the corresponding data from the local data memory (954). And write to the corresponding register of the register file (943).
  • the data output (949) of the register file (941) in the pre-processor core (940) passes through the 32-bit connection (946) and the data input of the local data memory (954) (948).
  • the data input (950) of the register file (943) in the processor core (942) of the current stage passes through the data output of the multiplexer (947) and the 32-bit connection (953) and the local data memory (954).
  • the ends (952) are connected.
  • the input of the multiplexer (947) is the data from the processor core of the processor (944) and the data from the pre-processor core (945) transmitted through the 32-bit connection (953).
  • the multiplexer (947) selects data derived from the processor core of the current stage (944); when the processor core performs the fetch operation, if the data corresponds to the core of the processor of the present stage If the local memory already exists, the data from the processor core of the current level is selected (944), otherwise the data from the pre-processor core is selected (945).
  • the multiplexer (947) ) choose to come from the front The data transmitted from the processor core (945).
  • the values of all the registers in the register file (941) may be sequentially written into the local data memory (954), and then these values are sequentially written into the register file (943).
  • the values of some registers in the register file (941) can be written to the local data memory (954) in turn, and then the values are sequentially written into the register file (943); one of the register banks (941) can also be Once the value of the register is written to the local data memory (954), the value is immediately written to the register file (943) and the process is repeated in sequence until the register value that needs to be transferred is passed.
  • FIG. 10 shows two embodiments of a connection structure composed of a processor core and a corresponding local memory according to the present invention.
  • Various possible substitutions, adjustments, and improvements of the various components of the embodiments may be made in accordance with the technical solutions and concepts of the present invention, and all such replacements, adjustments, and improvements are intended to be within the scope of the present invention.
  • a corresponding embodiment includes a local instruction memory and a local data memory (1001) corresponding to the processor core (1001) of the local data memory and its previous stage processor core.
  • the processor core (1001) is composed of a local instruction memory (1003), a local data memory (1004), an execution unit (1005), a register file (1006), a data address generation module (1007), a program counter (1008), and a write buffer ( 1009) and output buffer (1010).
  • the local instruction memory (1003) stores instructions required for execution by the processor core (1001).
  • the number of operands required by the execution unit (1005) in the processor core (1001) is from the register file (1006), or from the immediate value in the instruction; the execution result is written back to the register file (1006).
  • the local data memory has two sub-memory. Taking the local data memory (1004) as an example, the data read from the two sub memories is selected by the multiplexer (1018, 1019) to produce the final output data (1020).
  • the data in the local data memory (1002, 1004), the data in the write buffer (1009), or the data in the external shared memory (1011) can be read into the register file (1006) by a data load (load) instruction.
  • the data in the local data memory (1002, 1004), the data in the write buffer (1009), and the data in the external shared memory (1011) are selected by the multiplexer (1016, 1017). Input into the register file (1006).
  • the data in the register file (1006) can be delayed by the write buffer (1009) into the local data memory (1004) or the data in the register file (1006) can be buffered by the output (1010) by a data store instruction.
  • the delay is stored in the external shared memory.
  • the data can be read from the local data memory (1002) to the register file (1006) while being delayed by the write buffer (1009) into the local data memory (1004) to perform the LIS function of the present invention. , to achieve data transfer without cost.
  • the data received by the write buffer (1009) has three sources: data from the register file (1006), data from the pre-processor core local data memory (1002), and Data from external shared memory (1011).
  • the data from the register file (1006), the data from the pre-processor core local data memory (1002), and the data from the external shared memory (1011) are selected by the multiplexer (1012) and then input. To the write buffer (1009).
  • the local data store only receives data inputs from the write buffer in the same processor core.
  • the local data memory (1004) only receives data input from the write buffer (1009).
  • the local instruction memory (1003) and the local data memory (1002, 1004) are each composed of two identical sub-memory, and can simultaneously perform different sub-memory in the local memory. Read and write operations. With such a structure, the local data storage using the ping-pong buffer exchange described in the technical solution of the present invention can be realized.
  • the address received by the local instruction memory (1003) is generated by the program counter (1008).
  • the address received by the local data memory (1004) has three sources: an address for storing data from the address storage portion of the core processor write buffer (1009), and a core data address generation module from the processor (1007). The address for reading data, the address for reading data from the subsequent processor core data address generation module (1013).
  • the address for reading data from the post-processor core data address generation module (1013) is selected by the multiplexer (1014, 1015) and input to the address receiving of different sub-memory in the local data memory (1004). Module.
  • the address received by the local data memory (1002) also has three sources: an address for storing data from the address storage portion of the core processor write buffer of the current processor, and a module for generating a data from the core processor of the current processor.
  • An address for reading data an address for reading data from the subsequent processor core data address generation module (1007).
  • the above addresses are selected by the multiplexer and input to the address receiving modules of different sub memories in the local data memory (1002).
  • FIG. 10(b) is another connection structure based on the processor core and the corresponding local memory according to the present invention, wherein the processor core (1021) including the local instruction memory and the local data memory and the processor of the previous stage thereof The core corresponds to the local data memory (1022).
  • the processor core (1021) is composed of a local instruction memory (1003), a local data memory (1024), an execution unit (1005), a register file (1006), a data address generation module (1007), a program counter. (1008), write buffer (1009) and output buffer (1010).
  • connection structure proposed in the corresponding embodiment is substantially the same as the structure proposed in the corresponding embodiment of Figure 10 (a), the only difference being that the local data memories (1022, 1024) in this embodiment are each a double Port (dual-port) memory structure. Dual-port memory can support both read and write operations at two different addresses.
  • the address received by the local data memory (1024) has three sources: an address for storing data from the address storage portion of the host processor core write buffer (1009), and a core data address generation module from the local processor (1007)
  • the address for reading data the address for reading data from the subsequent processor core data address generation module (1025).
  • the address (1025) for reading data from the subsequent processor core data address generation module is selected by the multiplexer (1026) and input to the address receiving module of the local data memory (1024).
  • the address received by the local data memory (1022) also has three sources: an address for storing data from the address storage portion of the core processor write buffer of the processor, and a module for generating a data from the core processor of the current processor.
  • An address for reading data an address for reading data from the subsequent processor core data address generation module (1007). The above address is selected by the multiplexer and input to the address receiving module of the local data memory (1022).
  • a single-port memory can be used instead of the dual-port memory in the corresponding embodiment of FIG. 10(b).
  • the order of the instructions in the program is statically adjusted, or the order of execution of the instructions is dynamically adjusted during program execution, and the instructions for accessing the memory are simultaneously executed when the instructions that do not need to access the memory are executed, thereby making the composition of the connection structure more concise and efficient.
  • each local data memory is actually a dual port memory capable of supporting two read, two write or one read and one write operations simultaneously.
  • a valid flag (1032) and a attribution flag can be added to each address in the local data memory (1031) as shown in Figure 10 (c). Bit (1033).
  • the valid flag bit (1032) represents the validity of the data (1034) corresponding to the address in the local data memory (1031). For example, “1" can be used to represent the local data memory (1031). The data corresponding to the address (1034) is valid, and "0" is used to represent the data corresponding to the address in the local data memory (1031) (1034) is invalid.
  • the attribution flag bit (1033) represents which processor core is used by the data corresponding to the address (1034) in the local data memory (1031). For example, the address in the local data memory (1031) can be represented by "0".
  • the data (1034) is used by the corresponding processor core (1035) of the local data memory (1031), and the data corresponding to the address in the local data memory (1031) is represented by "1" (1034).
  • the memory (1031) is used by the corresponding processor core (1035) and its subsequent processor core (1036).
  • the attributes of each of the data stored in the local data store can be described as defined above for the valid flag bit (1032) and the home flag bit (1033), and correct read and write is guaranteed.
  • the valid flag bit (1032) corresponding to an address in the local data memory (1031) is "0"
  • BP If necessary, you can directly perform data storage operations on this address.
  • the valid flag bit (1032) is "1” and the home flag bit (1033) is "0”
  • the processor core (1035) is used, so the processor core (1035) of the current stage can directly perform data storage operations on the address if needed.
  • the valid flag bit (1032) is "1" and the home flag bit (1033) is "1"
  • the processor core (1035) and its subsequent processor core (1036) use, if the processor core (1035) needs to perform data storage operations on the address, it must wait for the attribution flag (1033).
  • the data storage operation can be performed, that is, the data corresponding to the address (1034) is first transmitted to the corresponding position in the local data memory (1037) corresponding to the subsequent processor core (1036), and the level is
  • the local flag bit (1033) corresponding to the address in the local data memory (1031) corresponding to the processor core (1035) is set to "0", so that the processor core (1035) of the current stage can perform data storage operation on the address. It is.
  • the corresponding valid flag bit (1032) can be set to "1". And according to whether the data (1034) is used by the latter processor (1036) to determine the attribution flag, if it is used by the latter processor (1036), the attribution flag (1033) is set to T, otherwise set to "0"; It is also possible to set the corresponding valid flag bit (1032) to "1” and also set the corresponding home flag bit (1032) to "1". This increases the capacity of the local data memory (1031), but simplifies its specificity. Implementation structure.
  • Figure 11 (a) shows the typical structure of the current system-on-chip.
  • the processor core (1101), the digital signal processor core (1102), the functional units (1103, 1104, 1105), the input/output interface control module (1106), and the storage control module (1108) are all connected to the system bus (1110). on.
  • the system-on-chip can transmit data through the input/output interface control module (1106) and peripheral devices (1107), and can also pass the storage control module (1108). Transfer data with external memory (1109).
  • FIG. 11(b) shows an embodiment of implementing a system on chip based on the technical solution of the present invention.
  • the processor core and the corresponding local memory (1121) together with the other six processor cores and the corresponding local memory constitute a functional module (1124)
  • the processor core and the corresponding local memory (1122) and the other four processings are a functional module (1124)
  • the cores and corresponding local memories together form a functional module (1125)
  • the processor core and corresponding local memory (1123) together with the other two processor cores and corresponding local memories constitute a functional module (1126).
  • the functional modules (1124, 1125, 1126) may each correspond to a processor core (1101), or a digital signal processor core (1102), or a functional unit (1103 or 1104 or 1105) in the embodiment of FIG. 11(a). Or an input/output interface control module (1106) or a storage control module (1108).
  • the processor core and the corresponding local memory (1123, 1127, 1128, 1129) constitute a multi-core structure of serial connection, the four processor cores and corresponding local memories (1123, 1127, 1128) , 1129) Together realize the functions of the function module (1126).
  • Data transfer between the processor core and corresponding local memory (1123) and the processor core and corresponding local memory (1127) is accomplished via an internal connection (1130).
  • the data transfer between the processor core and the corresponding local memory (1127) and the processor core and the corresponding local memory (1128) is implemented by an internal connection (1131), the processor core and the corresponding local memory (1128) and the processor.
  • Data transfer between the core and the corresponding local memory (1129) is accomplished via an internal connection (1322).
  • the function module (1126) is connected to the bus connection module (1138) through hard wiring (1133, 1134), so that the function module (1126) and the bus connection module (1138) can transfer data to each other.
  • the function module (1125) and the bus connection module (1139) can transfer data to each other, and the function module (1124) and the bus connection module (1140, 1141) can transfer data to each other.
  • the bus connection module (1138) and the bus connection module (1139) can transfer data to each other through hardwired (1135).
  • the bus connection module (1139) and the bus connection module (1140) can mutually transmit data through hardwired (1136).
  • the bus connection module (1140) and the bus connection module (1141) can transmit data to each other through hard wiring (1137).
  • the processor core and the corresponding local memory in the configurable multi-core/multi-core device proposed by the present invention are easily expandable in number, various types of system on chip can be conveniently implemented by the method of the present embodiment.
  • the structure of the system on chip can be flexibly changed through a real-time dynamic configuration method.
  • FIG. 11(c) shows another embodiment of implementing a system on chip based on the technical solution of the present invention.
  • the processor core and the corresponding local memory (1151) together with the other six processor cores and the corresponding local memory constitute a functional module (1163)
  • the processor core and the corresponding local memory (1152) and the other four processings are a functional module (1163)
  • the cores and corresponding local memories together form a functional module (1164)
  • the processor core and corresponding local memory (1153) together with the other two processor cores and corresponding local memories constitute a functional module (1165).
  • the functional modules (1163, 1164, 1165) may each correspond to a processor core (1101), or a digital signal processor core (1102), or a functional unit (1103 or 1104 or 1105) in the embodiment of FIG. 11(a). , or an input/output interface control module (1106), or a storage control module (1108).
  • the processor core and the corresponding local memories (1153, 1154, 1155, 1156) constitute a multi-core structure of serial connection, the four processor cores and corresponding local memories (1153, 1154, 1155) , 1156) Together realize the functions of the function module (1165).
  • Data transfer between the processor core and corresponding local memory (1153) and the processor core and corresponding local memory (1154) is accomplished via an internal connection (1160).
  • the data transfer between the processor core and the corresponding local memory (1154) and the processor core and the corresponding local memory (1155) is implemented by an internal connection (1161), the processor core and the corresponding local memory (1155) and the processor.
  • Data transfer between the core and the corresponding local memory (1156) is accomplished via an internal connection (1162).
  • an example is to implement data between the function module (1165) and the function module (1164) through data transfer between the processor core and the corresponding local memory (1156) and the processor core and the corresponding local memory (1166). Transmission requirements.
  • the interconnect network can be configured according to the data transmission requirements.
  • the bidirectional data path (1158) of the processor core and its corresponding local memory (1156) and the processor core and its corresponding local memory (1166) is automatically configured.
  • the processor core and its corresponding local memory (1166) need to transfer data to the processor core and its corresponding local memory (1156), the processor core and its corresponding local memory (1156) need to be directed to the processor.
  • the core and its corresponding local memory (1166) can transmit data in one direction, and can also establish a one-way data path in the same way.
  • the processor core and its corresponding local memory (1151) and the processor core and their corresponding versions are also established.
  • the bidirectional data path (1157, 1158, 1159) implements the system in FIG. 11(a).
  • the function of the bus (1110), together with the function modules (1163, 1164, 1165), constitutes a typical system-on-a-chip architecture.
  • system-on-a-chip there is not necessarily a single set of data paths between any two functional modules. Since the number of processor cores in the configurable multi-core/many-core apparatus proposed by the present invention is easily expanded in number, various types of system-on-chip can be conveniently implemented by the method of the present embodiment. In addition, when the configurable multi-core/nuclear device based on the present invention is operated in real time, the structure of the system on chip can be flexibly changed through a real-time dynamic configuration method.
  • Fig. 12(a) is a pre-compilation embodiment
  • Fig. 12(b) is a post-compilation embodiment.
  • the left side is the original program code (1201, 1203, 1204).
  • There are two function calls in the code which are A function call and B function call.
  • 1203 and 1204 are the A function and the B function code themselves.
  • the A function call and the B function call are respectively replaced with the corresponding function code, and there is no function call in the expanded code, as shown in 1202.
  • Figure 12 (b) shows the post-compilation embodiment.
  • the original object code (1205) is the object code after normal compilation.
  • the object code is based on the target code executed sequentially, after being compiled and split, it is formed.
  • each code block is assigned to a corresponding one of the processor cores for execution.
  • the corresponding A loop body is divided into a single code block (1207), and the B loop body is divided into two code blocks due to its relatively large size, namely B loop body 1 ( 1209 ) and B loop body 2 ( 1210 ). Two code blocks are executed on the two processor cores to complete the B loop body.
  • FIG. 13(a) is a schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy
  • FIG. 13(b) is a schematic diagram of a multi-core serial structure formed by configuration
  • Figure 13 (c) is a schematic diagram of a multi-core serial-parallel hybrid structure formed by configuration
  • Figure 13 (d) is a schematic diagram of a plurality of multi-core structures formed by configuration.
  • the device consists of multiple processor cores and configurable local memories (1301, 1303, 1305, 1307, 1309, 1311, 1313, 1315, 1317) and configurable interconnect structures (1302, 1304). , 1306, 1308, 1310, 1312, 1314, 1316, 1318).
  • each processor core and configurable local memory form a level of the macro pipeline.
  • Multiple processor cores and configurable by configuring configurable interconnect fabrics (such as 1302)
  • the local memories (1301, 1303, 1305, 1307, 1309, 1311, 1313, 1315, 1317) are connected in a serial connection structure.
  • Multiple serial connection structures may be independent of each other, or may be partially or fully interconnected, running programs serially, in parallel, or in series.
  • a multi-core serial structure in the figure is formed by configuring a corresponding configurable interconnect structure, wherein the processor core and the configurable local memory (1301) are the first level of the multi-core serial structure.
  • the processor core and configurable local memory (1317) are the last stage of the multi-core serial structure.
  • the processor core and configurable local memory (1301, 1303, 1305, 1313, 1315, 1317) form a serial structure
  • the processor core and The configurable local memories (1307, 1309, 1311) form a parallel structure, which ultimately forms a multi-core processor with a serial-parallel hybrid structure.
  • the processor core and configurable local memory (1301, 1307, 1313, 1315) form a serial structure by configuring the corresponding configurable interconnect structure, while the processor core and configurable local The memories (1303, 1309, 1305, 1131, 1317) form another serial structure, thereby forming two completely independent serial structures.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Logic Circuits (AREA)
  • Advance Control (AREA)
  • Microcomputers (AREA)
  • Devices For Executing Special Programs (AREA)

Description

一种数据处理的方法与装置
技术领域
本发明涉及集成电路设计领域。 背景技术
根据摩尔定律, 晶体管的特征尺寸正沿着 65nin, 45nm, 32nm……的路线逐渐縮小, 单 片芯片上所集成的晶体管数已超过十几亿只。但是自从上世纪八十年代推出综合及布局布线 工具, 解放了后端设计生产力后, EDA工具 20多年来并没有质的突破, 使得前端设计, 尤 其是验证变得越来越难以应对日益增大的单片芯片规模。 因此, 设计公司把目光投向多核, 即一块芯片中集成多个较为简单的核, 在提高芯片功能的同时降低设计、 验证难度。
传统多核处理器集成了多个并行执行程序的处理器核以提高芯片性能。对于传统多核处 理器,需要有并行编程的思想才有可能充分利用资源。然而操作系统对资源的分配和管理并 没有本质的改变,多是以对称的方式进行平均分配。尽管多个处理器核之间可以进行并行运 算,但对于单个程序线程而言,其串行执行的结构特点导致在传统多核处理器结构中无法实 现真正的流水线操作。此外目前的软件中依然存在大量必须串行执行的程序,无法被很好的 分割。因此, 当处理器核达到一定数量后,性能就无法再随着核数量的增加而提升了。此外, 随着半导体制造工艺的不断提升,多核处理器内部的工作频率已大大高于其外部存储器的工 作频率, 多个处理器核同时进行访存也已经成为制约系统性能的一大瓶颈,用并行多核结构 运行串行的程序无法达到预期的性能提升效果。
发明内容
本发明针对现有技术的不足, 提出一种用于高速运行串行程序的数据处理的方法与装 置, 提高吞吐率。
本发明所述的数据处理的方法与装置,包括:根据特定规则对运行于串行连接多处理器 核结构上的程序代码进行分割,使所述串行连接多处理器核结构中的每个核运行相应的分割 后代码片段所需的时间尽量相等, 以实现核间工作量的负载平衡 (load balancing)o 所述 串行连接多处理器核结构中包含复数个处理器核,所述处理器指通过执行指令进行运算和读 写数据的硬件, 包括但不限于中央处理器 (CPU) 和 1据信号处理器 (DSP)。 本发明所述的串行连接多处理器核结构构成串行多发射 (in serial multi-i ssue ), 串. 行连接多处理器核结构中任意核每单位时间内可进行单数个或复数个发射,复数个串行连接 的核同时形成更大规模的多发射, 即串行多发射。
本发明所述的串行连接多处理器核结构构成流水线层次结构 (pipel ine hierarchy ), 串行连接多处理器核结构中任意核的内部流水线为第一个层次,串行连接多处理器核结构中 每个核作为一个宏观流水线段而构成的宏观流水线为第二个层次,依此类推还可以得到更多 更高层次, 如以串行连接多处理器核结构作为一个更高层次流水线段而构成的第三个层次。
本发明所述的串行连接多处理器核结构中核上的代码片段经由前编译 (pre-compi le). 编译 (compile ) 和后编译 (post-compi le ) 三步骤中的部分或全部步骤产生单数个或复数 个代码片段, 所述程序代码包括但不限于高级语言代码和汇编语言代码。
所述编译即现有通常意义上从程序源代码到目标代码的编译;
所述前编译是在所述编译进行前对程序源代码的预编译,包括但不限于在进行程序编译 前将程序中的 "调用"(call ) 进行展开, 用实际调用的代码替代调用语句, 形成没有调用 的程序代码; 所述调用包括但不限于函数调用;
所述后编译是按要求分配到所述串行连接多处理器核结构中每个核的工作内容及负荷 将所述编译得到目标代码的划分为单数个或复数个代码片段, 步骤包括但不限于:
( a) 可执行的程序代码进行解析, 生成前端码流;
(b ) 在特定模型上运行、 扫描前端码流, 根据要求分析所需执行周期、 是否跳转以及 跳转地址等信息, 统计扫描结果, 间接确定分割信息; 或不扫描前端码流, 根据预设信息直 接确定分割信息; 所述特定模型包括但不限于所述串行连接多处理器核结构中核的行为模 型;
( c ) 根据分割信息对可执行的程序指令代码和进行代码分割, 生成所述串行连接多处 理器核结构中每个处理器核相应的代码片段。
本发明所述的前编译方法在程序源代码编译前实施,也可以作为编译器的组成部分在程 序源代码编译过程中实施, 还可以作为所述串行连接多处理器核结构的操作系统的组成部 分、 或作为驱动、 或作为应用程序, 在所述串行连接多处理器核结构运行时实时实施。
本发明所述的后编译方法可以在程序源代码编译完成之后实施,也可以作为编译器的组 成部分在程序源代码编译过程中实施,还可以作为包括但不限于所述串行连接多处理器核结 构的操作系统的组成部分、驱动、应用程序, 在所述串行连接多处理器核结构运行时实时实 施。 当所述后编译方法实时实施时, 可以人为确定所述代码片段中的相应配置信息, 也可以 根据所述串行连接多处理器核结构的使用情况动态地自动产生所述代码片段中的相应配置 信息, 还可以只产生固定的配置信息。
通过所述分割, 可将现有的应用程序进行程序分割, 分段同时执行, 不但提高了现有程 序在多核 /众核装置上的运行速度,而且充分发挥了多核 /众核装置的效率, 同时也保证了多 核 /众核装置对现有应用程序的兼容。有效地解决了现有应用程序无法充分发挥多核 /众核处 理器优势的困境。
本发明所述的后编译方法中,间接确定分割信息的依据包括但不限于指令执行的周期数 或时间、指令的条数, 即可以根据扫描前端码流获得的指令执行周期数或时间, 将整个可执 行程序代码分割成相同或相近运行时间的代码片段,也可以根据扫描前端码流获得的指令条 数,将整个可执行程序代码分割成相同或相近指令条数的代码片段;所述直接确定分割信息 的依据包括但不限于指令的条数, 即可以根据指令的条数,直接将整个可执行程序代码分割 成相同或相近指令条数的代码片段。
本发明所述的后编译方法中,所述可执行程序代码分割时根据特定规则尽可能避免对循 环代码进行分割。当无法避免对循环代码分割时,根据特定规则将所述循环代码通过单数次 或复数次分割形成复数个更小规模的循环代码。所述复数个更小规模的循环代码可以分别是 相同或不同的代码片段的组成部分。所述更小规模的循环代码包括但不限于包含更少的代码 数目的循环代码和代码执行周期数更少的循环代码。
本发明所述的后编译方法中,所述代码片段包括但不限于适用于固定处理器核数目的所 述串行连接多处理器核结构运行的己分段的可执行目标代码和 /或相应配置信息, 适用于所 述串行连接多处理器核结构运行的未分段的可执行目标代码以及包含适用于不固定核数目 的多种分段信息的相应配置信息, 其中分段信息包括但不限于包含代表每段指令数目的数 字, 代表分段边界的特定标志, 每个代码片段开始信息的指示表。
举例来说,在一个有 1000个处理器核的所述装置中,可以按最大处理器数目 1000生成 一张有 1000个项的表,每一项存储相应指令在所述未分段的可执行目标代码中的位置信息, 两项之间的指令组合即对应可以在相应单个核上运行的代码片段。 若在运行时用到了全部 1000 个处理器核, 则每个处理器核运行所述表中相应两项所指向的未分段的可执行目标代 码位置间的代码, 即每个处理器核运行所述表中对应的一段代码。 若在运行时只用到了 N 个处理器核(N<1000), 则每个处理器核运行所述表中对应的 1000/N段代码, 具体代码可以 根据表中相应位置信息确定。
在每个处理器核上运行的指令除所述分割后的相应代码片段外, 还可以包括额外的指 令。所述额外的指令包括但不限于代码片段头部扩展、代码片段尾部扩展, 用于实现不同处 理器核间指令执行的平滑过渡。举例来说,可以在每个代码片段的末尾加上代码片段尾部扩 展,将寄存器堆中所有值存储到数据存储器中的特定位置,在每个代码片段的开头加上代码 片段头部扩展,从数据存储器中的特定位置中的值读取到寄存器堆中, 以此实现不同处理器 核间的寄存器值传递, 保证程序的正确运行; 当执行到代码片段的末尾时, 下一条指令从所 述代码片段的第一条指令开始。
本发明所述的数据处理的方法与装置,可以构建出一种基于串行多发射和流水线层次结 构的可配置多核 /众核装置, 包括复数个处理器核(Processor Core )、 复数个可配置本地存 储器 ( configurable local memory ) > 可配置互联结构 ( configurable interconnect structure 其中- 处理器核, 用于执行指令, 进行运算并得到相应结果;
可配置本地存储器, 用于存储指令以及所述处理器核间的数据传递和数据保存; 可配置互联结构, 用于所述可配置多核 /众核装置内各模块间及与外部的连接。
所述可配置多核 /众核装置还可以包括扩展模块, 以适应更广泛的需求; 所述扩展模块 包括但不限于单数个或复数个以下模块的部分或全部:
共享存储器 (shared memory), 用于在所述可配置数据存储器溢出的情况下保存数据、 传递复数个处理器核间的共享数据;
直接存储器访问 (DMA) 控制器, 用于除处理器核外其他模块对所述可配置本地存储器 的直接访问;
异常处理 (exception handl ing ) 模块, 用于处理处理器核、 本地存储器发生的异常 ( exception );
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 处理器核 包括运算单元和程序计数器, 还可以包括扩展模块以适应更广泛的需求,所述扩展模块包括 但不限于寄存器堆。 所述处理器核执行的指令包括但不限于算术运算指令、 逻辑运算指令、 条件判断及跳转指令、异常陷入及返回指令; 所述算术运算指令、逻辑运算指令包括但不限 于乘法、 加 /减法、 乘加 /减、 累加、 移位、 提取、 交换操作, 且包括任意位宽小于等于所述 处理器核数据位宽的定点运算和浮点运算; 每个所述处理器核完成单数条或复数条所述指 令。 所述处理器核的数目可以根据实际应用需求进行扩展。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 每个所述 处理器核都有相应的可配置本地存储器, 包括用于存放分割后代码片段的指令存储器 ( instruction memory )禾卩用于存放数据的可配置数据存储器(configurable data memory )» 在同一可配置本地存储器中,所述指令存储器与可配置数据存储器之间的边界是可以根 据不同配置信息改变的。当根据配置信息确定可配置数据存储器的大小与边界后,所述可配 置数据存储器包括复数个数据子存储器。
在同一可配置数据存储器中,所述复数个数据子存储器之间的边界是可以根据不同配置 信息改变的。所述数据子存储器通过地址转换能映射到所述多核 /众核装置的全部地址空间。 所述映射包括但不限于通过査表进行地址转换和通过内容寻址存储器 (CAM) 匹配进行地址 转换。
所述数据子存储器中每项 (entry ) 包含数据和标志信息, 所述标志信息包括但不限于 有效位 (val id bit)、 数据地址。 所述有效位用于指示相应项中存储的数据是否有效。 所述 数据地址用于指示相应项中存储的数据在所述多核 /众核装置的全部地址空间应处于的位 置。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 所述可配 置互联结构通过配置用于所述可配置多核 /众核装置内各模块间及与外部的连接, 包括但不 限于处理器核与相邻可配置本地存储器的连接,处理器核与共享存储器的连接、处理器核与 直接存储器访问控制器的连接、可配置本地存储器与共享存储器的连接、可配置本地存储器 与直接存储器访问控制器的连接、可配置本地存储器与所述装置外部的连接和共享存储器与 所述装置外部的连接。
根据配置,可以使两个处理器核及其相应本地存储器构成前后级连接关系,包括但不限 于前一级处理器核通过其相应的可配置数据存储器将数据传输到后一级处理器核。 根据应用程序要求,可以通过配置将部分或全部处理器核及其相应本地存储器通过可配 置互联结构构成单数个或复数个串行连接结构。复数个所述串行连接结构可以各自独立,也 可以部分或全部有相互联系, 串行、并行或串并混合地执行指令。所述串行、 并行或串并混 合地执行指令包括但不限于根据应用程序要求不同串行连接结构在同步机制的控制下运行 不同的程序段并行执行不同指令、多线程并行运行,根据应用程序要求不同串行连接结构在 同步机制的控制下运行相同的程序段、 以单指令多数据流(SMD)方式进行相同指令、 不同 数据的密集运算。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 所述串行 连接结构中处理器核具有特定的数据读规则 (read pol icy), 写规则 (write pol icy)。
所述数据读规则,即所述串行连接结构中第一个处理器核的输入数据来源包括但不限于 本身相应的可配置数据存储器、 共享存储器、 所述可配置多核 /众核装置外部。 其他任意处 理器核的输入数据来源包括但不限于本身相应的可配置数据存储器、前一级处理器核相应的 可配置数据存储器。相应地,任意所述处理器核的输出数据的去向包括但不限于本身相应的 可配置数据存储器、共享存储器, 当扩展存储器存在时, 任意所述处理器核的输出数据的去 向还可以是扩展存储器。
所述数据写规则,即所述串行连接结构中第一个处理器核相应可配置数据存储器的输入 数据来源包括但不限于处理器核本身、 共享存储器、 所述可配置多核 /众核装置外部。 其他 任意处理器核相应可配置数据存储器的输入数据来源包括但不限于处理器核本身、前一级处 理器核相应可配置数据存储器、共享存储器。所述处理器核及其相应可配置数据存储器不同 来源的输入数据按特定规则进行多路选择以确定最终的输入数据。
同一个所述可配置数据存储器可以同时被其前后级的两个处理器核访问,不同的处理器 核各自访问所述可配置数据存储器中的不同数据子存储器。所述处理器核可以根据特定规则 对同一个可配置数据存储器中不同数据子存储器分别访问,所述特定规则包括但不限于同一 个可配置数据存储器中不同数据子存储器互为乒乓缓冲 (Ping- pong buffer ), 由两个处理 器核分别访问, 在所述前后两级处理器核均完成对乒乓缓冲的访问后, 进行乒乓缓冲交换, 使原先被前一级处理器核读 /写的数据子存储器作为被后一级处理器核读的数据子存储器, 原先被后一级处理器核读的数据子存储器中所有有效位均被置为无效,并作为被前一级处理 器核读 /写的数据子存储器。 当所述多核 /众核系统中处理器核包含寄存器堆时, 还需要具有特定的寄存器值传输规 则,所述寄存器值传输规则, 即所述串行连接结构中任意前级处理器核中的单数个或复数个 寄存器值都可以传输到任意后级处理器核的相应寄存器中。所述寄存器值包括但不限于所述 处理器核中寄存器堆中寄存器的值。所述寄存器值的传输途径包括但不限于通过可配置互联 结构传输,直接通过共享存储器传输,直接通过所述处理器核相应的可配置数据存储器传输, 根据特定指令通过共享存储器传输、根据特定指令通过所述处理器核相应的可配置数据存储 器传输。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 所述流水 线层次结构中的第二层次即宏观流水线段可以通过背压 (back pressure) 将本宏观流水线 段的信息传输到前一级宏观流水线段,所述前一级宏观流水线段根据收到的背压信息可知之 后的宏观流水线是否阻塞(stall ), 结合本宏观流水线段的情况, 确定本宏观流水线段是否 阻塞, 并将新的背压信息传输到更前一级宏观流水线段, 以此实现宏观流水线的控制。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 可以有扩 展的共享存储器,用于在处理器核相应可配置数据存储器溢出的情况下存储数据、传递复数 个处理器核间的共享数据; 还可以有扩展的异常处理 (exception handling)模块, 用于处 理处理器核、 本地存储器发生的异常 (ex¾eption)。
当所述多核 /众核装置有共享存储器且向可配置数据存储器存储数据时发生溢出, 则产 生异常, 并将被存储数据存储到共享存储器中, 此时, 所述数据子存储器中每项 (entry ) 包含的标志信息包括但不限于有效位、数据地址和数据标签(tag)。所述有效位用于指示相 应项中存储的数据是否有效。 所述数据地址和数据标签 (tag) 共同用于指示相应项中存储 的数据在所述多核 /众核装置的全部地址空间应处于的位置。
所有所述处理器核产生的异常信息均传输到异常处理模块,由异常处理模块进行相应处 理。所述异常处理模块可以由所述多核 /众核装置中的处理器核构成,也可以是额外的模块。 所述异常信息包括但不限于发生异常的处理器编号、异常类型。所述对发生异常的处理器核 和 /或本地存储器的相应处理包括但不限于通过背压信号的传递将流水线是否阻塞的信息传 递到串行连接结构中的各个处理器核。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中, 可以根据 应用程序耍求对处理器核、可配置本地存储器和可配置互联结构进行配置。所述配置包括但 不限于开启或关断处理器核、 配置本地存储器中指令存储器和数据子存储器的大小 /边界及 其中的内容、 配置互联结构和连接关系。
用于所述配置的配置信息的来源包括但不限于所述可配置多核 /众核装置内部和外部。 所述配置可以在运行期间任意时刻根据应用程序的要求进行调整。所述配置的配置方法包括 但不限于由处理器核或中央处理器核直接配置、由处理器核或中央处理器核通过直接存储器 访问控制器配置和外部请求通过直接存储器访问控制器配置。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置具有三个层次 的低功耗技术: 配置层次、 指令层次和应用层次。
所述配置层次, 根据配置信息, 没有被用到的处理器核可以进入低功耗状态; 所述低功 耗状态包括但不限于降低处理器时钟频率或切断电源供应。
所述指令层次, 当处理器核执行到读取数据的指令时, 如果该数据还没有准备好, 则所 述处理器核进入低功耗状态,直到所述数据准备好,所述处理器核再从低功耗状态恢复到正 常工作状态。所述数据没有准备好,包括但不限于前一级处理器核还没有将本级处理器核需 要的数据写入相应数据子存储器。所述低功耗状态包括但不限于降低处理器时钟频率或切断 电源供应。
所述应用层次, 采用全硬件实现, 匹配空闲 (idle)任务特征, 确定当前处理器核的使 用率 (uti lization), 根据当前处理器使用率和基准使用率确定是否进入低功耗状态或是否 从低功耗状态返回。所述的基准使用率可固定不变, 也可重新配置或自学习确定, 可以固化 在芯片内部, 也可以在所述装置启动时由所述装置写入, 也可以由软件写入。用于匹配的参 考内容可以在芯片生产时, 固化到芯片内部,也可以在所述装置启动时由所述装置或软件写 入, 还可以自学习写入, 其存储媒介包括但不限于挥发性的存储器、 非挥发性的存储器; 其 写入方式包括但不限于一次写入、可多次写入。所述低功耗状态包括但不限于降低处理器时 钟频率或切断电源供应。
本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置可以具备自测 试能力, 能够在加电工作的情况下不依赖于外部设备进行芯片的自测试。
当所述多核 /众核装置具备自测试能力时,可以将所述多核 /众核装置中特定的单数个或 复数个基本元件、 运算单元或处理器核用成比较器, 对所述多核 /众核装置中相应的复数组 其他基本元件、运算单元或处理器核及基本元件、运算单元或处理器核的组合给予具有特定 关系的激励,并用所述比较器比较所述复数组其他基本元件、运算单元或处理器核及基本元 件、运算单元或处理器核的的组合的输出是否符合相应的特定关系。所述激励可以来自所述 多核 /众核装置中的特定模块,也可以来自所述多核 /众核装置外部。所述特定关系包括但不 限于相等、 相反、 互逆、 互补。 所述测试结果可以被送到所述多核 /众核装置外部, 也可以 保存在所述多核 /众核装置中的存储器中。
所述的自测试可以是在晶圆测试,封装后集成电路测试或者芯片使用时在所述装置启动 时进行测试, 也可以人为设定自测试条件及周期, 在工作期间定期进行自测试。所述自测试 用到的存储器包括但不限于挥发性的存储器, 非挥发性的存储器。
当所述多核 /众核装置具备自测试能力时, 可以具备自修复能力。 当所述测试结果保存 在所述多核 /众核装置中的存储器中时,可以对失效处理器核作标记,在对所述多核 /众核装 置进行配置时, 可以根据相应标记绕过失效处理器核, 使所述多核 /众核装置依然能正常工 作, 实现自修复。所述自修复可以是在晶圆测试后进行, 封装后集成电路测试后进行或者芯 片使用时在所述装置启动时进行测试后进行,也可以人为设定自测试自修复条件及周期,在 工作期间定期进行自测试后进行。
本发明所述可配置多核 /众核装置中的复数个处理器核可以是同构的,也可以是异构的。 本发明所述可配置多核 /众核装置中本地指令存储器中指令字的长度可以是不固定的。 本发明所述可配置多核 /众核装置中本地指令存储器和本地数据存储器各自可以有单数 组或复数组读端口。
本发明所述可配置多核 /众核装置中,每个处理器核还可以对应复数个本地指令存储器, 所述复数个本地指令存储器可以是相同大小的, 也可以是不同大小的; 可以是相同结构的, 也可以是不同结构的。当所述复数个本地指令存储器中的一个或多个用于响应相应处理器核 取指操作时, 所述复数个本地指令存储器中的其他本地指令存储器可以进行指令更新操作。 更新指令的途径包括但不限于通过直接存储器访问控制器更新指令。
本发明所述可配置多核 /众核装置中的复数个处理器核可以工作在相同的时钟频率, 也 可以工作在不同的时钟频率。
本发明所述可配置多核 /众核装置可以具有读取导致写的特性 (LIS , load induced store 处理器核对于某个地址数据第一次读取时, 从相邻前一级处理器核对应的本地数据 存储器读取数据, 同时将读取到的数据写入本级处理器核对应的本地数据存储器,之后对该 地址数据的读写都访问本级对应的本地数据存储器,从而在不增加额外开销的情况下实现相 邻前后级本地数据存储器中相同地址数据的传递。
本发明所述可配置多核 /众核装置可以具有数据预传递的特性; 处理器核可以从前一级 处理器核对应的本地数据存储器中读取本处理器核不需要读写、但后续处理器核需要读取的 数据,并写入本级处理器核对应的本地数据存储器,从而实现前后级本地数据存储器中相同 地址数据的逐级传递。
本发明所述的本地数据存储器还可以包含单数个或复数个有效标志和单数个或复数个 归属标志。所述有效标志用于表示对应的数据是否有效。所述归属标志用于表示对应的数据 当前被哪个处理器核使用。采用所述有效标志和归属标志能避免使用乒乓缓冲,提高存储器 的使用效率, 且多个处理器核可以同时访问同一个数据存储器, 便于数据交换。
本发明所述的通过可配置互联结构传输寄存器值,包括但不限于采用大量硬连线直接将 所述处理器核中寄存器的值一次全部传输到后级处理器核的寄存器中,釆用移位寄存器的方 法将所述处理器核中寄存器的值依次移位传输到后级处理器核的寄存器中。
所述寄存器值的传输途径还可以是根据寄存器读写记录表决定需要传输的寄存器。本发 明所述的寄存器读写记录表用于记录寄存器对相应本地数据存储器的读写情况。如果寄存器 的值已经被写入本级处理器核对应的本地数据存储器且之后该寄存器的值没有发生改变,则 可以仅由后级处理器核从本级处理器核对应的本地数据存储器中相应地址读取数据,从而完 成所述寄存器的传递, 不需要单独传输该寄存器值到后级处理器。
举例而言, 当寄存器的值写入相应本地数据存储器时,所述寄存器读写记录表中相应的 项被清 "0", 当数据写入寄存器时, 所述寄存器读写记录表中相应的项被置 " 1 "。在进行寄 存器值传输时, 只传输寄存器读写记录表中项为 " 1 " 的相应寄存器的值。 所述数据写入所 述寄存器堆中寄存器,包括但不限于从相应本地数据存储器读取数据到所述寄存器堆中的寄 存器, 将指令执行的结果写回寄存器堆中的寄存器。
当本发明所述的基于串行多发射和流水线层次结构的可配置多核 /众核装置中处理器核 数目确定时,还可以根据分割后得到的确定的代码片段对代码片段头部扩展和代码片段尾部 扩展进行优化, 减少需要传递的寄存器的数量。
举例而言,在通常情况下,代码片段尾部扩展包含了将全部寄存器值存储到特定本地数 据存储器地址的指令,代码片段头部扩展包含了将相应地址中的值读入寄存器的指令,两者 配合实现寄存器值平滑传递。 当代码片段确定时, 可以根据代码片段中的指令, 减少代码片 段头部扩展和代码片段尾部扩展中存储和 /或读取指令的条数。
如果在本级处理器核对应的代码片段中,在写入某一寄存器之前没有使用过该寄存器内 的值,则可以省去前级处理器核对应的代码片段尾部扩展中存储该寄存器值的指令和本级处 理器核对应的代码片段头部扩展中从本地数据存储器中读取数据到该寄存器的指令。
如果在前级处理器核对应的代码片段中,某一寄存器的值在存储到本地数据存储器之后 就没有改变过, 则可以省去前级处理器核对应的代码片段尾部扩展中存储该寄存器值的指 令,并在本级处理器核对应的代码片段头部扩展中添加相关指令,使能够从本地数据存储器 中相应地址读取数据到该寄存器。
本发明所述的数据处理的方法与装置中,当复数个处理器核对应的代码片段执行过程中 均会转移到同一地址执行一段代码, 并在该段代码执行完毕转移回各自对应的代码片段时, 可以将所述同一地址的代码重复存储在所述复数个处理器核对应的本地指令存储器中;所述 同一地址的代码包括但不限于函数调用、 循环。
本发明所述的数据处理的方法与装置中,所述处理器核可以访问除所述处理器核外的处 理器核的本地指令存储器; 当复数个处理器核执行完全相同的代码,且所述代码长度超过单 个处理器核对应的本地指令存储器大小时,可以将所述代码依次存储在复数个处理器核对应 的本地指令存储器中;运行时,所述复数个处理器核中的任一处理器核先从存储所述完全相 同的代码中第一段代码的本地指令存储器读取指令并执行,第一段代码执行完毕后再从存储 所述代码中第二段代码的本地指令存储器读取指令并执行,依此类推,直到全部所述完全相 同的代码执行完毕。
本发明所述的数据处理的方法与装置中,所述复数个处理器核可以同步执行所述完全相 同的代码中的各段代码,也可以异步执行所述完全相同的代码中的各段代码;所述复数个处 理器核可以并行执行所述完全相同的代码中的各段代码,也可以串行执行所述完全相同的代 码中的各段代码; 还可以串并混合地执行所述完全相同的代码中的各段代码。
本发明所述的数据处理的方法与装置中,所述处理器核还可以对应复数个本地指令存储 器, 所述复数个本地指令存储器可以是相同大小的, 也可以是不同大小的; 可以是相同结构 的,也可以是不同结构的; 当所述复数个本地指令存储器中的一个或多个用于响应相应处理 器核取指操作时,所述复数个本地指令存储器中的其他本地指令存储器可以进行指令更新操 作; 更新指令的途径可以是通过直接存储器访问控制器更新指令。
传统片上系统 (SoC, System on Chip) 中除处理器外, 其他功能模块都是用硬连线逻 辑实现的专用集成电路模块。这些功能模块的性能要求很高,采用传统的处理器难以达到性 能要求, 因此无法以传统处理器替代这些专用集成电路模块。
本发明所述的数据处理的方法与装置中,可以将单数个或复数个处理器核及其相应本地 存储器构成高性能的多核连接结构,对多核连接结构进行配置、在相应本地指令存储器中放 入对应的代码片段,使所述多核连接结构实现特定的功能,能替代片上系统中的专用集成电 路模块。所述多核连接结构相当于片上系统中的功能模块,如图像解压縮模块或加解密模块。 这些功能模块再由系统总线连接, 以实现片上系统。
本发明所述的处理器核及其相应本地存储器与相邻处理器核及其相应本地存储器之间 的数据传输通道为本地连接(local interconnection), 单数个所述处理器核及其相应本地 存储器或通过本地连接连在一起的复数个处理器核及其相应本地存储器构成的多核连接结 构即对应片上系统的功能模块。
本发明所述的对应于片上系统中功能模块的多核连接结构与其他所述对应于片上系统 中功能模块的多核连接结构之间的数据传输通道为系统总线 (system bus)。 通过所述系统 总线将复数个对应于片上系统中功能模块的多核连接结构连接起来,就能实现通常意义上的 片上系统。
.基于本发明技术方案实现的片上系统,具有传统片上系统不具备的可配置性。通过对基 于本发明所述的数据处理装置进行不同配置,可以得到不同的片上系统。所述配置可以在运 行过程中实时进行,从而可以在运行过程中实时改变片上系统功能。可以动态地重新配置处 理器核及其相应本地存储器并动态改变相应本地指令存储器中的代码片段,从而改变所述片 上系统的功能。
根据本发明技术方案,所述对应于片上系统中功能模块的多核连接结构内部处理器核及 其相应本地存储器与其他处理器核及其相应本地存储器间用于数据传输的通路属于功能模 块内部的本地连接。通过所述功能模块内部的本地连接传输数据,通常需耍占用提出传输请 求的处理器核的操作。本发明所述的系统总线, 可以是所述本地连接, 也可以是不需耍占用 处理器核的操作即能完成不同处理器核及其相应本地存储器间数据传输的数据传输通道。所 述不同处理器核及其相应本地存储器可以是相邻的, 也可以是不相邻的。 本发明所述的数据处理的方法与装置中,构成系统总线的一个方法是采用复数个位置固 定的连接装置建立数据传输通道。任意所述多核连接结构的输入和输出都与相近的连接装置 通过单数根或复数根硬连线相连。 所有所述连接装置之间也通过单数根或复数根硬连线相 连。所述连接装置、所述多核连接结构与所述连接装置间的连线、及所述连接装置间的连线 共同构成所述系统总线。
本发明所述的数据处理的方法与装置中,构成系统总线的另一个方法是建立数据传输通 道,使任意处理器核及其相应本地数据存储器能与其他任意处理器核及其相应本地数据存储 器进行数据传递。所述数据传递的途径包括但不限于通过共享存储器传递、通过直接存储器 访问控制器传递、 通过专用总线或网络传递。
举例而言,一种方法是,可以事先在一些处理器核及其相应本地数据存储器中的两两处 理器核及其相应本地数据存储器之间布置好单数根或复数根硬连线,所述硬连线可以是可配 置的;当这些处理器核及其相应本地数据存储器中的任意两个处理器核及其相应本地数据存 储器处于不同的多核连接结构中、即处于不同的功能模块中时,所述两个处理器核及其相应 本地数据存储器之间的硬连线即可作为所述两个多核连接结构间的系统总线。
第二种方法是,可以使全部或部分所述处理器核及其相应本地数据存储器能通过直接存 储器访问控制器访问到其他的处理器核及其相应本地数据存储器。当这些处理器核及其相应 本地数据存储器中的任意两个处理器核及其相应本地数据存储器处于不同的多核连接结构 中、 即处于不同的功能模块中时, 就可以在实时运行过程中, 根据需要进行所述处理器核及 其相应本地数据存储器与另一个所述处理器核及其相应本地数据存储器间的数据传递,实现 两个多核连接结构间的系统总线。
第三种方法是,可以在全部或部分所述处理器核及其相应本地数据存储器上实现片上网 络 (Network on Chip ) 功能, 即当所述处理器核及其相应本地数据存储器的数据传输到其 他处理器核及其相应本地数据存储器时, 由可配置互联网络决定数据的去向,从而构成一条 数据通路,实现数据传输。当这些处理器核及其相应本地数据存储器中的任意两个处理器核 及其相应本地数据存储器处于不同的多核连接结构中、即处于不同的功能模块中时,就可以 在实时运行过程中,根据需耍进行所述处理器核及其相应本地数据存储器与另一个所述处理 器核及其相应本地数据存储器间的数据传递, 实现两个多核连接结构间的系统总线。 上述三种方法, 第一种方法采用硬连线结构实现的系统总线, 其连接是静态的, 第二种 采用直接存储器访问、 第三种方法采用片上网络方法, 其连接是动态的。
本发明所述的数据处理的方法与装置中,所述处理器核可以具有快速条件判断机制,用 以确定分支转移是否执行;所述快速条件判断机制可以是用于判断循环条件的计数器,也可 以是用于判断分支转移及循环条件的硬件有限状态机。
本发明所述配置层次低功耗,还可以根据配置信息,使特定的处理器核进入低功耗状态; 所述特定的处理器核包括但不限于没有被用到的处理器核, 工作负载相对较低的处理器核; 所述低功耗状态包括但不限于降低处理器时钟频率或切断电源供应。
本发明所述的数据处理的方法与装置中,还可以包括单数个或复数个专用处理模块。所 述专用处理模块能作为宏模块供所述处理器核及其相应本地存储器调用,也可以作为独立的 处理模块接收所述处理器核及其相应本地存储器的输出,并将处理结果送往所述处理器核及 其相应本地存储器或其他处理器核及其相应本地存储器。向所述专用处理模块输出的处理器 核及其相应本地存储器与接收所述专用处理模块输出的处理器核及其相应本地存储器可以 是同一处理器核及其相应本地存储器,也可以是不同处理器核及其相应本地存储器。所述专 用处理模块包括但不限于快速傅立叶变换 (FFT) 模块、 熵编码模块、 熵解码模块、 矩阵乘 法模块、 卷积编码模块、 维特比码 (Viterbi Code ) 解码模块、 涡轮码 (Turbo Code) 解码 模块。
以矩阵乘法模块为例,如果使用单个所述处理器核进行大规模的矩阵乘法,需要大量时 钟周期, 限制了数据吞吐率的提高; 如果使用多个所述处理器核实现大规模矩阵乘法, 虽然 能减少执行周期数, 但增加了处理器核间的数据传递量, 且占用大量处理器资源。采用专用 的矩阵乘法模块, 可以在少数个周期内完成大规模矩阵乘法。在对程序进行划分时, 可以将 该大规模矩阵乘法前的操作分配到若干个处理器核, 即前组处理器核中,将该大规模矩阵乘 法后的操作分配到另外的若干个处理器核, 即后组处理器核中,前组处理器核的输出中需要 参与该大规模矩阵乘法的数据被送到专用的矩阵乘法模块,经处理后再将结果送往后组处理 器核,前组处理器核的输出中不需要参与该大规模矩阵乘法的数据则被直接送往后组处理器 核。
有益效果: 首先, 本发明所述的数据处理的方法与装置,能够将串行的程序代码分割成适应于串行 连接多处理器核结构中各个处理器核运行的代码片段,针对不同数目的处理器核根据不同的 分割规则分割成不同大小和数目的代码片段, 适合可扩展 (Scalable ) 的多核 /众核装置 / 系统应用。
其次,根据本发明所述的数据处理的方法与装置,将代码片段分配给串行连接多处理器 核结构中各个处理器核运行,每个处理器核执行特定的指令,全部处理器核串行连接实现程 序的完整功能,从完整程序代码中分割出来的代码片段之间用到的数据通过专门的传输途径 传输,几乎没有数据相关性问题,实现了真正的多发射。在所述串行连接多处理器核结构中, 其多发射的发射数量即等于处理器核的数量,大大提高了运算单元的利用率,从而实现串行 连接多处理器核结构、 乃至装置 /系统的高吞吐率。
再次, 用本地存储器替代了处理器中通常会有的缓存(cache)。每个处理器核相应的本 地存储器中保存了该处理器核要用到的所有指令和数据, 做到了 100%的访问命中率 (hit rate ) , 解决了缓存缺失 (cache miss ) 造成的访问外部低速存储器的速度瓶颈问题, 进一 步提高了装置 /系统的整体性能。
再次, 本发明所述的多核 /众核装置具有三个层次的低功耗技术, 不但能够采用如切断 未被使用的处理器核的电源等方法实现粗粒度的功耗管理,还能根据数据驱动,进行针对指 令层次的细粒度功耗管理,更能用硬件的方式实施自动实时调整处理器核时钟频率,在保证 处理器核正常工作的前提下,有效降低了处理器核运行中的动态功耗,实现处理器核按需求 调整时钟频率, 且尽量减少人为的干预实施。 同时由于采用硬件的方式实现, 速度快, 能够 更有效的实现处理器时钟频率的实时调整。
最后, 采用本发明技术方案, 仅需要编程和配置就可实现片上系统, 能縮短从设计到产 品上市之间的研发周期。而且, 只需要重新编程和重配置, 就能使同一个硬件产品实现不同 的功能。 附图说明
虽然该发明可以以多种形式的修改和替换来扩展,说明书中也列出了一些具体的实施图 例并进行详细阐述。应当理解的是,发明者的出发点不是将该发明限于所阐述的特定实施例, 正相反, 发明者的出发点在于保护所有基于由本权利声明定义的精神或范围内进行的改进、 等效转换和修改。
图 1 是以高级语言程序和汇编语言程序的分割和分配为例对本发明进行说明的流程实 施例。
图 2 是本发明所述后编译方法中处理程序循环的实施例。
图 3是本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置示意图。 图 4是地址映射方式的实施例。
图 5是数据在核间传输的实施例。
图 6是背压、 异常处理及数据存储器与共享存储器之间连接的实施例。
图 7是本发明所述自测试自修复方法与结构实施例。
图 8 (a) 是相邻处理器核寄存器值传输的一种实施例。
图 8 (b) 是相邻处理器核寄存器值传输的第二种实施例。
图 9是相邻处理器核寄存器值传输的第三种实施例。
图 10 (a) 是基于本发明处理器核及对应本地存储器组成的一种实施例。
图 10 (b) 是基于本发明处理器核及对应本地存储器组成的另一种实施例。
图 10 (c)是基于本发明处理器核及对应本地存储器中有效标志位和归属标志位的实施 例。
图 11 (a) 是目前现有的片上系统的典型结构。
图 11 (b) 是基于本发明技术方案实现片上系统的一种实施例。
图 11 (c) 是基于本发明技术方案实现片上系统的另一种实施例。
图 12 (a) 是本发明技术方案中前编译的实施例。
图 12 (b) 是本发明技术方案中后编译的实施例。
图 13 (a) 是本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置另 一个示意图。
图 13 (b) 是本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置通 过配置形成的多核串行结构示意图。
图 13 (c) 是本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置通 过配置形成的多核串并行混合结构示意图。
图 13 (d) 是本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置通 过配置形成的多个多核结构的示意图。
具体实施方式
图 1 是以高级语言程序和汇编语言程序的分割和分配为例对本发明进行说明的流程实 施例。 首先经前编译(103)步骤将高级语言程序(101)和 /或汇编语言程序(102) 中的调 用展开得到调用展开后的高级语言代码和 /或汇编语言代码。 然后将调用展开后的高级语言 代码和 /或汇编语言代码通过编译器编译(104)得到符合程序执行顺序的汇编代码, 再进行 后编译(107); 如果程序中只有汇编语言代码, 且已经符合程序执行顺序, 则可以省去编译 (104), 直接进行后编译 (107)。 进行后编译 (107) 时, 在本实施例中, 以多核装置的结 构信息(106)为依据, 在处理器核的行为模型(108)上运行汇编代码并分割, 得到配置信 息(110), 同时产生相应配置引导程序(109)。 最后, 由所述装置中的一个处理器核(111) 直接或通过 DMA控制器 (112) 对相应的复数个处理器核 (113) 进行配置。
在图 2中, 指令分割器首先在步骤一(201)读入前端码流片断, 再在步骤二(202)读 入前端码流相关信息。 然后进入步骤三 (203) 判断该码流片断是否循环, 如果不循环, 则 进入步骤九(209)按照常规处理码流片断进行处理, 如果循环, 则进入步骤四 (204)首先 读入循环周期数 M,再进入步骤五(205)读入本程序段可以容纳的周期数 N。在步骤六(206) 判断循环周期数 M是否大于可以容纳的周期数 N,如果循环周期数 M大于可以容纳的周期数 N, 则进入步骤七(207)将循环分割为一个执行 N周的小循环和一个 M-N周的小循环, 并在 步骤八(208)将 M-N重新赋值给 M,同时进入下一程序段循环, 直到满足循环周期数小于可 以容纳的周期数。通过该方法,可以有效的解决循环周期数大于程序段可以容纳的周期数的 情况。
图 3是本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置示意图。 在本实施例中, 该装置由若干处理器核 (301)、 可配置本地存储器 (302) 和可配置互联结 构(303)构成。在本实施例中,每个处理器核(301)对应其下方的可配置本地存储器(302), 两者一起构成所述宏观流水线的一级。通过配置可配置互联结构(303), 可以将多个处理器 核(301)及其相应可配置本地存储器(302)连接成串行连接结构。 多个串行连接结构可以 各自独立, 也可以部分或全部有相互联系, 串行、 并行或串并混合地运行程序。
图 4是地址映射方式的实施例。 图 4 (a)采用査找表的方法实现地址査找。 以 16位地址 为例, 64K地址空间分为多块单个 1K地址空间的小存储器 (403), 采用顺序写入的方式, 一块存储器写完后, 再写入其他块。 每写完一次, 块内地址指针 (404 ) 自动指向下一个有 效位为 0的可用表项, 写入时将表项的有效位置 1。每个表项写入数据同时将其地址写入査 找表 (402 )。 以写入地址 BFC0的值为例, 此时地址指针 (404) 指向存储器 (403 ) 的 2号 表项, 将对应数据写入 2号表项时, 在査找表(402)对应地址 BFC0中写入 2, 从而建立地 址映射关系。 在读取数据时, 由地址根据査找表 (402 ) 来找到对应表项, 读出所存数据。 图 4 (b)采用 CAM阵列的方法实现地址査找。 以 16位地址为例, 64K地址空间分为多块单 个 1K地址空间的小存储器 (403), 采用顺序写入的方式, 一块存储器写完后, 再写入其他 块。 每写完一次, 块内地址指针 (406) 自动指向下一个有效位为 0的可用表项, 写入时将 表项的有效位置 1。 每个表项写入数据同时将其指令地址写入 CAM阵列 (402 ) 的下一个表 项。 以写入地址 BFC0的值为例, 此时地址指针 (406) 指向存储器 (403 ) 的 2号表项, 将 对应数据写入 2号表项时, 在 CAM阵列 (405 ) 的下一个表项写入指令地址 BFC0, 从而建立 地址映射关系。在读取数据时,输入指令地址与 CAM阵列所存的所有指令地址相比较来找到 对应表项, 读出所存数据。
图 5是数据在核间传输的实施例。所有数据存储器均位于处理器核之间,且分为逻辑意 义上的上下两部分。其中上部分用于数据存储器上面的处理器核的读写,下部分仅用于读取 数据供数据存储器下面的处理器核使用。处理器核运行程序的同时,数据从上面的数据存储 器向下接力传递。 三选一选择器 (502、 509 ) 可选择远处传来的数据 (506 )送入数据存储 器 (503、 504)。 '在处理器核 (510、 511 ) 不做 Store指令时, 数据存储器 (501、 503 ) 的 下部分分别通过三选一选择器(502、 509 )写入对应的下一个数据存储器(503、 504)的上 部分, 同时标志写入行的有效位 V为 1。 在做 Store指令时, 寄存器堆只向下面的数据存储 器写值。 在 Load指令需要取相应地址的数据时, 二选一选择器 (505、 507 ) 分别由数据存 储器(503、 504) 的有效位 V决定是从对应上面的数据存储器(501、 503 )或下面的数据存 储器 (503、 504) 中取数。 如果数据存储器 (503、 504 ) 中某表项的有效位 V为 1 , 即标志 数据已经从上面的数据存储器 (501、 503 ) 写入更新, 则在不选择远处传来的数据 (506 ) 的情况下, 三选一选择器(502、 509 )分别选择处理器核(510、 511 ) 的寄存器堆输出作为 输入,从而保证所存数据是经过处理器核(510、 511)处理后的最新值。在数据存储器(503) 的上部分被新数据写入时, 数据存储器(503)的下部分向数据存储器(504)的上部分传输 数据。数据传输时使用指针标志正在传输数据的表项, 当指针指向最后一个表项时, 标志传 输即将完成。一段程序运行完毕时, 数据应已完成向下一个存储器的传输。在下一段程序运 行时,数据存储器( 501 )的上部分向数据存储器( 503 )的下部分传输数据,数据存储器( 503 ) 的上部分向数据存储器(504) 的下部分传输数据, 数据存储器(504)的上部分向下传输数 据,从而构成乒乓传输结构。所有数据存储器都按所需指令空间大小划分出一部分用于指令 的存储, 即数据存储器和指令存储器在物理上是不分开的。
图 6 是背压、 异常处理及数据存储器与共享存储器之间连接的实施例。 本实施例中由 DMA控制器 (616) 向指令存储器 (601、 609、 610、 611) 写入相应代码片段 (615)。 处理 器核(602、 604、 606、 608)运行相应指令存储器 (601、 609、 610、 611) 中的代码, 并读 写相应的数据存储器(603、 605、 607、 612)。 以处理器核(604)、 数据存储器(605)及后 一级处理器核(606)为例, 前后两级处理器核(604、 606)都对数据存储器(605)有访问, 只有在前级处理器核 (604) 完成写数据存储器 (605) 且后级处理器核 (606) 完成读数据 存储器(605)后, 数据存储器(605) 中的数据子存储器才能做乒乓交换。 背压信号(614) 用于由后级处理器核 (606) 通知数据存储器 (605) 是否己完成读操作。 背压信号 (613) 用于由数据存储器(605)通知前级处理器核(604)是否有溢出,并传递由后级处理器核(606) 传输来的背压信号。前级处理器核(604)根据本身运行情况和由数据存储器(605)传输来 的背压信号, 判断宏观流水线是否阻塞、 决定是否对数据存储器 (605) 中的数据子存储器 做乒乓交换,并产生背压信号继续向前一级传递。通过如此处理器核到数据存储器再到处理 器核的反向背压信号传递, 即可控制宏观流水线的运行。 所有数据存储器(603, 605, 607, 612) 均通过连接 (619) 与共享存储器 (618) 连接。 当某个数据存储器所需写入或读出的 地址在其自身之外时, 发生地址异常, 进入共享存储器 (618) 中査找地址, 找到后将数据 写入该地址或将该地址的数据读出。 当处理器核(608)需耍用到数据存储器(605)中的数 据时,也发生异常,数据存储器(605)通过共享存储器(618)将数据传输到处理器核(608) 中。 处理器核和数据存储器产生的异常信息均通过专用通道 (620) 传输到异常处理模块 (617)。 在本实施例中, 以处理器核中的运算结果溢出为例, 异常处理模块 (617) 控制处 理器核对溢出的运算结果做限辐(saturation)操作; 以数据存储器溢出为例, 异常处理模 块 (617) 控制数据存储器访问共享存储器, 将数据存储到共享存储器中; 在此过程中, 异 常处理模块 (617) 发送信号到所述处理器核或数据存储器, 使之阻塞, 等完成异常处理操 作后再恢复运行,其他处理器核及数据存储器通过背压传递而来的信号各自确定自身是否阻 塞。
请参阅图 7,该图为所述自测试自修复方法与结构实施例。在该自测试自修复结构(701) 中, 向量生成器 (702) 产生的测试向量同步送到各处理器核, 测试向量分配控制器 (703) 控制各处理器核与向量生成器(702)的连接关系, 运算结果分发控制器(709)控制各处理 器核与比较器的连接关系,处理器核通过比较器和其他处理器核进行运算结果的比较,在本 实施例中, 每个处理器核可以和相邻的其他处理器核进行比较, 如处理器核 (704)可以通 过比较逻辑(708)和处理器核(705、 706、 707)进行比较。 在该实施例中, 每个比较逻辑 可以包含一个或者多个比较器,如果一个比较逻辑有一个比较器,则每个处理器核依次和相 邻的其他多个处理器核进行比较,如果一个比较逻辑有多个比较器,则每个处理器核同时和 相邻的其他多个处理器核进行比较, 测试结果直接从各比较逻辑写入测试结果表 (710)。
请参阅图 8, 图 8给出了相邻处理器核寄存器值传输的三种实施例。
在图 8(a)对应的实施例中,处理器核具有包含 31个 32位通用寄存器的寄存器堆 (801), 在传递前级处理器核 (802) 中所有通用寄存器值到本级处理器核 (803) 时, 可以用 992 根硬连线直接将前级处理器核 (802) 所有通用寄存器的每一位的输出端与本级处理器核 (803)所有通用寄存器的每一位的输入端通过多路选择器一一对应连通。传递寄存器值时, 在一个周期内即可将前级处理器核 (802) 中 31个 32位通用寄存器的值全部传递到本级处 理器核 (803)。 图 8 (a) 中具体显示了一个通用寄存器中一位 (804) 的硬连线连接方法, 其余 991位的硬连线连接方法与该位 (804) 相同。 前级处理器核 (802) 中相应位 (805) 的输出端 (806)通过硬连线(807)与本级处理器核(803) 中该位(804) 的输入端通过多 路选择器(808)连接。 当处理器核执行算术、 逻辑等运算时, 多路选择器(808)选择来源 于本级处理器核的数据(809); 当处理器核执行取数操作时, 如果该数据在本级处理器核对 应的本地存储器中已存在, 则选择来源于本级处理器核的数据(809), 否则选择来源于前级 处理器核传输而来的数据 (810); 当传递寄存器值时, 多路选择器 (808) 选择来源于前级 处理器核传输而来的数据(810)。全部 992位同时传输, 即可在一个周期内完成整个寄存器 堆值的传递。 在图 8 (b)对应的实施例中, 相邻处理器核 (820、 822)各自具有包含复数个 32位通 用寄存器的寄存器堆 (821、 823)。 在从前级处理器核 (820) 向本级处理器核 (822) 传递 寄存器值时, 可以用 32根硬连线将前级处理器核 (820) 中寄存器堆 (821) 的数据输出端
(829)与连接在本级处理器核(822) 中寄存器堆(823)数据输入端(830)上的多路选择 器 (827) 的输入连接, 多路选择器 (827) 的输入分别为本级处理器核来的数据 (824) 和 通过硬连线 (826) 传送来的从前级处理器核来的数据 (825), 当处理器核执行算术、 逻辑 等运算时, 多路选择器 (827) 选择来源于本级处理器核的数据 (824); 当处理器核执行取 数操作时,如果该数据在本级处理器核对应的本地存储器中已存在,则选择来源于本级处理 器核的数据 (824), 否则选择来源于前级处理器核传输而来的数据 (825) 当传递寄存器值 时, 多路选择器(827)选择来源于前级处理器核传输而来的数据(825)。 由寄存器堆(821、 823) 本身对应的寄存器地址产生模块 (828、 832) 产生需要传递寄存器值的寄存器地址送 到寄存器堆 (821、 823) 的地址输入端 (831、 833), 分多次将所述寄存器的值通过硬连线
(826) 和多路选择器 (827) 从寄存器堆 (821) 传递的寄存器堆 (823)。 这样, 可以在只 增加少量硬连线的情况下, 利用多个周期内完成寄存器堆内全部或部分寄存器值的传递。
在图 9对应的实施例中, 相邻处理器核 (940、 942) 各自具有包含复数个 32位通用寄 存器的寄存器堆 (941、 943)。 在从前级处理器核 (940) 向本级处理器核 (942) 传递寄存 器值时, 可以先由前级处理器核 (940) 利用数据存储 (store) 指令将寄存器堆 (941) 中 一个寄存器值写入前级处理器核(940)对应的本地数据存储器(954) 中, 再由本级处理器 核(942)利用数据装载(load)指令从本地数据存储器(954)中读出相应数据并写入寄存 器堆 (943) 的对应寄存器中。 在本实施例中, 前级处理器核 (940) 中的寄存器堆 (941) 的数据输出端 (949)通过 32位连线 (946)与本地数据存储器(954) 的数据输入端(948) 相连,本级处理器核(942)中的寄存器堆(943)的数据输入端(950)通过多路选择器(947) 及 32位连线(953)与本地数据存储器(954)的数据输出端(952)相连。多路选择器(947) 的输入分别为本级处理器核来的数据 (944) 和通过 32位连线 (953) 传送来的从前级处理 器核来的数据 (945), 当处理器核执行算术、 逻辑等运算时, 多路选择器 (947) 选择来源 于本级处理器核的数据(944); 当处理器核执行取数操作时, 如果该数据在本级处理器核对 应的本地存储器中已存在, 则选择来源于本级处理器核的数据(944), 否则选择来源于前级 处理器核传输而来的数据(945)当传递寄存器值时, 多路选择器(947)选择来源于前级处 理器核传输而来的数据(945)。在图 8 (c )对应的实施例中, 可以先依次将寄存器堆(941 ) 中全部寄存器的值都写入本地数据存储器(954)中,之后依次将这些值写入寄存器堆(943 ) 中; 也可以先依次将寄存器堆(941 ) 中部分寄存器的值写入本地数据存储器(954) 中, 之 后依次将这些值写入寄存器堆(943)中; 还可以将寄存器堆(941 )中一个寄存器的值写入 本地数据存储器(954) 中后, 马上将该值写入寄存器堆(943) 中, 依次重复此过程, 直到 需要传递的寄存器值都传递完毕。
请参阅图 10, 图 10给出了基于本发明所述处理器核及对应本地存储器组成的连接结构 的两种实施例。对于本领域普通技术人员来说,可以根据本发明的技术方案和构思对这些实 施例中各组成部分进行各种可能的替换、调整和改进, 而所有这些替换、调整和改进都应属 于本发明所附权利要求的保护范围。
图 10 (a) 对应的实施例包含了本地指令存储器和本地数据存储器的处理器核 (1001 ) 及其前一级处理器核对应的本地数据存储器 (1002 )。 处理器核 (1001 ) 由本地指令存储器 ( 1003)、 本地数据存储器(1004)、 执行单元(1005)、 寄存器堆(1006)、 数据地址产生模 块 (1007)、 程序计数器 (1008)、 写缓冲 (1009) 以及输出缓冲 (1010) 组成。
本地指令存储器 (1003)存储有处理器核 (1001 ) 执行所需的指令。 处理器核 (1001 ) 中执行单元 (1005) 所需的操作数来自寄存器堆 (1006 ), 或来自指令中的立即数; 执行结 果写回寄存器堆 (1006)。
本实施例中, 本地数据存储器有两个子存储器。 以本地数据存储器(1004)为例, 从两 个子存储器读出的数据通过多路选择器(1018、 1019)选择, 产生最终输出的数据(1020)。
通过数据装载( load )指令可以将本地数据存储器( 1002、 1004 )中的数据、写缓冲( 1009 ) 中的数据、 或外部的共享存储器中的数据(1011 )读取到寄存器堆(1006) 中。 在本实施例 中, 本地数据存储器(1002、 1004) 中的数据、 写缓冲 (1009) 中的数据和外部的共享存储 器中的数据 (1011 ) 通过多路选择器 (1016、 1017) 选择后, 输入到寄存器堆 (1006 ) 中。
通过数据存储 (store) 指令可以将寄存器堆 (1006) 中的数据通过写缓冲 ( 1009) 延 时存储到本地数据存储器(1004)中, 或将寄存器堆(1006)中的数据通过输出缓冲(1010) 延时存储到外部的共享存储器中。在从本地数据存储器(1002 )读取数据到寄存器堆(1006 ) 的同时可以将该数据通过写缓冲(1009 )延时存储到本地数据存储器(1004)中, 以完成本 发明所述的 LIS功能, 实现无代价的数据传递。 在图 10(a)对应的实施例中,写缓冲(1009)接收的数据有三个来源:从寄存器堆(1006) 来的数据、从前级处理器核本地数据存储器(1002)来的数据、 以及从外部的共享存储器来 的数据(1011)。所述从寄存器堆(1006)来的数据、从前级处理器核本地数据存储器(1002) 来的数据、 以及从外部的共享存储器来的数据(1011)通过多路选择器(1012)选择后输入 到写缓冲 (1009)。
在图 10 (a)对应的实施例中, 本地数据存储器只接收从同一处理器核中写缓冲来的数 据输入。 如在处理器核 (1001) 中, 本地数据存储器(1004)只接收从写缓冲(1009)来的 数据输入。
在图 10 (a)对应的实施例中,本地指令存储器(1003)和本地数据存储器(1002、 1004) 各自都是由两个相同的子存储器构成,可以同时对本地存储器中不同的子存储器进行读、写 操作。采用这样的结构就可以实现本发明技术方案所述的采用乒乓缓冲交换的本地数据存储 器。本地指令存储器(1003)接收的地址由程序计数器(1008)产生。本地数据存储器(1004) 接收的地址有三个来源: 从本级处理器核写缓冲(1009)中地址存储部分来的用于存储数据 的地址、 从本级处理器核数据地址产生模块(1007)来的用于读取数据的地址、 从后级处理 器核数据地址产生模块来的用于读取数据的地址(1013)。所述从本级处理器核写缓冲(1009) 中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块(1007)来的 用于读取数据的地址、 从后级处理器核数据地址产生模块来的用于读取数据的地址 (1013) 通过多路选择器(1014、 1015)选择后, 分别输入到本地数据存储器(1004) 中不同子存储 器的地址接收模块。
相应地, 本地数据存储器(1002)接收的地址也有三个来源: 从本级处理器核写缓冲中 地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块来的用于读取数 据的地址、 从后级处理器核数据地址产生模块(1007)来的用于读取数据的地址。上述地址 通过多路选择器选择后, 分别输入到本地数据存储器(1002)中不同子存储器的地址接收模 块。
图 10 (b)是另一种基于本发明所述处理器核及对应本地存储器组成的连接结构, 其中 包含了本地指令存储器和本地数据存储器的处理器核(1021)及其前一级处理器核对应的本 地数据存储器 (1022) 组成。 处理器核 (1021) 由本地指令存储器 (1003)、 本地数据存储 器 (1024)、 执行单元(1005)、 寄存器堆(1006)、 数据地址产生模块(1007)、 程序计数器 (1008)、 写缓冲 (1009) 以及输出缓冲 (1010) 组成。
图 10 (b) 对应实施例提出的连接结构与图 10 (a) 对应实施例提出的结构大致相同, 唯一的不同点在于本实施例中的本地数据存储器 ( 1022、 1024) 各是由一个双端口 (dual-port) 存储器构成。 双端口存储器可以同时支持两个不同地址的读、 写操作。
本地数据存储器(1024)接收的地址有三个来源: 从本级处理器核写缓冲(1009)中地 址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块(1007)来的用于 读取数据的地址、 从后级处理器核数据地址产生模块来的用于读取数据的地址 (1025)。 所 述从本级处理器核写缓冲(1009)中地址存储部分来的用于存储数据的地址、从本级处理器 核数据地址产生模块(1007)来的用于读取数据的地址、从后级处理器核数据地址产生模块 来的用于读取数据的地址(1025)通过多路选择器(1026)选择后, 输入到本地数据存储器 (1024) 的地址接收模块。
相应地, 本地数据存储器(1022)接收的地址也有三个来源: 从本级处理器核写缓冲中 地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块来的用于读取数 据的地址、从后级处理器核数据地址产生模块(1007)来的用于读取数据的地址。上述地址 通过多路选择器选择后, 输入到本地数据存储器 (1022) 的地址接收模块。
由于通常程序中需要访问存储器的数据装载指令和数据存储指令一般不超过 40%, 因此 可以用单端口 (single- port) 存储器代替图 10 (b) 对应实施例中的双端口存储器, 在程 序编译时静态调整程序中指令的顺序,或在程序执行时动态调整指令执行顺序,在执行不需 访问存储器的指令时同时执行对存储器访问的指令,进而使连接结构的组成更为简洁、高效。
图 10 (b)对应实施例中每个本地数据存储器实际上是一个双端口存储器, 能同时支持 两个读、 两个写或一读一写操作。 为保证数据在执行中不被误改写, 可以采用如图 10 (c) 所示的方法, 在本地数据存储器 (1031) 中的每一地址都对应增加一个有效标志位 (1032) 和一个归属标志位 (1033)。
图 10 (c) 中, 有效标志位 (1032)代表了本地数据存储器 (1031) 中该地址对应的数 据 (1034) 的有效性, 举例而言, 可以用 "1"代表本地数据存储器 (1031) 中该地址对应 的数据(1034)是有效的,用 "0"代表本地数据存储器(1031)中该地址对应的数据(1034) 是无效的。归属标志位(1033)代表了本地数据存储器(1031)中该地址对应的数据(1034) 是归哪个处理器核使用, 举例而言, 可以用 "0"代表本地数据存储器 (1031) 中该地址对 应的数据 (1034) 归所述本地数据存储器 (1031) 对应的处理器核 (1035) 使用, 用 "1" 代表本地数据存储器 (1031) 中该地址对应的数据 (1034) 归所述本地数据存储器 (1031) 对应的处理器核 (1035) 及其后级处理器核 (1036) 使用。
在具体实施例中, 可以按上述对有效标志位(1032)和归属标志位(1033)的定义描述 存储在本地数据存储器中的每个数据的属性, 并保证正确的读写。
在图 10 (c)对应实施例中, 如果本地数据存储器 (1031) 中某地址对应的有效标志位 (1032) 为 "0", 则表示该地址对应的数据 (1034)是无效的, BP, 如果需要, 可以直接对 该地址进行数据存储操作。 如果有效标志位 (1032) 为 "1"且归属标志位(1033)为 "0", 则表示该地址对应的数据(1034)是有效的, 且是给所述本地数据存储器(1031)对应的处 理器核(1035)使用的, 因此本级处理器核(1035)如果需要, 可以直接对该地址进行数据 存储操作。 如果有效标志位 (1032) 为 "1"且归属标志位 (1033) 为 "1", 则表示该地址 对应的数据( 1034)是有效的,且是耍给所述本地数据存储器( 1031)对应的处理器核( 1035) 及其后级处理器核(1036)使用的, 如果本级处理器核(1035)需耍对该地址进行数据存储 操作, 则必须等到所述归属标志位 (1033) 为 "0"后才可进行数据存储操作, 即先将该地 址对应的数据(1034)传输到后级处理器核(1036)对应的本地数据存储器(1037) 中的相 应位置, 同时将本级处理器核(1035)对应的本地数据存储器(1031) 中该地址对应的归属 标志位(1033)置为 "0", 这样, 本级处理器核(1035)就可以对该地址进行数据存储操作 了。
在图 10 (c)对应实施例中, 若本级处理器(1035)对其对应的本地数据存储器(1031) 进行数据存储操作, 则可以将对应的有效标志位(1032)置 "1", 并根据该数据 (1034)是 否会被后级处理器(1036)使用决定归属标志位, 如果会被后级处理器(1036)使用则归属 标志位 (1033) 置 T, 否则置 "0"; 也可以将对应的有效标志位 (1032)置 "1", 同时将 对应的归属标志位 (1032) 也置 "1", 这样虽然需要增加本地数据存储器 (1031) 的容量, 但能简化其具体的实现结构。
请参阅图 11 (a), 图 11 (a) 给出了目前现有的片上系统的典型结构。 其中处理器核 (1101)、 数字信号处理器核 (1102)、 功能单元 (1103、 1104、 1105)、 输入输出接口控制 模块(1106)和存储控制模块(1108)都连接在系统总线(1110)上。 该片上系统可以通过 输入输出接口控制模块( 1106)与外围设备( 1107)传输数据,还可以通过存储控制模块( 1108) 与外部存储器 (1109) 传输数据。
请参阅图 11 (b), 图 11 (b) 给出了基于本发明技术方案实现片上系统的一种实施例。 在本实施例中, 处理器核及相应本地存储器(1121)与其他六个处理器核及相应本地存储器 共同构成功能模块 (1124), 处理器核及相应本地存储器 (1122) 与其他四个处理器核及相 应本地存储器共同构成功能模块 (1125), 处理器核及相应本地存储器 (1123) 与其他两个 处理器核及相应本地存储器共同构成功能模块(1126)。 所述功能模块(1124、 1125、 1126) 各自可以对应图 11 (a) 实施例中的处理器核 (1101)、 或数字信号处理器核 (1102)、 或功 能单元(1103或 1104或 1105)、或输入输出接口控制模块( 1106)、或存储控制模块( 1108)。
以功能模块(1126) 为例, 处理器核及相应本地存储器(1123、 1127、 1128、 1129)构 成串行连接的多核结构, 所述四个处理器核及相应本地存储器 (1123、 1127、 1128、 1129) 共同实现功能模块 (1126) 具备的功能。
处理器核及相应本地存储器(1123)与处理器核及相应本地存储器(1127)之间的数据 传输通过内部连接 (1130)实现。 同样地, 处理器核及相应本地存储器(1127)与处理器核 及相应本地存储器(1128)之间的数据传输通过内部连接(1131)实现, 处理器核及相应本 地存储器( 1128 )与处理器核及相应本地存储器( 1129 )之间的数据传输通过内部连接( 1132 ) 实现。
功能模块(1126)通过硬连线(1133、 1134)与总线连接模块(1138)连接, 使功能模 块(1126)与总线连接模块(1138)之间能相互传输数据。 同样地, 功能模块(1125) 与总 线连接模块(1139)之间能相互传输数据, 功能模块(1124)与总线连接模块(1140、 1141) 之间能相互传输数据。 总线连接模块 (1138) 与总线连接模块 (1139) 通过硬连线 (1135) 能相互传输数据。 总线连接模块(1139)与总线连接模块(1140)通过硬连线 (1136) 能相 互传输数据。 总线连接模块(1140)与总线连接模块(1141)通过硬连线(1137)能相互传 输数据。 通过这种方法, 可以实现功能模块 (1125)、 功能模块 (1126)、 功能模块 (1127) 之间的数据相互传输, 总线连接模块 (1138、 1139、 1140、 1141) 与硬连线 (1135、 1136、 1137) 实现了图 11 (a) 中系统总线 (1110) 的功能, 并与功能模块 (1125、 1126、 1127) 一起, 构成了典型的片上系统结构。
由于本发明提出的可配置多核 /众核装置中处理器核及相应本地存储器在数目上是很容 易扩展的, 因此采用本实施例的方法可以很方便地实现各种类型的片上系统。此外, 在基于 本发明提出的可配置多核 /众核装置实时运行时, 也可以通过实时动态配置的方法, 使片上 系统的结构能灵活改变。
请参阅图 11 (c), 图 11 (c) 给出了基于本发明技术方案实现片上系统的 /另一种实施 例。在本实施例中, 处理器核及相应本地存储器(1151)与其他六个处理器核及相应本地存 储器共同构成功能模块 (1163), 处理器核及相应本地存储器 (1152) 与其他四个处理器核 及相应本地存储器共同构成功能模块 (1164), 处理器核及相应本地存储器 (1153) 与其他 两个处理器核及相应本地存储器共同构成功能模块 (1165)。 所述功能模块 (1163、 1164、 1165)各自可以对应图 11 (a)实施例中的处理器核(1101)、或数字信号处理器核(1102)、 或功能单元 (1103或 1104或 1105)、 或输入输出接口控制模块 (1106)、 或存储控制模块 (1108)。
以功能模块(1165)为例, 处理器核及相应本地存储器(1153、 1154、 1155、 1156)构 成串行连接的多核结构, 所述四个处理器核及相应本地存储器 (1153、 1154、 1155、 1156) 共同实现功能模块 (1165) 具备的功能。
处理器核及相应本地存储器(1153)与处理器核及相应本地存储器(1154)之间的数据 传输通过内部连接(1160)实现。 同样地, 处理器核及相应本地存储器(1154)与处理器核 及相应本地存储器(1155)之间的数据传输通过内部连接(1161)实现, 处理器核及相应本 地存储器(1155)与处理器核及相应本地存储器(1156)之间的数据传输通过内部连接(1162) 实现。
在本实施例中,一个例子是通过处理器核及相应本地存储器(1156)与处理器核及相应 本地存储器(1166)间的数据传输实现功能模块(1165)与功能模块(1164) 间的数据传输 需求。根据本发明技术方案, 运行过程中, 一旦处理器核及其相应本地存储器(1156)需要 与处理器核及其相应本地存储器(1166)相互传输数据, 可配置互联网络根据所述数据传输 的需求自动配置、建立处理器核及其相应本地存储器(1156)与处理器核及其相应本地存储 器 (1166) 的双向数据通路 (1158)。 同样地, 一旦处理器核及其相应本地存储器 (1166) 需要向处理器核及其相应本地存储器(1156)单向传输数据, 或处理器核及其相应本地存储 器(1156)需耍向处理器核及其相应本地存储器(1166)单向传输数据, 也可按相同方法建 立单向的数据通路。
在本实施例中, 还建立了处理器核及其相应本地存储器(1151)与处理器核及其相应本 地存储器 (1152 ) 之间的双向数据通路 (1157), 和处理器核及其相应本地存储器 (1165 ) 与处理器核及其相应本地存储器 (1155 ) 之间的双向数据通路 (1159)。 通过这种方法, 可 以实现功能模块(1163 )、 功能模块(1164)、 功能模块(1165 )之间的数据相互传输, 双向 数据通路 (1157、 1158、 1159) 实现了图 11 (a) 中系统总线 (1110) 的功能, 并与功能模 块 (1163、 1164、 1165 ) 一起, 构成了典型的片上系统结构。
根据片上系统应用需求的不同,任意两个功能模块之间不一定只有一组数据通路。由于 本发明提出的可配置多核 /众核装置中处理器核在数目上是很容易扩展的, 因此采用本实施 例的方法可以很方便地实现各种类型的片上系统。 此外, 在基于本发明提出的可配置多核 / 众核装置实时运行时, 也可以通过实时动态配置的方法, 使片上系统的结构能灵活改变。
图 12前编译和后编译实施例, 其中图 12 ( a)为前编译实施例, 图 12 (b )为后编译实 施例。
如图 12 ( a) 所示, 左边为原始的程序代码 (1201、 1203、 1204), 在代码中有两次函 数调用, 分别为 A函数调用和 B函数调用。 其中 1203、 1204分别为 A函数和 B函数代码本 身。 在进行前编译展开后, A函数调用和 B函数调用分别被替换成相应的函数代码, 展开后 的代码中没有函数调用, 如 1202所示。
图 12 (b ) 为后编译实施例, 如图所示, 原始的目标代码 (1205 ) 为经过普通编译后的 目标代码, 该目标代码是基于顺序执行的目标代码, 经过后编译分割后, 形成如图所示的代 码块 (1206、 1207、 1208、 1209、 1210、 1211 ), 每个代码块分配给相应的一个处理器核执 行。 相应的 A循环体被分割为一个单独的代码块 (1207), 而 B循环体由于本身相对较大, 被分割成两个代码块, 即 B循环体 1 ( 1209 )和 B循环体 2 ( 1210)。两个代码块在两个处理 器核上执行, 共同完成 B循环体。
请参阅图 13, 图 13 ( a)为本发明所述基于串行多发射和流水线层次结构的可配置多核 /众核装置示意图, 图 13 (b )为通过配置形成的多核串行结构示意图, 图 13 ( c )为通过配 置形成的多核串并行混合结构示意图, 图 13 ( d )为通过配置形成的多个多核结构的示意图。
如图 13 ( a) 所示, 该装置由多个处理器核及可配置本地存储器 (1301、 1303、 1305、 1307、 1309、 1311、 1313、 1315、 1317 )和可配置互联结构(1302、 1304、 1306、 1308、 1310、 1312、 1314、 1316、 1318)构成。 在本实施例中, 每个处理器核及可配置本地存储器构成所 述宏观流水线的一级。 通过配置可配置互联结构 (如 1302 ), 可以将多个处理器核及可配置 本地存储器(1301、 1303、 1305、 1307、 1309、 1311、 1313、 1315、 1317)连接成串行连接 结构。 多个串行连接结构可以各自独立, 也可以部分或全部有相互联系, 串行、 并行或串并 混合地运行程序。
如图 13 (b)所示, 通过配置相应的可配置互联结构,形成图中的多核串行结构,其中处 理器核及可配置本地存储器(1301)为该多核串行结构的第一级, 处理器核及可配置本地存 储器 (1317)为该多核串行结构的最后一级。
如图 13 (c) 所示, 通过配置相应的可配置互连结构, 处理器核及可配置本地存储器 (1301、 1303、 1305、 1313、 1315, 1317)构成串行结构, 而处理器核及可配置本地存储器 (1307、 1309、 1311) 构成并行结构, 最终形成一个串并行混合结构的多核处理器。 . 如图 13 (d) 所示, 通过配置相应的可配置互连结构, 处理器核及可配置本地存储器 (1301、 1307、 1313、 1315)构成串行结构, 而处理器核及可配置本地存储器(1303、 1309、 1305、 1311、 1317) 构成另外一条串行结构, 从而构成两条完全独立的串行结构。

Claims

权利要求
1、 一种数据处理的方法, 用于在多处理器核结构上执行程序并得到结果; 所述处理器 指通过执行指令进行运算和读写数据的硬件; 其特征在于所述数据处理的方法包括:
( 1 ) 对程序代码进行分割, 使所述多处理器核结构中的每个核运行相应的分割后代码 片段所需的时间尽量相等;
( 2 ) 在所述多处理器核结构中串行连接的多核结构上运行程序时, 串行连接多核结构 中前一个处理器核的执行结果作为输入送给后一个处理器核;串行连接的多处理器核结构中 任意核每单位时间内可进行单数个或复数个发射,复数个串行连接的核同时形成更大规模的 多发射, 即串行多发射;
( 3 ) 在所述多处理器核结构中串行连接的多核结构上运行程序时, 串行连接多核结构 中任意核的内部流水线为第一个层次,串行连接多处理器核结构中每个核作为一个宏观流水 线段而构成的宏观流水线为第二个层次, 依此类推还可以得到更多更高层次。
2、 根据权利要求 1所述的数据处理的方法, 其特征在于产生运行于所述多处理器核结 构中处理器核上的代码片段,除需耍对程序代码进行现有通常意义上从程序源代码到目标代 码的编译 (compi le) 外, 还可以进行前编译 (pre- compile), 即在所述编译进行前对程序 源代码的预编译; 还可以进行后编译 (post-compi le ), 即按要求分配到所述串行连接多处 理器核结构中每个核的工作内容及负荷将程序代码的划分为单数个或复数个代码片段。
3、 根据权利要求 2所述的数据处理的方法, 其特征在于所述后编译步骤包括:
(a) 对程序代码进行解析, 生成前端码流;
( b) 扫描、 分析前端码流, 根据执行前端码流所需执行周期、 是否跳转以及跳转地址 信息, 统计扫描结果, 间接确定分割信息; 或不扫描前端码流, 根据预设信息直接确定分割 信息;
( c ) 根据分割信息对可执行的程序指令代码和进行代码分割, 生成所述串行连接多处 理器核结构中每个处理器核相应的代码片段。
4、 根据权利耍求 3所述的数据处理的方法, 其特征在于在每个所述处理器核上运行的 指令除所述分割后的相应代码片段外, 还可以包括不影响原始程序功能的额外指令。
5、一种基于串行多发射和流水线层次结构的可配置多核 /众核装置,包括复数个处理器 核、复数个可配置本地存储器(configurable local memory )、可配置互联结构(configurable interconnect structure ), 其中 r
处理器核, 用于执行指令, 进行运算并得到相应结果;
可配置本地存储器, 用于存储指令以及所述处理器核间的数据传递和数据保存; 可配置互联结构, 用于所述可配置多核 /众核装置内各模块间及与外部的连接; 可以根据应用程序要求对处理器核、可配置本地存储器和可配置互联结构进行配置,构 成单数个或复数个串行连接的多核结构;所述串行连接的多核结构中每个处理器核各运行程 序代码中的一部分代码片段;所述串行连接的多核结构中所有处理器核共同实现程序代码的 完整功能。
6、根据权利要求 5所述的可配置多核 /众核装置,其特征在于所述基于串行多发射和流 水线层次结构的可配置多核 /众核装置还可以包括单数个或复数个扩展模块; 所述扩展模块 可以是:
共享存储器 (shared memory ) , 用于在所述可配置数据存储器溢出的情况下保存数据、 传递复数个处理器核间的共享数据; 或
直接存储器访问 (DMA) 控制器, 用于除处理器核外其他模块对所述可配置本地存储器 的直接访问; 或
异常处理 (exception handl ing ) 模块, 用于处理处理器核、 本地存储器发生的异常 ( exception
7、根据权利要求 5所述的可配置多核 /众核装置,其特征在于所述基于串行多发射和流 水线层次结构的可配置多核 /众核装置中, 处理器核包括运算单元和程序计数器。
8、根据权利耍求 7所述的可配置多核 /众核装置,其特征在于所述基于串行多发射和流 水线层次结构的可配置多核 /众核装置中, 处理器核还可以包括扩展模块; 所述扩展模块可 以是寄存器堆。
9、根据权利耍求 5所述的可配置多核 /众核装置,其特征在于所述基于串行多发射和流 水线层次结构的可配置多核 /众核装置中,每个所述处理器核都有相应的可配置本地存储器, 包括用于存放分割后代码片段的指令存储器( instruction memory )和用于存放数据的可配 置数据存储器 ( configurable data memory ) 0
10、 根据权利耍求 9所述的可配置多核 /众核装置, 其特征在于同一可配置本地存储器 中, 所述指令存储器与可配置数据存储器之间的边界是可以根据不同配置信息改变的。
11、 根据权利要求 9所述的可配置多核 /众核装置, 其特征在于在同一可配置数据存储 器中可以包括复数个数据子存储器,所述复数个数据子存储器之间的边界是可以根据不同配 置信息改变的。
12、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述可配置互联结构包 括处理器核与相邻可配置本地存储器的连接、处理器核与共享存储器的连接、处理器核与直 接存储器访问控制器的连接、可配置本地存储器与共享存储器的连接、可配置本地存储器与 直接存储器访问控制器的连接、可配置本地存储器与系统外部的连接和共享存储器与系统外 部的连接。
13、根据权利要求 12所述的可配置多核 /众核装置, 其特征在于根据配置, 可以使两个 所述处理器核及其相应本地存储器构成前后级连接关系,也可以将部分或全部处理器核及其 相应本地存储器通过可配置互联结构构成单数个或复数个串行连接结构;复数个所述串行连 接结构可以各自独立, 也可以部分或全部有相互联系, 串行、 并行或串并混合地执行指令。
14、根据权利要求 13所述的可配置多核 /众核装置, 其特征在于所述串行、并行或串并 混合地执行指令可以是根据应用程序要求不同串行连接结构在同步机制的控制下运行不同 的程序段并行执行不同指令、多线程并行运行,也可以是根据应用程序要求不同串行连接结 构在同步机制的控制下运行相同的程序段, 还可以是以单指令多数据流(SIMD)方式进行相 同指令、 不同数据的运算。
15、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置中, 所述串行连接结构中处理器核具有特定的数据 读规则(read policy), 即所述串行连接结构中第一个处理器核的输入数据来源可以是本身 相应的可配置数据存储器, 可以是共享存储器, 也可以是所述可配置多核 /众核装置外部; 其他任意处理器核的输入数据来源,可以是本身相应的可配置数据存储器,也可以是前一级 处理器核相应的可配置数据存储器。
16、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置中, 所述串行连接结构中处理器核具有特定的数据 写规则 (write policy), 即所述串行连接结构中第一个处理器核相应可配置数据存储器的输 入数据来源可以是处理器核本身, 可以是共享存储器, 也可以是所述可配置多核 /众核装置 外部;其他任意处理器核相应可配置数据存储器的输入数据来源可以是处理器核本身,可以 是前一级处理器核相应可配置数据存储器,也可以是共享存储器;任意所述处理器核的输出 数据的去向可以是本身相应的可配置数据存储器,也可以是共享存储器; 当扩展存储器存在 时, 任意所述处理器核的输出数据的去向还可以是扩展存储器。
17、根据权利要求 15、 16所述的可配置多核 /众核装置, 其特征在于所述处理器核及其 相应可配置数据存储器不同来源的输入数据按特定规则进行多路选择以确定最终的输入数 据。
18、根据权利要求 15、 16所述的可配置多核 /众核装置, 其特征在于同一个所述可配置 数据存储器可以同时被其前后级的两个处理器核访问,不同的处理器核各自访问所述可配置 数据存储器中的不同数据子存储器。
19、根据权利要求 5所述的可配置多核 /众核装置,其特征在于当所述多核 /众核系统中 处理器核包含寄存器堆时,还需要具有传输寄存器值的功能, 即所述串行连接结构中任意前 级处理器核中的单数个或复数个寄存器值都可以传输到任意后级处理器核的相应寄存器中。
20、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置中, 所述流水线层次结构中的第二层次即宏观流水 线段可以通过背压 (back pressure ) 信息将本宏观流水线段的信息传输到前一级宏观流水 线段, 所述前一级宏观流水线段根据收到的背压信息可知之后的宏观流水线是否阻塞
( stall ), 结合本宏观流水线段的情况, 确定本宏观流水线段是否阻塞, 并将新的背压信息 传输到更前一级宏观流水线段, 以此实现宏观流水线的控制。
21、 根据权利要求 6所述的可配置多核 /众核装置, 其特征在于所述异常处理模块可以 由所述多核 /众核装置中的处理器核构成, 也可以是额外的模块。
22、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置中, 用于所述配置的配置信息包括开启或关断处理 器核、 配置本地存储器中指令存储器和数据子存储器的大小 /边界及其中的内容、 配置互联 结构和连接关系。
23、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置具有三个层次的低功耗技术: ( a) 配置层次, 根据配置信息, 没有被用到的处理器核可以进入低功耗状态;
(b ) 指令层次, 当处理器核执行到读取数据的指令时, 如果该数据还没有准备好, 则 所述处理器核进入低功耗状态,直到所述数据准备好,所述处理器核再从低功耗状态恢复到 正常工作状态;所述数据没有准备好,可以是因为前一级处理器核还没有将本级处理器核需 要的数据写入相应数据子存储器;
( c ) 应用层次, 采用全硬件实现, 匹配空闲 (idle) 任务特征, 确定当前处理器核的 使用率 (util ization) , 根据当前处理器使用率和基准使用率确定是否进入低功耗状态或是 否从低功耗状态返回;所述基准使用率可以是固定不变的,也可以是能重新配置或自学习确 定的; 可以是固化在芯片内部的, 也可以是在系统启动时由系统写入的, 还可以由软件写入 的;用于匹配的参考内容可以在芯片生产时固化到芯片内部,也可以在系统启动时由系统或 软件写入, 还可以自学习写入; 其写入方式可以是一次写入也可以是多次写入;
所述低功耗状态可以是降低处理器时钟频率, 也可以是切断电源供应。
24、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置可以具备自测试能力, 能够在加电工作的情况下不 依赖于外部设备进行芯片的自测试; 当所述多核 /众核系统具备自测试能力时, 可以将所述 多核 /众核系统中特定的单数个或复数个基本元件、 运算单元或处理器核用成比较器, 对所 述多核 /众核系统中相应的复数组其他基本元件、 运算单元或处理器核及基本元件、 运算单 元或处理器核的组合给予具有特定关系的激励,并用所述比较器比较所述复数组其他基本元 件、运算单元或处理器核及基本元件、运算单元或处理器核的的组合的输出是否符合相应的 特定关系。
25、 根据权利耍求 24所述的可配置多核 /众核装置, 其特征在于当所述多核 /众核系统 具备自测试能力时, 可以具备自修复能力; 即当所述测试结果保存在所述多核 /众核系统中 的存储器中时, 可以对失效基本单元或失效行或失效阵作标记, 在对所述多核 /众核系统进 行配置时, 可以根据相应标记绕过失效基本单元或失效行或失效阵, 使所述多核 /众核系统 依然能正常工作, 实现自修复。
26、 根据权利耍求 5所述的可配置多核 /众核装置, 其特征在于所述复数个处理器核可 以是同构的, 也可以是异构的。
.
27、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于可以具有读取导致写的 特性 (LIS, load induced store ); 即处理器核对于某个地址数据第一次读取时, 从相邻前 一级处理器核对应的本地数据存储器读取数据,同时将读取到的数据写入本级处理器核对应 的本地数据存储器, 之后对该地址数据的读写都访问本级对应的本地数据存储器。
28、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于可以具有数据预传递的 特性;即处理器核可以从前一级处理器核对应的本地数据存储器中读取本处理器核不需要读 写、 但后续处理器核需要读取的数据, 并写入本级处理器核对应的本地数据存储器。
29、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述本地数据存储器可 以包含单数个或复数个有效标志和单数个或复数个归属标志;所述有效标志用于表示对应的 数据是否有效; 所述归属标志用于表示对应的数据当前被哪个处理器核使用。
30、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于处理器核可以访问除所 述处理器核外的处理器核的本地指令存储器;所述复数个处理器核可以并行执行所述相同的 代码中的各段代码,也可以并行执行所述不同的代码中的各段代码;所述复数个处理器核可 以串行执行所述相同的代码中的各段代码, 也可以串行执行所述不同的代码中的各段代码; 所述复数个处理器核还可以串并混合地执行所述相同代码中的各段代码或所述不同代码中 的各段代码。
31、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于还可以将单数个或复数 个处理器核及其相应本地存储器构成高性能的多核连接结构, 对所述多核连接结构进行配 置、 在相应本地指令存储器中放入对应的代码片段, 使所述多核连接结构实现特定的功能; 所述多核连接结构在功能上相当于片上系统 (SoC, System on Chip) 中的功能模块; 复数 个所述功能模块再由所述功能模块间的数据传输通道连接,实现片上系统;所述数据传输通 道即对应传统片上系统结构中的系统总线。
32、根据权利要求 31所述的可配置多核 /众核装置,其特征在于可以通过所述可配置互 联结构,对有数据传递关系的复数个处理器核的输入、输出预先建立或动态建立复数个连接, 构成处理器核之间的相当于传统片上系统中系统总线结构的数据传输通道。
33、根据权利耍求 31所述的可配置多核 /众核装置,其特征在于处理器核及其相应本地 存储器可以被动态地重新配置,相应本地指令存储器中的代码片段可以被动态地改变,从而 改变所述片上系统的功能。
34、 根据权利要求 7所述的可配置多核 /众核装置, 其特征在于所述处理器核可以具有 快速条件判断机制,用以确定分支转移是否执行;所述快速条件判断机制可以是用于判断循 环条件的计数器, 也可以是用于判断分支转移及循环条件的硬件有限状态机。
35、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述处理器核还可以对 应复数个本地指令存储器;当所述复数个本地指令存储器中的一个或多个用于响应相应处理 器核取指操作时,所述复数个本地指令存储器中的其他本地指令存储器可以进行指令更新操 作。
36、根据权利要求 5所述的可配置多核 /众核装置,其特征在于所述可配置多核 /众核装 置中的复数个处理器核可以工作在相同的时钟频率, 也可以工作在不同的时钟频率。
37、 根据权利要求 5所述的可配置多核 /众核装置, 其特征在于所述基于串行多发射和 流水线层次结构的可配置多核 /众核装置中还可以包括单数个或复数个专用处理模块; 所述 专用处理模块能作为宏模块供所述处理器核调用,也可以作为独立的处理模块接收所述处理 器核的输出,并将处理结果送往所述处理器核; 向所述专用处理模块输出的处理器核与接收 所述专用处理器核输出的处理器核可以是同一个处理器核, 也可以是不同的处理器核。
PCT/CN2009/001346 2008-11-28 2009-11-30 一种数据处理的方法与装置 WO2010060283A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP09828544A EP2372530A4 (en) 2008-11-28 2009-11-30 METHOD AND DEVICE FOR DATA PROCESSING
KR1020117014902A KR101275698B1 (ko) 2008-11-28 2009-11-30 데이터 처리 방법 및 장치
US13/118,360 US20110231616A1 (en) 2008-11-28 2011-05-27 Data processing method and system

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN200810203778.7 2008-11-28
CN200810203777.2 2008-11-28
CN200810203777A CN101751280A (zh) 2008-11-28 2008-11-28 针对多核/众核处理器程序分割的后编译系统
CN200810203778A CN101751373A (zh) 2008-11-28 2008-11-28 基于单一指令集微处理器运算单元的可配置多核/众核系统
CN200910046117.2 2009-02-11
CN200910046117 2009-02-11
CN200910208432.0A CN101799750B (zh) 2009-02-11 2009-09-29 一种数据处理的方法与装置
CN200910208432.0 2009-09-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/118,360 Continuation US20110231616A1 (en) 2008-11-28 2011-05-27 Data processing method and system

Publications (1)

Publication Number Publication Date
WO2010060283A1 true WO2010060283A1 (zh) 2010-06-03

Family

ID=42225216

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2009/001346 WO2010060283A1 (zh) 2008-11-28 2009-11-30 一种数据处理的方法与装置

Country Status (4)

Country Link
US (1) US20110231616A1 (zh)
EP (1) EP2372530A4 (zh)
KR (1) KR101275698B1 (zh)
WO (1) WO2010060283A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102475A (zh) * 2013-04-11 2014-10-15 腾讯科技(深圳)有限公司 分布式并行任务处理的方法、装置及系统
CN107750374A (zh) * 2015-05-08 2018-03-02 鞍点有限责任两合公司 具有配置参数设置器件的危险报警中心

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2751388A1 (en) * 2011-09-01 2013-03-01 Secodix Corporation Method and system for mutli-mode instruction-level streaming
CN102646059B (zh) * 2011-12-01 2017-10-20 中兴通讯股份有限公司 多核处理器系统的负载平衡处理方法及装置
US9465619B1 (en) * 2012-11-29 2016-10-11 Marvell Israel (M.I.S.L) Ltd. Systems and methods for shared pipeline architectures having minimalized delay
US9032256B2 (en) * 2013-01-11 2015-05-12 International Business Machines Corporation Multi-core processor comparison encoding
WO2015050474A1 (en) * 2013-10-03 2015-04-09 Huawei Technologies Co., Ltd Method and system for assigning a computational block of a software program to cores of a multi-processor system
US9294097B1 (en) 2013-11-15 2016-03-22 Scientific Concepts International Corporation Device array topology configuration and source code partitioning for device arrays
US9698791B2 (en) 2013-11-15 2017-07-04 Scientific Concepts International Corporation Programmable forwarding plane
US10326448B2 (en) 2013-11-15 2019-06-18 Scientific Concepts International Corporation Code partitioning for the array of devices
US9460012B2 (en) 2014-02-18 2016-10-04 National University Of Singapore Fusible and reconfigurable cache architecture
CN103955406A (zh) * 2014-04-14 2014-07-30 浙江大学 一种基于超级块的投机并行化方法
US10318356B2 (en) * 2016-03-31 2019-06-11 International Business Machines Corporation Operation of a multi-slice processor implementing a hardware level transfer of an execution thread
US10055155B2 (en) * 2016-05-27 2018-08-21 Wind River Systems, Inc. Secure system on chip
US20180259576A1 (en) * 2017-03-09 2018-09-13 International Business Machines Corporation Implementing integrated circuit yield enhancement through array fault detection and correction using combined abist, lbist, and repair techniques
WO2019089918A1 (en) * 2017-11-03 2019-05-09 Coherent Logix, Inc. Programming flow for multi-processor system
US11435947B2 (en) 2019-07-02 2022-09-06 Samsung Electronics Co., Ltd. Storage device with reduced communication overhead using hardware logic
KR102246797B1 (ko) * 2019-11-07 2021-04-30 국방과학연구소 명령 코드 생성을 위한 장치, 방법, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램
EP4085354A4 (en) * 2019-12-30 2024-03-13 Star Ally International Limited PROCESSOR FOR CONFIGURABLE PARALLEL CALCULATIONS
KR102320270B1 (ko) * 2020-02-17 2021-11-02 (주)티앤원 학습용 무선 마이크로컨트롤러 키트
US11734017B1 (en) 2020-12-07 2023-08-22 Waymo Llc Methods and systems for processing vehicle sensor data across multiple digital signal processing cores virtually arranged in segments based on a type of sensor
US11782602B2 (en) 2021-06-24 2023-10-10 Western Digital Technologies, Inc. Providing priority indicators for NVMe data communication streams
US11960730B2 (en) 2021-06-28 2024-04-16 Western Digital Technologies, Inc. Distributed exception handling in solid state drives

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567187A (zh) * 2003-06-11 2005-01-19 华为技术有限公司 数据处理系统及方法
JP2008146503A (ja) * 2006-12-12 2008-06-26 Sony Computer Entertainment Inc 分散処理方法、オペレーティングシステムおよびマルチプロセッサシステム

Family Cites Families (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4089059A (en) * 1975-07-21 1978-05-09 Hewlett-Packard Company Programmable calculator employing a read-write memory having a movable boundary between program and data storage sections thereof
SE448680B (sv) * 1984-05-10 1987-03-16 Duma Ab Doseringsanordning till en injektionsspruta
CA2143145C (en) * 1994-04-18 1999-12-28 Premkumar Thomas Devanbu Determining dynamic properties of programs
US5732209A (en) * 1995-11-29 1998-03-24 Exponential Technology, Inc. Self-testing multi-processor die with internal compare points
US7080238B2 (en) * 2000-11-07 2006-07-18 Alcatel Internetworking, (Pe), Inc. Non-blocking, multi-context pipelined processor
EP1205840B1 (en) * 2000-11-08 2010-07-14 Altera Corporation Stall control in a processor with multiple pipelines
US7062641B1 (en) * 2001-01-10 2006-06-13 Cisco Technology, Inc. Method and apparatus for unified exception handling with distributed exception identification
US6757761B1 (en) * 2001-05-08 2004-06-29 Tera Force Technology Corp. Multi-processor architecture for parallel signal and image processing
US20030046429A1 (en) * 2001-08-30 2003-03-06 Sonksen Bradley Stephen Static data item processing
US20050177679A1 (en) * 2004-02-06 2005-08-11 Alva Mauricio H. Semiconductor memory device
EP1619584A1 (en) * 2004-02-13 2006-01-25 Jaluna SA Memory allocation
US20070083785A1 (en) * 2004-06-10 2007-04-12 Sehat Sutardja System with high power and low power processors and thread transfer
US7536567B2 (en) * 2004-12-10 2009-05-19 Hewlett-Packard Development Company, L.P. BIOS-based systems and methods of processor power management
ATE393932T1 (de) * 2004-12-22 2008-05-15 Galileo Avionica Spa Rekonfigurierbares mehrprozessorsystem besonders zur digitalen verarbeitung von radarbildern
US7689867B2 (en) * 2005-06-09 2010-03-30 Intel Corporation Multiprocessor breakpoint
US7793278B2 (en) * 2005-09-30 2010-09-07 Intel Corporation Systems and methods for affine-partitioning programs onto multiple processing units
US8104030B2 (en) * 2005-12-21 2012-01-24 International Business Machines Corporation Mechanism to restrict parallelization of loops
US7689838B2 (en) * 2005-12-22 2010-03-30 Intel Corporation Method and apparatus for providing for detecting processor state transitions
US7784037B2 (en) * 2006-04-14 2010-08-24 International Business Machines Corporation Compiler implemented software cache method in which non-aliased explicitly fetched data are excluded
US20070250825A1 (en) * 2006-04-21 2007-10-25 Hicks Daniel R Compiling Alternative Source Code Based on a Metafunction
US7797563B1 (en) * 2006-06-09 2010-09-14 Oracle America System and method for conserving power
US8589666B2 (en) * 2006-07-10 2013-11-19 Src Computers, Inc. Elimination of stream consumer loop overshoot effects
US7665000B2 (en) * 2007-03-07 2010-02-16 Intel Corporation Meeting point thread characterization

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1567187A (zh) * 2003-06-11 2005-01-19 华为技术有限公司 数据处理系统及方法
JP2008146503A (ja) * 2006-12-12 2008-06-26 Sony Computer Entertainment Inc 分散処理方法、オペレーティングシステムおよびマルチプロセッサシステム

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104102475A (zh) * 2013-04-11 2014-10-15 腾讯科技(深圳)有限公司 分布式并行任务处理的方法、装置及系统
CN107750374A (zh) * 2015-05-08 2018-03-02 鞍点有限责任两合公司 具有配置参数设置器件的危险报警中心

Also Published As

Publication number Publication date
EP2372530A1 (en) 2011-10-05
KR101275698B1 (ko) 2013-06-17
KR20110112810A (ko) 2011-10-13
US20110231616A1 (en) 2011-09-22
EP2372530A4 (en) 2012-12-19

Similar Documents

Publication Publication Date Title
WO2010060283A1 (zh) 一种数据处理的方法与装置
JP6243935B2 (ja) コンテキスト切替方法及び装置
US7284092B2 (en) Digital data processing apparatus having multi-level register file
Krashinsky et al. The vector-thread architecture
EP1137984B1 (en) A multiple-thread processor for threaded software applications
CN101799750B (zh) 一种数据处理的方法与装置
US6988181B2 (en) VLIW computer processing architecture having a scalable number of register files
US6826674B1 (en) Program product and data processor
US7219185B2 (en) Apparatus and method for selecting instructions for execution based on bank prediction of a multi-bank cache
CN105144082B (zh) 基于平台热以及功率预算约束,对于给定工作负荷的最佳逻辑处理器计数和类型选择
US20140181477A1 (en) Compressing Execution Cycles For Divergent Execution In A Single Instruction Multiple Data (SIMD) Processor
GB2524126A (en) Combining paths
JP2015534188A (ja) ユーザレベルのスレッディングのために即時のコンテキスト切り替えを可能とする新規の命令および高度に効率的なマイクロアーキテクチャ
JP2006012163A5 (zh)
US20120110303A1 (en) Method for Process Synchronization of Embedded Applications in Multi-Core Systems
JP2011513843A (ja) 実行装置内のデータ転送のシステムおよび方法
Zhang et al. Leveraging caches to accelerate hash tables and memoization
US6594711B1 (en) Method and apparatus for operating one or more caches in conjunction with direct memory access controller
EP4143682A1 (en) Handling multiple graphs, contexts and programs in a coarse-grain reconfigurable array processor
CN114691597A (zh) 自适应远程原子操作
CN114253607A (zh) 用于由集群化解码流水线对共享微代码定序器的乱序访问的方法、系统和装置
US6119220A (en) Method of and apparatus for supplying multiple instruction strings whose addresses are discontinued by branch instructions
US7080234B2 (en) VLIW computer processing architecture having the problem counter stored in a register file register
CN108845832B (zh) 一种提高处理器主频的流水线细分装置
CN112148106A (zh) 用于处理器的混合预留站的系统、装置和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09828544

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 4513/CHENP/2011

Country of ref document: IN

ENP Entry into the national phase

Ref document number: 20117014902

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2009828544

Country of ref document: EP