WO2010060283A1 - 一种数据处理的方法与装置 - Google Patents
一种数据处理的方法与装置 Download PDFInfo
- Publication number
- WO2010060283A1 WO2010060283A1 PCT/CN2009/001346 CN2009001346W WO2010060283A1 WO 2010060283 A1 WO2010060283 A1 WO 2010060283A1 CN 2009001346 W CN2009001346 W CN 2009001346W WO 2010060283 A1 WO2010060283 A1 WO 2010060283A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- core
- data
- processor
- configurable
- memory
- Prior art date
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 10
- 230000015654 memory Effects 0.000 claims description 402
- 238000012546 transfer Methods 0.000 claims description 55
- 238000000034 method Methods 0.000 claims description 54
- 230000006870 function Effects 0.000 claims description 50
- 238000012545 processing Methods 0.000 claims description 41
- 238000012360 testing method Methods 0.000 claims description 28
- 230000005540 biological transmission Effects 0.000 claims description 25
- 239000012634 fragment Substances 0.000 claims description 17
- 238000003860 storage Methods 0.000 claims description 13
- 238000013500 data storage Methods 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000005516 engineering process Methods 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000004044 response Effects 0.000 claims 1
- 239000011159 matrix material Substances 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000013461 design Methods 0.000 description 6
- 238000013507 mapping Methods 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 230000002159 abnormal effect Effects 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 230000005856 abnormality Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 230000003111 delayed effect Effects 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/3012—Organisation of register space, e.g. banked or distributed register file
- G06F9/30134—Register stacks; shift registers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/3826—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
- G06F9/3828—Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
Definitions
- the present invention relates to the field of integrated circuit design. Background technique
- the characteristic size of transistors is gradually narrowing along the route of 65nin, 45nm, 32nm..., and the number of transistors integrated on a single chip has exceeded one billion.
- EDA tools have not made a qualitative breakthrough for more than 20 years, making front-end design, especially verification, more and more difficult to cope with increasing The scale of the single chip. Therefore, the design company has turned its attention to multi-core, that is, integrates multiple simple cores in one chip, which reduces the design and verification difficulty while improving the function of the chip.
- the internal working frequency of the multi-core processor has been much higher than the operating frequency of its external memory. Simultaneous memory access by multiple processor cores has become a major bottleneck restricting system performance. Parallel multi-core architectures running serial programs do not achieve the expected performance improvements.
- the present invention is directed to the deficiencies of the prior art, and proposes a method and apparatus for data processing for running a serial program at a high speed to improve throughput.
- the method and apparatus for data processing of the present invention includes: segmenting program code running on a serially connected multiprocessor core structure according to a specific rule, causing each of the serially connected multiprocessor core structures The time required for the core to run the corresponding split code segment is as equal as possible to achieve load balancing of the inter-core workload.
- the serial-connected multi-processor core structure includes a plurality of processor cores.
- the hardware refers to the hardware that performs operations and reads and writes data by executing instructions, including but not limited to a central processing unit (CPU) and a data processing unit (DSP).
- serial connection multi-processor core structure of the present invention constitutes serial multi-i ssue, and any of the cores of the serial-to-row multi-processor core structure can be singular or plural per unit time. Transmit, multiple serially connected cores simultaneously form a larger scale of multiple emissions, ie serial multiple emissions.
- the serial connection multi-processor core structure of the present invention constitutes a pipeline ine hierarchy, and the internal pipeline of any core in the serial connection multi-processor core structure is the first layer, and the serial connection multi-processor core is connected.
- the macro pipeline formed by each core in the structure as a macro pipeline segment is the second level, and so on can get more higher levels, such as serial connection multi-processor core structure as a higher-level pipeline segment. And the third level of composition.
- the code segment on the core in the serial connection multiprocessor core structure of the present invention is generated by some or all of the three steps of pre-compilation, compile and post-compi A singular or plural code segment, the program code including but not limited to a high level language code and an assembly language code.
- the compilation is currently compiled from the program source code to the target code in the usual sense
- the pre-compilation is a pre-compilation of the program source code before the compilation is performed, including but not limited to, expanding a "call" in the program before compiling the program, and replacing the call statement with the actually called code.
- the post-compilation is to divide the work content and load of each core in the serial connection multi-processor core structure as required, and divide the compiled object code into a single number or a plurality of code segments, and the steps include but not Limited to:
- the specific model includes, but is not limited to, a behavior model of the core in the serially connected multiprocessor core structure;
- the pre-compilation method described in the present invention is implemented before the program source code is compiled, and can also be implemented as a component of the compiler in the program source code compilation process, and can also be used as the operating system of the serial connection multi-processor core structure.
- the component, or as a driver, or as an application, is implemented in real time while the serial connection multiprocessor core structure is running.
- the post-compilation method described in the present invention can be implemented after the compilation of the program source code is completed, or can be used as a group of compilers. Partially implemented during program source code compilation, and may also be part of a operating system including, but not limited to, the serial-connected multi-processor core structure, drivers, applications, and serial-connected multi-processor cores
- the structure is implemented in real time while it is running.
- the post-compilation method is implemented in real time, the corresponding configuration information in the code segment may be manually determined, and the corresponding content in the code segment may be dynamically generated automatically according to the usage condition of the serial connection multi-processor core structure. Configuration information can also generate only fixed configuration information.
- the existing application program can be divided and segmented simultaneously, which not only improves the running speed of the existing program on the multi-core/nuclear-core device, but also fully utilizes the efficiency of the multi-core/nuclear-core device.
- Multi-core/nuclear-core devices are also guaranteed to be compatible with existing applications. It effectively solves the dilemma that existing applications cannot take full advantage of the multicore/many core processor.
- the basis for indirectly determining the split information includes, but is not limited to, the number of cycles or time of execution of the instruction, and the number of instructions, that is, the number of execution cycles or time according to the instruction obtained by scanning the front-end code stream,
- the entire executable program code is divided into code segments of the same or similar running time, and the entire executable program code may be divided into code segments of the same or similar instruction number according to the number of instructions obtained by scanning the front-end code stream;
- the basis for determining the split information includes, but is not limited to, the number of instructions, that is, the entire executable program code can be directly divided into code segments of the same or similar instruction number according to the number of instructions.
- the executable program code is divided as much as possible to avoid segmentation of the cyclic code according to a specific rule.
- the loop code is divided into a plurality of smaller-scale loop codes by singular or plural times according to a specific rule.
- the plurality of smaller scale loop codes may be components of the same or different code snippets, respectively.
- the smaller scale loop code includes, but is not limited to, loop code containing a smaller number of codes and loop code having a smaller number of code execution cycles.
- the code segment includes, but is not limited to, a segmented executable object code and/or corresponding to the serial connection multi-processor core structure operation for a fixed number of processor cores.
- Configuration information applicable to the unsegmented executable object code of the serial connection multiprocessor core structure operation and corresponding configuration information including a plurality of segmentation information applicable to the number of unfixed cores, wherein the segmentation information includes It is not limited to a number including a number representing the number of instructions per segment, a specific flag representing a segment boundary, and an indication table of the start information of each code segment.
- a table of 1000 items can be generated by a maximum number of processors 1000, each item storing a corresponding instruction in the unsegmented executable
- the location information in the object code, the combination of instructions between the two corresponds to the code segment that can run on the corresponding single core.
- each processor core running code between unsegmented executable object code locations pointed to by corresponding two entries in the table ie each processor core runs a corresponding segment of the table Code.
- N ⁇ 1000 each processor core runs the corresponding 1000/N segment code in the table, and the specific code can be determined according to the corresponding location information in the table.
- Instructions running on each processor core may include additional instructions in addition to the segmented corresponding code segments.
- the additional instructions include, but are not limited to, a code fragment header extension, a code fragment tail extension, and a smooth transition for implementing instruction execution between different processors. For example, you can add a code fragment tail extension to the end of each code fragment, store all the values in the register file to a specific location in the data memory, and add a code fragment header extension at the beginning of each code fragment. Values in specific locations in the data memory are read into the register file to enable register value transfer between different processor cores to ensure proper operation of the program; when executed to the end of the code segment, the next instruction is from the The first instruction of the code snippet begins.
- the data processing method and apparatus of the present invention can construct a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy, including a plurality of processor cores and a plurality of configurable Configurable local memory > configurable interconnect structure (the configurable interconnect structure where - the processor core, used to execute instructions, perform operations and get corresponding results;
- a local memory is configurable for storing instructions and data transfer and data storage between the processor cores; a configurable interconnect structure for inter-module and external connections within the configurable multi-core/many-core device.
- the configurable multi-core/many-core device may also include an expansion module to accommodate a wider range of needs;
- the expansion module includes, but is not limited to, a single or a plurality of the following modules:
- a shared memory for storing data in the case of overflow of the configurable data memory and transferring shared data between the plurality of processor cores
- DMA direct memory access
- An exception handling module that handles exceptions that occur in the processor core and local memory
- the processor core of the configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention may also include an expansion module to accommodate a wider range of needs, including but not limited to a register file.
- the instructions executed by the processor core include, but are not limited to, an arithmetic operation instruction, a logic operation instruction, a condition determination and a jump instruction, an abnormal trap and a return instruction;
- the arithmetic operation instruction and the logic operation instruction include but are not limited to multiplication, plus/ Subtraction, multiply-add/subtract, accumulate, shift, extract, swap operations, and include fixed-point operations and floating-point operations with any bit width less than or equal to the processor core data bit width; each of the processor cores completes a singular number Or a plurality of instructions as described.
- the number of processor cores can be extended according to actual application requirements.
- each of the processor cores has a corresponding configurable local memory, including instructions for storing the segmented code segment.
- Configurable data memory for storing data and data in the same configurable local memory, the boundary between the instruction memory and the configurable data memory can be changed according to different configuration information of.
- the configurable data store includes a plurality of data sub-memory when the size and boundary of the configurable data store are determined based on the configuration information.
- the boundaries between the plurality of data sub-memory can be changed according to different configuration information.
- the data sub-memory can be mapped to the entire address space of the multi-core/nuclear device by address translation.
- the mapping includes, but is not limited to, address translation by look-up table and address translation by content addressed memory (CAM) matching.
- Each entry in the data sub-memory contains data and flag information, including but not limited to a valid bit (val id bit), a data address.
- the valid bit is used to indicate whether the data stored in the corresponding item is valid.
- the data address is used to indicate where the data stored in the corresponding item should be in the entire address space of the multi-core/nuclear device.
- the configurable interconnect structure is configured for inter-module and external between the configurable multi-core/many-core devices Connections, including but not limited to connection of processor cores to adjacent configurable local memories, connections of processor cores to shared memory, connections of processor cores to direct memory access controllers, configurable connections of local and shared memories
- Connections including but not limited to connection of processor cores to adjacent configurable local memories, connections of processor cores to shared memory, connections of processor cores to direct memory access controllers, configurable connections of local and shared memories
- the connection of the local memory to the direct memory access controller, the connection of the local memory to the outside of the device, and the connection of the shared memory to the exterior of the device can be configured.
- the two processor cores and their respective local memories can be configured to form a front-to-back connection, including but not limited to, the previous stage processor core transmits data to the next stage processor core through its corresponding configurable data memory.
- some or all of the processor cores and their corresponding local memories can be configured to form a single or multiple serial connection structures through a configurable interconnect structure.
- the plurality of serial connection structures may be independent of each other, or may be partially or completely interconnected, and instructions may be executed serially, in parallel, or in series and in series.
- serial, parallel or serial and mixed execution instructions include, but are not limited to, different serial connection structures according to application requirements, running different program segments under the control of the synchronization mechanism, executing different instructions in parallel, and running multiple threads in parallel, according to the application program. It is required that different serial connection structures run the same program segment under the control of the synchronization mechanism, and perform the same instruction and intensive operation of different data in a single instruction multiple data stream (SMD) manner.
- SMD single instruction multiple data stream
- the processor core in the serial connection structure has a specific data read rule (read pol icy), write rule (write Pol icy).
- the data read rule i.e., the input data source of the first processor core in the serial connection structure, includes but is not limited to its own corresponding configurable data memory, shared memory, and external to the configurable multi-core/nuclear device.
- Sources of input data for any other processor core include, but are not limited to, their respective configurable data stores, corresponding configurable data stores of the previous stage processor core.
- the destination of the output data of any of the processor cores includes, but is not limited to, a corresponding configurable data memory and a shared memory.
- the output of any of the processor cores may be extended. Memory.
- the data write rule that is, the input data source of the corresponding configurable data memory of the first processor core in the serial connection structure includes but is not limited to the processor core itself, the shared memory, and the configurable multi-core/multi-core device external.
- Other input data sources for the corresponding configurable data memory of any processor core include, but are not limited to, the processor core itself, the corresponding configurable data memory of the previous stage processor core, and the shared memory.
- Input data from different sources of the processor core and its corresponding configurable data store are multiplexed according to specific rules to determine the final input data.
- the same configurable data store can be simultaneously accessed by two processor cores of its preceding and succeeding stages, each of which accesses a different data sub-memory in the configurable data store.
- the processor core may separately access different data sub-memory in the same configurable data memory according to a specific rule, including but not limited to different data sub-memory in the same configurable data memory, which is a ping-pong buffer (Ping- Pong buffer ), accessed by two processor cores respectively, after the two-stage processor cores complete the access to the ping-pong buffer, perform ping-pong buffer exchange, so that the data originally read/written by the previous-stage processor core
- the sub-memory is the data sub-memory read by the latter-stage processor, and all the valid bits in the data sub-memory that were originally read by the latter-stage processor are invalidated, and are read by the previous-stage processor.
- the processor core in the multi-core/many-core system includes a register file, it is also necessary to have a specific register value transmission rule, that is, in any pre-stage processor core in the serial connection structure.
- a single or multiple register values can be transferred to the corresponding registers of any subsequent processor core.
- the register values include, but are not limited to, values of registers in the register file in the processor core.
- the transmission path of the register value includes, but is not limited to, transmission through a configurable interconnect structure, directly transmitted through a shared memory, directly transmitted through a corresponding configurable data memory of the processor core, and transmitted through a shared memory according to a specific instruction, according to a specific instruction. Transmission by the corresponding configurable data memory of the processor core.
- the second level in the pipeline hierarchy that is, the macro-pipeline segment
- the second level in the pipeline hierarchy can be used to back the macro pipeline by back pressure.
- the information of the segment is transmitted to the macro-stage pipeline segment of the previous stage.
- the received back pressure information it can be known whether the subsequent macro-pipeline is blocked (stall), and the macro-pipeline is determined according to the situation of the macro-pipeline segment. Whether the segment is blocked or not, and the new back pressure information is transmitted to the upper stage of the macro-pipeline segment to achieve macro-pipeline control.
- the configurable multi-core/many-core device based on the serial multi-emission and pipeline hierarchy of the present invention, there may be an extended shared memory for storing data and transmitting in case the corresponding configurable data memory of the processor core overflows.
- shared data between multiple processor cores there may also be an extended exception handling module for handling exceptions (ex3 ⁇ 4eption) that occur in the processor core and local memory.
- each item in the data sub-memory Contains flag information including but not limited to valid bits, data addresses, and data tags.
- the valid bit is used to indicate whether the data stored in the corresponding item is valid.
- the data address and data tag are used together to indicate where the data stored in the corresponding item should be in the entire address space of the multi-core/nuclear device.
- the exception handling module may be composed of a processor core in the multi-core/many-core device, or may be an additional module.
- the abnormality information includes, but is not limited to, a processor number in which an abnormality occurs, and an abnormal type.
- the respective processing of the processor core and/or local memory in which the exception occurred includes, but is not limited to, transmitting information of whether the pipeline is blocked by the transfer of the backpressure signal to the respective processor cores in the serial connection structure.
- the processor core, the configurable local memory, and the configurable interconnect structure can be configured according to an application program.
- the configuration includes but It is not limited to turning the processor core on or off, configuring the size/boundary of the instruction memory and data sub-memory in the local memory and the contents thereof, the configuration interconnection structure, and the connection relationship.
- Sources of configuration information for the configuration include, but are not limited to, internal and external to the configurable multi-core/nuclear device.
- the configuration can be adjusted at any time during the run according to the requirements of the application.
- the configuration methods of the configuration include, but are not limited to, direct configuration by a processor core or a central processor core, by a processor core or a central processor core through direct memory access controller configuration and external request via direct memory access controller configuration.
- the configurable multi-core/many-core device based on the serial multi-emission and pipeline hierarchy described in the present invention has three levels of low-power technologies: configuration hierarchy, instruction hierarchy, and application hierarchy.
- the configuration hierarchy may enter a low power state; the low power state includes, but is not limited to, reducing the processor clock frequency or cutting off the power supply.
- the instruction hierarchy when the processor core executes an instruction to read data, if the data is not yet ready, the processor core enters a low power state until the data is ready, the processor core Then return to the normal working state from the low power state.
- the data is not ready, including but not limited to, the previous stage processor core has not yet written the data required by the processor core of the stage to the corresponding data sub-memory.
- the low power state includes, but is not limited to, reducing the processor clock frequency or turning off the power supply.
- the application hierarchy adopts full hardware implementation, matches idle task characteristics, determines current processor core usage rate (uti lization), determines whether to enter a low power consumption state according to current processor usage rate and reference usage rate, or Return from a low power state.
- the reference usage rate may be fixed, may be reconfigured or self-learned, may be solidified inside the chip, may be written by the device when the device is started, or may be written by software.
- the reference content for matching may be solidified into the chip during chip production, or may be written by the device or software when the device is started, or may be self-learned, and the storage medium includes but is not limited to volatile Memory, non-volatile memory; its writing mode includes but is not limited to one-time write, multiple-write.
- the low power state includes, but is not limited to, reducing the processor clock frequency or turning off the power supply.
- the configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention can have self-test capability, and can perform self-test of the chip without relying on external devices in the case of power-on operation.
- an arithmetic unit or a processor core of the multi-core/many-core device may be used as a comparator for the multi-core/ The corresponding complex array of other complex elements in the many-core device, the arithmetic unit or the combination of the processor core and the basic element, the arithmetic unit or the processor core are given a specific The excitation of the relationship, and comparing, by the comparator, whether the output of the other basic elements of the complex array, the arithmetic unit or the combination of the processor core and the basic element, the arithmetic unit or the processor core conforms to the corresponding specific relationship.
- the excitation may be from a particular module in the multi-core/nuclear-core device or may be external to the multi-core/nuclear device.
- the specific relationships include, but are not limited to, equal, opposite, reciprocal, and complementary.
- the test results may be sent to the outside of the multi-core/nuclear-core device or may be stored in a memory in the multi-core/nuclear-core device.
- the self-test may be performed during the wafer test, the package integrated circuit test or the chip is used when the device is started, or the self-test condition and cycle may be manually set, and the self-test is periodically performed during the work.
- the memory used for the self test includes, but is not limited to, a volatile memory, a non-volatile memory.
- the multi-core/nuclear device When the multi-core/nuclear device has self-test capability, it can have self-repair capability.
- the test result When the test result is saved in the memory in the multi-core/many-core device, the failed processor can be marked, and when the multi-core/many-core device is configured, the failure process can be bypassed according to the corresponding flag.
- the core so that the multi-core/nuclear device can still work normally and realize self-repair.
- the self-repair may be performed after the wafer is tested, after the integrated circuit test after the package is tested, or after the chip is used, the test is performed after the device is started, and the self-test self-repair condition and cycle may be manually set at work. Perform regular self-tests during the period.
- the plurality of processor cores in the configurable multi-core/many-core device of the present invention may be isomorphic or heterogeneous.
- the length of the instruction word in the local instruction memory in the configurable multi-core/many-core device of the present invention may not be fixed.
- the local instruction memory and the local data memory in the configurable multi-core/many-core device of the present invention may each have a singular or complex array read port.
- each processor core may further correspond to a plurality of local instruction memories, and the plurality of local instruction memories may be the same size or different sizes; Structurally, it can also be of different structure.
- the other local instruction memories of the plurality of local instruction memories may perform an instruction update operation. Ways to update instructions include, but are not limited to, accessing controller update instructions through direct memory.
- the plurality of processor cores in the configurable multi-core/many-core device of the present invention can operate at the same clock frequency or at different clock frequencies.
- the configurable multi-core/many-core device of the present invention may have the characteristics of reading-induced writes (LIS, load induced store processor core corresponding to the previous pre-stage processor core for the first reading of a certain address data) Local data
- the memory reads the data, and writes the read data to the local data memory corresponding to the processor core of the current level, and then reads and writes the address data to access the local data memory corresponding to the level, so that no additional overhead is added. In this case, the transfer of the same address data in the adjacent local data memory is implemented.
- the configurable multi-core/many-core device of the present invention may have the characteristics of data pre-transfer; the processor core may read from the local data memory corresponding to the processor core of the previous stage, and the processor core does not need to read or write, but the subsequent processor The core needs to read the data and write it to the local data memory corresponding to the processor core of the current level, thereby implementing the stepwise transfer of the same address data in the local data memory of the front and rear stages.
- the local data store of the present invention may also include a singular or plural number of valid flags and a singular or plural number of attribution flags.
- the valid flag is used to indicate whether the corresponding data is valid.
- the attribution flag is used to indicate which processor core the corresponding data is currently used by. The use of the valid flag and the attribution flag can avoid the use of ping-pong buffering, improve memory usage efficiency, and multiple processor cores can access the same data memory at the same time, facilitating data exchange.
- the transfer register value through the configurable interconnect structure of the present invention includes, but is not limited to, directly transferring the value of the register in the processor core to the register of the subsequent processor core at a time by using a large number of hardwires.
- the method of the bit register sequentially shifts the values of the registers in the processor core to the registers of the subsequent processor core.
- the transmission path of the register value may also be a register that needs to be transmitted according to a register read/write record table.
- the register read/write record table of the present invention is used to record the read and write of the register to the corresponding local data memory. If the value of the register has been written to the local data memory corresponding to the processor core of the current level and then the value of the register has not changed, then only the subsequent processor core can correspond to the local data memory corresponding to the core of the current processor core. The address reads the data, thereby completing the transfer of the register, and does not need to separately transfer the register value to the subsequent processor.
- the corresponding entry in the register read/write record table is cleared to "0".
- the register reads and writes the corresponding record in the record table.
- the item is set to "1".
- register value transfer is performed, only the value of the corresponding register whose entry is "1" in the register read/write log table is transferred.
- the data is written to registers in the register file, including but not limited to, reading data from the corresponding local data memory to a register in the register file, and writing the result of the instruction execution back to a register in the register file.
- the code segment header extension and code may also be obtained according to the determined code segment obtained after the segmentation.
- the fragment tail extension is optimized to reduce the number of registers that need to be passed.
- the code fragment tail extension contains storing all register values to a specific local number.
- the code fragment header extension contains an instruction to read the value in the corresponding address into the register, and the two cooperate to realize the smooth transfer of the register value.
- the code segment tail extension corresponding to the pre-processor core can be omitted.
- the instruction in the code fragment header extension corresponding to the processor core of the current stage reads the data from the local data memory to the register.
- the code value of the code segment tail extension corresponding to the pre-processor core may be omitted.
- the code segment when a code segment corresponding to a plurality of processor cores is executed, the code segment is transferred to the same address to execute a piece of code, and after the execution of the piece of code, the corresponding code segment is transferred back to the corresponding code segment.
- the code of the same address may be repeatedly stored in a local instruction memory corresponding to the plurality of processor cores; the code of the same address includes but is not limited to a function call, a loop.
- the processor core can access a local instruction memory of a processor core other than the processor core; when a plurality of processor cores execute exactly the same code, and When the code length exceeds the local instruction memory size corresponding to a single processor core, the code may be sequentially stored in a local instruction memory corresponding to the plurality of processor cores; at runtime, any one of the plurality of processor cores
- the processor core first reads and executes the instruction from the local instruction memory storing the first piece of code in the identical code, and after reading the first piece of code, reads from the local instruction memory storing the second piece of code in the code.
- the instruction is fetched and executed, and so on, until all of the identical code is executed.
- the plurality of processor cores may synchronously execute each piece of code in the identical code, or may asynchronously execute each piece of code in the identical code.
- the plurality of processor cores may execute each piece of code in the identical code in parallel, or may execute each piece of code in the identical code serially; or perform the complete sequence in a mixed manner The pieces of code in the same code.
- the processor core may further correspond to a plurality of local instruction memories, and the plurality of local instruction memories may be the same size or different sizes; Structurally, may also be of different structure; when one or more of the plurality of local instruction memories are used to respond to corresponding processing
- the other local instruction memories in the plurality of local instruction memories may perform an instruction update operation; the way to update the instructions may be to update the instructions by direct memory access controller.
- SoC System on Chip
- other functional modules are ASIC modules implemented in hardwired logic.
- the performance requirements of these functional modules are very high, and it is difficult to meet the performance requirements with conventional processors, so these ASIC modules cannot be replaced by conventional processors.
- a single or multiple processor cores and their corresponding local memories can be configured to form a high-performance multi-core connection structure, and the multi-core connection structure is configured and placed in the corresponding local instruction memory.
- the corresponding code segment is inserted to enable the multi-core connection structure to implement a specific function, which can replace the ASIC module in the system on chip.
- the multi-core connection structure is equivalent to a functional module in a system on a chip, such as an image decompression module or an encryption and decryption module. These functional modules are then connected by the system bus to implement a system on a chip.
- the data transmission channel between the processor core and its corresponding local memory and the adjacent processor core and its corresponding local memory is a local interconnection, and a plurality of the processor cores and their corresponding local memories
- a multi-core connection structure composed of a plurality of processor cores connected together by a local connection and their corresponding local memories corresponds to a functional module of the system on chip.
- the data transmission channel between the multi-core connection structure corresponding to the function module in the system on chip and the multi-core connection structure corresponding to the function module in the system on chip is described as a system bus.
- a system bus connecting a plurality of multi-core connection structures corresponding to the functional modules in the system on chip a system-on-a-chip in the usual sense can be realized.
- a system on chip implemented based on the technical solution of the present invention has configurability that is not provided by a conventional system on chip.
- Different on-chip systems can be obtained by differently configuring the data processing apparatus according to the present invention.
- the configuration can be performed in real time during operation so that the on-chip system functions can be changed in real time during operation.
- the functionality of the on-chip system can be changed by dynamically reconfiguring the processor core and its corresponding local memory and dynamically changing the code segments in the corresponding local instruction memory.
- the multi-core connection structure internal processor core corresponding to the function module in the system on chip and the corresponding local memory and other processor cores and their corresponding local memories are used for data transmission within the function module. local connection.
- the transmission of data through the local connection within the functional module typically requires the operation of the processor core that is proposing the transfer request.
- the system bus of the present invention may be the local connection, or a data transmission channel capable of completing data transmission between different processor cores and their corresponding local memories without occupying the operation of the processor core. Place The different processor cores and their corresponding local memories may be contiguous or non-adjacent.
- one method of constructing a system bus is to establish a data transmission channel by using a plurality of fixed connection devices.
- the input and output of any of the multi-core connection structures are connected to a similar connection device by a single or multiple hard-wired connection. All of the connecting devices are also connected by a single or multiple hard wires.
- the connection device, the connection between the multi-core connection structure and the connection device, and the connection between the connection devices together constitute the system bus.
- another method of constructing the system bus is to establish a data transmission channel, so that any processor core and its corresponding local data memory can be combined with any other processor core and its corresponding local data memory.
- the means of data transfer include, but are not limited to, delivery via shared memory, transfer via direct memory access controller, delivery over a dedicated bus or network.
- one method is to arrange a single root or a plurality of hardwires between two processor cores and their respective local data memories in some processor cores and their corresponding local data memories.
- Hardwired may be configurable; when any two of the processor cores and their respective local data memories and their respective local data stores are in different multicore connection structures, ie, in different functional modules
- the hardwire between the two processor cores and their respective local data memories can serve as the system bus between the two multi-core connection structures.
- the second method is to enable all or part of the processor core and its corresponding local data memory to access other processor cores and their corresponding local data memory through the direct memory access controller.
- these two processor cores and their corresponding local data memories and their corresponding local data memories are in different multi-core connection structures, that is, in different functional modules, they can be in real-time operation.
- the data transfer between the processor core and its corresponding local data memory and another of the processor cores and their respective local data memories is performed as needed to implement a system bus between two multi-core connection structures.
- a third method is to implement a network on chip function on all or part of the processor core and its corresponding local data memory, that is, when the processor core and its corresponding local data memory are transferred to the data
- the configurable internetwork determines the destination of the data, thereby forming a data path for data transmission.
- these two processor cores and their corresponding local data memories and their corresponding local data memories are in different multi-core connection structures, that is, in different functional modules, they can be in real-time operation.
- a system bus between two multi-core connection structures is implemented as needed to perform data transfer between the processor core and its corresponding local data memory and another of the processor cores and their respective local data memories.
- the processor core may have a fast condition determination mechanism for determining whether a branch transfer is performed; the fast condition determination mechanism may be a counter for determining a loop condition, It can be a hardware finite state machine for determining branch transitions and loop conditions.
- the configuration level of the present invention is low power consumption, and the specific processor core can also enter a low power consumption state according to the configuration information;
- the specific processor core includes but is not limited to a processor core that is not used, and the workload A relatively low processor core;
- the low power state includes, but is not limited to, reducing the processor clock frequency or shutting down the power supply.
- the method and apparatus for data processing according to the present invention may further include a single number or a plurality of dedicated processing modules.
- the dedicated processing module can be used as a macro module for the processor core and its corresponding local memory to be called, or can be used as an independent processing module to receive the output of the processor core and its corresponding local memory, and send the processing result to the location.
- the processor core outputted to the dedicated processing module and its corresponding local memory and the processor core receiving the output of the dedicated processing module and its corresponding local memory may be the same processor core and its corresponding local memory, or may be different processing Kernel and its corresponding local memory.
- the dedicated processing module includes but is not limited to a fast Fourier transform (FFT) module, an entropy encoding module, an entropy decoding module, a matrix multiplication module, a convolutional coding module, a Viterbi Code decoding module, and a Turbo Code. Decoding module.
- FFT fast Fourier transform
- a matrix multiplication module if a single large-scale matrix multiplication is performed using a single processor core, a large number of clock cycles are required, which limits the increase in data throughput; if a plurality of the processor cores are used to implement large-scale matrix multiplication, Although it can reduce the number of execution cycles, it increases the amount of data transfer between processor cores and consumes a lot of processor resources. With a dedicated matrix multiplication module, large-scale matrix multiplication can be done in a few cycles. When the program is divided, the operation before the large-scale matrix multiplication can be allocated to a plurality of processor cores, that is, the former group processor core, and the operation after the large-scale matrix multiplication is allocated to another plurality of processors.
- the data in the output of the former group processor core that needs to participate in the large-scale matrix multiplication is sent to a dedicated matrix multiplication module, and after processing, the result is sent to the latter group processor core.
- the data in the output of the former group processor core that does not need to participate in the large-scale matrix multiplication is sent directly to the post-group processor core.
- the data processing method and apparatus of the present invention are capable of dividing serial program code into code segments adapted to run by respective processor cores in a serially connected multiprocessor core structure, for different numbers of processor cores. Divided into different sizes and numbers of code segments according to different segmentation rules, suitable for Scalable multi-core/nuclear device/system applications.
- code segments are allocated to each processor core in a serial connection multi-processor core structure, each processor core executes a specific instruction, and all processor core strings are executed.
- the line connection realizes the complete function of the program.
- the data used between the code segments separated from the complete program code is transmitted through a special transmission path, and there is almost no data correlation problem, realizing true multi-emission.
- the serial connection multi-processor core structure the number of multi-transmitted transmissions is equal to the number of processor cores, which greatly improves the utilization of the arithmetic unit, thereby realizing serial connection of multi-processor core structures, and even devices/ The high throughput of the system.
- the local memory replaces the cache that is usually present in the processor.
- Each processor core saves all the instructions and data used by the processor core in the corresponding local memory, achieving a 100% hit rate, and solving the cache miss caused by accessing the external
- the speed bottleneck problem of low-speed memory further improves the overall performance of the device/system.
- the multi-core/many-core device of the present invention has three levels of low-power technology, and can implement coarse-grained power management by using methods such as cutting off the power of an unused processor core.
- Drive perform fine-grained power management for instruction level, and implement automatic real-time adjustment of processor core clock frequency in hardware mode.
- processor core operation Under the premise of ensuring normal operation of processor core, it can effectively reduce the dynamics of processor core operation. Power consumption, enabling the processor core to adjust the clock frequency as needed, and minimize the implementation of human intervention.
- the speed is fast, and the real-time adjustment of the processor clock frequency can be realized more effectively.
- the on-chip system can be realized only by programming and configuration, and the development cycle from design to product launch can be shortened. Moreover, only the need to reprogram and reconfigure enables the same hardware product to perform different functions.
- Fig. 1 is a flow chart showing an embodiment of the present invention by taking the division and assignment of a high-level language program and an assembly language program as an example.
- FIG. 2 is an embodiment of a processing routine loop in the post-compilation method of the present invention.
- FIG 3 is a schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
- Figure 4 is an embodiment of an address mapping method.
- Figure 5 is an embodiment of data transfer between cores.
- Figure 6 is an embodiment of back pressure, exception handling, and the connection between data memory and shared memory.
- Figure 8 (a) is an embodiment of adjacent processor core register value transfers.
- Figure 8(b) is a second embodiment of adjacent processor core register value transfers.
- Figure 9 is a third embodiment of adjacent processor core register value transfers.
- Figure 10 (a) is an embodiment of a processor core and corresponding local memory composition based on the present invention.
- Figure 10 (b) is another embodiment of a processor core and corresponding local memory composition based on the present invention.
- Figure 10 (c) is an embodiment of a valid flag and a home flag bit in a processor core and corresponding local memory in accordance with the present invention.
- Figure 11 (a) shows the typical structure of the current system-on-chip.
- Figure 11 (b) is an embodiment of implementing a system on a chip based on the technical solution of the present invention.
- Figure 11 (c) is another embodiment of implementing a system on a chip based on the technical solution of the present invention.
- Figure 12 (a) is an embodiment of pre-compilation in the technical solution of the present invention.
- Figure 12 (b) is an embodiment of post-compilation in the technical solution of the present invention.
- Figure 13 (a) is another schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy of the present invention.
- Figure 13 (b) is a schematic diagram of a multi-core serial structure formed by configuration of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
- Figure 13 (c) is a configurable multi-core / many-core device based on serial multi-emission and pipeline hierarchy according to the present invention
- Figure 13 (d) is a schematic diagram of a plurality of multi-core structures formed by configuration of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
- Fig. 1 is a flow chart showing an embodiment of the present invention by taking the division and assignment of a high-level language program and an assembly language program as an example.
- the pre-compilation (103) step is used to expand the calls in the high-level language program (101) and/or the assembly language program (102) to obtain the expanded high-level language code and/or assembly language code.
- the compiled high-level language code and/or assembly language code is then compiled by the compiler (104) to obtain assembly code that conforms to the program execution order, and then compiled (107); if there is only assembly language code in the program, and has been met In the program execution sequence, the compilation (104) can be omitted, and the post-compilation (107) can be performed directly.
- post-compilation in the present embodiment, based on the structure information (106) of the multi-core device, the assembly code is run on the behavior model (108) of the processor core and divided to obtain configuration information (110). A corresponding configuration bootloader (109) is also generated. Finally, a corresponding plurality of processor cores (113) are configured by one of the processor cores (111) either directly or through a DMA controller (112).
- the instruction splitter first reads the front-end stream segment in step one (201), and reads the front-end stream related information in step two (202). Then, proceed to step 3 (203) to determine whether the stream segment is looped. If it is not looping, proceed to step 9 (209) to process the stream segment according to the conventional processing. If looping, proceed to step 4 (204) to first read the loop. Count M, and then go to step 5 (205) to read the number of cycles N that can be accommodated in this block. In step 6 (206), it is judged whether the number of cycles M is greater than the number N of cycles that can be accommodated.
- step 7 If the number of cycles M is greater than the number N of cycles that can be accommodated, then the process proceeds to step 7 (207) to divide the cycle into a small N-period. Loop and a small loop of MN weeks, and re-assign MN to M in step eight (208), while entering the next block cycle until the number of cycles is satisfied is less than the number of cycles that can be accommodated.
- FIG. 3 is a schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy according to the present invention.
- the apparatus is comprised of a number of processor cores (301), configurable local memory (302), and a configurable interconnect structure (303).
- each processor core (301) corresponds to a configurable local memory (302) below it, which together form a level of the macro pipeline.
- the configurable interconnect structure (303) multiple processor cores (301) and their respective configurable local memories (302) can be connected into a serial connection structure.
- Multiple serial connection structures can They are independent, and they can be partially or completely interconnected, running programs serially, in parallel or in series.
- Figure 4 is an embodiment of an address mapping method.
- Figure 4 (a) uses the lookup table method to achieve address lookup. Taking a 16-bit address as an example, the 64K address space is divided into a plurality of small memories (403) of a single 1K address space, which are sequentially written, and then written to other blocks after one block of memory is written. After each write, the intra-block address pointer (404) automatically points to the next available entry with a valid bit of 0, which is set to a valid position of the entry. Each entry writes data and writes its address to the lookup table (402). Taking the value of the write address BFC0 as an example, the address pointer (404) points to the No. 2 entry of the memory (403), and when the corresponding data is written to the No.
- Figure 4 (b) uses the CAM array method to implement address lookup. Taking a 16-bit address as an example, the 64K address space is divided into a plurality of small memories (403) of a single 1K address space, which are sequentially written, and after one block of memory is written, another block is written. Each time the block is written, the intra-block address pointer (406) automatically points to the next available entry with a valid bit of 0, which is set to a valid bit of the entry.
- Each entry writes data and writes its instruction address to the next entry in the CAM array (402).
- the address pointer (406) points to the No. 2 entry of the memory (403), and the next entry in the CAM array (405) when the corresponding data is written to the No. 2 entry.
- the instruction address BFC0 is written to establish an address mapping relationship.
- the input command address is compared with all the instruction addresses stored in the CAM array to find the corresponding entry, and the stored data is read.
- Figure 5 is an embodiment of data transfer between cores. All data memory is located between the processor cores and is divided into two parts in the logical sense. The upper part is used for reading and writing of the processor core above the data memory, and the lower part is only used for reading data for use by the processor core below the data memory. While the processor core is running the program, data is transferred from the data memory above.
- the three-choice selector (502, 509) can select the data (506) from the far-end to be sent to the data store (503, 504). 'When the processor core (510, 511) does not execute the Store instruction, the lower portions of the data memory (501, 503) are written to the corresponding next data memory (503, 504) through the three-select selectors (502, 509), respectively.
- the upper part of the flag marks the valid bit V of the write line as 1.
- the register file only writes values to the underlying data memory.
- the two selectors (505, 507) are respectively determined by the valid bit V of the data memory (503, 504) from the corresponding data memory (501, 503) or below.
- the data memory (503, 504) takes the number. If the valid bit V of an entry in the data memory (503, 504) is 1, that is, the flag data has been written and updated from the above data memory (501, 503), the data transmitted from the far side is not selected (506).
- selectors (502, 509) select the register file output of the processor core (510, 511) as Input to ensure that the stored data is the latest value processed by the processor core (510, 511).
- the lower portion of the data memory (503) transfers data to the upper portion of the data memory (504).
- the pointer is used to transfer the data entry.
- the flag transfer is about to be completed.
- the data should have completed the transfer to the next memory.
- the upper portion of the data memory (501) transfers data to the lower portion of the data memory (503), and the upper portion of the data memory (503) transfers data to the lower portion of the data memory (504), the data memory (504).
- the upper part of the data is transferred downwards to form a ping-pong transmission structure. All data memories are divided into a portion of the storage for instructions in terms of the required instruction space size, ie the data memory and the instruction memory are not physically separated.
- Figure 6 is an embodiment of back pressure, exception handling, and the connection between data memory and shared memory.
- a corresponding code segment (615) is written to the instruction memory (601, 609, 610, 611) by the DMA controller (616).
- the processor cores (602, 604, 606, 608) run the code in the respective instruction memory (601, 609, 610, 611) and read and write the corresponding data memory (603, 605, 607, 612).
- the processor core (604), the data memory (605), and the subsequent processor core (606) as an example, both the front and rear processor cores (604, 606) have access to the data memory (605), only in front.
- the data sub-memory in the data memory (605) can be ping-pong swapped.
- the back pressure signal (614) is used by the downstream processor core (606) to inform the data memory (605) if the read operation has been completed.
- the back pressure signal (613) is used by the data memory (605) to notify the pre-processor core (604) whether there is an overflow and to pass the back-pressure signal transmitted by the post-processor core (606).
- the pre-processor core (604) determines whether the macro pipeline is blocked according to its own operation and the back pressure signal transmitted by the data memory (605), and determines whether to ping-pong the data sub-memory in the data memory (605), and The back pressure signal is generated and continues to pass to the next stage. Through such a reverse backpressure signal transfer from the processor core to the data memory to the processor core, the operation of the macro pipeline can be controlled. All data stores (603, 605, 607, 612) are connected to the shared memory (618) via a connection (619). When an address written or read by a data memory is outside its own address, an address exception occurs, and the address is searched in the shared memory (618). After the data is found, the data is written to the address or the data of the address is read.
- An exception also occurs when the processor core (608) needs to use data in the data store (605), and the data store (605) transfers the data to the processor core (608) via the shared memory (618).
- the exception information generated by the processor core and data memory is transferred to the exception handling module (617) via the dedicated channel (620).
- the exception processing module (617) controls the processor to perform a saturation operation on the overflow operation result; taking the data memory overflow as an example, the exception processing mode Block (617) controls the data memory to access the shared memory and stores the data in the shared memory; in the process, the exception handling module (617) sends a signal to the processor core or data memory, blocks it, and then completes the exception handling. After the operation resumes operation, the signals transmitted by the other processor cores and the data memory through the back pressure respectively determine whether they are blocked.
- FIG. 7 is a self-test self-repair method and structural embodiment.
- the test vectors generated by the vector generator (702) are synchronously sent to the processor cores, and the test vector distribution controller (703) controls the processor cores and vector generators (702).
- the connection relationship, the operation result distribution controller (709) controls the connection relationship between the processor cores and the comparators, and the processor core compares the operation results with the other processor cores.
- each processing The cores can be compared to other adjacent processor cores, as the processor core (704) can be compared by comparison logic (708) with processor cores (705, 706, 707).
- each comparison logic may include one or more comparators.
- each processor core is sequentially compared with other adjacent processor cores, if one The comparison logic has multiple comparators, and each processor core is simultaneously compared with other adjacent processor cores, and the test results are directly written into the test result table from each comparison logic (710).
- the processor core has a register file (801) including 31 32-bit general-purpose registers, and transfers all the general-purpose register values in the pre-stage processor core (802) to the local processor.
- each of the general-purpose registers of the pre-processor core (802) can be directly connected to each of the general-purpose registers of the processor core (803) with 992 hard-wired connections.
- the input ends are connected one by one through a multiplexer.
- the register value is passed, the values of the 31 32-bit general-purpose registers in the pre-processor core (802) can be passed to the processor core (803) in one cycle.
- Figure 8 (a) specifically shows a hard-wired connection method for one bit (804) in a general-purpose register.
- the remaining 991-bit hard-wired connection method is the same as this bit (804).
- the output (806) of the corresponding bit (805) in the pre-processor core (802) is hardwired (807) to the input of the bit (804) in the processor core (803) of the present stage through the multiplexer (808) Connection.
- the processor core performs arithmetic, logic, etc.
- the multiplexer (808) selects data from the processor core of the current stage (809); when the processor core performs the fetch operation, if the data is processed at the present stage If the local memory corresponding to the core exists, the data from the processor core of the current level is selected (809), otherwise the data transmitted from the core of the previous processor is selected (810); when the register value is transferred,
- the way selector (808) selects data (810) derived from the pre-processor core. All 992 bits are transmitted simultaneously, and the entire register file value can be transferred in one cycle.
- adjacent processor cores (820, 822) each have a register file (821, 823) containing a plurality of 32-bit general purpose registers. When transferring register values from the pre-processor core (820) to the processor core (822), the data output of the register file (821) in the pre-processor core (820) can be hardwired with 32 hardwires.
- the multiplexer (827) is connected to the input of a multiplexer (827) connected to the data input (830) of the register file (823) in the processor core (822) of the present stage, and the inputs of the multiplexer (827) are respectively The data from the processor core of this level (824) and the data from the pre-processor core (825) transmitted through the hardwire (826).
- the multiplexer (827) selecting data from the processor core of the current level (824); when the processor core performs the fetch operation, if the data already exists in the local memory corresponding to the processor core of the current level, the selection is derived from the level The data of the processor core (824), otherwise the data from the pre-processor core is selected (825).
- the multiplexer (827) is selected from the pre-processor core to transmit.
- the register address generation module (828, 832) corresponding to the register file (821, 823) itself generates a register address to which the register value needs to be transferred to the address input terminal (831, 833) of the register file (821, 823), and is divided into multiple times. Pass the value of the register through hardwire
- the register file (823) passed from the register file (821). In this way, the transfer of all or part of the register values in the register file can be completed in multiple cycles with only a small number of hardwires added.
- adjacent processor cores (940, 942) each have a register file (941, 943) containing a plurality of 32-bit general purpose registers.
- the pre-processor core (940) can first use a data store (store) instruction to register a register in the register file (941). The value is written into the local data memory (954) corresponding to the pre-processor core (940), and the corresponding processor core (942) uses the data load command to read the corresponding data from the local data memory (954). And write to the corresponding register of the register file (943).
- the data output (949) of the register file (941) in the pre-processor core (940) passes through the 32-bit connection (946) and the data input of the local data memory (954) (948).
- the data input (950) of the register file (943) in the processor core (942) of the current stage passes through the data output of the multiplexer (947) and the 32-bit connection (953) and the local data memory (954).
- the ends (952) are connected.
- the input of the multiplexer (947) is the data from the processor core of the processor (944) and the data from the pre-processor core (945) transmitted through the 32-bit connection (953).
- the multiplexer (947) selects data derived from the processor core of the current stage (944); when the processor core performs the fetch operation, if the data corresponds to the core of the processor of the present stage If the local memory already exists, the data from the processor core of the current level is selected (944), otherwise the data from the pre-processor core is selected (945).
- the multiplexer (947) ) choose to come from the front The data transmitted from the processor core (945).
- the values of all the registers in the register file (941) may be sequentially written into the local data memory (954), and then these values are sequentially written into the register file (943).
- the values of some registers in the register file (941) can be written to the local data memory (954) in turn, and then the values are sequentially written into the register file (943); one of the register banks (941) can also be Once the value of the register is written to the local data memory (954), the value is immediately written to the register file (943) and the process is repeated in sequence until the register value that needs to be transferred is passed.
- FIG. 10 shows two embodiments of a connection structure composed of a processor core and a corresponding local memory according to the present invention.
- Various possible substitutions, adjustments, and improvements of the various components of the embodiments may be made in accordance with the technical solutions and concepts of the present invention, and all such replacements, adjustments, and improvements are intended to be within the scope of the present invention.
- a corresponding embodiment includes a local instruction memory and a local data memory (1001) corresponding to the processor core (1001) of the local data memory and its previous stage processor core.
- the processor core (1001) is composed of a local instruction memory (1003), a local data memory (1004), an execution unit (1005), a register file (1006), a data address generation module (1007), a program counter (1008), and a write buffer ( 1009) and output buffer (1010).
- the local instruction memory (1003) stores instructions required for execution by the processor core (1001).
- the number of operands required by the execution unit (1005) in the processor core (1001) is from the register file (1006), or from the immediate value in the instruction; the execution result is written back to the register file (1006).
- the local data memory has two sub-memory. Taking the local data memory (1004) as an example, the data read from the two sub memories is selected by the multiplexer (1018, 1019) to produce the final output data (1020).
- the data in the local data memory (1002, 1004), the data in the write buffer (1009), or the data in the external shared memory (1011) can be read into the register file (1006) by a data load (load) instruction.
- the data in the local data memory (1002, 1004), the data in the write buffer (1009), and the data in the external shared memory (1011) are selected by the multiplexer (1016, 1017). Input into the register file (1006).
- the data in the register file (1006) can be delayed by the write buffer (1009) into the local data memory (1004) or the data in the register file (1006) can be buffered by the output (1010) by a data store instruction.
- the delay is stored in the external shared memory.
- the data can be read from the local data memory (1002) to the register file (1006) while being delayed by the write buffer (1009) into the local data memory (1004) to perform the LIS function of the present invention. , to achieve data transfer without cost.
- the data received by the write buffer (1009) has three sources: data from the register file (1006), data from the pre-processor core local data memory (1002), and Data from external shared memory (1011).
- the data from the register file (1006), the data from the pre-processor core local data memory (1002), and the data from the external shared memory (1011) are selected by the multiplexer (1012) and then input. To the write buffer (1009).
- the local data store only receives data inputs from the write buffer in the same processor core.
- the local data memory (1004) only receives data input from the write buffer (1009).
- the local instruction memory (1003) and the local data memory (1002, 1004) are each composed of two identical sub-memory, and can simultaneously perform different sub-memory in the local memory. Read and write operations. With such a structure, the local data storage using the ping-pong buffer exchange described in the technical solution of the present invention can be realized.
- the address received by the local instruction memory (1003) is generated by the program counter (1008).
- the address received by the local data memory (1004) has three sources: an address for storing data from the address storage portion of the core processor write buffer (1009), and a core data address generation module from the processor (1007). The address for reading data, the address for reading data from the subsequent processor core data address generation module (1013).
- the address for reading data from the post-processor core data address generation module (1013) is selected by the multiplexer (1014, 1015) and input to the address receiving of different sub-memory in the local data memory (1004). Module.
- the address received by the local data memory (1002) also has three sources: an address for storing data from the address storage portion of the core processor write buffer of the current processor, and a module for generating a data from the core processor of the current processor.
- An address for reading data an address for reading data from the subsequent processor core data address generation module (1007).
- the above addresses are selected by the multiplexer and input to the address receiving modules of different sub memories in the local data memory (1002).
- FIG. 10(b) is another connection structure based on the processor core and the corresponding local memory according to the present invention, wherein the processor core (1021) including the local instruction memory and the local data memory and the processor of the previous stage thereof The core corresponds to the local data memory (1022).
- the processor core (1021) is composed of a local instruction memory (1003), a local data memory (1024), an execution unit (1005), a register file (1006), a data address generation module (1007), a program counter. (1008), write buffer (1009) and output buffer (1010).
- connection structure proposed in the corresponding embodiment is substantially the same as the structure proposed in the corresponding embodiment of Figure 10 (a), the only difference being that the local data memories (1022, 1024) in this embodiment are each a double Port (dual-port) memory structure. Dual-port memory can support both read and write operations at two different addresses.
- the address received by the local data memory (1024) has three sources: an address for storing data from the address storage portion of the host processor core write buffer (1009), and a core data address generation module from the local processor (1007)
- the address for reading data the address for reading data from the subsequent processor core data address generation module (1025).
- the address (1025) for reading data from the subsequent processor core data address generation module is selected by the multiplexer (1026) and input to the address receiving module of the local data memory (1024).
- the address received by the local data memory (1022) also has three sources: an address for storing data from the address storage portion of the core processor write buffer of the processor, and a module for generating a data from the core processor of the current processor.
- An address for reading data an address for reading data from the subsequent processor core data address generation module (1007). The above address is selected by the multiplexer and input to the address receiving module of the local data memory (1022).
- a single-port memory can be used instead of the dual-port memory in the corresponding embodiment of FIG. 10(b).
- the order of the instructions in the program is statically adjusted, or the order of execution of the instructions is dynamically adjusted during program execution, and the instructions for accessing the memory are simultaneously executed when the instructions that do not need to access the memory are executed, thereby making the composition of the connection structure more concise and efficient.
- each local data memory is actually a dual port memory capable of supporting two read, two write or one read and one write operations simultaneously.
- a valid flag (1032) and a attribution flag can be added to each address in the local data memory (1031) as shown in Figure 10 (c). Bit (1033).
- the valid flag bit (1032) represents the validity of the data (1034) corresponding to the address in the local data memory (1031). For example, “1" can be used to represent the local data memory (1031). The data corresponding to the address (1034) is valid, and "0" is used to represent the data corresponding to the address in the local data memory (1031) (1034) is invalid.
- the attribution flag bit (1033) represents which processor core is used by the data corresponding to the address (1034) in the local data memory (1031). For example, the address in the local data memory (1031) can be represented by "0".
- the data (1034) is used by the corresponding processor core (1035) of the local data memory (1031), and the data corresponding to the address in the local data memory (1031) is represented by "1" (1034).
- the memory (1031) is used by the corresponding processor core (1035) and its subsequent processor core (1036).
- the attributes of each of the data stored in the local data store can be described as defined above for the valid flag bit (1032) and the home flag bit (1033), and correct read and write is guaranteed.
- the valid flag bit (1032) corresponding to an address in the local data memory (1031) is "0"
- BP If necessary, you can directly perform data storage operations on this address.
- the valid flag bit (1032) is "1” and the home flag bit (1033) is "0”
- the processor core (1035) is used, so the processor core (1035) of the current stage can directly perform data storage operations on the address if needed.
- the valid flag bit (1032) is "1" and the home flag bit (1033) is "1"
- the processor core (1035) and its subsequent processor core (1036) use, if the processor core (1035) needs to perform data storage operations on the address, it must wait for the attribution flag (1033).
- the data storage operation can be performed, that is, the data corresponding to the address (1034) is first transmitted to the corresponding position in the local data memory (1037) corresponding to the subsequent processor core (1036), and the level is
- the local flag bit (1033) corresponding to the address in the local data memory (1031) corresponding to the processor core (1035) is set to "0", so that the processor core (1035) of the current stage can perform data storage operation on the address. It is.
- the corresponding valid flag bit (1032) can be set to "1". And according to whether the data (1034) is used by the latter processor (1036) to determine the attribution flag, if it is used by the latter processor (1036), the attribution flag (1033) is set to T, otherwise set to "0"; It is also possible to set the corresponding valid flag bit (1032) to "1” and also set the corresponding home flag bit (1032) to "1". This increases the capacity of the local data memory (1031), but simplifies its specificity. Implementation structure.
- Figure 11 (a) shows the typical structure of the current system-on-chip.
- the processor core (1101), the digital signal processor core (1102), the functional units (1103, 1104, 1105), the input/output interface control module (1106), and the storage control module (1108) are all connected to the system bus (1110). on.
- the system-on-chip can transmit data through the input/output interface control module (1106) and peripheral devices (1107), and can also pass the storage control module (1108). Transfer data with external memory (1109).
- FIG. 11(b) shows an embodiment of implementing a system on chip based on the technical solution of the present invention.
- the processor core and the corresponding local memory (1121) together with the other six processor cores and the corresponding local memory constitute a functional module (1124)
- the processor core and the corresponding local memory (1122) and the other four processings are a functional module (1124)
- the cores and corresponding local memories together form a functional module (1125)
- the processor core and corresponding local memory (1123) together with the other two processor cores and corresponding local memories constitute a functional module (1126).
- the functional modules (1124, 1125, 1126) may each correspond to a processor core (1101), or a digital signal processor core (1102), or a functional unit (1103 or 1104 or 1105) in the embodiment of FIG. 11(a). Or an input/output interface control module (1106) or a storage control module (1108).
- the processor core and the corresponding local memory (1123, 1127, 1128, 1129) constitute a multi-core structure of serial connection, the four processor cores and corresponding local memories (1123, 1127, 1128) , 1129) Together realize the functions of the function module (1126).
- Data transfer between the processor core and corresponding local memory (1123) and the processor core and corresponding local memory (1127) is accomplished via an internal connection (1130).
- the data transfer between the processor core and the corresponding local memory (1127) and the processor core and the corresponding local memory (1128) is implemented by an internal connection (1131), the processor core and the corresponding local memory (1128) and the processor.
- Data transfer between the core and the corresponding local memory (1129) is accomplished via an internal connection (1322).
- the function module (1126) is connected to the bus connection module (1138) through hard wiring (1133, 1134), so that the function module (1126) and the bus connection module (1138) can transfer data to each other.
- the function module (1125) and the bus connection module (1139) can transfer data to each other, and the function module (1124) and the bus connection module (1140, 1141) can transfer data to each other.
- the bus connection module (1138) and the bus connection module (1139) can transfer data to each other through hardwired (1135).
- the bus connection module (1139) and the bus connection module (1140) can mutually transmit data through hardwired (1136).
- the bus connection module (1140) and the bus connection module (1141) can transmit data to each other through hard wiring (1137).
- the processor core and the corresponding local memory in the configurable multi-core/multi-core device proposed by the present invention are easily expandable in number, various types of system on chip can be conveniently implemented by the method of the present embodiment.
- the structure of the system on chip can be flexibly changed through a real-time dynamic configuration method.
- FIG. 11(c) shows another embodiment of implementing a system on chip based on the technical solution of the present invention.
- the processor core and the corresponding local memory (1151) together with the other six processor cores and the corresponding local memory constitute a functional module (1163)
- the processor core and the corresponding local memory (1152) and the other four processings are a functional module (1163)
- the cores and corresponding local memories together form a functional module (1164)
- the processor core and corresponding local memory (1153) together with the other two processor cores and corresponding local memories constitute a functional module (1165).
- the functional modules (1163, 1164, 1165) may each correspond to a processor core (1101), or a digital signal processor core (1102), or a functional unit (1103 or 1104 or 1105) in the embodiment of FIG. 11(a). , or an input/output interface control module (1106), or a storage control module (1108).
- the processor core and the corresponding local memories (1153, 1154, 1155, 1156) constitute a multi-core structure of serial connection, the four processor cores and corresponding local memories (1153, 1154, 1155) , 1156) Together realize the functions of the function module (1165).
- Data transfer between the processor core and corresponding local memory (1153) and the processor core and corresponding local memory (1154) is accomplished via an internal connection (1160).
- the data transfer between the processor core and the corresponding local memory (1154) and the processor core and the corresponding local memory (1155) is implemented by an internal connection (1161), the processor core and the corresponding local memory (1155) and the processor.
- Data transfer between the core and the corresponding local memory (1156) is accomplished via an internal connection (1162).
- an example is to implement data between the function module (1165) and the function module (1164) through data transfer between the processor core and the corresponding local memory (1156) and the processor core and the corresponding local memory (1166). Transmission requirements.
- the interconnect network can be configured according to the data transmission requirements.
- the bidirectional data path (1158) of the processor core and its corresponding local memory (1156) and the processor core and its corresponding local memory (1166) is automatically configured.
- the processor core and its corresponding local memory (1166) need to transfer data to the processor core and its corresponding local memory (1156), the processor core and its corresponding local memory (1156) need to be directed to the processor.
- the core and its corresponding local memory (1166) can transmit data in one direction, and can also establish a one-way data path in the same way.
- the processor core and its corresponding local memory (1151) and the processor core and their corresponding versions are also established.
- the bidirectional data path (1157, 1158, 1159) implements the system in FIG. 11(a).
- the function of the bus (1110), together with the function modules (1163, 1164, 1165), constitutes a typical system-on-a-chip architecture.
- system-on-a-chip there is not necessarily a single set of data paths between any two functional modules. Since the number of processor cores in the configurable multi-core/many-core apparatus proposed by the present invention is easily expanded in number, various types of system-on-chip can be conveniently implemented by the method of the present embodiment. In addition, when the configurable multi-core/nuclear device based on the present invention is operated in real time, the structure of the system on chip can be flexibly changed through a real-time dynamic configuration method.
- Fig. 12(a) is a pre-compilation embodiment
- Fig. 12(b) is a post-compilation embodiment.
- the left side is the original program code (1201, 1203, 1204).
- There are two function calls in the code which are A function call and B function call.
- 1203 and 1204 are the A function and the B function code themselves.
- the A function call and the B function call are respectively replaced with the corresponding function code, and there is no function call in the expanded code, as shown in 1202.
- Figure 12 (b) shows the post-compilation embodiment.
- the original object code (1205) is the object code after normal compilation.
- the object code is based on the target code executed sequentially, after being compiled and split, it is formed.
- each code block is assigned to a corresponding one of the processor cores for execution.
- the corresponding A loop body is divided into a single code block (1207), and the B loop body is divided into two code blocks due to its relatively large size, namely B loop body 1 ( 1209 ) and B loop body 2 ( 1210 ). Two code blocks are executed on the two processor cores to complete the B loop body.
- FIG. 13(a) is a schematic diagram of a configurable multi-core/many-core device based on serial multi-emission and pipeline hierarchy
- FIG. 13(b) is a schematic diagram of a multi-core serial structure formed by configuration
- Figure 13 (c) is a schematic diagram of a multi-core serial-parallel hybrid structure formed by configuration
- Figure 13 (d) is a schematic diagram of a plurality of multi-core structures formed by configuration.
- the device consists of multiple processor cores and configurable local memories (1301, 1303, 1305, 1307, 1309, 1311, 1313, 1315, 1317) and configurable interconnect structures (1302, 1304). , 1306, 1308, 1310, 1312, 1314, 1316, 1318).
- each processor core and configurable local memory form a level of the macro pipeline.
- Multiple processor cores and configurable by configuring configurable interconnect fabrics (such as 1302)
- the local memories (1301, 1303, 1305, 1307, 1309, 1311, 1313, 1315, 1317) are connected in a serial connection structure.
- Multiple serial connection structures may be independent of each other, or may be partially or fully interconnected, running programs serially, in parallel, or in series.
- a multi-core serial structure in the figure is formed by configuring a corresponding configurable interconnect structure, wherein the processor core and the configurable local memory (1301) are the first level of the multi-core serial structure.
- the processor core and configurable local memory (1317) are the last stage of the multi-core serial structure.
- the processor core and configurable local memory (1301, 1303, 1305, 1313, 1315, 1317) form a serial structure
- the processor core and The configurable local memories (1307, 1309, 1311) form a parallel structure, which ultimately forms a multi-core processor with a serial-parallel hybrid structure.
- the processor core and configurable local memory (1301, 1307, 1313, 1315) form a serial structure by configuring the corresponding configurable interconnect structure, while the processor core and configurable local The memories (1303, 1309, 1305, 1131, 1317) form another serial structure, thereby forming two completely independent serial structures.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Logic Circuits (AREA)
- Advance Control (AREA)
- Microcomputers (AREA)
- Devices For Executing Special Programs (AREA)
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP09828544A EP2372530A4 (en) | 2008-11-28 | 2009-11-30 | METHOD AND DEVICE FOR DATA PROCESSING |
KR1020117014902A KR101275698B1 (ko) | 2008-11-28 | 2009-11-30 | 데이터 처리 방법 및 장치 |
US13/118,360 US20110231616A1 (en) | 2008-11-28 | 2011-05-27 | Data processing method and system |
Applications Claiming Priority (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810203778.7 | 2008-11-28 | ||
CN200810203777.2 | 2008-11-28 | ||
CN200810203777A CN101751280A (zh) | 2008-11-28 | 2008-11-28 | 针对多核/众核处理器程序分割的后编译系统 |
CN200810203778A CN101751373A (zh) | 2008-11-28 | 2008-11-28 | 基于单一指令集微处理器运算单元的可配置多核/众核系统 |
CN200910046117.2 | 2009-02-11 | ||
CN200910046117 | 2009-02-11 | ||
CN200910208432.0A CN101799750B (zh) | 2009-02-11 | 2009-09-29 | 一种数据处理的方法与装置 |
CN200910208432.0 | 2009-09-29 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/118,360 Continuation US20110231616A1 (en) | 2008-11-28 | 2011-05-27 | Data processing method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010060283A1 true WO2010060283A1 (zh) | 2010-06-03 |
Family
ID=42225216
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2009/001346 WO2010060283A1 (zh) | 2008-11-28 | 2009-11-30 | 一种数据处理的方法与装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20110231616A1 (zh) |
EP (1) | EP2372530A4 (zh) |
KR (1) | KR101275698B1 (zh) |
WO (1) | WO2010060283A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102475A (zh) * | 2013-04-11 | 2014-10-15 | 腾讯科技(深圳)有限公司 | 分布式并行任务处理的方法、装置及系统 |
CN107750374A (zh) * | 2015-05-08 | 2018-03-02 | 鞍点有限责任两合公司 | 具有配置参数设置器件的危险报警中心 |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2751388A1 (en) * | 2011-09-01 | 2013-03-01 | Secodix Corporation | Method and system for mutli-mode instruction-level streaming |
CN102646059B (zh) * | 2011-12-01 | 2017-10-20 | 中兴通讯股份有限公司 | 多核处理器系统的负载平衡处理方法及装置 |
US9465619B1 (en) * | 2012-11-29 | 2016-10-11 | Marvell Israel (M.I.S.L) Ltd. | Systems and methods for shared pipeline architectures having minimalized delay |
US9032256B2 (en) * | 2013-01-11 | 2015-05-12 | International Business Machines Corporation | Multi-core processor comparison encoding |
WO2015050474A1 (en) * | 2013-10-03 | 2015-04-09 | Huawei Technologies Co., Ltd | Method and system for assigning a computational block of a software program to cores of a multi-processor system |
US9294097B1 (en) | 2013-11-15 | 2016-03-22 | Scientific Concepts International Corporation | Device array topology configuration and source code partitioning for device arrays |
US9698791B2 (en) | 2013-11-15 | 2017-07-04 | Scientific Concepts International Corporation | Programmable forwarding plane |
US10326448B2 (en) | 2013-11-15 | 2019-06-18 | Scientific Concepts International Corporation | Code partitioning for the array of devices |
US9460012B2 (en) | 2014-02-18 | 2016-10-04 | National University Of Singapore | Fusible and reconfigurable cache architecture |
CN103955406A (zh) * | 2014-04-14 | 2014-07-30 | 浙江大学 | 一种基于超级块的投机并行化方法 |
US10318356B2 (en) * | 2016-03-31 | 2019-06-11 | International Business Machines Corporation | Operation of a multi-slice processor implementing a hardware level transfer of an execution thread |
US10055155B2 (en) * | 2016-05-27 | 2018-08-21 | Wind River Systems, Inc. | Secure system on chip |
US20180259576A1 (en) * | 2017-03-09 | 2018-09-13 | International Business Machines Corporation | Implementing integrated circuit yield enhancement through array fault detection and correction using combined abist, lbist, and repair techniques |
WO2019089918A1 (en) * | 2017-11-03 | 2019-05-09 | Coherent Logix, Inc. | Programming flow for multi-processor system |
US11435947B2 (en) | 2019-07-02 | 2022-09-06 | Samsung Electronics Co., Ltd. | Storage device with reduced communication overhead using hardware logic |
KR102246797B1 (ko) * | 2019-11-07 | 2021-04-30 | 국방과학연구소 | 명령 코드 생성을 위한 장치, 방법, 컴퓨터 판독 가능한 기록 매체 및 컴퓨터 프로그램 |
EP4085354A4 (en) * | 2019-12-30 | 2024-03-13 | Star Ally International Limited | PROCESSOR FOR CONFIGURABLE PARALLEL CALCULATIONS |
KR102320270B1 (ko) * | 2020-02-17 | 2021-11-02 | (주)티앤원 | 학습용 무선 마이크로컨트롤러 키트 |
US11734017B1 (en) | 2020-12-07 | 2023-08-22 | Waymo Llc | Methods and systems for processing vehicle sensor data across multiple digital signal processing cores virtually arranged in segments based on a type of sensor |
US11782602B2 (en) | 2021-06-24 | 2023-10-10 | Western Digital Technologies, Inc. | Providing priority indicators for NVMe data communication streams |
US11960730B2 (en) | 2021-06-28 | 2024-04-16 | Western Digital Technologies, Inc. | Distributed exception handling in solid state drives |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1567187A (zh) * | 2003-06-11 | 2005-01-19 | 华为技术有限公司 | 数据处理系统及方法 |
JP2008146503A (ja) * | 2006-12-12 | 2008-06-26 | Sony Computer Entertainment Inc | 分散処理方法、オペレーティングシステムおよびマルチプロセッサシステム |
Family Cites Families (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4089059A (en) * | 1975-07-21 | 1978-05-09 | Hewlett-Packard Company | Programmable calculator employing a read-write memory having a movable boundary between program and data storage sections thereof |
SE448680B (sv) * | 1984-05-10 | 1987-03-16 | Duma Ab | Doseringsanordning till en injektionsspruta |
CA2143145C (en) * | 1994-04-18 | 1999-12-28 | Premkumar Thomas Devanbu | Determining dynamic properties of programs |
US5732209A (en) * | 1995-11-29 | 1998-03-24 | Exponential Technology, Inc. | Self-testing multi-processor die with internal compare points |
US7080238B2 (en) * | 2000-11-07 | 2006-07-18 | Alcatel Internetworking, (Pe), Inc. | Non-blocking, multi-context pipelined processor |
EP1205840B1 (en) * | 2000-11-08 | 2010-07-14 | Altera Corporation | Stall control in a processor with multiple pipelines |
US7062641B1 (en) * | 2001-01-10 | 2006-06-13 | Cisco Technology, Inc. | Method and apparatus for unified exception handling with distributed exception identification |
US6757761B1 (en) * | 2001-05-08 | 2004-06-29 | Tera Force Technology Corp. | Multi-processor architecture for parallel signal and image processing |
US20030046429A1 (en) * | 2001-08-30 | 2003-03-06 | Sonksen Bradley Stephen | Static data item processing |
US20050177679A1 (en) * | 2004-02-06 | 2005-08-11 | Alva Mauricio H. | Semiconductor memory device |
EP1619584A1 (en) * | 2004-02-13 | 2006-01-25 | Jaluna SA | Memory allocation |
US20070083785A1 (en) * | 2004-06-10 | 2007-04-12 | Sehat Sutardja | System with high power and low power processors and thread transfer |
US7536567B2 (en) * | 2004-12-10 | 2009-05-19 | Hewlett-Packard Development Company, L.P. | BIOS-based systems and methods of processor power management |
ATE393932T1 (de) * | 2004-12-22 | 2008-05-15 | Galileo Avionica Spa | Rekonfigurierbares mehrprozessorsystem besonders zur digitalen verarbeitung von radarbildern |
US7689867B2 (en) * | 2005-06-09 | 2010-03-30 | Intel Corporation | Multiprocessor breakpoint |
US7793278B2 (en) * | 2005-09-30 | 2010-09-07 | Intel Corporation | Systems and methods for affine-partitioning programs onto multiple processing units |
US8104030B2 (en) * | 2005-12-21 | 2012-01-24 | International Business Machines Corporation | Mechanism to restrict parallelization of loops |
US7689838B2 (en) * | 2005-12-22 | 2010-03-30 | Intel Corporation | Method and apparatus for providing for detecting processor state transitions |
US7784037B2 (en) * | 2006-04-14 | 2010-08-24 | International Business Machines Corporation | Compiler implemented software cache method in which non-aliased explicitly fetched data are excluded |
US20070250825A1 (en) * | 2006-04-21 | 2007-10-25 | Hicks Daniel R | Compiling Alternative Source Code Based on a Metafunction |
US7797563B1 (en) * | 2006-06-09 | 2010-09-14 | Oracle America | System and method for conserving power |
US8589666B2 (en) * | 2006-07-10 | 2013-11-19 | Src Computers, Inc. | Elimination of stream consumer loop overshoot effects |
US7665000B2 (en) * | 2007-03-07 | 2010-02-16 | Intel Corporation | Meeting point thread characterization |
-
2009
- 2009-11-30 WO PCT/CN2009/001346 patent/WO2010060283A1/zh active Application Filing
- 2009-11-30 EP EP09828544A patent/EP2372530A4/en not_active Withdrawn
- 2009-11-30 KR KR1020117014902A patent/KR101275698B1/ko active IP Right Grant
-
2011
- 2011-05-27 US US13/118,360 patent/US20110231616A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1567187A (zh) * | 2003-06-11 | 2005-01-19 | 华为技术有限公司 | 数据处理系统及方法 |
JP2008146503A (ja) * | 2006-12-12 | 2008-06-26 | Sony Computer Entertainment Inc | 分散処理方法、オペレーティングシステムおよびマルチプロセッサシステム |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104102475A (zh) * | 2013-04-11 | 2014-10-15 | 腾讯科技(深圳)有限公司 | 分布式并行任务处理的方法、装置及系统 |
CN107750374A (zh) * | 2015-05-08 | 2018-03-02 | 鞍点有限责任两合公司 | 具有配置参数设置器件的危险报警中心 |
Also Published As
Publication number | Publication date |
---|---|
EP2372530A1 (en) | 2011-10-05 |
KR101275698B1 (ko) | 2013-06-17 |
KR20110112810A (ko) | 2011-10-13 |
US20110231616A1 (en) | 2011-09-22 |
EP2372530A4 (en) | 2012-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2010060283A1 (zh) | 一种数据处理的方法与装置 | |
JP6243935B2 (ja) | コンテキスト切替方法及び装置 | |
US7284092B2 (en) | Digital data processing apparatus having multi-level register file | |
Krashinsky et al. | The vector-thread architecture | |
EP1137984B1 (en) | A multiple-thread processor for threaded software applications | |
CN101799750B (zh) | 一种数据处理的方法与装置 | |
US6988181B2 (en) | VLIW computer processing architecture having a scalable number of register files | |
US6826674B1 (en) | Program product and data processor | |
US7219185B2 (en) | Apparatus and method for selecting instructions for execution based on bank prediction of a multi-bank cache | |
CN105144082B (zh) | 基于平台热以及功率预算约束,对于给定工作负荷的最佳逻辑处理器计数和类型选择 | |
US20140181477A1 (en) | Compressing Execution Cycles For Divergent Execution In A Single Instruction Multiple Data (SIMD) Processor | |
GB2524126A (en) | Combining paths | |
JP2015534188A (ja) | ユーザレベルのスレッディングのために即時のコンテキスト切り替えを可能とする新規の命令および高度に効率的なマイクロアーキテクチャ | |
JP2006012163A5 (zh) | ||
US20120110303A1 (en) | Method for Process Synchronization of Embedded Applications in Multi-Core Systems | |
JP2011513843A (ja) | 実行装置内のデータ転送のシステムおよび方法 | |
Zhang et al. | Leveraging caches to accelerate hash tables and memoization | |
US6594711B1 (en) | Method and apparatus for operating one or more caches in conjunction with direct memory access controller | |
EP4143682A1 (en) | Handling multiple graphs, contexts and programs in a coarse-grain reconfigurable array processor | |
CN114691597A (zh) | 自适应远程原子操作 | |
CN114253607A (zh) | 用于由集群化解码流水线对共享微代码定序器的乱序访问的方法、系统和装置 | |
US6119220A (en) | Method of and apparatus for supplying multiple instruction strings whose addresses are discontinued by branch instructions | |
US7080234B2 (en) | VLIW computer processing architecture having the problem counter stored in a register file register | |
CN108845832B (zh) | 一种提高处理器主频的流水线细分装置 | |
CN112148106A (zh) | 用于处理器的混合预留站的系统、装置和方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09828544 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 4513/CHENP/2011 Country of ref document: IN |
|
ENP | Entry into the national phase |
Ref document number: 20117014902 Country of ref document: KR Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009828544 Country of ref document: EP |