CN101799750B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN101799750B
CN101799750B CN 200910208432 CN200910208432A CN101799750B CN 101799750 B CN101799750 B CN 101799750B CN 200910208432 CN200910208432 CN 200910208432 CN 200910208432 A CN200910208432 A CN 200910208432A CN 101799750 B CN101799750 B CN 101799750B
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
data
processor core
core
memory
processor
Prior art date
Application number
CN 200910208432
Other languages
Chinese (zh)
Other versions
CN101799750A (en )
Inventor
林正浩
任浩琪
王静
Original Assignee
上海芯豪微电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

一种数据处理的方法与装置,根据特定规则对运行于串行连接多处理器核结构上的程序代码进行分割,使所述串行连接多处理器核结构构成串行多发射和流水线层次结构,并使每个核运行相应的分割得到的代码片段所需的时间尽量相等,以实现核间工作量的负载平衡。 Method and apparatus for data processing, the operation of the serial connection program code on a multi-processor core structure is divided in accordance with certain rules, so that the serial connection of serial multi processor core structure composed of multiple transmit and hierarchical pipelined , and the time required for the respective code fragment obtained by dividing each core may run as much as possible equal, to achieve load balancing workload between core.

Description

一种数据处理的方法与装置 A data processing method and apparatus

技术领域 FIELD

[0001 ] 本发明涉及集成电路设计领域。 [0001] The present invention relates to the field of integrated circuit design.

背景技术 Background technique

[0002] 根据摩尔定律,晶体管的特征尺寸正沿着65nm, 45nm, 32nm......的路线逐渐缩小, [0002] According to Moore's Law feature size, the transistor being along 65nm, 45nm, 32nm ...... line gradually decreases,

单片芯片上所集成的晶体管数已超过十几亿只。 Integrated on a single chip transistor count has more than ten million. 但是自从上世纪八十年代推出综合及布局布线工具,解放了后端设计生产力后,EDA工具20多年来并没有质的突破,使得前端设计,尤其是验证变得越来越难以应对日益增大的单片芯片规模。 But since the 1980's to launch a comprehensive and layout tools, the liberation of the back-end design productivity, EDA tools for over 20 years and there is no qualitative breakthrough, making front-end design, especially verification becomes increasingly difficult to cope with the increasing monolithic chip scale. 因此,设计公司把目光投向多核,即一块芯片中集成多个较为简单的核,在提高芯片功能的同时降低设计、验证难度。 Therefore, the sights multicore design company, i.e., a single chip integrates a plurality of nuclear simpler, reducing chip design while improving function, verification more difficult.

[0003] 传统多核处理器集成了多个并行执行程序的处理器核以提高芯片性能。 [0003] The conventional multi-core processor integrates multiple parallel processor core executing a program to improve the performance of the chip. 对于传统多核处理器,需要有并行编程的思想才有可能充分利用资源。 For conventional multi-core processors, parallel programming requires thought it possible to make full use of resources. 然而操作系统对资源的分配和管理并没有本质的改变,多是以对称的方式进行平均分配。 However, the operating system for resource allocation and management did not change the nature of multi-averaged distribution is symmetrical manner. 尽管多个处理器核之间可以进行并行运算,但对于单个程序线程而言,其串行执行的结构特点导致在传统多核处理器结构中无法实现真正的流水线操作。 Although a plurality of parallel operation between the processor cores, but for a single thread, the structural characteristics of serially performed in the conventional multi-core processor architecture leads can not be true pipelining. 此外目前的软件中依然存在大量必须串行执行的程序,无法被很好的分割。 In addition the current software still exist a large number of programs to be executed serially, can not be a good split. 因此,当处理器核达到一定数量后,性能就无法再随着核数量的增加而提升了。 Therefore, when a certain number of processor cores, performance is no longer with the increase in the number of cores and improved. 此外,随着半导体制造工艺的不断提升,多核处理器内部的工作频率已大大高于其外部存储器的工作频率,多个处理器核同时进行访存也已经成为制约系统性能的一大瓶颈,用并行多核结构运行串行的程序无法达到预期的性能提升效果。 Furthermore, with the rising of the semiconductor manufacturing process, the interior of the multi-core processor operating frequency is much higher than the operating frequency of the external memory, a plurality of processor cores simultaneously access memory has become a bottleneck of system performance, with parallel multi-core architecture that runs a serial program can not achieve the desired performance results.

发明内容 SUMMARY

[0004] 本发明针对现有技术的不足,提出一种用于高速运行串行程序的数据处理的方法与装置,提高吞吐率。 [0004] The present invention addresses the deficiencies of the prior art, a method and apparatus for serial high speed data processing program for improving throughput.

[0005] 本发明所述的数据处理的方法与装置,包括:根据特定规则对运行于串行连接多处理器核结构上的程序代码进行分割,使所述串行连接多处理器核结构中的每个核运行相应的分割后代码片段所需的时间尽量相等,以实现核间工作量的负载平衡(loadbalancing)。 [0005] Method and apparatus for data processing according to the present invention, comprising: a program code in serial connection on a multiprocessor core structure in accordance with certain rules dividing operation, the serial connection structure of the multi processor core dividing the respective time after each core segment of code required to run as far as possible equal, to achieve load balancing workload between the core (loadbalancing). 所述串行连接多处理器核结构中包含复数个处理器核,所述处理器指通过执行指令进行运算和读写数据的硬件,包括但不限于中央处理器(CPU)和数据信号处理器(DSP)。 The serial connection multi processor core structure comprises a plurality of processor cores, the processor refers to the hardware and operation by executing instructions read and write data, including but not limited to a central processing unit (CPU) and a data signal processor (DSP).

[0006] 本发明所述的串行连接多处理器核结构构成串行多发射(inserialmult1-1ssue),串行连接多处理器核结构中任意核每单位时间内可进行单数个或复数个发射,复数个串行连接的核同时形成更大规模的多发射,即串行多发射。 [0006] The serial connection structure of the multi-core processor according to the present invention is composed of the serial multiple transmit (inserialmult1-1ssue), within the multi-core processor core structure in any of the serial connection per unit time may be singular or plural transmit , a plurality of serially connected while forming a core of a bigger emission, i.e., multiple transmit serial.

[0007] 本发明所述的串行连接多处理器核结构构成流水线层次结构(pipelinehierarchy),串行连接多处理器核结构中任意核的内部流水线为第一个层次,串行连接多处理器核结构中每个核作为一个宏观流水线段而构成的宏观流水线为第二个层次,依此类推还可以得到更多更高层次,如以串行连接多处理器核结构作为一个更高层次流水线段而构成的第三个层次。 [0007] The serial connection according to the present invention, a multi-core processor pipeline structure composed of hierarchical structure (pipelinehierarchy), serial connection of any multi-processor core structure inside the core lines for the first-level, multi-processor serial connection each core structure as the core to form a macro-macro pipeline stages pipeline for the second level, and so on can also get more higher level, such as a serial connection to a multi-processor core pipeline architecture as a higher level the third level segment constituted.

[0008] 本发明所述的串行连接多处理器核结构中核上的代码片段经由前编译(pre-compile)、编译(compile)和后编译(post-compile)三步骤中的部分或全部步骤产生单数个或复数个代码片段,所述程序代码包括但不限于高级语言代码和汇编语言代码。 [0008] Fragment serial connection on the core of the multi processor core structure of the present invention via the front compiler (pre-compile), compiled (the compile), and compiled in a three-step portions (post-compile), or all of the steps generating singular or a plurality of code segments, said program code comprising code for, but not limited to high-level language and assembly language code.

[0009] 所述编译即现有通常意义上从程序源代码到目标代码的编译; [0009] The compiler typically sense i.e. existing source code from a program compiled into object code;

[0010] 所述前编译是在所述编译进行前对程序源代码的预编译,包括但不限于在进行程序编译前将程序中的“调用”(call)进行展开,用实际调用的代码替代调用语句,形成没有调用的程序代码;所述调用包括但不限于函数调用; [0010] compiling of the front is performed prior to the compilation of the source code of precompiled, including but not limited to the compiler before performing the "calling" program (call) are expanded, using the code actually called alternative call statement, forming program code does not call; including but not limited to the calling function calls;

[0011] 所述后编译是按要求分配到所述串行连接多处理器核结构中每个核的工作内容及负荷将所述编译得到目标代码的划分为单数个或复数个代码片段,步骤包括但不限于: [0011] After the compilation is required to assign the serial connection structure of the multi processor core load and work content of each core obtained dividing the compiled object code is singular or a plurality of code segments, step including but not limited to:

[0012] (a)可执行的程序代码进行解析,生成前端码流; [0012] (a) parsing program code executable to generate a bit stream the front end;

[0013] (b)在特定模型上运行、扫描前端码流,根据要求分析所需执行周期、是否跳转以及跳转地址等信息,统计扫描结果,间接确定分割信息;或不扫描前端码流,根据预设信息直接确定分割信息;所述特定模型包括但不限于所述串行连接多处理器核结构中核的行为模型; [0013] (b) running on a particular model, the front end of the scan code stream, in accordance with the desired requirements of the analysis execution cycle, whether or not the jump and jump information address, the statistical results of the scan, the indirect determination division information; stream or front end of the scan the preset information is determined directly divided information; the specific models include, but are not limited to the serial connection structure of the multi processor core of nuclear behavior model;

[0014] (c)根据分割信息对可执行的程序指令代码和进行代码分割,生成所述串行连接多处理器核结构中每个处理器核相应的代码片段。 [0014] (c) according to the division information and program instructions executable code division codes to generate said serial connection the appropriate code fragments for each processor core of the multi processor core structure.

[0015] 本发明所述的前编译方法在程序源代码编译前实施,也可以作为编译器的组成部分在程序源代码编译过程中实施,还可以作为所述串行连接多处理器核结构的操作系统的组成部分、或作为驱动、或作为应用程序,在所述串行连接多处理器核结构运行时实时实施。 [0015] Before compiling method according to the present invention before the program source code compiler embodiment, may be as part of a compiler of the embodiment of the source code during compilation, the serial connection can also be used as a multi-processor core structure part of the operating system, or as a driver or as an application in real-time the serial connection structure of multi-processor core embodiment runtime.

[0016] 本发明所述的后编译方法可以在程序源代码编译完成之后实施,也可以作为编译器的组成部分在程序源代码编译过程中实施,还可以作为包括但不限于所述串行连接多处理器核结构的操作系统的组成部分、驱动、应用程序,在所述串行连接多处理器核结构运行时实时实施。 [0016] The method of the compiler of the present invention may be implemented after the program compiled the source code, may be as part of a compiler of the source code in the embodiment of the compilation process, as can also include but are not limited to the serial connection part of the operating system, multi-processor core structure, drivers, applications, the real-time embodiment of a serial connection running multiprocessor core structure. 当所述后编译方法实时实施时,可以人为确定所述代码片段中的相应配置信息,也可以根据所述串行连接多处理器核结构的使用情况动态地自动产生所述代码片段中的相应配置信息,还可以只产生固定的配置信息。 When compiling method the real-time embodiment, the artificially determine the corresponding configuration information of the code segment, the usage may be connected to a multi-processor core structure according to dynamically and automatically generate said serial code fragment corresponding to the configuration information, can also produce only a fixed configuration.

[0017] 通过所述分割,可将现有的应用程序进行程序分割,分段同时执行,不但提高了现有程序在多核/众核装置上的运行速度,而且充分发挥了多核/众核装置的效率,同时也保证了多核/众核装置对现有应用程序的兼容。 [0017], existing applications can be divided by the dividing program, while performing segmentation, not only improve the running speed in the conventional multi / many-core devices, and give full play multi / manycore means efficiency, but also to ensure compatibility multi / manycore means existing applications. 有效地解决了现有应用程序无法充分发挥多核/众核处理器优势的困境。 Effective solution to the dilemma of existing applications can not give full play to multi-core / many-core processor advantage.

[0018] 本发明所述的后编译方法中,间接确定分割信息的依据包括但不限于指令执行的周期数或时间、指令的条数,即可以根据扫描前端码流获得的指令执行周期数或时间,将整个可执行程序代码分割成相同或相近运行时间的代码片段,也可以根据扫描前端码流获得的指令条数,将整个可执行程序代码分割成相同或相近指令条数的代码片段;所述直接确定分割信息的依据包括但不限于指令的条数,即可以根据指令的条数,直接将整个可执行程序代码分割成相同或相近指令条数的代码片段。 [0018] The method of the compiler of the present invention, is determined indirectly based on segmentation information include, but are not limited to the number of instruction execution cycles or time, number of instructions, i.e., the number of cycles may be performed according to an instruction code stream obtained by scanning the front end or time, the entire executable program code into the code fragment running the same or similar time, the number of instructions may be in accordance with the front end of scanning the code stream obtained by the entire executable program code or code segments into the same number of pieces of similar instruction; the direct determination based on division information includes, but is not limited to the number of instructions, i.e., the number of instructions can direct the entire executable program code or code segments into the same number of pieces of similar instructions.

[0019] 本发明所述的后编译方法中,所述可执行程序代码分割时根据特定规则尽可能避免对循环代码进行分割。 [0019] The method of the compiler of the present invention, when the executable program code is divided in accordance with certain rules as to avoid code division cycle. 当无法避免对循环代码分割时,根据特定规则将所述循环代码通过单数次或复数次分割形成复数个更小规模的循环代码。 When it is impossible to avoid the code division cycle, in accordance with certain rules of the cyclic code of forming a plurality of divided smaller scale by a single loop code several or multiple times. 所述复数个更小规模的循环代码可以分别是相同或不同的代码片段的组成部分。 The plurality of smaller-scale cycles each part of the code may be the same or different code segments. 所述更小规模的循环代码包括但不限于包含更少的代码数目的循环代码和代码执行周期数更少的循环代码。 The smaller scale code comprises circulating loop, but the number of codes and code execution cycle number of codes is not limited to contain fewer smaller loop code.

[0020] 本发明所述的后编译方法中,所述代码片段包括但不限于适用于固定处理器核数目的所述串行连接多处理器核结构运行的已分段的可执行目标代码和/或相应配置信息,适用于所述串行连接多处理器核结构运行的未分段的可执行目标代码以及包含适用于不固定核数目的多种分段信息的相应配置信息,其中分段信息包括但不限于包含代表每段指令数目的数字,代表分段边界的特定标志,每个代码片段开始信息的指示表。 [0020] The method of the compiler of the present invention, the snippet including but not limited to a fixed number of processor cores suitable for the serial connection structure of the multi processor core running segmented executable object code and / or corresponding configuration information for the multiprocessor serial connection segmented executable object code does not run, and comprising the core structure is not suitable for a fixed number of nuclei of a variety of corresponding configuration information segment, wherein the segment including but not limited to information comprising the number of instructions in each segment numerals, representative of a specific segment of the boundary markers, each code segment table indicating the start information.

[0021] 举例来说,在一个有1000个处理器核的所述装置中,可以按最大处理器数目1000生成一张有1000个项的表,每一项存储相应指令在所述未分段的可执行目标代码中的位置信息,两项之间的指令组合即对应可以在相应单个核上运行的代码片段。 [0021] For example, in a processor core of the device 1000, the processor 1000 may be generated by the maximum number of table 1000 has an entry, the corresponding instruction is stored in each of the unsegmented location information of a target executable code, i.e., an instruction corresponding to a combination between the two can be run on the respective mononuclear snippet. 若在运行时用到了全部1000个处理器核,则每个处理器核运行所述表中相应两项所指向的未分段的可执行目标代码位置间的代码,即每个处理器核运行所述表中对应的一段代码。 If all 1000 processor used at runtime core, each processor core of the operating table between the two corresponding code points unsegmented executable object code position, i.e., each processor core is running in the table corresponds to a piece of code. 若在运行时只用到了N个处理器核(N< 1000),则每个处理器核运行所述表中对应的1000/N段代码,具体代码可以根据表中相应位置信息确定。 If only used at runtime processor cores of the N (N <1000), the operation of each processor core corresponding to the table 1000 / N of the code, the specific code corresponding to the table may be determined according to the location information.

[0022] 在每个处理器核上运行的指令除所述分割后的相应代码片段外,还可以包括额外的指令。 [0022] instructions running on each processor core corresponding code fragments in addition to the division, may also include additional instructions. 所述额外的指令包括但不限于代码片段头部扩展、代码片段尾部扩展,用于实现不同处理器核间指令执行的平滑过渡。 Said additional instructions include but are not limited to header extension code snippets, code snippets tail extension, for achieving a smooth transition between different processor cores execution. 举例来说,可以在每个代码片段的末尾加上代码片段尾部扩展,将寄存器堆中所有值存储到数据存储器中的特定位置,在每个代码片段的开头加上代码片段头部扩展,从数据存储器中的特定位置中的值读取到寄存器堆中,以此实现不同处理器核间的寄存器值传递,保证程序的正确运行;当执行到代码片段的末尾时,下一条指令从所述代码片段的第一条指令开始。 For example, each code can be added at the end of the tail segment code fragment extension, all the values ​​stored in the register file into the particular location in the data memory, at the beginning of each code segment plus the header extension code snippet from a specific value in data memory location is read to the register file, the register value in order to achieve transfer between different processor cores, correct operation of the program; when the end of the code fragment executed, the next instruction from start of the first instruction code snippet.

[0023] 本发明所述的数据处理的方法与装置,可以构建出一种基于串行多发射和流水线层次结构的可配置多核/众核装置,包括复数个处理器核(ProcessorCore)、复数个可配置本地存储器(confi gurable local memory)、可配置互联结构(configurable interconnectstructure)。 [0023] The data processing method and apparatus according to the present invention can be constructed configurable multi / many-core device based on multiple transmit and serial pipelined hierarchical structure comprising a plurality of processor cores (ProcessorCore), a plurality of configurable local memory (confi gurable local memory), configurable interconnect structure (configurable interconnectstructure). 其中: among them:

[0024] 处理器核,用于执行指令,进行运算并得到相应结果; [0024] processor core for executing instructions, and calculates the corresponding result;

[0025] 可配置本地存储器,用于存储指令以及所述处理器核间的数据传递和数据保存; [0025] The local memory may be configured for data transfer and data between the processor core and a store instruction stored;

[0026] 可配置互联结构,用于所述可配置多核/众核装置内各模块间及与外部的连接。 [0026] The interconnect structure may be configured, for each of the modules can be arranged between the inner and the outer core / manycore connected.

[0027] 所述可配置多核/众核装置还可以包括扩展模块,以适应更广泛的需求;所述扩展模块包括但不限于单数个或复数个以下模块的部分或全部: [0027] The configurable multi / many-core extension module may further include means to accommodate a wider range of requirements; the expansion module including but not limited to singular or plural part or all of the following modules:

[0028] 共享存储器(shared memory),用于在所述可配置数据存储器溢出的情况下保存数据、传递复数个处理器核间的共享数据; [0028] shared memory (shared memory), for storing data in a case where the configurable memory data overflow, data transfer between the plurality of shared processor core;

[0029] 直接存储器访问(DMA)控制器,用于除处理器核外其他模块对所述可配置本地存储器的直接访问; [0029] Direct Memory Access (DMA) controller, a direct access to the other modules in addition to the processor core of the configurable local memory;

[0030] 异常处理(except1n handling)模块,用于处理处理器核、本地存储器发生的异常(except1n): [0030] Exception Handling (except1n handling) means for exception handling processor core, local memory occurs (except1n):

[0031] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,处理器核包括运算单元和程序计数器,还可以包括扩展模块以适应更广泛的需求,所述扩展模块包括但不限于寄存器堆。 [0031] Based on the serial transmission and multi-configurable pipelined hierarchy according to the present invention, a multi-core / many-core devices, the processor core comprises a program counter and an arithmetic unit, also including the extension modules to accommodate a wider range of demand, the said extension module including but not limited to the register file. 所述处理器核执行的指令包括但不限于算术运算指令、逻辑运算指令、条件判断及跳转指令、异常陷入及返回指令;所述算术运算指令、逻辑运算指令包括但不限于乘法、加/减法、乘加/减、累加、移位、提取、交换操作,且包括任意位宽小于等于所述处理器核数据位宽的定点运算和浮点运算;每个所述处理器核完成单数条或复数条所述指令。 The instructions executed by processor core include but are not limited to arithmetic instructions, logical operation instructions, and conditional branch instructions, and return instructions into abnormal; the arithmetic instructions, logical operation instructions include but are not limited to multiplication, addition / subtraction, multiplication and addition / subtraction, the accumulated shift, extraction, the switching operation, and includes any processor core bit width less than or equal to the data bit fixed-point and floating point operations; each of the number of processor cores to complete a single a plurality of strips or the instructions. 所述处理器核的数目可以根据实际应用需求进行扩展。 The number of processor cores can be extended based on application requirements.

[0032] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,每个所述处理器核都有相应的可配置本地存储器,包括用于存放分割后代码片段的指令存储器(instruct1n memory)和用于存放数据的可配置数据存储器(configurable datamemory)。 [0032] The configurable multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention, each of said processor core has a corresponding configurable local memory for storing the segmentation code comprising instruction memory segment (instruct1n memory) and a configurable data storage (configurable datamemory) storage of data.

[0033] 在同一可配置本地存储器中,所述指令存储器与可配置数据存储器之间的边界是可以根据不同配置信息改变的。 [0033] In the configuration may be the same local memory, the instructions may be a boundary between memory and the data memory configuration can be changed depending on the configuration information. 当根据配置信息确定可配置数据存储器的大小与边界后,所述可配置数据存储器包括复数个数据子存储器。 When determining the configuration and size of the boundary data memory according to configuration information, said configuration data memory comprises a plurality of data sub-memory.

[0034] 在同一可配置数据存储器中,所述复数个数据子存储器之间的边界是可以根据不同配置信息改变的。 [0034] The configurable data in the same memory, the boundary between the plurality of data sub-memory can be changed depending on the configuration information. 所述数据子存储器通过地址转换能映射到所述多核/众核装置的全部地址空间。 The sub-data memory address translation entire address space can be mapped to the multi / many-core devices. 所述映射包括但不限于通过查表进行地址转换和通过内容寻址存储器(CAM)匹配进行地址转换。 Including but not limited to the mapping performed by the address conversion table look-up, and content addressable memory (CAM) matching address translation.

[0035] 所述数据子存储器中每项(entry)包含数据和标志信息,所述标志信息包括但不限于有效位(valid bit)、数据地址。 The [0035] data for each sub-memory (entry) contains the data and flag information, said flag information includes, but is not limited to a valid bit (valid bit), data address. 所述有效位用于指示相应项中存储的数据是否有效。 Whether the valid bit for the respective data item indicates a valid stored. 所述数据地址用于指示相应项中存储的数据在所述多核/众核装置的全部地址空间应处于的位置。 The data indicative of an address for data stored in the corresponding entry in the address space of the entire multi / many-core device should be in position.

[0036] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,所述可配置互联结构通过配置用于所述可配置多核/众核装置内各模块间及与外部的连接,包括但不限于处理器核与相邻可配置本地存储器的连接,处理器核与共享存储器的连接、处理器核与直接存储器访问控制器的连接、可配置本地存储器与共享存储器的连接、可配置本地存储器与直接存储器访问控制器的连接、可配置本地存储器与所述装置外部的连接和共享存储器与所述装置外部的连接。 [0036] The configurable multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention, by the configurable interconnect structure is configured between modules within the configurable multi / manycore means and an external connection, including but not limited to, a processor core connection may be arranged adjacent local memory connected to the processor core and the shared memory, the processor core and a direct memory access controller, may be configured with a shared local memory the memory is connected, the local memory may be configured to connect direct memory access controller, may be configured with a local memory external to the device and a shared memory connected to the external connection device.

[0037] 根据配置,可以使两个处理器核及其相应本地存储器构成前后级连接关系,包括但不限于前一级处理器核通过其相应的可配置数据存储器将数据传输到后一级处理器核。 [0037] According to the configuration, the two processor cores may be local memory and its corresponding connection relationship before and after class configuration, including but not limited to a processor core prior to processing a data transmission after a configurable via their respective data memory cores.

[0038] 根据应用程序要求,可以通过配置将部分或全部处理器核及其相应本地存储器通过可配置互联结构构成单数个或复数个串行连接结构。 [0038] Depending on the application requirements, by configuring part or all of local memory and its corresponding processor core via configurable interconnects structure composed singular or plural serial connection structure. 复数个所述串行连接结构可以各自独立,也可以部分或全部有相互联系,串行、并行或串并混合地执行指令。 A plurality of said serial connection structure may each independently, may have some or all interconnected, serial, parallel or mixed serial and execute instructions. 所述串行、并行或串并混合地执行指令包括但不限于根据应用程序要求不同串行连接结构在同步机制的控制下运行不同的程序段并行执行不同指令、多线程并行运行,根据应用程序要求不同串行连接结构在同步机制的控制下运行相同的程序段、以单指令多数据流(SIMD)方式进行相同指令、不同数据的密集运算。 The serial, parallel or mixed serial-parallel execution of instructions include, but are not limited to the serial connection structure of the different applications require different program segments running different instructions are executed in parallel under the control of the synchronization mechanism, multithreaded run in parallel, depending on the application serial connection require different structures run under the same control block synchronization mechanism, a single instruction multiple data (SIMD) instruction performs the same way, different data intensive operations.

[0039] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,所述串行连接结构中处理器核具有特定的数据读规则(read policy)、写规则(writepolicy)。 [0039] The configurable multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention, the serial connection structure of a processor core having a specific rule read data (read policy), writing rule (writepolicy).

[0040] 所述数据读规则,即所述串行连接结构中第一个处理器核的输入数据来源包括但不限于本身相应的可配置数据存储器、共享存储器、所述可配置多核/众核装置外部。 [0040] The rule data is read, i.e., the serial connection structure of input data sources include a first processor core itself, but are not limited to configurable data corresponding memory, shared memory, the configurable multi / manycore external means. 其他任意处理器核的输入数据来源包括但不限于本身相应的可配置数据存储器、前一级处理器核相应的可配置数据存储器。 Any other input data sources include, but are not limited core processor itself to a respective configurable data memory, before a processor core corresponding configurable data storage. 相应地,任意所述处理器核的输出数据的去向包括但不限于本身相应的可配置数据存储器、共享存储器,当扩展存储器存在时,任意所述处理器核的输出数据的去向还可以是扩展存储器。 Accordingly, the output data of the whereabouts of any of said processor core include but are not limited to the respective own configurable data memory, shared memory, when there is an extended memory, any data to the output destination processor core may also be extended memory.

[0041] 所述数据写规则,即所述串行连接结构中第一个处理器核相应可配置数据存储器的输入数据来源包括但不限于处理器核本身、共享存储器、所述可配置多核/众核装置外部。 [0041] The data writing rule, i.e., the serial connection structure of a processor core corresponding configuration data memory may input data sources including but not limited to a processor core itself, a shared memory, the configurable multi / It means all external to the core. 其他任意处理器核相应可配置数据存储器的输入数据来源包括但不限于处理器核本身、前一级处理器核相应可配置数据存储器、共享存储器。 The processor core may be any other configuration data store corresponding input data sources include but are not limited to the processor core itself, the former may be a processor core corresponding to the configuration data memory, the shared memory. 所述处理器核及其相应可配置数据存储器不同来源的输入数据按特定规则进行多路选择以确定最终的输入数据。 The core processor may be configured from different sources and their respective data store input data multiplexer to determine the final input data by a specific rule.

[0042] 同一个所述可配置数据存储器可以同时被其前后级的两个处理器核访问,不同的处理器核各自访问所述可配置数据存储器中的不同数据子存储器。 [0042] The same configuration data memory may be simultaneously accessed by two processor cores before and after the stage, the different processor cores each configured to access said different data sub memory data store. 所述处理器核可以根据特定规则对同一个可配置数据存储器中不同数据子存储器分别访问,所述特定规则包括但不限于同一个可配置数据存储器中不同数据子存储器互为乒乓缓冲(Ping-pong buffer),由两个处理器核分别访问,在所述前后两级处理器核均完成对乒乓缓冲的访问后,进行乒乓缓冲交换,使原先被前一级处理器核读/写的数据子存储器作为被后一级处理器核读的数据子存储器,原先被后一级处理器核读的数据子存储器中所有有效位均被置为无效,并作为被前一级处理器核读/写的数据子存储器。 The processor core may be configured in different data sub-data storage memory depending on the particular access rules are the same, the specific rules including but not limited to the same configuration data memory may be a different data mutually ping-pong buffer sub-memory (The ping- after pong buffer), are accessed by two processor cores, two cores are completed ping-pong buffer to said longitudinal access, ping-pong buffer exchanged, so that the original read by the processor before a core / write data sub memory as the data sub-processor core memory after a read, data originally after a sub-memory in the processor core read all the valid bits are set to invalid, as the processor core and before a read / the sub-memory write data.

[0043] 当所述多核/众核系统中处理器核包含寄存器堆时,还需要具有特定的寄存器值传输规则,所述寄存器值传输规则,即所述串行连接结构中任意前级处理器核中的单数个或复数个寄存器值都可以传输到任意后级处理器核的相应寄存器中。 [0043] When the multi / many-core system, comprising a processor core register file also needs to have a specific transmission rule register value, said register value transmission rules, i.e., the serial connection structure of any preceding-stage processor the singular core or a plurality of register values ​​can be transferred to any subsequent stage corresponding register in the processor core. 所述寄存器值包括但不限于所述处理器核中寄存器堆中寄存器的值。 The register values ​​include but are not limited to the value of the processor core register file register. 所述寄存器值的传输途径包括但不限于通过可配置互联结构传输,直接通过共享存储器传输,直接通过所述处理器核相应的可配置数据存储器传输,根据特定指令通过共享存储器传输、根据特定指令通过所述处理器核相应的可配置数据存储器传输。 The register value of the transmission routes include but are not limited to transmission through the interconnect structure disposed directly transmitted via shared memory, the processor directly through the respective core may be configured to transmit a data memory, according to a specific instruction transmitted via a shared memory, according to a particular instruction said processor core through the corresponding configurable data memory transfer.

[0044] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,所述流水线层次结构中的第二层次即宏观流水线段可以通过背压(backpressure)将本宏观流水线段的信息传输到前一级宏观流水线段,所述前一级宏观流水线段根据收到的背压信息可知之后的宏观流水线是否阻塞(stall),结合本宏观流水线段的情况,确定本宏观流水线段是否阻塞,并将新的背压信息传输到更前一级宏观流水线段,以此实现宏观流水线的控制。 [0044] configurable multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention, the second level of the hierarchy i.e. pipeline macroscopic pipeline stages by the back pressure (the backpressure policy) present information transmission to the macro before the macro pipeline stages of a pipeline stages, the pipeline stages before a macro-macro pipeline according to the information received after the back pressure seen whether blocking (STALL), the binding of the pipeline stages Macroscopically, this determination macro pipeline segment is blocked, and a new backpressure information to more pipeline stages before the macro level, in order to achieve macro-control of the pipeline.

[0045] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,可以有扩展的共享存储器,用于在处理器核相应可配置数据存储器溢出的情况下存储数据、传递复数个处理器核间的共享数据;还可以有扩展的异常处理(except1n handling)模块,用于处理处理器核、本地存储器发生的异常(exert1n)。 Under [0045] The configurable multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention, there may be extended to a shared memory, for the corresponding configurable data out of memory in the storage processor core data, transfer data between a plurality of shared processor core; may also be extended exception handling (except1n handling) means for exception handling processor core, local memory occurs (exert1n).

[0046] 当所述多核/众核装置有共享存储器且向可配置数据存储器存储数据时发生溢出,则产生异常,并将被存储数据存储到共享存储器中,此时,所述数据子存储器中每项(entry)包含的标志信息包括但不限于有效位、数据地址和数据标签(tag)。 [0046] When the multi / many-core means has a shared memory and a data memory for storing data to be arranged an overflow occurs, an exception, and is stored in the data storage to the shared memory. In this case, the sub data memory each flag information (entry) include, but are not limited comprising valid bits of data and the address data tag (tag). 所述有效位用于指示相应项中存储的数据是否有效。 Whether the valid bit for the respective data item indicates a valid stored. 所述数据地址和数据标签(tag)共同用于指示相应项中存储的数据在所述多核/众核装置的全部地址空间应处于的位置。 The data address and tag data (tag) indicating the items together for respective data stored in the entire address space of the multi / many-core device should be in position.

[0047] 所有所述处理器核产生的异常信息均传输到异常处理模块,由异常处理模块进行相应处理。 [0047] All of the abnormality information generated by the processor core are transferred to the exception handling module, it is treated by the exception handling module. 所述异常处理模块可以由所述多核/众核装置中的处理器核构成,也可以是额外的模块。 The exception handling module may be configured by the multi-core processor core / many-core device, or may be additional modules. 所述异常信息包括但不限于发生异常的处理器编号、异常类型。 The abnormality information including but not limited to abnormal processor number, type of exception. 所述对发生异常的处理器核和/或本地存储器的相应处理包括但不限于通过背压信号的传递将流水线是否阻塞的信息传递到串行连接结构中的各个处理器核。 The processing of the respective processor cores and / or the local storage of the abnormality information include, but are not limited signal by transmitting the back pressure of the pipeline is blocked is transmitted to each processor core of the serial connection structure.

[0048] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中,可以根据应用程序要求对处理器核、可配置本地存储器和可配置互联结构进行配置。 [0048] The configurable multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention, in accordance with the requirements of the application processor core, local memory may be arranged and configured configurable interconnect structure. 所述配置包括但不限于开启或关断处理器核、配置本地存储器中指令存储器和数据子存储器的大小/边界及其中的内容、配置互联结构和连接关系。 Including but not limited to the configuration to turn on or off the processor core configured content size / boundaries and in the local memory of the sub-instruction memory and data memory, configurable interconnects configurations and connection relationship.

[0049] 用于所述配置的配置信息的来源包括但不限于所述可配置多核/众核装置内部和外部。 [0049] The source of the configuration information in a configuration including but not limited to the configurable multi / manycore internal and external devices. 所述配置可以在运行期间任意时刻根据应用程序的要求进行调整。 The configuration may be adjusted according to the application at any time during operation. 所述配置的配置方法包括但不限于由处理器核或中央处理器核直接配置、由处理器核或中央处理器核通过直接存储器访问控制器配置和外部请求通过直接存储器访问控制器配置。 Configuration method the configuration including but not limited to, direct configuration by the processor core or central processor core, and the outer configuration request via direct memory access controller configured by a processor core or central processor core via a direct memory access controller.

[0050] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置具有三个层次的低功耗技术:配置层次、指令层次和应用层次。 [0050] Based on the serial transmission and multi-configurable pipelined hierarchical multi / manycore apparatus according to the present invention, low-power technology has three levels: level configuration, the instruction level and application level.

[0051] 所述配置层次,根据配置信息,没有被用到的处理器核可以进入低功耗状态;所述低功耗状态包括但不限于降低处理器时钟频率或切断电源供应。 [0051] The hierarchical configuration, according to the configuration information, is not used in the processor core may enter a low power state; the low power consumption state but not limited to reduced processor clock frequency or cut off the power supply.

[0052] 所述指令层次,当处理器核执行到读取数据的指令时,如果该数据还没有准备好,则所述处理器核进入低功耗状态,直到所述数据准备好,所述处理器核再从低功耗状态恢复到正常工作状态。 [0052] The instruction level, when the processor core executes instructions to read the data, if the data is not ready, the processor core to enter the low power consumption state until the data is ready, the processor core and then to recover from a low power state to normal operation. 所述数据没有准备好,包括但不限于前一级处理器核还没有将本级处理器核需要的数据写入相应数据子存储器。 The data is not ready, including but not limited to a processor before the data is not yet present core processor core level required to write the corresponding data sub-memory. 所述低功耗状态包括但不限于降低处理器时钟频率或切断电源供应。 The low power consumption state but not limited to reduced processor clock frequency or cut off the power supply.

[0053] 所述应用层次,采用全硬件实现,匹配空闲(idle)任务特征,确定当前处理器核的使用率(utilizat1n),根据当前处理器使用率和基准使用率确定是否进入低功耗状态或是否从低功耗状态返回。 [0053] The application level, all-hardware, matching idle (IDLE) task characteristics, to determine the current processor core utilization (utilizat1n), the current processor usage and usage to determine whether the reference low power state or whether to return from the low power state. 所述的基准使用率可固定不变,也可重新配置或自学习确定,可以固化在芯片内部,也可以在所述装置启动时由所述装置写入,也可以由软件写入。 The reference usage rate may be fixed, or may be self-learning determination reconfigured, may be cured in the chip, may be written by the device when the device starts, may be written by software. 用于匹配的参考内容可以在芯片生产时,固化到芯片内部,也可以在所述装置启动时由所述装置或软件写入,还可以自学习写入,其存储媒介包括但不限于挥发性的存储器、非挥发性的存储器;其写入方式包括但不限于一次写入、可多次写入。 Reference may be used to match content during chip production, the cured chip may be written by the software or the device may also be self-learning when the device starts to write, for storing a volatile medium, including but not limited to memory, non-volatile memory; written including but not limited to a write-once, multiple times written. 所述低功耗状态包括但不限于降低处理器时钟频率或切断电源供应。 The low power consumption state but not limited to reduced processor clock frequency or cut off the power supply.

[0054] 本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置可以具备自测试能力,能够在加电工作的情况下不依赖于外部设备进行芯片的自测试。 [0054] Based on the serial transmission and multi-configurable pipelined hierarchical multi / manycore apparatus according to the present invention may include self-test capabilities, without depending on the external device chip self-test in case of power work.

[0055] 当所述多核/众核装置具备自测试能力时,可以将所述多核/众核装置中特定的单数个或复数个基本元件、运算单元或处理器核用成比较器,对所述多核/众核装置中相应的复数组其他基本元件、运算单元或处理器核及基本元件、运算单元或处理器核的组合给予具有特定关系的激励,并用所述比较器比较所述复数组其他基本元件、运算单元或处理器核及基本元件、运算单元或处理器核的的组合的输出是否符合相应的特定关系。 [0055] When the multi-core / many-core self-test means includes a capability, the multi / many-core device specific singular or plural basic elements, the arithmetic unit or processor core to use the comparator of the said multi / manycore apparatus corresponding plural groups other basic elements, and the core, the arithmetic unit arithmetic unit or a combination of a processor or processor core base member having a specific relationship given excitation, and comparing the plurality of groups by said comparator other basic elements, combined output of the arithmetic unit or processor core and the base element, the arithmetic unit or processor core meets the corresponding specific relationship. 所述激励可以来自所述多核/众核装置中的特定模块,也可以来自所述多核/众核装置外部。 The excitation may be from the core / many-core device specific modules, may be from outside the multi / many-core devices. 所述特定关系包括但不限于相等、相反、互逆、互补。 The specific relationship includes but is not limited to equal and opposite, reciprocal, complementary. 所述测试结果可以被送到所述多核/众核装置外部,也可以保存在所述多核/众核装置中的存储器中。 The test results can be sent to the outside of the multi / many-core device, it can be stored in a memory of the multi / nuclear installations in public.

[0056] 所述的自测试可以是在晶圆测试,封装后集成电路测试或者芯片使用时在所述装置启动时进行测试,也可以人为设定自测试条件及周期,在工作期间定期进行自测试。 Self-Test [0056] The may be, when testing a chip or integrated circuit package after use in wafer testing device is activated when the test may be set manually from the test conditions and the period for periodically during operation since test. 所述自测试用到的存储器包括但不限于挥发性的存储器,非挥发性的存储器。 The self-test memory used include but are not limited to volatile memory, non-volatile memory.

[0057] 当所述多核/众核装置具备自测试能力时,可以具备自修复能力。 [0057] When the multi-core / many-core self-test means includes a capability may be provided with self-healing capabilities. 当所述测试结果保存在所述多核/众核装置中的存储器中时,可以对失效处理器核作标记,在对所述多核/众核装置进行配置时,可以根据相应标记绕过失效处理器核,使所述多核/众核装置依然能正常工作,实现自修复。 When the test result is stored in the memory of the multi / many-core device, the processor core may be marked invalid, when configuring the multi / many-core device, can bypass the failure processing the marker is cores, the core / manycore apparatus would still work properly, to achieve self-healing. 所述自修复可以是在晶圆测试后进行,封装后集成电路测试后进行或者芯片使用时在所述装置启动时进行测试后进行,也可以人为设定自测试自修复条件及周期,在工作期间定期进行自测试后进行。 The self-healing may be performed after the wafer test, carried out after the test when the device is activated or when the integrated circuit chips using the test after packaging, may be set manually from the test conditions and self-repair cycle, the working after regular self-test period.

[0058] 本发明所述可配置多核/众核装置中的复数个处理器核可以是同构的,也可以是异构的。 [0058] The present invention may be arranged a plurality of processor cores multi / many-core devices may be homogenous, and may be heterogeneous.

[0059] 本发明所述可配置多核/众核装置中本地指令存储器中指令字的长度可以是不固定的。 Nuclear length instruction memory means in the local instruction word [0059] The present invention may be configured multi / public may not be fixed.

[0060] 本发明所述可配置多核/众核装置中本地指令存储器和本地数据存储器各自可以有单数组或复数组读端口。 [0060] The present invention may be configured multi / many-core local instruction memory device and a local data memory array may each be a single or a plurality of sets of read port.

[0061] 本发明所述可配置多核/众核装置中,每个处理器核还可以对应复数个本地指令存储器,所述复数个本地指令存储器可以是相同大小的,也可以是不同大小的;可以是相同结构的,也可以是不同结构的。 [0061] The present invention may be configured multi / many-core means, each core may also correspond to a plurality of local instruction memory, said plurality of local instruction memory may be the same size or may be of different sizes; may be the same structure, the structure may be different. 当所述复数个本地指令存储器中的一个或多个用于响应相应处理器核取指操作时,所述复数个本地指令存储器中的其他本地指令存储器可以进行指令更新操作。 When a plurality of the local instruction memory in response to one or more corresponding processor core fetches, the plurality of other local instruction memory local instruction memory instructions may update operation. 更新指令的途径包括但不限于通过直接存储器访问控制器更新指令。 Route update instructions include but are not limited by the direct memory access controller update instruction.

[0062] 本发明所述可配置多核/众核装置中的复数个处理器核可以工作在相同的时钟频率,也可以工作在不同的时钟频率。 [0062] The present invention may be configured multi / many-core devices may be a plurality of processor cores operate at the same clock frequency, it may operate at different clock frequencies.

[0063] 本发明所述可配置多核/众核装置可以具有读取导致写的特性(LIS,loadinduced store)。 [0063] The present invention may be configured multi / many-core devices may have properties (LIS, loadinduced store) fetch causes written. 处理器核对于某个地址数据第一次读取时,从相邻前一级处理器核对应的本地数据存储器读取数据,同时将读取到的数据写入本级处理器核对应的本地数据存储器,之后对该地址数据的读写都访问本级对应的本地数据存储器,从而在不增加额外开销的情况下实现相邻前后级本地数据存储器中相同地址数据的传递。 The first processor core when an address for data reading, data is read from an adjacent previous stage processor core corresponding to the local data store, while the read data is written to the same level corresponding to the local processor core data storage, data transmission speed local memory the same data before and after the address of the adjacent achieved without the read address after the data have access to the level corresponding to the local data store, so that no additional overhead.

[0064] 本发明所述可配置多核/众核装置可以具有数据预传递的特性;处理器核可以从前一级处理器核对应的本地数据存储器中读取本处理器核不需要读写、但后续处理器核需要读取的数据,并写入本级处理器核对应的本地数据存储器,从而实现前后级本地数据存储器中相同地址数据的逐级传递。 [0064] The present invention may be configured multi / many-core devices may have pre-transmission characteristic data; a processor core may be a processor core corresponding to the previous local data store does not require the reader to read the present processor core, but subsequent data processor core need to be read, and writes the AV this local data store to be checked in order to achieve progressive transmission level before and after the same local data store address data.

[0065] 本发明所述的本地数据存储器还可以包含单数个或复数个有效标志和单数个或复数个归属标志。 [0065] The local data store according to the present invention may also comprise singular or plural and singular valid flag or a plurality of attributions. 所述有效标志用于表示对应的数据是否有效。 The valid flag for indicating whether the corresponding data is valid. 所述归属标志用于表示对应的数据当前被哪个处理器核使用。 The home flag indicates that the corresponding data for the currently used which processor core. 采用所述有效标志和归属标志能避免使用乒乓缓冲,提高存储器的使用效率,且多个处理器核可以同时访问同一个数据存储器,便于数据交换。 Using the home flag and valid flag to avoid the use of the ping-pong buffers, to improve the efficiency of using the memory, and a plurality of processor cores can access the same data memory simultaneously, to facilitate data exchange.

[0066] 本发明所述的通过可配置互联结构传输寄存器值,包括但不限于采用大量硬连线直接将所述处理器核中寄存器的值一次全部传输到后级处理器核的寄存器中,采用移位寄存器的方法将所述处理器核中寄存器的值依次移位传输到后级处理器核的寄存器中。 [0066] By the present invention may be configured to interconnect configuration register value transfer, including but not limited to the use of a large number of hard-wired directly to the value of the processor core registers all at once to the subsequent stage transfer processor core register, the method of using the shift register in the register value processor cores are sequentially transmitted to the shift register stage processor core.

[0067] 所述寄存器值的传输途径还可以是根据寄存器读写记录表决定需要传输的寄存器。 The [0067] transmission routes register values ​​may also be determined according to the register read record table register needs to be transmitted. 本发明所述的寄存器读写记录表用于记录寄存器对相应本地数据存储器的读写情况。 Record table register read register for reading and writing the recording data on the respective local memory of the present invention. 如果寄存器的值已经被写入本级处理器核对应的本地数据存储器且之后该寄存器的值没有发生改变,则可以仅由后级处理器核从本级处理器核对应的本地数据存储器中相应地址读取数据,从而完成所述寄存器的传递,不需要单独传输该寄存器值到后级处理器。 If the value of the register of the present stage have been written processor core corresponding to the local data store and thereafter the value of the register is not changed, it is possible to check from this stage only by the AV processor core corresponding to a respective local data store address reading data, thereby completing the transfer of the register, the register value need not individually transmitted to the rear-stage processor.

[0068] 举例而言,当寄存器的值写入相应本地数据存储器时,所述寄存器读写记录表中相应的项被清“0”,当数据写入寄存器时,所述寄存器读写记录表中相应的项被置“I”。 [0068] For example, when the value of the register is written in the corresponding local data store, said register read and write the appropriate entry in the table record is cleared to "0" when data is written to a register read record table corresponding entry is set to "I". 在进行寄存器值传输时,只传输寄存器读写记录表中项为“I”的相应寄存器的值。 Register value during transmission, only the read transfer register record table register entry corresponding to the value of the "I". 所述数据写入所述寄存器堆中寄存器,包括但不限于从相应本地数据存储器读取数据到所述寄存器堆中的寄存器,将指令执行的结果写回寄存器堆中的寄存器。 Write the data registers in the register file, including but not limited to reading data from the data memory to the respective local register stack register, the instruction execution result is written back to the register file register.

[0069] 当本发明所述的基于串行多发射和流水线层次结构的可配置多核/众核装置中处理器核数目确定时,还可以根据分割后得到的确定的代码片段对代码片段头部扩展和代码片段尾部扩展进行优化,减少需要传递的寄存器的数量。 Determining code fragment [0069] When the number of processor cores may be configured to determine multi / manycore apparatus based on multi-transmit and serial pipelined hierarchical structure according to the present invention may also be obtained according to the segmentation of the code segment header fragment tail spread and spreading is optimized to reduce the number of registers need to pass.

[0070] 举例而言,在通常情况下,代码片段尾部扩展包含了将全部寄存器值存储到特定本地数据存储器地址的指令,代码片段头部扩展包含了将相应地址中的值读入寄存器的指令,两者配合实现寄存器值平滑传递。 [0070] For example, under normal circumstances, the tail spread snippet contains all the register values ​​stored in the instruction memory address particular local data, header extension code snippets comprising instructions corresponding to the read address value into the register , both with the register value to achieve a smooth transfer. 当代码片段确定时,可以根据代码片段中的指令,减少代码片段头部扩展和代码片段尾部扩展中存储和/或读取指令的条数。 When the code segment is determined, according to instructions in the code segment, and to reduce the header extension code snippets of code fragments stored in the number of tails and / or read instruction extension.

[0071] 如果在本级处理器核对应的代码片段中,在写入某一寄存器之前没有使用过该寄存器内的值,则可以省去前级处理器核对应的代码片段尾部扩展中存储该寄存器值的指令和本级处理器核对应的代码片段头部扩展中从本地数据存储器中读取数据到该寄存器的指令。 [0071] If the present stage in the processor core corresponding to the code fragment, the value in the register is not used before writing a register, the check shall be omitted AV extended code fragment is stored in the tail instruction register and the present value of the AV should check code fragment header extension data read from local memory to the instruction register.

[0072] 如果在前级处理器核对应的代码片段中,某一寄存器的值在存储到本地数据存储器之后就没有改变过,则可以省去前级处理器核对应的代码片段尾部扩展中存储该寄存器值的指令,并在本级处理器核对应的代码片段头部扩展中添加相关指令,使能够从本地数据存储器中相应地址读取数据到该寄存器。 [0072] If the AV should check code fragment, the value of a register after stored in a local data store has not changed, the pre-processor core to be stored in the tail spread code snippet can be omitted the instruction register value, and add the relevant instruction processor core corresponding to the present stage header extension code snippets, so data can be read from this register to the local data store corresponding address.

[0073] 本发明所述的数据处理的方法与装置中,当复数个处理器核对应的代码片段执行过程中均会转移到同一地址执行一段代码,并在该段代码执行完毕转移回各自对应的代码片段时,可以将所述同一地址的代码重复存储在所述复数个处理器核对应的本地指令存储器中;所述同一地址的代码包括但不限于函数调用、循环。 [0073] The data processing method and apparatus according to the present invention, when a plurality of processor core during execution of code fragments will be transferred to the same address implementation of a code, and the code is completed in the period corresponding to the transfer back to their when the code fragment, said code may be stored at the same address is repeated in the plurality of processor core corresponding to the local instruction memory; the same address codes include but are not limited to the function call, the cycle.

[0074] 本发明所述的数据处理的方法与装置中,所述处理器核可以访问除所述处理器核外的处理器核的本地指令存储器;当复数个处理器核执行完全相同的代码,且所述代码长度超过单个处理器核对应的本地指令存储器大小时,可以将所述代码依次存储在复数个处理器核对应的本地指令存储器中;运行时,所述复数个处理器核中的任一处理器核先从存储所述完全相同的代码中第一段代码的本地指令存储器读取指令并执行,第一段代码执行完毕后再从存储所述代码中第二段代码的本地指令存储器读取指令并执行,依此类推,直到全部所述完全相同的代码执行完毕。 [0074] The data processing method and apparatus according to the present invention, in addition to the processor core may access the local memory of the processor core instruction outside the processor core; when a plurality of identical core executes code for a processor and the code length exceeds the single collation processor local instruction memory corresponding to the size, the codes may be sequentially stored in the processor core corresponding to the plurality of the local instruction memory; run, the plurality of processor cores native instruction code memory either processor core to start storing the same in the first code read and execute instructions, code execution is completed before the first segment from the second segment of code storing said local code memory read instruction and executes the instruction, and so on, until all of the same code is completed.

[0075] 本发明所述的数据处理的方法与装置中,所述复数个处理器核可以同步执行所述完全相同的代码中的各段代码,也可以异步执行所述完全相同的代码中的各段代码;所述复数个处理器核可以并行执行所述完全相同的代码中的各段代码,也可以串行执行所述完全相同的代码中的各段代码;还可以串并混合地执行所述完全相同的代码中的各段代码。 [0075] The data processing method of the present invention, the apparatus, the plurality of processor cores can perform synchronization in each of the code of the same code, or asynchronous execution of the same code each of the code; said plurality of processor cores may be performed in parallel in each of the code the same code, the code segments may be serial execution of the same code; may also perform serial mixing the exact same code in each section of code.

[0076] 本发明所述的数据处理的方法与装置中,所述处理器核还可以对应复数个本地指令存储器,所述复数个本地指令存储器可以是相同大小的,也可以是不同大小的;可以是相同结构的,也可以是不同结构的;当所述复数个本地指令存储器中的一个或多个用于响应相应处理器核取指操作时,所述复数个本地指令存储器中的其他本地指令存储器可以进行指令更新操作;更新指令的途径可以是通过直接存储器访问控制器更新指令。 [0076] Method and apparatus for data processing according to the present invention, the processor core may also correspond to a plurality of local instruction memory, said plurality of local instruction memory may be the same size or may be of different sizes; may be the same structure, the structure may be different; and when a plurality of the local instruction memory or a plurality of processor cores in response to the corresponding fetch operation, the plurality of other local local instruction memory instruction memory update instruction operation may be performed; route update instruction may update instruction by a direct memory access controller.

[0077] 传统片上系统(SoC, System on Chip)中除处理器外,其他功能模块都是用硬连线逻辑实现的专用集成电路模块。 In addition to the processor, other functional modules are implemented by hardwired logic [0077] system (SoC, System on Chip) ASIC conventionally module sheet. 这些功能模块的性能要求很高,采用传统的处理器难以达到性能要求,因此无法以传统处理器替代这些专用集成电路模块。 Performance of these functional modules demanding, difficult to achieve with traditional processor performance requirements, and therefore not be able to replace these conventional processor module application specific integrated circuit.

[0078] 本发明所述的数据处理的方法与装置中,可以将单数个或复数个处理器核及其相应本地存储器构成高性能的多核连接结构,对多核连接结构进行配置、在相应本地指令存储器中放入对应的代码片段,使所述多核连接结构实现特定的功能,能替代片上系统中的专用集成电路模块。 [0078] The data processing method and apparatus according to the present invention may be singular or plural processor cores and their respective local memories constituting a high-performance multi-core connection structure, the connection structure of the multi-core configuration, in the corresponding native instructions fragment into corresponding memory, the multi-core connection structure to achieve a particular function, the system can replace a dedicated on-chip integrated circuit module. 所述多核连接结构相当于片上系统中的功能模块,如图像解压缩模块或加解密模块。 The multi-core connection structure corresponding to the system on-chip functional modules, image decompression module or the encryption and decryption module. 这些功能模块再由系统总线连接,以实现片上系统。 Then these functional blocks are connected by a system bus, to achieve a system on a chip.

[0079] 本发明所述的处理器核及其相应本地存储器与相邻处理器核及其相应本地存储器之间的数据传输通道为本地连接(local interconnect1n),单数个所述处理器核及其相应本地存储器或通过本地连接连在一起的复数个处理器核及其相应本地存储器构成的多核连接结构即对应片上系统的功能模块。 [0079] The processor core according to the present invention and their respective data transmission channel between the local memory and the processor core adjacent its respective local memory is a local (local interconnect1n) connected to said processor core singular and together respective local memories or a plurality of processor cores of a multi-core connection structure and a corresponding local storage configuration system on a chip, i.e., corresponding to a connection through local function module.

[0080] 本发明所述的对应于片上系统中功能模块的多核连接结构与其他所述对应于片上系统中功能模块的多核连接结构之间的数据传输通道为系统总线(system bus)。 [0080] The present invention according to the corresponding system-on-chip data transmission channel between the multi-core connection structure with other functional modules corresponding to the system-on-chip functional modules of the multi-core connection structure of a system bus (system bus). 通过所述系统总线将复数个对应于片上系统中功能模块的多核连接结构连接起来,就能实现通常意义上的片上系统。 Through the system bus to a plurality of functional modules corresponding to the system-on-chip multi-core connection structure together, system on a chip can be realized in the usual sense.

[0081] 基于本发明技术方案实现的片上系统,具有传统片上系统不具备的可配置性。 [0081] System-on-chip implementation aspect of the present invention is based on the conventional sheet having a system does not have to be disposed of. 通过对基于本发明所述的数据处理装置进行不同配置,可以得到不同的片上系统。 By a different configuration of the data processing apparatus of the present invention is based, it can be a different system on a chip. 所述配置可以在运行过程中实时进行,从而可以在运行过程中实时改变片上系统功能。 The configuration can be performed in real time during operation, the system can be changed in real time on-chip during operation. 可以动态地重新配置处理器核及其相应本地存储器并动态改变相应本地指令存储器中的代码片段,从而改变所述片上系统的功能。 You can dynamically reconfigure a processor core and corresponding local memory and dynamically changing the code fragment corresponding to the local instruction memory, thereby changing the function of the on-chip system.

[0082] 根据本发明技术方案,所述对应于片上系统中功能模块的多核连接结构内部处理器核及其相应本地存储器与其他处理器核及其相应本地存储器间用于数据传输的通路属于功能模块内部的本地连接。 [0082] According to the present invention, the functional modules corresponding to the system on a chip connection structure inside a multi-core processor core and its respective local memories of other processor cores, and the respective path between the local memory for data transfer is a functional locally connected inside the module. 通过所述功能模块内部的本地连接传输数据,通常需要占用提出传输请求的处理器核的操作。 Transmitting said data via a local connection function inside the module, generally need to occupy processor core made of the operation of the transmission request. 本发明所述的系统总线,可以是所述本地连接,也可以是不需要占用处理器核的操作即能完成不同处理器核及其相应本地存储器间数据传输的数据传输通道。 The present invention is a system bus, a local connection may be, may be a processor core operations do not take up that can perform different processor cores and their respective data transmission channel between the local memory data transfer. 所述不同处理器核及其相应本地存储器可以是相邻的,也可以是不相邻的。 The different local memory and its corresponding processor core may be adjacent, or may be non-adjacent.

[0083] 本发明所述的数据处理的方法与装置中,构成系统总线的一个方法是采用复数个位置固定的连接装置建立数据传输通道。 [0083] Method and apparatus for data processing according to the present invention, a method for forming a system bus to establish a data transmission channel using a plurality of stationary connection means. 任意所述多核连接结构的输入和输出都与相近的连接装置通过单数根或复数根硬连线相连。 Input and output of the multi-core connection structure of any and all similar connecting device by a single or plural number of the root is connected to hardwired. 所有所述连接装置之间也通过单数根或复数根硬连线相连。 Also connecting means between all of said single or plural number of the root is connected to hardwired. 所述连接装置、所述多核连接结构与所述连接装置间的连线、及所述连接装置间的连线共同构成所述系统总线。 Said connecting means connecting the multi-core connection structure between the connection means between the connecting means and together constitute the connection to the system bus.

[0084] 本发明所述的数据处理的方法与装置中,构成系统总线的另一个方法是建立数据传输通道,使任意处理器核及其相应本地数据存储器能与其他任意处理器核及其相应本地数据存储器进行数据传递。 [0084] The data processing method and apparatus according to the present invention, another approach is to build a system bus composed of a data transmission channel, arbitrary processor core and corresponding local data storage can be any other processor core and its corresponding local data store for data transfer. 所述数据传递的途径包括但不限于通过共享存储器传递、通过直接存储器访问控制器传递、通过专用总线或网络传递。 The data transfer pathway include but are not limited to delivery via shared memory, direct memory access controller is transmitted by, or transmitted through a dedicated bus network.

[0085] 举例而言,一种方法是,可以事先在一些处理器核及其相应本地数据存储器中的两两处理器核及其相应本地数据存储器之间布置好单数根或复数根硬连线,所述硬连线可以是可配置的;当这些处理器核及其相应本地数据存储器中的任意两个处理器核及其相应本地数据存储器处于不同的多核连接结构中、即处于不同的功能模块中时,所述两个处理器核及其相应本地数据存储器之间的硬连线即可作为所述两个多核连接结构间的系统总线。 [0085] For example, a method that may be arranged in advance in some core processor and its corresponding local data store between every two processor core and its corresponding local data storage or plural number of the good root single hardwired the hardwired may be configurable; when different multi-core connection structure of any two of these cores and the corresponding local data store and its corresponding processor core is in a local data store, i.e. in different functions when the module, the two processor cores and its corresponding hard-wired between the local data storage can be used as a system bus is connected between the two multi-core structure.

[0086] 第二种方法是,可以使全部或部分所述处理器核及其相应本地数据存储器能通过直接存储器访问控制器访问到其他的处理器核及其相应本地数据存储器。 [0086] The second method is to make all or part of its corresponding processor core via the local data store direct memory access controller to access another processor core and its respective local data store. 当这些处理器核及其相应本地数据存储器中的任意两个处理器核及其相应本地数据存储器处于不同的多核连接结构中、即处于不同的功能模块中时,就可以在实时运行过程中,根据需要进行所述处理器核及其相应本地数据存储器与另一个所述处理器核及其相应本地数据存储器间的数据传递,实现两个多核连接结构间的系统总线。 When a different connection structure of any of these multi-core processor cores and their respective local data store and the two cores in the respective local data storage, i.e. in different functional modules, can be real time during the operation, the processor core as needed for their respective local data store of another of said processor core and its corresponding data transfer between the local data store, to achieve a system bus connected between two multi-core structure.

[0087] 第三种方法是,可以在全部或部分所述处理器核及其相应本地数据存储器上实现片上网络(Network on Chip)功能,即当所述处理器核及其相应本地数据存储器的数据传输到其他处理器核及其相应本地数据存储器时,由可配置互联网络决定数据的去向,从而构成一条数据通路,实现数据传输。 [0087] The third method is to be implemented on all or part of the processor core and on-chip memory data corresponding to the local network (Network on Chip) function, i.e., when the processor core and the corresponding local data storage when transferring data to the other processor cores and their respective local data storage, it may be determined by the destination network configuration data, to constitute a data path, data transmission. 当这些处理器核及其相应本地数据存储器中的任意两个处理器核及其相应本地数据存储器处于不同的多核连接结构中、即处于不同的功能模块中时,就可以在实时运行过程中,根据需要进行所述处理器核及其相应本地数据存储器与另一个所述处理器核及其相应本地数据存储器间的数据传递,实现两个多核连接结构间的系统总线。 When a different connection structure of any of these multi-core processor cores and their respective local data store and the two cores in the respective local data storage, i.e. in different functional modules, can be real time during the operation, the processor core as needed for their respective local data store of another of said processor core and its corresponding data transfer between the local data store, to achieve a system bus connected between two multi-core structure.

[0088] 上述三种方法,第一种方法采用硬连线结构实现的系统总线,其连接是静态的,第二种采用直接存储器访问、第三种方法采用片上网络方法,其连接是动态的。 [0088] The above three methods, the first method implemented using hard-wired system bus architecture, which is connected to a static, the second direct memory access, the third method using a network-on-chip method, which is dynamic connection .

[0089] 本发明所述的数据处理的方法与装置中,所述处理器核可以具有快速条件判断机制,用以确定分支转移是否执行;所述快速条件判断机制可以是用于判断循环条件的计数器,也可以是用于判断分支转移及循环条件的硬件有限状态机。 [0089] The data processing method of the present invention, the apparatus, the processor core may have a mechanism for quick determination conditions to determine whether to perform a branch; the mechanism may be rapid determination condition for determining the loop condition counter, it may be used for determining the branch and cycling conditions hardware finite state machines.

[0090] 本发明所述配置层次低功耗,还可以根据配置信息,使特定的处理器核进入低功耗状态;所述特定的处理器核包括但不限于没有被用到的处理器核,工作负载相对较低的处理器核;所述低功耗状态包括但不限于降低处理器时钟频率或切断电源供应。 [0090] The configuration of the present invention is low power level, also according to configuration information, a specific processor core enters a low power state; said particular processor core include but are not limited to the processor core is not used , the workload of the processor core is relatively low; the low power consumption state but not limited to reduced processor clock frequency or cut off the power supply.

[0091] 本发明所述的数据处理的方法与装置中,还可以包括单数个或复数个专用处理模块。 [0091] Method and apparatus for data processing according to the present invention, further can comprise a single or plural number of dedicated processing modules. 所述专用处理模块能作为宏模块供所述处理器核及其相应本地存储器调用,也可以作为独立的处理模块接收所述处理器核及其相应本地存储器的输出,并将处理结果送往所述处理器核及其相应本地存储器或其他处理器核及其相应本地存储器。 The specialized processing block for the macroblock can be used as the processor core and the respective local memory recall, the processor core may also receive and output respective local memory as a separate processing module, and the results sent to the said processor core and the respective local memories of other processor cores or their respective local memories. 向所述专用处理模块输出的处理器核及其相应本地存储器与接收所述专用处理模块输出的处理器核及其相应本地存储器可以是同一处理器核及其相应本地存储器,也可以是不同处理器核及其相应本地存储器。 To the dedicated processor core processing module outputs its corresponding processor core and local memory for receiving the output of specialized processing block and its corresponding local memory may be the same processor core and corresponding local memory, the processing may be different cores and their respective local memories. 所述专用处理模块包括但不限于快速傅立叶变换(FFT)模块、熵编码模块、熵解码模块、矩阵乘法模块、卷积编码模块、维特比码(Viterbi Code)解码模块、涡轮码(TurboCode)解码模块。 Said specialized processing blocks include but are not limited to the fast Fourier transform (FFT) module, entropy encoding module, entropy decoding module, a matrix multiplication module, a convolution coding block, Viterbi code (Viterbi Code) decoding module, a turbo code (TurboCode) decoding module.

[0092] 以矩阵乘法模块为例,如果使用单个所述处理器核进行大规模的矩阵乘法,需要大量时钟周期,限制了数据吞吐率的提高;如果使用多个所述处理器核实现大规模矩阵乘法,虽然能减少执行周期数,但增加了处理器核间的数据传递量,且占用大量处理器资源。 [0092] In the matrix multiplication module, for example, if the processor core using a single large-scale matrix multiplication, a large number of clock cycles, limiting the improvement of data throughput; if using a plurality of processor cores for large-scale matrix multiplication, although able to reduce the number of execution cycles, but increases the amount of data transfer between the processor core and processor intensive. 采用专用的矩阵乘法模块,可以在少数个周期内完成大规模矩阵乘法。 Matrix multiplication using a dedicated module, a matrix multiplication can be done in a large scale within a few cycles. 在对程序进行划分时,可以将该大规模矩阵乘法前的操作分配到若干个处理器核,即前组处理器核中,将该大规模矩阵乘法后的操作分配到另外的若干个处理器核,即后组处理器核中,前组处理器核的输出中需要参与该大规模矩阵乘法的数据被送到专用的矩阵乘法模块,经处理后再将结果送往后组处理器核,前组处理器核的输出中不需要参与该大规模矩阵乘法的数据则被直接送往后组处理器核。 When the program is divided to be allocated before the mass matrix multiplication operations to a number of processor cores, i.e., the front group of processor cores, the mass matrix multiplication operation after distribution to another plurality of processors cores, processor cores rear group, the front group output processor core need to be involved in large-scale data of the matrix multiplication is sent to dedicated matrix multiplication module, after processing, then the result to the group of processor cores, the data output of the processor core in the front group is not involved in the mass of matrix multiplication rear group were sent directly to the processor core.

[0093] 有益效果: [0093] beneficial effects:

[0094] 首先,本发明所述的数据处理的方法与装置,能够将串行的程序代码分割成适应于串行连接多处理器核结构中各个处理器核运行的代码片段,针对不同数目的处理器核根据不同的分割规则分割成不同大小和数目的代码片段,适合可扩展(Scalable)的多核/众核装置/系统应用。 [0094] First, data processing method and apparatus according to the present invention, can be divided into program code serial adapted for serial connection snippet each processor core is running in a multi-processor core structure, the number of different the processor core is divided according to different rules is divided into different sizes and the number of code segments, may be suitable for extended (the scalable) multinucleated / manycore apparatus / system applications.

[0095] 其次,根据本发明所述的数据处理的方法与装置,将代码片段分配给串行连接多处理器核结构中各个处理器核运行,每个处理器核执行特定的指令,全部处理器核串行连接实现程序的完整功能,从完整程序代码中分割出来的代码片段之间用到的数据通过专门的传输途径传输,几乎没有数据相关性问题,实现了真正的多发射。 [0095] Next, a data processing method and apparatus according to the present invention, assigned to the serial code segment connecting each processor core is running in a multi-processor core structure, a particular processor core executes each instruction, all of the processing cores serial connection of the full functionality of the program implemented, used between carved out of the complete program code snippets of data transmitted through a dedicated transmission path, almost no data dependency problems, to achieve a true multiple transmit. 在所述串行连接多处理器核结构中,其多发射的发射数量即等于处理器核的数量,大大提高了运算单元的利用率,从而实现串行连接多处理器核结构、乃至装置/系统的高吞吐率。 The serial connection structure of the multi processor core, the number of transmit its multi-emitter, i.e. equal to the number of processor cores, greatly improves the utilization arithmetic unit, thereby realizing a multi-processor serial core structure, and the connected / high system throughput.

[0096] 再次,用本地存储器替代了处理器中通常会有的缓存(cache)。 [0096] Again, instead of the local memory by the processor caches usually have (cache). 每个处理器核相应的本地存储器中保存了该处理器核要用到的所有指令和数据,做到了100%的访问命中率(hit rate),解决了缓存缺失(cache miss)造成的访问外部低速存储器的速度瓶颈问题,进一步提高了装置/系统的整体性能。 Each processor core corresponding saved in the local memory of the processor core to use all instructions and data, so that the hit ratio of 100% (hit rate), to solve the cache miss (cache miss) caused by the external access slow memory speed bottleneck to further improve the overall performance of the device / system.

[0097] 再次,本发明所述的多核/众核装置具有三个层次的低功耗技术,不但能够采用如切断未被使用的处理器核的电源等方法实现粗粒度的功耗管理,还能根据数据驱动,进行针对指令层次的细粒度功耗管理,更能用硬件的方式实施自动实时调整处理器核时钟频率,在保证处理器核正常工作的前提下,有效降低了处理器核运行中的动态功耗,实现处理器核按需求调整时钟频率,且尽量减少人为的干预实施。 [0097] Again, multi / manycore apparatus according to the present invention has three levels of low-power technology, not only the power supply or the like can be employed as a processor core cutting unused Power Management in coarse-grained, further the data driver can be, for fine-grained power management command level, better real-time automatic adjustment processor core clock frequency hardware embodiment, to ensure the normal operation of the processor core premise, effectively reduces the processor core running dynamic power consumption, the processor core clock frequency is adjusted according to the needs, and to reduce human intervention embodiment. 同时由于采用硬件的方式实现,速度快,能够更有效的实现处理器时钟频率的实时调整。 Because of the way while the hardware implementation, high speed, real-time can be more effectively adjust the processor clock rate.

[0098] 最后,采用本发明技术方案,仅需要编程和配置就可实现片上系统,能缩短从设计到产品上市之间的研发周期。 [0098] Finally, the technical solution of the present invention, and requires only the programming system on a chip configuration can be achieved, can be shortened from design to market period between development. 而且,只需要重新编程和重配置,就能使同一个硬件产品实现不同的功能。 Moreover, only reprogramming and reconfiguration, can make the same hardware for different functions.

附图说明 BRIEF DESCRIPTION

[0099] 虽然该发明可以以多种形式的修改和替换来扩展,说明书中也列出了一些具体的实施图例并进行详细阐述。 [0099] While the invention may be extended to various forms of modifications and substitutions are also listed in the description of some specific embodiments and illustration in detail. 应当理解的是,发明者的出发点不是将该发明限于所阐述的特定实施例,正相反,发明者的出发点在于保护所有基于由本权利声明定义的精神或范围内进行的改进、等效转换和修改。 It will be appreciated that the starting point of the inventors are not limited to the particular embodiments of the invention set forth, on the contrary, the starting point is to protect the inventor for all modifications based on the spirit or scope of the statement as defined by the claim, equivalents and modifications conversion .

[0100] 图1是以高级语言程序和汇编语言程序的分割和分配为例对本发明进行说明的流程实施例。 [0100] FIG. 1 is a high-level language and assembly language program segmentation and allocation of an example of the process of the present invention will be described embodiments.

[0101] 图2是本发明所述后编译方法中处理程序循环的实施例。 [0101] FIG. 2 is an embodiment of the present invention, after the processing of the program loop compilation process.

[0102] 图3是本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置示意图。 [0102] FIG. 3 of the present invention is based on the serial transmission and multiple hierarchies pipeline multi / many schematic core device may be configured.

[0103] 图4是地址映射方式的实施例。 [0103] FIG. 4 is an embodiment of the address mapping mode.

[0104] 图5是数据在核间传输的实施例。 [0104] FIG. 5 is an embodiment of a data transmission between the nucleus.

[0105] 图6是背压、异常处理及数据存储器与共享存储器之间连接的实施例。 [0105] FIG. 6 is a back pressure, and the abnormality handling is connected between the shared memory and the data memory of the embodiment.

[0106] 图7是本发明所述自测试自修复方法与结构实施例。 [0106] FIG. 7 is a self-test of the present invention from Example repair methods and structures.

[0107] 图8 (a)是相邻处理器核寄存器值传输的一种实施例。 [0107] FIG. 8 (a) is a processor core register value of the transmission adjacent to the embodiment.

[0108] 图8 (b)是相邻处理器核寄存器值传输的第二种实施例。 [0108] FIG. 8 (b) is a second value of the transmission processor core registers adjacent embodiment.

[0109] 图9是相邻处理器核寄存器值传输的第三种实施例。 [0109] FIG. 9 is a third embodiment of a processor core register values ​​adjacent to the transmission.

[0110] 图10(a)是基于本发明处理器核及对应本地存储器组成的一种实施例。 [0110] FIG. 10 (a) is based on a local memory and a corresponding processor core consisting embodiment of the present invention.

[0111] 图10(b)是基于本发明处理器核及对应本地存储器组成的另一种实施例。 [0111] FIG. 10 (b) is based on another embodiment of the processor core and the local storage corresponding to the composition of the present invention.

[0112] 图10(c)是基于本发明处理器核及对应本地存储器中有效标志位和归属标志位的实施例。 [0112] FIG. 10 (c) is based on the embodiment of the processor core and the corresponding valid flag in the local memory and a home flag present invention.

[0113] 图11(a)是目前现有的片上系统的典型结构。 [0113] FIG. 11 (a) is a typical structure of the system currently available on the sheet.

[0114] 图11(b)是基于本发明技术方案实现片上系统的一种实施例。 [0114] FIG. 11 (b) is an aspect of the present invention is to implement a system based on a chip embodiment.

[0115] 图11(c)是基于本发明技术方案实现片上系统的另一种实施例。 [0115] FIG. 11 (c) is based on another embodiment of the present invention achieves the technical solution of the system on a chip.

[0116] 图12(a)是本发明技术方案中前编译的实施例。 [0116] FIG. 12 (a) is an aspect of the present invention before compiling embodiment.

[0117] 图12(b)是本发明技术方案中后编译的实施例。 [0117] FIG. 12 (b) is a rear aspect of the present invention compiled embodiment.

[0118] 图13(a)是本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置另一个不意图。 [0118] FIG. 13 (a) of the present invention is based on the serial transmission and multi-configurable pipeline hierarchy multi / manycore another device is not intended.

[0119] 图13(b)是本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置通过配置形成的多核串行结构示意图。 [0119] FIG. 13 (b) of the present invention is based on the serial transmission and multi-line serial hierarchies polynuclear schematic structural multi / many-core formed by the apparatus of the configurable.

[0120] 图13(c)是本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置通过配置形成的多核串并行混合结构示意图。 [0120] FIG. 13 (c) of the present invention is based on the multiple transmit and configurable serial multi / manycore hierarchy pipeline means formed by arranging a schematic view of multi-core parallel series hybrid structure.

[0121] 图13(d)是本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置通过配置形成的多个多核结构的示意图。 [0121] FIG. 13 (d) is a diagram showing the structure of a plurality of multi-core and serial multiple transmit pipeline may be arranged hierarchy multi / many-core formed by the apparatus of the present invention is configured based.

具体实施方式 Detailed ways

[0122] 图1是以高级语言程序和汇编语言程序的分割和分配为例对本发明进行说明的流程实施例。 [0122] FIG. 1 is a high-level language and assembly language program segmentation and allocation of an example of the process of the present invention will be described embodiments. 首先经前编译(103)步骤将高级语言程序(101)和/或汇编语言程序(102)中的调用展开得到调用展开后的高级语言代码和/或汇编语言代码。 First level language code compiled by front (103) calls the high-level language program step (101) and / or assembly language program (102) is invoked after the commencement of the deployment and / or assembly language code. 然后将调用展开后的高级语言代码和/或汇编语言代码通过编译器编译(104)得到符合程序执行顺序的汇编代码,再进行后编译(107);如果程序中只有汇编语言代码,且已经符合程序执行顺序,则可以省去编译(104),直接进行后编译(107)。 Then level language code after the call to expand and / or assembly language code to be executed sequentially in line assembler codes by the compiler (104), the compiler (107) after then; if the program only the assembly language code, and have been met program execution order, the compiler may be omitted (104), the compiler (107) directly after. 进行后编译(107)时,在本实施例中,以多核装置的结构信息(106)为依据,在处理器核的行为模型(108)上运行汇编代码并分割,得到配置信息(110),同时产生相应配置引导程序(109)。 After when compiling (107), in the present embodiment, structural information (106) based on multi-core device, through the machine code and segmentation, the configuration information (110) in a behavioral model of a processor core (108), while generating configuration corresponding boot program (109). 最后,由所述装置中的一个处理器核 Finally, by the means of a processor core

(111)直接或通过DMA控制器(112)对相应的复数个处理器核(113)进行配置。 (111) directly or configure the corresponding plurality of processor cores (113) through the DMA controller (112).

[0123] 在图2中,指令分割器首先在步骤一(201)读入前端码流片断,再在步骤二(202)读入前端码流相关信息。 [0123] In FIG 2, reads the first instruction segmenter distal clip stream in a step (201), and then reads the information stream at the front end of the step two (202). 然后进入步骤三(203)判断该码流片断是否循环,如果不循环,则进入步骤九(209)按照常规处理码流片断进行处理,如果循环,则进入步骤四(204)首先读入循环周期数M,再进入步骤五(205)读入本程序段可以容纳的周期数N。 Then proceeds to step three (203) determines that the bitstream fragment whether the cycle, if not recycled, the process proceeds to Step 9 (209) for processing in a conventional process stream segment, if the loop, the process proceeds to step four (204) first reads cycle number M, and then proceeds to step five (205) the number of read blocks can accommodate this period N. 在步骤六(206)判断循环周期数M是否大于可以容纳的周期数N,如果循环周期数M大于可以容纳的周期数N,则进入步骤七(207)将循环分割为一个执行N周的小循环和一个MN周的小循环,并在步骤八(208)将MN重新赋值给M,同时进入下一程序段循环,直到满足循环周期数小于可以容纳的周期数。 In step six (206) determines whether the cycle number M is greater than can accommodate the number of cycles N, if the cycle number M is greater than can accommodate the number of cycles N, the process proceeds to step seven (207) that divide the cycle into small an execution N weeks MN week cycle and a small cycles, and the step eight (208) will be re-assigned to the MN M, while the next block into the loop, the number of cycles until a number satisfying a period of less than can be accommodated. 通过该方法,可以有效的解决循环周期数大于程序段可以容纳的周期数的情况。 By this method, it can effectively solve the case where the number of cycles is greater than the number of blocks can accommodate cycle.

[0124] 图3是本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置示意图。 [0124] FIG. 3 of the present invention is based on the serial transmission and multiple hierarchies pipeline multi / many schematic core device may be configured. 在本实施例中,该装置由若干处理器核(301)、可配置本地存储器(302)和可配置互联结构(303)构成。 In the present embodiment, the device consists of a plurality of processor cores (301), a configurable local memory (302) and configurable interconnect structure (303) configuration. 在本实施例中,每个处理器核(301)对应其下方的可配置本地存储器(302),两者一起构成所述宏观流水线的一级。 In the present embodiment, each processor core (301) may be arranged corresponding to below the local memory (302), which together constitute the pipeline in a macro. 通过配置可配置互联结构(303),可以将多个处理器核(301)及其相应可配置本地存储器(302)连接成串行连接结构。 By the configurable interconnect structure (303), a plurality of processor cores (301) can be arranged and their respective local memory (302) connected in a serial connection configuration. 多个串行连接结构可以各自独立,也可以部分或全部有相互联系,串行、并行或串并混合地运行程序。 Structures may be a plurality of independent serial connection, there can also be partially or fully interconnected, serial, parallel or mixed serial and running the program.

[0125] 图4是地址映射方式的实施例。 [0125] FIG. 4 is an embodiment of the address mapping mode. 图4(a)采用查找表的方法实现地址查找。 FIG. 4 (a) using the lookup table to implement address lookup method. 以16位地址为例,64K地址空间分为多块单个IK地址空间的小存储器(403),采用顺序写入的方式,一块存储器写完后,再写入其他块。 In an example 16-bit address, 64K address space is divided into a plurality of small single IK memory address space (403), by way of sequential writes, after a finished memory, rewriting other blocks. 每写完一次,块内地址指针(404)自动指向下一个有效位为O的可用表项,写入时将表项的有效位置I。 Each completed time, within the block address pointer (404) to automatically point to the next available valid bit for the entry of O, an effective entry writing position I. 每个表项写入数据同时将其地址写入查找表(402)。 Each entry is written while data is written to the address lookup table (402). 以写入地址BFCO的值为例,此时地址指针(404)指向存储器(403)的2号表项,将对应数据写入2号表项时,在查找表(402)对应地址BFCO中写入2,从而建立地址映射关系。 The value of the write address BFCO an example, when the address pointer (404) pointing to a memory (403) entry No. 2, the data is written to the corresponding entry No. 2, in a lookup table (402) corresponding to a write address BFCO the 2, so as to establish an address mappings. 在读取数据时,由地址根据查找表(402)来找到对应表项,读出所存数据。 When reading data, the address (402) to find the corresponding entry lookup table, reads out the stored data. 图4(b)采用CAM阵列的方法实现地址查找。 FIG. 4 (b) Methods The CAM lookup address array implementation. 以16位地址为例,64K地址空间分为多块单个IK地址空间的小存储器(403),采用顺序写入的方式,一块存储器写完后,再写入其他块。 In an example 16-bit address, 64K address space is divided into a plurality of small single IK memory address space (403), by way of sequential writes, after a finished memory, rewriting other blocks. 每写完一次,块内地址指针(406)自动指向下一个有效位为O的可用表项,写入时将表项的有效位置I。 Each completed time, within the block address pointer (406) to automatically point to the next available valid bit for the entry of O, an effective entry writing position I. 每个表项写入数据同时将其指令地址写入CAM阵列(402)的下一个表项。 Each entry is written while data is written to the instruction address CAM array (402) of the next entry. 以写入地址BFCO的值为例,此时地址指针(406)指向存储器(403)的2号表项,将对应数据写入2号表项时,在CAM阵列(405)的下一个表项写入指令地址BFC0,从而建立地址映射关系。 The value of the write address BFCO an example, when the address pointer (406) pointing to a memory (403) entry No. 2, the data is written to the corresponding entry Table 2, the next entry in the CAM array (405) write command address BFC0, in order to establish the address mappings. 在读取数据时,输入指令地址与CAM阵列所存的所有指令地址相比较来找到对应表项,读出所存数据。 When reading data, the address entered all instructions and instruction address stored in the CAM array comparing find a corresponding entry, reads out the stored data.

[0126] 图5是数据在核间传输的实施例。 [0126] FIG. 5 is an embodiment of a data transmission between the nucleus. 所有数据存储器均位于处理器核之间,且分为逻辑意义上的上下两部分。 All data memory are located between the processor core, and divided into two parts in the logical sense. 其中上部分用于数据存储器上面的处理器核的读写,下部分仅用于读取数据供数据存储器下面的处理器核使用。 Wherein the upper portion for reading and writing data memory of the processor core, the processor core section is only used for reading the data memory for data below. 处理器核运行程序的同时,数据从上面的数据存储器向下接力传递。 Processor core program running at the same time, the data is transmitted downward from above the relay data store. 三选一选择器(502、509)可选择远处传来的数据(506)送入数据存储器(503、504)。 One of three selectors (502, 509) to select data from the remote (506) into the data memory (503, 504). 在处理器核(510、511)不做Store指令时,数据存储器(501、503)的下部分分别通过三选一选择器(502、509)写入对应的下一个数据存储器(503、504)的上部分,同时标志写入行的有效位V为I。 When the processor core (510, 511) is not Store instruction, the next portion of the data memory (501, 503) respectively corresponding to the next write data memory (503, 504) by one of three selectors (502, 509) the upper part, while the valid bit V flag is written in the line I. 在做Store指令时,寄存器堆只向下面的数据存储器写值。 In doing Store instruction, the register file write-only memory the value of the following data. 在Load指令需要取相应地址的数据时,二选一选择器(505、507)分别由数据存储器(503、504)的有效位V决定是从对应上面的数据存储器(501、503)或下面的数据存储器(503,504)中取数。 Load instruction needed when the corresponding fetch address, choose one selector (505, 507) are respectively determined by the data memory (503, 504) effective bit V corresponding to the above data from the memory (501, 503) or below data memory (503, 504) the number taken. 如果数据存储器(503、504)中某表项的有效位V为1,即标志数据已经从上面的数据存储器(501、503)写入更新,则在不选择远处传来的数据(506)的情况下,三选一选择器(502、509)分别选择处理器核(510、511)的寄存器堆输出作为输入,从而保证所存数据是经过处理器核(510、511)处理后的最新值。 If the data memory (503, 504) an entry valid bit V in is 1, i.e., the flag data has (501, 503) to write updated data from the above memory, then the transmitted data is not selected distance (506) in the case where, one of three selectors (502, 509) respectively to select a processor core (510, 511) as input the output of the register file, so as to ensure the latest value of the stored data is processed through a processor core (510, 511) . 在数据存储器(503)的上部分被新数据写入时,数据存储器(503)的下部分向数据存储器(504)的上部分传输数据。 When the portion of the data memory (503) the new data is written, the lower portion of the transmission data portion of the data memory (503) to the data memory (504) on. 数据传输时使用指针标志正在传输数据的表项,当指针指向最后一个表项时,标志传输即将完成。 Using data transmission pointer symbol table entry data is being transferred, when the pointer points to the last entry, sign transfer is nearing completion. 一段程序运行完毕时,数据应已完成向下一个存储器的传输。 When a program has finished running, the data transmission should have been completed the next memory. 在下一段程序运行时,数据存储器(501)的上部分向数据存储器(503)的下部分传输数据,数据存储器(503)的上部分向数据存储器(504)的下部分传输数据,数据存储器(504)的上部分向下传输数据,从而构成乒乓传输结构。 Next a program is running, the transmission data portion of the data memory (501) of the lower portion to the data memory (503), the transmission data portion of the data memory (503) of the lower portion to the data memory (504), data memory (504 ) to transfer data down the upper portion to constitute a ping-pong transmission structure. 所有数据存储器都按所需指令空间大小划分出一部分用于指令的存储,即数据存储器和指令存储器在物理上是不分开的。 All data memory space divided into a desired instruction for storing a portion of the instruction, i.e., instruction memory and data memory are not physically separated.

[0127] 图6是背压、异常处理及数据存储器与共享存储器之间连接的实施例。 [0127] FIG. 6 is a back pressure, and the abnormality handling is connected between the shared memory and the data memory of the embodiment. 本实施例中由DMA控制器(616)向指令存储器(601、609、610、611)写入相应代码片段(615)。 Corresponding code fragment (615) written by the DMA controller in the embodiment (616) to the instruction memory (601,609,610,611) of the present embodiment. 处理器核(602、604、606、608)运行相应指令存储器出01、609、610、611)中的代码,并读写相应的数据存储器出03、605、607、612)。 The processor core (602,604,606,608) Run the code corresponding to the instruction memory 01,609,610,611) in and read out the corresponding data memory 03,605,607,612). 以处理器核¢04)、数据存储器(605)及后一级处理器核(606)为例,前后两级处理器核(604、606)都对数据存储器(605)有访问,只有在前级处理器核(604)完成写数据存储器(605)且后级处理器核(606)完成读数据存储器(605)后,数据存储器(605)中的数据子存储器才能做乒乓交换。 ¢ In the processor core 04), data memory (605) and a post-processor core (606), for example, before and after the two processor cores (604, 606) are (605) has access to the data memory, only the front after stage processor core (604) to complete the write data memory (605) and the rear-stage processor core (606) to complete the read data memory (605), the data sub-memory data memory (605) in order to make ping-pong exchange. 背压信号(614)用于由后级处理器核(606)通知数据存储器(605)是否已完成读操作。 Back pressure signal (614) for notifying the processor core by a subsequent stage (606) data memory (605) has been completed read operation. 背压信号(613)用于由数据存储器(605)通知前级处理器核(604)是否有溢出,并传递由后级处理器核(606)传输来的背压信号。 Back pressure signal (613) for notifying the pre-processor core (604) from the data memory (605) if there is an overflow, and passes by the post-stage processor core (606) to transmit signals back pressure. 前级处理器核(604)根据本身运行情况和由数据存储器(605)传输来的背压信号,判断宏观流水线是否阻塞、决定是否对数据存储器(605)中的数据子存储器做乒乓交换,并产生背压信号继续向前一级传递。 Pre-processor core (604) and according to the operation itself (605) transmitted from the data memory back pressure signal, the macro determines whether the pipeline is blocked, determines whether the data in the sub-memory (605) to make ping-pong exchanging data memory, and back pressure signals continue forward pass level. 通过如此处理器核到数据存储器再到处理器核的反向背压信号传递,即可控制宏观流水线的运行。 By thus data processor core to memory and then reverse back pressure signal is transmitted to the processor core, to control the operation of the macro pipeline. 所有数据存储器(603,605,607,612)均通过连接(619)与共享存储器(618)连接。 All data memory (603,605,607,612) are connected through a connection (619) and shared memory (618). 当某个数据存储器所需写入或读出的地址在其自身之外时,发生地址异常,进入共享存储器¢18)中查找地址,找到后将数据写入该地址或将该地址的数据读出。 When a write or read desired data memory outside its own address, an address exception occurs, into the shared memory 18 is ¢) in the address lookup, to find the data read after writing data to the address or the address out. 当处理器核(608)需要用到数据存储器(605)中的数据时,也发生异常,数据存储器(605)通过共享存储器(618)将数据传输到处理器核(608)中。 When the processor core (608) need to use the data in the data memory (605) when the abnormality occurs, the data memory (605) to transfer the data to the processor core (608) via shared memory (618). 处理器核和数据存储器产生的异常信息均通过专用通道(620)传输到异常处理模块(617)。 Abnormality information processor core and the memory data are generated (620) transmitted through a dedicated channel to the exception handling module (617). 在本实施例中,以处理器核中的运算结果溢出为例,异常处理模块(617)控制处理器核对溢出的运算结果做限辐(saturat1n)操作;以数据存储器溢出为例,异常处理模块¢17)控制数据存储器访问共享存储器,将数据存储到共享存储器中;在此过程中,异常处理模块(617)发送信号到所述处理器核或数据存储器,使之阻塞,等完成异常处理操作后再恢复运行,其他处理器核及数据存储器通过背压传递而来的信号各自确定自身是否阻塞。 In the present embodiment, the processor cores to an overflow of an example, the exception handling module (617) controls the processor to check the operation result overflows do A limiter (saturat1n) operation; data out of memory, for example, the exception handling module ¢. 17) control data memory access to the shared memory, the data stored in the shared memory; in this process, the exception handling module (617) sends a signal to the processor core or the data memory, so that obstruction, anomaly processing operation is completed, etc. and then resume operation, the signal from the other processor cores and memory data transfer through a back pressure to determine if their respective blockage.

[0128] 请参阅图7,该图为所述自测试自修复方法与结构实施例。 [0128] Referring to FIG. 7, the graph of the self-test from Example repair methods and structures. 在该自测试自修复结构(701)中,向量生成器(702)产生的测试向量同步送到各处理器核,测试向量分配控制器(703)控制各处理器核与向量生成器(702)的连接关系,运算结果分发控制器(709)控制各处理器核与比较器的连接关系,处理器核通过比较器和其他处理器核进行运算结果的比较,在本实施例中,每个处理器核可以和相邻的其他处理器核进行比较,如处理器核(704)可以通过比较逻辑(708)和处理器核(705、706、707)进行比较。 In this self-test self-repair structure (701), the vector generator (702) generating test vectors to the synchronous processor cores each test vector distribution controller (703) controls each processor core and the vector generator (702) the connection relationship, the calculation result of the distribution controller (709) controls the connection relationship of each processor core and the comparator, the processor core and the other by a comparator comparing a processor core operation result, in the present embodiment, each processing and the adjacent cores can be compared to other processor cores, such as a processor core (704) can be compared by compare logic (708) and the processor core (705,706,707). 在该实施例中,每个比较逻辑可以包含一个或者多个比较器,如果一个比较逻辑有一个比较器,则每个处理器核依次和相邻的其他多个处理器核进行比较,如果一个比较逻辑有多个比较器,则每个处理器核同时和相邻的其他多个处理器核进行比较,测试结果直接从各比较逻辑写入测试结果表(710)。 In this embodiment, each of the comparison logic may comprise one or more comparators, a comparison if there is a logic comparator, each processor core and a plurality of sequentially adjacent to the other processor cores are compared, if a a plurality of comparison logic comparator, each processor core simultaneously a plurality of adjacent and other processor cores comparing test results are written test result table (710) from each of the compare logic.

[0129] 请参阅图8,图8给出了相邻处理器核寄存器值传输的三种实施例。 [0129] Referring to FIG. 8, FIG. 8 shows three adjacent processor core register value of the transmission embodiments.

[0130] 在图8 (a)对应的实施例中,处理器核具有包含31个32位通用寄存器的寄存器堆(801),在传递前级处理器核(802)中所有通用寄存器值到本级处理器核(803)时,可以用992根硬连线直接将前级处理器核(802)所有通用寄存器的每一位的输出端与本级处理器核(803)所有通用寄存器的每一位的输入端通过多路选择器一一对应连通。 [0130] (a) corresponding to the embodiment, a processor core having a register file (801) contains 31 32-bit general-purpose registers in FIG. 8, in a stage before passing the processor core (802) to all of the present general register values when the level of each processor core (803), 992 can be hard-wired directly to the pre-processor core (802) outputs of all general purpose registers per processor core with a present level (803) of all general purpose registers input of a one to one communication via multiplexer. 传递寄存器值时,在一个周期内即可将前级处理器核(802)中31个32位通用寄存器的值全部传递到本级处理器核(803)。 When the register value is transmitted within a pre-cycle to the processor core (802) and 31 32-bit general register values ​​passed to this stage all the processor core (803). 图8(a)中具体显示了一个通用寄存器中一位(804)的硬连线连接方法,其余991位的硬连线连接方法与该位(804)相同。 FIG 8 (a) shows a specific general purpose register one (804) of hard-wired connection method, (804) remaining the same hard-wired connection 991 to the bit method. 前级处理器核(802)中相应位(805)的输出端(806)通过硬连线(807)与本级处理器核(803)中该位(804)的输入端通过多路选择器(808)连接。 (802) a respective pre-bit processor core (805) output (806) by hardwired (807) to the first stage processor core (803) in the bit (804) by an input of multiplexer (808) is connected. 当处理器核执行算术、逻辑等运算时,多路选择器(808)选择来源于本级处理器核的数据(809);当处理器核执行取数操作时,如果该数据在本级处理器核对应的本地存储器中已存在,则选择来源于本级处理器核的数据(809),否则选择来源于前级处理器核传输而来的数据(810);当传递寄存器值时,多路选择器(808)选择来源于前级处理器核传输而来的数据(810)。 When the processor core performs arithmetic, logical and other operation, multiplexer (808) to select data from the present level of the processor core (809); when the number of fetch processor core executes the operation, if the data is handled at this level is checked to be local memory already exists, the data is selected from the present level of the processor core (809), otherwise the selection data (810) derived from the pre-transmission processor core; when transmitting the register value, multiple path selector (808) selecting from preceding data transfer from the processor core (810). 全部992位同时传输,即可在一个周期内完成整个寄存器堆值的传递。 All 992 transmitted simultaneously, to complete the transfer of the entire value register file in one cycle.

[0131] 在图8(b)对应的实施例中,相邻处理器核(820、822)各自具有包含复数个32位通用寄存器的寄存器堆(821、823)。 [0131] in FIG. 8 (b) corresponding to the embodiment, the adjacent processor core (820, 822) having each comprising a plurality of 32-bit general register register file (821, 823). 在从前级处理器核(820)向本级处理器核(822)传递寄存器值时,可以用32根硬连线将前级处理器核(820)中寄存器堆(821)的数据输出端(829)与连接在本级处理器核(822)中寄存器堆(823)数据输入端(830)上的多路选择器(827)的输入连接,多路选择器(827)的输入分别为本级处理器核来的数据(824)和通过硬连线(826)传送来的从前级处理器核来的数据(825),当处理器核执行算术、逻辑等运算时,多路选择器(827)选择来源于本级处理器核的数据(824);当处理器核执行取数操作时,如果该数据在本级处理器核对应的本地存储器中已存在,则选择来源于本级处理器核的数据(824),否则选择来源于前级处理器核传输而来的数据(825)当传递寄存器值时,多路选择器(827)选择来源于前级处理器核传输而来的数据(825)。 When the preceding stage processor core (820) is transmitted to the register value according to the present level processor core (822), 32 can be hardwired to pre-processor core (820) in the register file (821) a data output terminal ( 829) is connected with the multiplexer (827) at the present stage of the processor core (822) in the register file (823) data input (830) connected to an input multiplexer (827) respectively for input data (824) to the core, and the AV data (825) transmitted to the previous stage by the processor core hardwired (826) to when the processor core performs arithmetic, logical and other operation, multiplexer ( 827) selected from the present level data processor core (824); when the number of fetch processor core executes the operation, if the AV data is present in local memory to be checked already exists, the processing is selected from the present level data of cores (824), or from selected data (825) from the pre-transmission processor core when the register value is transmitted, multiplexer (827) derived from a pre-selected transmission from the processor core data (825). 由寄存器堆(821、823)本身对应的寄存器地址产生模块(828、832)产生需要传递寄存器值的寄存器地址送到寄存器堆(821、823)的地址输入端(831、833),分多次将所述寄存器的值通过硬连线(826)和多路选择器(827)从寄存器堆(821)传递的寄存器堆(823)。 A register file (821, 823) itself corresponding to the register address generation module (828,832) to produce the desired register address register value is transmitted to the register file (821, 823) of the address inputs (831, 833), in multiple the value of the register by a hardwired stack (826) and a multiplexer (827) from the register (821) of the register file is transmitted (823). 这样,可以在只增加少量硬连线的情况下,利用多个周期内完成寄存器堆内全部或部分寄存器值的传递。 In this way, only a slight increase in the case of hard-wired by a plurality of cycles to complete transfer of all or part of the register file register values.

[0132] 在图9对应的实施例中,相邻处理器核(940、942)各自具有包含复数个32位通用寄存器的寄存器堆(941、943)。 [0132] In the embodiment corresponding to FIG. 9, the adjacent processor core (940, 942) having each comprising a plurality of 32-bit general register register file (941,943). 在从前级处理器核(940)向本级处理器核(942)传递寄存器值时,可以先由前级处理器核(940)利用数据存储(store)指令将寄存器堆(941)中一个寄存器值写入前级处理器核(940)对应的本地数据存储器(954)中,再由本级处理器核 When the preceding stage processor core (940) is transmitted to the register value according to the present level processor core (942), may first be pre-processor core (940) using the data store (Store) instruction register file (941) in a register value is written to the AV core (940) corresponding to the local data memory (954), and then by the said processor core

(942)利用数据装载(load)指令从本地数据存储器(954)中读出相应数据并写入寄存器堆 (942) using the loaded data (load) instruction reads out the corresponding data from the local data storage (954) and written to the register file

(943)的对应寄存器中。 (943) corresponding to the register. 在本实施例中,前级处理器核(940)中的寄存器堆(941)的数据输出端(949)通过32位连线(946)与本地数据存储器(954)的数据输入端(948)相连,本级处理器核(942)中的寄存器堆(943)的数据输入端(950)通过多路选择器(947)及32位连线(953)与本地数据存储器(954)的数据输出端(952)相连。 In the present embodiment, the pre-processor core register file (941) (940) of the data output (949) by wires 32 (946) with the local data store (954) a data input terminal (948) connected to a processor core register file at the same level (943) (942) of the data input terminal (950) through a multiplexer (947) and a 32-bit connection (953) with the local data store (954) data output end (952) is connected. 多路选择器(947)的输入分别为本级处理器核来的数据(944)和通过32位连线(953)传送来的从前级处理器核来的数据(945),当处理器核执行算术、逻辑等运算时,多路选择器(947)选择来源于本级处理器核的数据(944);当处理器核执行取数操作时,如果该数据在本级处理器核对应的本地存储器中已存在,则选择来源于本级处理器核的数据(944),否则选择来源于前级处理器核传输而来的数据(945)当传递寄存器值时,多路选择器(947)选择来源于前级处理器核传输而来的数据(945)。 Multiplexer (947) respectively for input data (944) and the processor core level data (945) transmitted by the connection 32 (953) of the previous stage to a processor core when a processor core when performing arithmetic, logical and other operation, multiplexer (947) to select data from the present level of the processor core (944); when the number of fetch processor core executes the operation, if the data checking in the processor corresponding to the present stage already exists in the local memory, the data is selected from the present level of the processor core (944), otherwise the selection data (945) derived from nuclear transfer from the pre-processor when the register value is transmitted, multiplexer (947 ) selected from preceding data transfer from the processor core (945). 在图8(c)对应的实施例中,可以先依次将寄存器堆(941)中全部寄存器的值都写入本地数据存储器(954)中,之后依次将这些值写入寄存器堆(943)中;也可以先依次将寄存器堆(941)中部分寄存器的值写入本地数据存储器(954)中,之后依次将这些值写入寄存器堆(943)中;还可以将寄存器堆(941)中一个寄存器的值写入本地数据存储器(954)中后,马上将该值写入寄存器堆(943)中,依次重复此过程,直到需要传递的寄存器值都传递完毕。 In FIG. 8 (c) corresponding to the embodiment, it is possible to sequentially register file (941) values ​​of all the registers are written to the local data store (954), after which these values ​​are written sequentially register file (943) in ; it may also be sequentially first value register file (941) in the section of the register written into the local data memory (954), after which these values ​​are sequentially written to the register file (943); and also to register stack (941) in a after the value of the register is written into the local data memory (954), the immediate value is written to the register file (943) sequentially repeating this process, until the desired value of the register are passed passed finished.

[0133] 请参阅图10,图10给出了基于本发明所述处理器核及对应本地存储器组成的连接结构的两种实施例。 [0133] Please refer to FIG. 10, FIG. 10 shows the present invention is based on two of the local memory and the corresponding processor core connection structure consisting embodiment. 对于本领域普通技术人员来说,可以根据本发明的技术方案和构思对这些实施例中各组成部分进行各种可能的替换、调整和改进,而所有这些替换、调整和改进都应属于本发明所附权利要求的保护范围。 To those of ordinary skill in the art, may be performed in accordance with the inventive concept and the technical solutions to each of these components of the various embodiments possible alternative embodiments, adjustments and improvements, but all such alternatives, adjustments and modifications of the present invention should fall the scope of the appended claims.

[0134] 图10(a)对应的实施例包含了本地指令存储器和本地数据存储器的处理器核(1001)及其前一级处理器核对应的本地数据存储器(1002)。 [0134] FIG. 10 (a) corresponds to Example embodiments include a processor core native instructions and a local memory data store (1001) and its corresponding processor core before a local data store (1002). 处理器核(1001)由本地指令存储器(1003)、本地数据存储器(1004)、执行单元(1005)、寄存器堆(1006)、数据地址产生模块(1007)、程序计数器(1008)、写缓冲(1009)以及输出缓冲(1010)组成。 A processor core (1001) by the local instruction memory (1003), local data storage (1004), an execution unit (1005), register file (1006), a data address generation module (1007), the program counter (1008), write buffer ( 1009) and an output buffer (1010) composition.

[0135] 本地指令存储器(1003)存储有处理器核(1001)执行所需的指令。 [0135] Local instruction memory (1003) stores the processor core (1001) to perform the required command. 处理器核(1001)中执行单元(1005)所需的操作数来自寄存器堆(1006),或来自指令中的立即数;执行结果写回寄存器堆(1006)。 A processor core (1001) in the execution unit (1005) of the desired operands from the register file (1006), or from the immediate instruction; Results of register file write back (1006).

[0136] 本实施例中,本地数据存储器有两个子存储器。 [0136] In this embodiment, there are two sub-local data storage memory. 以本地数据存储器(1004)为例,从两个子存储器读出的数据通过多路选择器(1018、1019)选择,产生最终输出的数据(1020)。 In local data store (1004), for example, data read out from the two sub-memories by multiplexers (1018 and 1019) to select, generating final output data (1020).

[0137] 通过数据装载(load)指令可以将本地数据存储器(1002、1004)中的数据、写缓冲 [0137] By loading data (load) instruction may be local data storage (1002, 1004) in the write buffer

(1009)中的数据、或外部的共享存储器中的数据(1011)读取到寄存器堆(1006)中。 Data (1009) in, or data (1011) of the shared external memory are read into the register file (1006) in. 在本实施例中,本地数据存储器(1002、1004)中的数据、写缓冲(1009)中的数据和外部的共享存储器中的数据(1011)通过多路选择器(1016、1017)选择后,输入到寄存器堆(1006)中。 In the present embodiment, the local data storage (1002, 1004) in the write data buffer (1011) and external data (1009) in the shared memory by a multiplexer (1016,1017) selection, is input to the register file (1006) in.

[0138] 通过数据存储(store)指令可以将寄存器堆(1006)中的数据通过写缓冲(1009)延时存储到本地数据存储器(1004)中,或将寄存器堆(1006)中的数据通过输出缓冲 Data [0138] By storing data (Store) instruction may be a register file (1006) stored by the write buffer (1009) to the delayed local data store (1004), or in a register file (1006) is the output buffer

(1010)延时存储到外部的共享存储器中。 (1010) stored in an external delay shared memory. 在从本地数据存储器(1002)读取数据到寄存器堆(1006)的同时可以将该数据通过写缓冲(1009)延时存储到本地数据存储器(1004)中,以完成本发明所述的LIS功能,实现无代价的数据传递。 While reading data from the local data store (1002) to register file (1006) may be stored by writing the data buffer (1009) to the delayed local data store (1004) to complete the function of the present invention LIS , cost-free data transfer.

[0139] 在图10(a)对应的实施例中,写缓冲(1009)接收的数据有三个来源:从寄存器堆(1006)来的数据、从前级处理器核本地数据存储器(1002)来的数据、以及从外部的共享存储器来的数据(1011)。 [0139] in FIG. 10 (a) corresponding to the embodiment, the write buffer (1009) of data received from three sources: the heap data (1006) from the register to, the previous local data store level processor core (1002) to the data, and data from the external shared memory (1011). 所述从寄存器堆(1006)来的数据、从前级处理器核本地数据存储器(1002)来的数据、以及从外部的共享存储器来的数据(1011)通过多路选择器(1012)选择后输入到写缓冲(1009)。 The data from the register file (1006) to after the previous data level processor local data storage core (1002) to, and the data (1011) from the external shared memory (1012) selected by the multiplexer inputs to write buffer (1009).

[0140] 在图10(a)对应的实施例中,本地数据存储器只接收从同一处理器核中写缓冲来的数据输入。 [0140] (a) corresponding to the embodiment, the local data write-only buffer memory receives data input from the same to the processor core 10 in FIG. 如在处理器核(1001)中,本地数据存储器(1004)只接收从写缓冲(1009)来的数据输入。 As the processor core (1001), the local data storage (1004) received from the write-only buffer (1009) to data entry.

[0141] 在图10(a)对应的实施例中,本地指令存储器(1003)和本地数据存储器(1002、1004)各自都是由两个相同的子存储器构成,可以同时对本地存储器中不同的子存储器进行读、写操作。 [0141] in FIG. 10 (a) embodiment, the local instruction memory (1003) and local data memory (1002, 1004) are each composed of two sub-memories corresponding to the same embodiment, the local memory can be simultaneously different sub-memory read and write operations. 采用这样的结构就可以实现本发明技术方案所述的采用乒乓缓冲交换的本地数据存储器。 With such a configuration can be achieved using the local data store aspect of the invention the ping-pong buffer exchange. 本地指令存储器(1003)接收的地址由程序计数器(1008)产生。 Local instruction memory (1003) received by the program counter address generator (1008). 本地数据存储器(1004)接收的地址有三个来源:从本级处理器核写缓冲(1009)中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块(1007)来的用于读取数据的地址、从后级处理器核数据地址产生模块来的用于读取数据的地址(1013)。 Local data store (1004) the address received from three sources: a write buffer (1009) from the present level of processor cores to address storage section for storing the address data, the generation module (1007) of the present address data from the processor core grade to an address for reading data, generating an address (1013) module for reading data from the AV data address core. 所述从本级处理器核写缓冲(1009)中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块(1007)来的用于读取数据的地址、从后级处理器核数据地址产生模块来的用于读取数据的地址(1013)通过多路选择器(1014、1015)选择后,分别输入到本地数据存储器(1004)中不同子存储器的地址接收模块。 From this stage the write buffer core processor (1009) to the address storage portion for storing the address data, the address generation module (1007) for reading data to the data address from the processor core of the present stage, from after the processor core data address generating stage address (1013) for reading the data module (1014 and 1015) selected by the multiplexer, the data are input to the local memory (1004) of the memory address received different sub module.

[0142] 相应地,本地数据存储器(1002)接收的地址也有三个来源:从本级处理器核写缓冲中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块来的用于读取数据的地址、从后级处理器核数据地址产生模块(1007)来的用于读取数据的地址。 [0142] Accordingly, the local data store (1002) the received addresses from three sources: the write address buffer to the address storage section for storing data from the processor core according to the present level, the address data generated from the processor core of the present level module address for reading data, the address generation module (1007) for reading data to the subsequent stage from the processor core data addresses. 上述地址通过多路选择器选择后,分别输入到本地数据存储器(1002)中不同子存储器的地址接收模块。 After the above address selected by multiplexer are input to the local data store (1002) address the different memory sub-receiving module.

[0143] 图10(b)是另一种基于本发明所述处理器核及对应本地存储器组成的连接结构,其中包含了本地指令存储器和本地数据存储器的处理器核(1021)及其前一级处理器核对应的本地数据存储器(1022)组成。 [0143] FIG. 10 (b) is another connection structure of the present invention is based on the corresponding local memory and processor core composition, which comprises a local processor core and local instruction memory data memory (1021) and previous AV should check the local data store (1022) composition. 处理器核(1021)由本地指令存储器(1003)、本地数据存储器(1024)、执行单元(1005)、寄存器堆(1006)、数据地址产生模块(1007)、程序计数器(1008)、写缓冲(1009)以及输出缓冲(1010)组成。 A processor core (1021) by the local instruction memory (1003), local data storage (1024), an execution unit (1005), register file (1006), a data address generation module (1007), the program counter (1008), write buffer ( 1009) and an output buffer (1010) composition.

[0144] 图10(b)对应实施例提出的连接结构与图10(a)对应实施例提出的结构大致相同,唯一的不同点在于本实施例中的本地数据存储器(1022、1024)各是由一个双端口(dual-port)存储器构成。 [0144] FIG. 10 (b) corresponding to the embodiment of FIG connection structure set forth in Example 10 (a) corresponds to the proposed structure of an embodiment is substantially the same, the only difference is that the local data storage (1022, 1024) in the present embodiment are each consists of a dual-port (dual-port) memory. 双端口存储器可以同时支持两个不同地址的读、写操作。 Dual-port memory can simultaneously support two different addresses read and write operations.

[0145] 本地数据存储器(1024)接收的地址有三个来源:从本级处理器核写缓冲(1009)中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块(1007)来的用于读取数据的地址、从后级处理器核数据地址产生模块来的用于读取数据的地址(1025)。 The received address [0145] local data store (1024) from three sources: a write buffer (1009) from the present level of processor cores to address storage section for storing address data, a data generation module address from the processor core of the present level (1007) to the address for reading data, generating an address (1025) module for reading data from the AV data address core. 所述从本级处理器核写缓冲(1009)中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块(1007)来的用于读取数据的地址、从后级处理器核数据地址产生模块来的用于读取数据的地址(1025)通过多路选择器(1026)选择后,输入到本地数据存储器(1024)的地址接收模块。 From this stage the write buffer core processor (1009) to the address storage portion for storing the address data, the address generation module (1007) for reading data to the data address from the processor core of the present stage, from after the processor core data address generating stage address (1025) for reading the data module (1026) selected by the multiplexer, the address input to the local data store (1024) receiving module.

[0146] 相应地,本地数据存储器(1022)接收的地址也有三个来源:从本级处理器核写缓冲中地址存储部分来的用于存储数据的地址、从本级处理器核数据地址产生模块来的用于读取数据的地址、从后级处理器核数据地址产生模块(1007)来的用于读取数据的地址。 [0146] Accordingly, the local data store (1022) the received addresses from three sources: the write address buffer to the address storage section for storing data from the processor core according to the present level, the address data generated from the processor core of the present level module address for reading data, the address generation module (1007) for reading data to the subsequent stage from the processor core data addresses. 上述地址通过多路选择器选择后,输入到本地数据存储器(1022)的地址接收模块。 After the above address selected by multiplexer, the address input to the local data store (1022) receiving module.

[0147] 由于通常程序中需要访问存储器的数据装载指令和数据存储指令一般不超过40%,因此可以用单端(single-port)存储器代替图10(b)对应实施例中的双端口存储器,在程序编译时静态调整程序中指令的顺序,或在程序执行时动态调整指令执行顺序,在执行不需访问存储器的指令时同时执行对存储器访问的指令,进而使连接结构的组成更为简洁、闻效。 [0147] Since the normal program needs access to the data store load instruction and data storage instruction is generally not more than 40% can be a memory instead of a single terminal (single-port) of FIG. 10 (b) corresponding to the dual port memory in the embodiment, adjustment order of the static program instructions in a program compiled, or dynamically adjusted during program execution instruction execution sequence, while executing memory access instructions during instruction execution without access to the memory, and thus make the composition more simple connection structure, smell effect.

[0148] 图10(b)对应实施例中每个本地数据存储器实际上是一个双端口存储器,能同时支持两个读、两个写或一读一写操作。 [0148] FIG. 10 (b) corresponding to each of the local data storage example embodiment is actually a dual-port memory can simultaneously support two read and two write or a read-write operation. 为保证数据在执行中不被误改写,可以采用如图10(c)所示的方法,在本地数据存储器(1031)中的每一地址都对应增加一个有效标志位(1032)和一个归属标志位(1033)。 To ensure that erroneous data is not rewritten during execution, the method shown in FIG. 10 (c) may be employed, each address in the local data memory (1031) corresponds to an increase in a valid flag (1032) and a home flag bit (1033).

[0149] 图10(c)中,有效标志位(1032)代表了本地数据存储器(1031)中该地址对应的数据(1034)的有效性,举例而言,可以用“I”代表本地数据存储器(1031)中该地址对应的数据(1034)是有效的,用“O”代表本地数据存储器(1031)中该地址对应的数据(1034)是无效的。 Effectiveness [0149] FIG. 10 (c), the effective flag (1032) represents the local data memory data (1034) to (1031) in the address corresponding to, for example, may be "I" represents the local data memory data (1034) and (1031) corresponding to the address is valid, with "O" represents the local data store (1031) the data corresponding to the address (1034) is invalid. 归属标志位(1033)代表了本地数据存储器(1031)中该地址对应的数据(1034)是归哪个处理器核使用,举例而言,可以用“O”代表本地数据存储器(1031)中该地址对应的数据(1034)归所述本地数据存储器(1031)对应的处理器核(1035)使用,用“I”代表本地数据存储器(1031)中该地址对应的数据(1034)归所述本地数据存储器(1031)对应的处理器核(1035)及其后级处理器核(1036)使用。 Data (1034) flag home (1033) represents the local data memory (1031) corresponding to the address in which the processor core using a normalized, for example, can be "O" represents the local data memory (1031) in the address corresponding data (1034) normalizing said local data storage (1031) corresponding to a processor core (1035) to use, with "I" represents the local data store (1031) the data corresponding to the address (1034) normalizing said local data the memory (1031) corresponding to a processor core (1035) and the subsequent stage processor core (1036) to use.

[0150] 在具体实施例中,可以按上述对有效标志位(1032)和归属标志位(1033)的定义描述存储在本地数据存储器中的每个数据的属性,并保证正确的读写。 [0150] In a particular embodiment, each of the data attributes may be in the local data storage memory described by the above definition of the effective flag (1032) and a home flag (1033), and to ensure proper reading and writing.

[0151] 在图10(c)对应实施例中,如果本地数据存储器(1031)中某地址对应的有效标志位(1032)为“0”,则表示该地址对应的数据(1034)是无效的,即,如果需要,可以直接对该地址进行数据存储操作。 [0151] corresponding to the embodiment in FIG. 10 (c), corresponding to an address if a local data store (1031) valid flag (1032) is "0", then the data (1034) corresponding to the address is invalid , i.e., if desired, the address data storage operations directly. 如果有效标志位(1032)为“I”且归属标志位(1033)为“0”,则表示该地址对应的数据(1034)是有效的,且是给所述本地数据存储器(1031)对应的处理器核(1035)使用的,因此本级处理器核(1035)如果需要,可以直接对该地址进行数据存储操作。 If the valid flag (1032) is "I" and a home flag (1033) is "0", then the data (1034) corresponding to the address is valid, and is to the local data store (1031) corresponding to a processor core (1035) to be used, so this level processor core (1035) If desired, the address data storage operations directly. 如果有效标志位(1032)为“I”且归属标志位(1033)为“1”,则表示该地址对应的数据(1034)是有效的,且是要给所述本地数据存储器(1031)对应的处理器核(1035)及其后级处理器核(1036)使用的,如果本级处理器核(1035)需要对该地址进行数据存储操作,则必须等到所述归属标志位(1033)为“O”后才可进行数据存储操作,即先将该地址对应的数据(1034)传输到后级处理器核(1036)对应的本地数据存储器(1037)中的相应位置,同时将本级处理器核(1035)对应的本地数据存储器(1031)中该地址对应的归属标志位(1033)置为“0”,这样,本级处理器核(1035)就可以对该地址进行数据存储操作了。 If the valid flag (1032) is "I" and a home flag (1033) is "1", it indicates that the data corresponding to the address (1034) is valid, and is to give the local data memory (1031) corresponding to a processor core (1035) and the subsequent stage processor core (1036) to be used, if the present level processor core (1035) is required for the data storage operation address, you must wait until the home flag (1033) is can be conducted after the "O" data storage operations, i.e., address data corresponding to the first (1034) to the post-stage processor core (1036) in the corresponding position (1037) corresponding to a local data store, while the present process stage cores (1035) corresponding to a local data store (1031) corresponding to the address in a home flag (1033) is set to "0" so that the present level processor core (1035) may be the address of the data storage operations .

[0152] 在图10(c)对应实施例中,若本级处理器(1035)对其对应的本地数据存储器(1031)进行数据存储操作,则可以将对应的有效标志位(1032)置“1”,并根据该数据(1034)是否会被后级处理器(1036)使用决定归属标志位,如果会被后级处理器(1036)使用则归属标志位(1033)置“1”,否则置“O”;也可以将对应的有效标志位(1032)置“1”,同时将对应的归属标志位(1032)也置“1”,这样虽然需要增加本地数据存储器(1031)的容量,但能简化其具体的实现结构。 [0152] In the embodiment corresponding to FIG. 10 (c), if the present operation of the AV (1035) corresponding to its local data store (1031) for data storage, may be the corresponding valid flag (1032) is set to " 1 ", and according to the data (1034) will be whether the post-stage processor (1036) using determine the ownership flag will be if the ownership flag (1033) set the stage processor (1036) using the" 1 ", otherwise is set to "O"; may be the corresponding valid flag (1032) is set to "1", while the corresponding home flag (1032) is also set to "1", so that although the need to increase the local data store (1031) volume, However, the specific implementation can simplify the structure.

[0153] 请参阅图11(a),图11(a)给出了目前现有的片上系统的典型结构。 [0153] Please refer to FIG. 11 (a), FIG. 11 (a) shows the structure of a conventional typical current system on a chip. 其中处理器核(1101)、数字信号处理器核(1102)、功能单元(1103、1104、1105)、输入输出接口控制模块(1106)和存储控制模块(1108)都连接在系统总线(1110)上。 Wherein the processor core (1101), a digital signal processor core (1102), a functional unit (1103,1104,1105), a control input and output interface module (1106) and the memory control module (1108) are connected to the system bus (1110) on. 该片上系统可以通过输入输出接口控制模块(1106)与外围设备(1107)传输数据,还可以通过存储控制模块(1108)与外部存储器(1109)传输数据。 On-chip system may also be transmitted via the data input and output interface control module (1106) and the peripheral devices (1107) to transmit data via the memory control module (1108) and an external memory (1109).

[0154] 请参阅图11(b),图11(b)给出了基于本发明技术方案实现片上系统的一种实施例。 [0154] Please refer to FIG. 11 (b), FIG. 11 (b) shows one implementation of system-on-chip-based embodiment of the technical solution of the present invention. 在本实施例中,处理器核及相应本地存储器(1121)与其他六个处理器核及相应本地存储器共同构成功能模块(1124),处理器核及相应本地存储器(1122)与其他四个处理器核及相应本地存储器共同构成功能模块(1125),处理器核及相应本地存储器(1123)与其他两个处理器核及相应本地存储器共同构成功能模块(1126)。 In the present embodiment, the processor core and the corresponding local memory (1121) and six other processor cores together constitute respective local memory function module (1124), the processor core and the corresponding local memory (1122) with the other four processing and respective cores together form a local memory function module (1125), the processor core and the corresponding local memory (1123) and the other two processor cores together constitute respective local memory function module (1126). 所述功能模块(1124、1125、1126)各自可以对应图11(a)实施例中的处理器核(1101)、或数字信号处理器核(1102)、或功能单元(1103或1104或1105)、或输入输出接口控制模块(1106)、或存储控制模块(1108)。 The functional modules (1124,1125,1126) may correspond to each processor core (1101) in the embodiment of FIG. 11 (a) embodiment, or a digital signal processor core (1102), or the functional unit (1103 or 1104 or 1105) or O interface control module (1106), or a storage control module (1108).

[0155] 以功能模块(1126)为例,处理器核及相应本地存储器(1123、1127、1128、1129)构成串行连接的多核结构,所述四个处理器核及相应本地存储器(1123、1127、1128、1129)共同实现功能模块(1126)具备的功能。 [0155] In the functional module (1126), for example, a processor core and a respective local memory (1123,1127,1128,1129) constituting the serial connection of multi-core structure, the four processor cores and the corresponding local memory (1123, 1127,1128,1129) together to achieve functional modules (1126) have the function.

[0156] 处理器核及相应本地存储器(1123)与处理器核及相应本地存储器(1127)之间的数据传输通过内部连接(1130)实现。 [0156] data transfer between the processor core and the corresponding local memory (1123) and the corresponding processor core and local memory (1127) through an internal connection (1130) to achieve. 同样地,处理器核及相应本地存储器(1127)与处理器核及相应本地存储器(1128)之间的数据传输通过内部连接(1131)实现,处理器核及相应本地存储器(1128)与处理器核及相应本地存储器(1129)之间的数据传输通过内部连接(1132)实现。 Similarly, the data transmission between the processor core and the corresponding local memory (1127) and the corresponding processor core and local memory (1128) through an internal connection (1131) implementation, the processor core and the corresponding local memory (1128) to the processor data transmission between the nucleus and the corresponding local memory (1129) is connected (1132) through the interior.

[0157] 功能模块(1126)通过硬连线(1133、1134)与总线连接模块(1138)连接,使功能模块(1126)与总线连接模块(1138)之间能相互传输数据。 [0157] function module (1126) through a hardwired connection (1133, 1134) connected to a bus module (1138), the functional module (1126) each capable of transmitting data between modules connected to a bus (1138). 同样地,功能模块(1125)与总线连接模块(1139)之间能相互传输数据,功能模块(1124)与总线连接模块(1140、1141)之间能相互传输数据。 Similarly, between the functional module (1125) connected to a bus module (1139) each capable of transmitting data, a functional module (1124) connected to a bus module can transfer data between each other (1140,1141). 总线连接模块(1138)与总线连接模块(1139)通过硬连线(1135)能相互传输数据。 Bus connection module (1138) connected to a bus module (1139) capable of transmitting data by hardwired (1135) to each other. 总线连接模块(1139)与总线连接模块(1140)通过硬连线(1136)能相互传输数据。 Bus connection module (1139) connected to a bus module (1140) capable of transmitting data by hardwired (1136) to each other. 总线连接模块(1140)与总线连接模块(1141)通过硬连线(1137)能相互传输数据。 Bus connection module (1140) connected to a bus module (1141) capable of transmitting data by hardwired (1137) to each other. 通过这种方法,可以实现功能模块(1125)、功能模块(1126)、功能模块(1127)之间的数据相互传输,总线连接模块(1138、1139、1140、1141)与硬连线(1135、1136、1137)实现了图11(a)中系统总线(1110)的功能,并与功能模块(1125、1126、1127) —起,构成了典型的片上系统结构。 By this method, it is possible to realize the function module (1125), of data between the functional blocks (1126), the function module (1127) each transfer, the bus connection module (1138,1139,1140,1141) and hardwired (1135, 1136,1137) implements FIG 11 (a) in the system bus (1110) function, and the function modules (1125,1126,1127) - on, constitutes a typical chip architecture.

[0158] 由于本发明提出的可配置多核/众核装置中处理器核及相应本地存储器在数目上是很容易扩展的,因此采用本实施例的方法可以很方便地实现各种类型的片上系统。 [0158] Since the configurable multi / manycore apparatus proposed by the present invention, the processor core and the number of corresponding local memory is easily expanded, and therefore the method according to the present embodiment can be easily implemented on various types of chip system . 此夕卜,在基于本发明提出的可配置多核/众核装置实时运行时,也可以通过实时动态配置的方法,使片上系统的结构能灵活改变。 Bu this evening, the configurable multi / manycore apparatus of the present invention based on the proposed real-time operation, the real time may be dynamically configured by the method of the system on chip structure can be changed flexibly.

[0159] 请参阅图11 (C),图11 (C)给出了基于本发明技术方案实现片上系统的/另一种实施例。 [0159] Please refer to FIG. 11 (C), FIG. 11 (C) shows the on-chip system for implementing the present invention based on the technical / to another embodiment. 在本实施例中,处理器核及相应本地存储器(1151)与其他六个处理器核及相应本地存储器共同构成功能模块(1163),处理器核及相应本地存储器(1152)与其他四个处理器核及相应本地存储器共同构成功能模块(1164),处理器核及相应本地存储器(1153)与其他两个处理器核及相应本地存储器共同构成功能模块(1165)。 In the present embodiment, the processor core and the corresponding local memory (1151) and six other processor cores together constitute respective local memory function module (1163), the processor core and the corresponding local memory (1152) with the other four processing and respective cores together form a local memory function module (1164), the processor core and the corresponding local memory (1153) and the other two processor cores together constitute respective local memory function module (1165). 所述功能模块(1163、1164、1165)各自可以对应图11(a)实施例中的处理器核(1101)、或数字信号处理器核(1102)、或功能单元(1103或1104或1105)、或输入输出接口控制模块(1106)、或存储控制模块(1108)。 The functional modules (1163,1164,1165) may correspond to each processor core (1101) in the embodiment of FIG. 11 (a) embodiment, or a digital signal processor core (1102), or the functional unit (1103 or 1104 or 1105) or O interface control module (1106), or a storage control module (1108).

[0160] 以功能模块(1165)为例,处理器核及相应本地存储器(1153、1154、1155、1156)构成串行连接的多核结构,所述四个处理器核及相应本地存储器(1153、1154、1155、1156)共同实现功能模块(1165)具备的功能。 [0160] In the functional module (1165), for example, a processor core and a respective local memory (1153,1154,1155,1156) constituting the serial connection of multi-core structure, the four processor cores and the corresponding local memory (1153, 1154,1155,1156) together to achieve functional modules (1165) have the function.

[0161] 处理器核及相应本地存储器(1153)与处理器核及相应本地存储器(1154)之间的数据传输通过内部连接(1160)实现。 [0161] data transfer between the processor core and the corresponding local memory (1153) and the corresponding processor core and local memory (1154) through an internal connection (1160) to achieve. 同样地,处理器核及相应本地存储器(1154)与处理器核及相应本地存储器(1155)之间的数据传输通过内部连接(1161)实现,处理器核及相应本地存储器(1155)与处理器核及相应本地存储器(1156)之间的数据传输通过内部连接(1162)实现。 Similarly, the data transmission between the processor core and the corresponding local memory (1154) and the corresponding processor core and local memory (1155) through an internal connection (1161) implementation, the processor core and the corresponding local memory (1155) to the processor data transmission between the nucleus and the corresponding local memory (1156) is connected (1162) through the interior.

[0162] 在本实施例中,一个例子是通过处理器核及相应本地存储器(1156)与处理器核及相应本地存储器(1166)间的数据传输实现功能模块(1165)与功能模块(1164)间的数据传输需求。 [0162] In the present embodiment, an example of the data transmission between the (1156) and the corresponding processor core and local memory (1166) to realize the function module (1165) and the functional module (1164) by the processor core and the respective local memories demand for data transmission between. 根据本发明技术方案,运行过程中,一旦处理器核及其相应本地存储器(1156)需要与处理器核及其相应本地存储器(1166)相互传输数据,可配置互联网络根据所述数据传输的需求自动配置、建立处理器核及其相应本地存储器(1156)与处理器核及其相应本地存储器(1166)的双向数据通路(1158)。 According to the present invention, during operation, once the corresponding processor core and local memory (1156) need to transmit data to each processor core and its corresponding local memory (1166), the Internet may be configured according to the data transmission demand automatic configuration, the establishment of the local memory and its corresponding processor core (1156) and its corresponding processor core and local memory (1166) is a bidirectional data path (1158). 同样地,一旦处理器核及其相应本地存储器(1166)需要向处理器核及其相应本地存储器(1156)单向传输数据,或处理器核及其相应本地存储器(1156)需要向处理器核及其相应本地存储器(1166)单向传输数据,也可按相同方法建立单向的数据通路。 Similarly, once the processor core and the respective local memory (1166) requires (1156) the one-way transmission of data to the processor core and corresponding local memory or local memory and its corresponding processor core (1156) to the processor core requires its one-way transmission of data corresponding to the local memory (1166), also established a one-way data path can be the same method.

[0163] 在本实施例中,还建立了处理器核及其相应本地存储器(1151)与处理器核及其相应本地存储器(1152)之间的双向数据通路(1157),和处理器核及其相应本地存储器(1165)与处理器核及其相应本地存储器(1155)之间的双向数据通路(1159)。 [0163] In the present embodiment, also established a processor core and the corresponding local memory bidirectional data path (1157) between the (1151) and its corresponding processor core and local memory (1152), and a processor core and a bidirectional data path between their respective local memory (1165) and its corresponding processor core and local memory (1155) (1159). 通过这种方法,可以实现功能模块(1163)、功能模块(1164)、功能模块(1165)之间的数据相互传输,双向数据通路(1157、1158、1159)实现了图11(a)中系统总线(1110)的功能,并与功能模块(1163、1164、1165) —起,构成了典型的片上系统结构。 By this method, it is possible to realize the function module (1163), the function module (1164), of data between the function modules (1165) each transmission, a bidirectional data path (1157,1158,1159) to achieve (a) in the system 11 of FIG. a bus (1110) function, and the function modules (1163,1164,1165) - on, constitutes a typical chip architecture.

[0164] 根据片上系统应用需求的不同,任意两个功能模块之间不一定只有一组数据通路。 [0164] Depending on the application requirements of the system on chip, not necessarily only one set of data path between any two modules. 由于本发明提出的可配置多核/众核装置中处理器核在数目上是很容易扩展的,因此采用本实施例的方法可以很方便地实现各种类型的片上系统。 Since the configurable multi / manycore apparatus proposed by the present invention, the number of processor cores is very easy to spread, thus the method according to the present embodiment can be easily implemented on various types of chip system. 此外,在基于本发明提出的可配置多核/众核装置实时运行时,也可以通过实时动态配置的方法,使片上系统的结构能灵活改变。 In addition, a configurable multi / manycore apparatus of the present invention based on the proposed real-time operation, the real time may be dynamically configured by the method of the system on chip structure can be changed flexibly.

[0165] 图12前编译和后编译实施例,其中图12(a)为前编译实施例,图12(b)为后编译实施例。 [0165] FIG. 12 and compiled before compiling the embodiment, wherein FIG 12 (a) to compile the previous embodiment, FIG. 12 (b) of Example compiled.

[0166] 如图12(a)所示,左边为原始的程序代码(1201、1203、1204),在代码中有两次函数调用,分别为A函数调用和B函数调用。 [0166] FIG. 12 (a), the left side of the original program code (1201,1203,1204), twice in the code function calls, function calls, respectively A and B function calls. 其中1203、1204分别为A函数和B函数代码本身。 1203 and 1204 respectively, wherein A and B are a function of the function code itself. 在进行前编译展开后,A函数调用和B函数调用分别被替换成相应的函数代码,展开后的代码中没有函数调用,如1202所示。 After compiling is performed before the commencement of, A and B function calls function calls are replaced with the corresponding function codes, the code function call is not expanded, as shown in 1202.

[0167] 图12(b)为后编译实施例,如图所示,原始的目标代码(1205)为经过普通编译后的目标代码,该目标代码是基于顺序执行的目标代码,经过后编译分割后,形成如图所示的代码块(1206、1207、1208、1209、1210、1211),每个代码块分配给相应的一个处理器核执行。 [0167] FIG. 12 (b) is the embodiment compiled embodiment, as shown, the original object code (1205) has elapsed after the ordinary compiled object code, the object code is executed based on the target code sequence, after the divided compiled after formation of the code block (1206,1207,1208,1209,1210,1211) as shown, each code block is assigned to a corresponding processor core executes. 相应的A循环体被分割为一个单独的代码块(1207),而B循环体由于本身相对较大,被分割成两个代码块,即B循环体I (1209)和B循环体2 (1210)。 A corresponding loop body is divided into a single block of code (1207), while the B loop because of their relatively large, is divided into two code blocks, i.e., loop B I (1209) and the cycle B 2 (1210 ). 两个代码块在两个处理器核上执行,共同完成B循环体。 Two blocks of code executed in the two processor cores, B together to complete the loop.

[0168] 请参阅图13,图13(a)为本发明所述基于串行多发射和流水线层次结构的可配置多核/众核装置示意图,图13(b)为通过配置形成的多核串行结构示意图,图13(c)为通过配置形成的多核串并行混合结构示意图,图13(d)为通过配置形成的多个多核结构的示意图。 [0168] Referring to FIG. 13, FIG. 13 (a) based on the serial transmission and multiple hierarchies pipeline multi / many-core device may be configured to present a schematic view of the invention, FIG. 13 (b) is formed by serially arranged multicore structure diagram, FIG. 13 (c) is a schematic view of a multi-core composite structure formed by the strings arranged in parallel, FIG. 13 (d) is a schematic configuration of a plurality of multi-core formed by the configuration.

[0169] 如图13(a)所示,该装置由多个处理器核及可配置本地存储器(1301、1303、1305、1307、1309、1311、1313、1315、1317)和可配置互联结构(1302、1304、1306、1308、1310、1312、1314、1316、1318)构成。 [0169] FIG. 13 (a), the apparatus a plurality of processor cores and configurable local memory (1301,1303,1305,1307,1309,1311,1313,1315,1317) and configurable interconnect structure ( 1302,1304,1306,1308,1310,1312,1314,1316,1318) constitution. 在本实施例中,每个处理器核及可配置本地存储器构成所述宏观流水线的一级。 In the present embodiment, each processor core and configurable local memory of a pipeline constituting the macro. 通过配置可配置互联结构(如1302),可以将多个处理器核及可配置本地存储器(1301、1303、1305、1307、1309、1311、1313、1315、1317)连接成串行连接结构。 By the configurable interconnect structure (e.g., 1302), a plurality of processor cores and configurable local memory (1301,1303,1305,1307,1309,1311,1313,1315,1317) connected in a serial connection configuration. 多个串行连接结构可以各自独立,也可以部分或全部有相互联系,串行、并行或串并混合地运行程序。 Structures may be a plurality of independent serial connection, there can also be partially or fully interconnected, serial, parallel or mixed serial and running the program.

[0170] 如图13(b)所示,通过配置相应的可配置互联结构,形成图中的多核串行结构,其中处理器核及可配置本地存储器(1301)为该多核串行结构的第一级,处理器核及可配置本地存储器(1317)为该多核串行结构的最后一级。 [0170] FIG. 13 (b), by appropriate configuration of the configurable interconnect structure to form a multi-core structure of FIG serial in which the processor core and configurable local memory (1301) for the serial structure of multicore a processor core and configurable local memory (1317) is the last one of serial multi-core structure.

[0171] 如图13(c)所示,通过配置相应的可配置互连结构,处理器核及可配置本地存储器(1301、1303、1305、1313、1315、1317)构成串行结构,而处理器核及可配置本地存储器(1307、1309、1311)构成并行结构,最终形成一个串并行混合结构的多核处理器。 [0171] FIG. 13 (c), by appropriate configuration of configurable interconnect architecture, the processor core and configurable local memory (1301,1303,1305,1313,1315,1317) constituting a serial structure, and the processing cores and configurable local memory (1307,1309,1311) constituting a parallel structure, forming a serial-parallel hybrid structure of multi-core processors.

[0172] 如图13(d)所示,通过配置相应的可配置互连结构,处理器核及可配置本地存储器(1301、1307、1313、1315)构成串行结构,而处理器核及可配置本地存储器(1303、1309、1305、1311、1317)构成另外一条串行结构,从而构成两条完全独立的串行结构。 [0172] FIG. 13 (d), by appropriate configuration of configurable interconnect architecture, the processor core and configurable local memory (1301,1307,1313,1315) constituting a serial structure, but may be a processor core and the local memory configuration (1303,1309,1305,1311,1317) constituting another one series configuration so as to constitute two completely separate serial structure.

Claims (15)

  1. 1.一种用于执行程序的可配置多核结构,包括: 复数个处理器核; 辅助所述复数个处理器核的复数个可配置本地存储器; 复数个用于串行连接所述复数个处理器核的可配置互联结构; 其特征在于: 根据多核结构的配置信息,程序被分割成对应复数个处理器核的复数个代码片段,使得每个代码片段的执行周期数相等或相近; 当一个代码片段中包含一个循环且这个循环的循环次数大于该代码片段中允许的循环次数时,所述循环被进一步分割成两组或更多组子循环,使得一个代码片段中只包含一组子循环; 其在所述可配置多核结构中:每个处理器核按串行顺序运行程序代码中的一部分代码片段;所述串行连接的多核结构中所有处理器核以流水线方式共同实现程序代码的完整功倉泛; 可配置本地存储器中存储了提供给相应处理器核的代码片段,并作为处理 A program for executing configurable multi-core structure, comprising: a plurality of processor cores; auxiliary processor core of the plurality of plurality of configurable local memory; a plurality of means for processing the plurality of serial connections cores configurable interconnect structure; characterized in that: a plurality of divided corresponding to a plurality of processor cores snippet multicore structure according to configuration information, the program, so that the number of execution cycles of each code fragment equal or close; when a fragment contains one cycle and the cycle number of the cycles is greater than the number of cycles permitted code fragment, the circulation is further divided into two or more sets of sub-cycles, such that only a fragment of code comprises a set of sub-cycles ; in which the configurable multi-core structure in which: a portion of each core by serial snippet sequence runs in the program code; multinuclear structure of the serial connection of all processor cores implemented in a pipelined manner common program code full power cartridge pan; configurable local code stored in the memory to a corresponding fragment of the processor core, and a processing 运行所需的数据的来源和去处;和每个可配置本地存储器包含一个指令存储器和一个可配置数据存储器;所述可配置数据存储器分为逻辑意义上的两部分,其中上部分用于前一级处理器核读写数据,下部分仅用于后一级处理器核的读取数据;和所述可配置本地存储器包括复数个可被前一级处理器核和后一级处理器核在同一时间分别访问的数据子模块; 当前一级处理器核与后一级处理器核都包含寄存器堆时,前一级处理器核中的寄存器堆中的寄存器的值在运行过程中被传输到后一级处理器核中的寄存器堆中的对应的寄存器中。 Data source and destination of the operation required; and each configurable local memory comprises an instruction memory and a data memory configuration; the configurable data memory is divided into two parts in the logical sense, wherein a front upper portion for AV data read and write core, only the lower part for reading a data processor core; and the configurable local memory includes a plurality of processor cores may be a front and a rear core in a processor each sub-module data access at the same time; the current one processor core and the core contains a processor register file, before a value of the register stack processor core is transferred to a register during operation a register file of the processor core in the corresponding register.
  2. 2.根据权利要求1所述的多核结构,其中: 单个处理器核以单发射或多发射的形式运行一条内部流水线;和复数个处理器核运行一条宏观流水线,其中每个处理器核为所述宏观流水线中的一级,从而实现比单个处理器核发射数量更大数量的发射。 2. The multi-core structure according to claim 1, wherein: in the form of a single processor core in a single or multiple emission operation of an internal emission lines; and a plurality of processor cores running a macro line, wherein each processor core of the said pipeline a macro, a single processor to achieve a larger number than the number of transmit transmission core.
  3. 3.根据权利要求1所述的多核结构,其中: 根据多核结构的配置信息,程序被分割成对应复数个处理器核的复数个代码片段,使得每个代码片段的执行周期数相等或相近;且用于产生代码片段的分割过程包括: 一个前编译过程,用于将程序中的函数调用替换为函数调用中的真实代码; 一个编译过程,用于将程序源代码转换成目标代码;和一个后编译过程,用于将目标代码分割成代码片段并将引导代码添加到代码片段中。 3. The multi-core structure according to claim 1, wherein: the multi-core structure according to configuration information, a program is divided into a plurality of processor cores corresponding to a plurality of code segments, such that each code segment execution period equal or similar numbers; and dividing process for generating a snippet comprising: a front compilation process, the program for the function call in the function call to replace the real codes; a compilation for converting the source code into object code; and a after the compilation process, the object code for the snippet into the boot and add the code to the code fragment.
  4. 4.根据权利要求1所述的多核结构,进一步包括: 一个或多个扩展模块;和所述模块包括一个用于存储从可配置本地存储器中溢出的数据及在处理器核之间传输的数据的共享存储器,一个用于访问可配置本地存储器的直接存储器访问(DMA)控制器,或一个用于处理所述处理器核及可配置本地存储器异常的异常处理模块, 其中每个处理器核包含一个执行单元,一个程序计数器。 4. The multi-core structure according to claim 1, further comprising: one or more expansion modules; and said overflow module comprises a local memory stores the configuration data and data may be transferred between the processor core for shared memory, a local memory configured for accessing a direct memory access (DMA) controller, or a process for the processor core and configurable local memory exception exception handling module, wherein each processor core comprises a unit, a program counter is executed.
  5. 5.根据权利要求1所述的多核结构,其中: 每个可配置本地存储器包含一个指令存储器和一个可配置数据存储器,且所述指令存储器和可配置数据存储器之间的边界是可配置的。 The multi-core structure according to claim 1, wherein: each configurable local memory comprises an instruction memory and a data memory configuration, and the boundary between the instruction memory and data memory configuration is configurable.
  6. 6.根据权利要求5所述的多核结构,其中: 可配置数据存储器包含复数个子存储器,且子存储器之间的边界是可配置的。 6. The multi-core structure as claimed in claim 5, wherein: a plurality of sub configuration data memory comprises a memory, and the sub-boundary between the memory is configurable.
  7. 7.根据权利要求4所述的多核结构,其中: 可配置互联结构包括处理器核与可配置本地存储器之间的连接,处理器核与共享存储器之间的连接,处理器核与直接存储器访问控制器之间的连接,可配置本地存储器与共享存储器之间的连接,可配置本地存储器与直接存储器访问控制器之间的连接,可配置本地存储器与外部系统之间的连接,以及共享存储器与外部系统之间的连接。 7. The multi-core structure as claimed in claim 4, wherein: the configurable interconnect structure includes a processor core and configurable connection between the local memory coupled between the processor core and the shared memory, the processor core and a direct memory access the connection between the controller, configure the connection between the local memory and the shared memory, configure the connection between the local memory and a direct memory access controller, configure the connection between the local memory and the external system, and the shared memory and the connection between the external system.
  8. 8.根据权利要求2所述的多核结构,其中: 宏观流水线受相邻两级宏观流水线之间传递的背压信号控制,以决定前级宏观流水线是否暂停和本级宏观流水线是否暂停。 8. The multi-core structure according to claim 2, wherein: the pipeline by the macro control passes back pressure signal between two adjacent lines of the macro, macro-stage pipeline to determine whether to suspend the former and present macro-stage pipeline is paused.
  9. 9.根据权利要求1所述的多核结构,其中处理器核被配置为拥有复数个功耗管理模式,包括: 一个配置级功耗管理模式,将没有在工作的处理器核设置为低功耗状态; 一个指令级功耗管理模式,将正在等待数据访问完成的处理器核设置为低功耗状态;和一个应用级功耗管理模式,将当前利用率低于一个阈值的处理器核设置为低功耗状态。 9. A multi-core structure according to claim 1, wherein the processor core is configured to have a plurality of power management modes, comprising: a configuration-level power management mode, the processor core is not provided in the low-power operating state; a instruction-level power management mode, waiting for the data access is complete core processor the low power consumption state; and an application-level power management mode, the processor core to set the current utilization is below a threshold as low-power state.
  10. 10.根据权利要求1所述的多核结构,进一步包括: 一个自测试装置,用于产生测试向量并储存测试结果,使得一个处理器核可以与相邻的处理器核对使用相同测试向量运行的结果进行比较,从而判断该处理器核是否正常运行, 其中任何非正常运行的处理器核被标记为无效,使得标记为无效的处理器核不被配置到宏观流水线中,从而实现自修复功能。 10. The multi-core structure according to claim 1, further comprising: a self-test means for generating test vectors and storing the results of the test results, such that a processor core may be adjacent to the same processor core running test vector are compared to determine if the processor core is operating correctly, any non-normal operation of the processor core is marked as invalid, so that the processor core marked invalid is not configured into the macro pipeline, in order to achieve self-repairing function.
  11. 11.根据权利要求1所述的多核结构,可以构成一种包含至少一个所述多核结构的片上系统,所述片上系统进一步包括: 复数个并行连接的处理器核,其中所述复数个串行连接的处理器核与所述复数个并行连接的处理器核相互连接构成串并混连的多核片上系统。 11. The multi-core structure according to claim 1, may constitute a system on chip comprising at least a structure of the multi-core system on-chip further comprising: a plurality of processor cores connected in parallel, wherein said plurality of serial the processor core connected to the plurality of processor cores connected in parallel and which are interconnected series hybrid system connected multicore sheet.
  12. 12.根据权利要求1所述的多核结构,可以构成一种包含至少一个所述多核结构的片上系统,所述片上系统进一步包括: 第二组串行连接多核结构,其中串行连接的处理器核的运行与第一组多核结构中的串行连接处理器核无关。 12. The multi-core structure according to claim 1, it may constitute a system on chip comprising at least a structure of the multi-core system on-chip further comprises: a second set of serially connected multi-core structure, wherein the processor serial connection independent core processor core connected to a first set of operating a multi-core structure in serial.
  13. 13.根据权利要求1所述的多核结构,可以构成一种包含复数个基于所述多核结构的功能模块的片上系统,所述片上系统进一步包括: 复数个连接所述复数个功能模块的用于交换数据的总线连接模块; 由总线连接模块之间的多条数据通路与复数个总线连接模块以及总线连接模块与功能模块之间的连接构成一条系统总线, 其中系统总线包括不同功能模块中的两个处理器核之间预先设置的连接;和所述功能模块包含一个通过静态配置用于实现一个专用的数据处理功能的专用功能模块并能通过配置被其他功能模块动态地调用。 13. A multi-core structure according to claim 1, comprising a plurality may be configured based on the system further comprises an on-chip functional modules of the multi-core system configuration, the sheet: a plurality of connecting the plurality of functional modules for bus connection module for exchanging data; constituting a connection between the system bus by a plurality of data paths and a plurality of modules connected to the bus and the bus connection between the module and the function module connected to module bus, wherein the bus system comprises two different functional modules the connection between processor cores set in advance; and the functional module comprises a dedicated implement a data processing function by a dedicated static configuration and function block configuration can be dynamically called by other modules.
  14. 14.一个用于执行程序的可配置多核结构,包括: 一个第一处理器核,作为所述多核结构中宏观流水线运行的第一级并执行程序的第一代码片段; 一个第一可配置本地存储器,用于辅助所述第一处理器核并存储了所述第一代码片段; 一个第二处理器核,作为宏观流水线的第二级并执行程序的第二代码片段,其中第二代码片段具有与第一代码片段相等或相近的执行周期; 一个第二可配置本地存储器,用于辅助所述第二处理器核并存储了所述第二代码片段;和复数个可配置互联结构,用于串行连接第一处理器核和第二处理器核; 其特征在于: 每个可配置本地存储器包含一个指令存储器和一个可配置数据存储器;所述可配置数据存储器分为逻辑意义上的两部分,其中上部分用于前一级处理器核读写数据,下部分仅用于后一级处理器核的读取数据; 第一可 14. A program for executing a configurable multi-core structure, comprising: a first processor core, a first stage of the first execution of the program code snippet and run the macro structure as the multi-core lines; a first local configuration a memory for a first processor core and said auxiliary storing the first code segment; a second processor core, the second pipeline stage macro and execute the program codes of a second segment, wherein the second code segment equal or close to the first segment of code execution cycle; a second configurable local memory, a second processor core and said auxiliary storing the second code segment; and a plurality of configurable interconnect structure, with the serial connection of the first processor core and the second processor core; characterized in that: each configurable local memory comprises an instruction memory and a data memory configuration; the configurable data memory is divided into two logical sense portion, wherein the front upper portion for a processor core to read and write data, only the lower part for reading a data processor core; first 置本地存储器和第二可配置本地存储器中共有的项包含一个数据部分,一个用于表示该数据部分是否有效的有效标志,和一个用于表示该数据被第一处理器核和第二处理器核中的哪个使用的所有权标志;和当第二处理器核对一个地址第一次读取数据时,第二处理器核从第一可配置本地存储器中读取并将读出的数据存储到第二可配置本地存储器中,使得之后的所有访问都能从第二可配置本地存储器中进行,从而实现读引致写(load-1nduced-store, LIS)的功能;和第一可配置本地存储器包括复数个可被第一处理器核和第二处理器核在同一时间分别访问的数据子模块; 当第一处理器核与第二处理器核都包含寄存器堆时,第一处理器核中的寄存器堆中的寄存器的值在运行过程中被传输到第二处理器核中的寄存器堆中的对应的寄存器中。 Home local memory and a second configuration item common to the local memory contains a data portion, a data portion for indicating the validity flag is valid, and the data is used to represent a first processor and a second processor core which ownership flag used in the core; and a second processor core when a first address data is read, the second processor core may be configured to read from the first local memory and transmits the read data stored in the first after two all access local memory may be configured, so that the configuration can be carried out from a second local memory, thereby realizing a read write lead (load-1nduced-store, LIS) function; and a first local memory comprises a plurality of configurable data may be a sub-module core and the second processor core are accessible at the same time a first processor; when the first processor core and the second processor core contains a register file, a register in the first processor core the value of the stack register is transferred to the register file corresponding to a second processor core registers during operation.
  15. 15.根据权利要求14所述的多核结构,其中: 第一处理器核被配置为第一读策略,使得用于将数据输入到第一处理器核的第一来源包括第一可配置本地存储器、共享存储器和外部设备; 第二处理器核被配置为第二读策略,使得用于将数据输入到第二处理器核的第二来源包括第二可配置本地存储器、第一可配置本地存储器、共享存储器和外部设备; 第一处理器核被配置为第一写策略,使得从第一处理器核来的数据输出的第一去向包括第一可配置本地存储器、共享存储器和外部设备;和第二处理器核被配置为第二写策略,使得从第一处理器核来的数据输出的第二去向包括第二可配置本地存储器、共享存储器和外部设备。 15. A multi-core structure according to claim 14, wherein: the first processor core is configured to read a first policy, such that for a first input data to the processor core comprises a first source of a first local memory may be arranged , a shared memory and an external device; a second processor core is configured to read a second policy, such that the second source data for input to a second processor core includes a local memory a second configuration, the first configuration may be a local memory , a shared memory and an external device; a first processor core is configured to a first write strategy, so that the data output from the first destination to a first processor core comprises a first configurable local memory, shared memory and an external device; and the second processor core is configured to a second write strategy, so that the whereabouts of the data output from the second to the first processor core configured to include a second local memory, shared memory and external devices.
CN 200910208432 2009-02-11 2009-09-29 Data processing method and device CN101799750B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN200910046117 2009-02-11
CN 200910208432 CN101799750B (en) 2009-02-11 2009-09-29 Data processing method and device

Applications Claiming Priority (5)

Application Number Priority Date Filing Date Title
CN 200910208432 CN101799750B (en) 2009-02-11 2009-09-29 Data processing method and device
PCT/CN2009/001346 WO2010060283A1 (en) 2008-11-28 2009-11-30 Data processing method and device
EP20090828544 EP2372530A4 (en) 2008-11-28 2009-11-30 Data processing method and device
KR20117014902A KR101275698B1 (en) 2008-11-28 2009-11-30 Data processing method and device
US13118360 US20110231616A1 (en) 2008-11-28 2011-05-27 Data processing method and system

Publications (2)

Publication Number Publication Date
CN101799750A true CN101799750A (en) 2010-08-11
CN101799750B true CN101799750B (en) 2015-05-06

Family

ID=42595439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910208432 CN101799750B (en) 2009-02-11 2009-09-29 Data processing method and device

Country Status (1)

Country Link
CN (1) CN101799750B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937370B (en) * 2010-08-16 2013-02-13 中国科学技术大学 Method and device supporting system-level resource distribution and task scheduling on FCMP (Flexible-core Chip Microprocessor)
WO2012049728A1 (en) * 2010-10-12 2012-04-19 富士通株式会社 Simulation device, method, and program
US9552206B2 (en) * 2010-11-18 2017-01-24 Texas Instruments Incorporated Integrated circuit with control node circuitry and processing circuitry
CN102023846B (en) * 2011-01-06 2014-06-04 中国人民解放军国防科学技术大学 Shared front-end assembly line structure based on monolithic multiprocessor system
CN102521201A (en) * 2011-11-16 2012-06-27 刘大可 Multi-core DSP (digital signal processor) system-on-chip and data transmission method
GB2516995B (en) * 2013-12-18 2015-08-19 Imagination Tech Ltd Task execution in a SIMD processing unit
CN104978235A (en) * 2015-06-30 2015-10-14 柏斯红 Operating frequency prediction based load balancing method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1484169A (en) * 2002-06-19 2004-03-24 阿尔卡塔尔加拿大公司 Multiprocessor computing device having shared program memory
CN1608249A (en) * 2001-10-22 2005-04-20 太阳微系统有限公司 Multi-core multi-threaded processors
EP1675015A1 (en) * 2004-12-22 2006-06-28 Galileo Avionica S.p.A. Reconfigurable multiprocessor system particularly for digital processing of radar images

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1608249A (en) * 2001-10-22 2005-04-20 太阳微系统有限公司 Multi-core multi-threaded processors
CN1484169A (en) * 2002-06-19 2004-03-24 阿尔卡塔尔加拿大公司 Multiprocessor computing device having shared program memory
EP1675015A1 (en) * 2004-12-22 2006-06-28 Galileo Avionica S.p.A. Reconfigurable multiprocessor system particularly for digital processing of radar images

Also Published As

Publication number Publication date Type
CN101799750A (en) 2010-08-11 application

Similar Documents

Publication Publication Date Title
Martin et al. The design of an asynchronous microprocessor
US7185185B2 (en) Multiple-thread processor with in-pipeline, thread selectable storage
US6351808B1 (en) Vertically and horizontally threaded processor with multidimensional storage for storing thread data
US7174432B2 (en) Asynchronous, independent and multiple process shared memory system in an adaptive computing architecture
Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator
US20120246450A1 (en) Register file segments for supporting code block execution by using virtual cores instantiated by partitionable engines
US7284092B2 (en) Digital data processing apparatus having multi-level register file
US6205543B1 (en) Efficient handling of a large register file for context switching
US6480938B2 (en) Efficient I-cache structure to support instructions crossing line boundaries
US20070150895A1 (en) Methods and apparatus for multi-core processing with dedicated thread management
US6938147B1 (en) Processor with multiple-thread, vertically-threaded pipeline
US20120297163A1 (en) Automatic kernel migration for heterogeneous cores
Narasiman et al. Improving GPU performance via large warps and two-level warp scheduling
US6826674B1 (en) Program product and data processor
US9047193B2 (en) Processor-cache system and method
US6631439B2 (en) VLIW computer processing architecture with on-chip dynamic RAM
US6279100B1 (en) Local stall control method and structure in a microprocessor
US7734895B1 (en) Configuring sets of processor cores for processing instructions
US20010042189A1 (en) Single-chip multiprocessor with cycle-precise program scheduling of parallel execution
Caspi et al. A streaming multi-threaded model
US7020763B2 (en) Computer processing architecture having a scalable number of processing paths and pipelines
US20060005173A1 (en) Execution of hardware description language (HDL) programs
US7840914B1 (en) Distributing computations in a parallel processing environment
US6594711B1 (en) Method and apparatus for operating one or more caches in conjunction with direct memory access controller
US20120291040A1 (en) Automatic load balancing for heterogeneous cores

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model