CN103714039A - Universal computing digital signal processor - Google Patents

Universal computing digital signal processor Download PDF

Info

Publication number
CN103714039A
CN103714039A CN201310725118.6A CN201310725118A CN103714039A CN 103714039 A CN103714039 A CN 103714039A CN 201310725118 A CN201310725118 A CN 201310725118A CN 103714039 A CN103714039 A CN 103714039A
Authority
CN
China
Prior art keywords
dsp
core
dsp core
cpu
interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310725118.6A
Other languages
Chinese (zh)
Other versions
CN103714039B (en
Inventor
陈书明
杨学军
万江华
刘仲
陈海燕
郭阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310725118.6A priority Critical patent/CN103714039B/en
Publication of CN103714039A publication Critical patent/CN103714039A/en
Application granted granted Critical
Publication of CN103714039B publication Critical patent/CN103714039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Microcomputers (AREA)

Abstract

The invention discloses a universal computing digital signal processor which comprises a CPU core unit, a DSP core unit, a multi-layer interconnection structure, an in-chip share storage array, an off-chip memory interface, a high-speed input and output interface, a second high-speed input and output interface, an inter-chip direct connection interface and an inter-core synchronous device. The CPU core unit comprises a plurality of CPU cores. The DSP core unit comprises a plurality of DSP cores. The CPU cores and the DSP cores are respectively connected with the in-chip share storage array through the multi-layer interconnection structure. The CPU cores are connected with the first high-speed input and output interface. The DSP cores are connected with the second-speed input and output interface and the inter-chip direct connection interface. The application program of the universal computing digital signal processor is obtained in a mode that uniform compiling and linking are conducted on CPU end object codes and DSP end object codes obtained in a compiling mode through a uniform and parallel programming method. The universal computing digital signal processor can maintain the basic characteristics of an embedded type DSP and the advantages of high performance and low consumption, and can effectively support universal scientific calculation.

Description

General-purpose computations digital signal processor
Technical field
The present invention relates to microprocessor architecture design field, be specifically related to be suitable for the general-purpose computations digital signal processor (General-Purpose Digital Signal Processor is called for short GPDSP) that the coenocytism of embedded dsp essential characteristic was calculated, had simultaneously to 64 general science.
Background technology
Due to the advantage of its low-power consumption and hard real-time, digital signal processor (Digital Signal Processor is called for short DSP) has been widely used in various embedded systems at present as a kind of typical embedded microprocessor.Although the architecture of current DSP has had the many features with central processing unit (CPU) homogeneity on calculating and controlling, but it is generally acknowledged that the following DSP of being is different from the essential characteristic of CPU: 1) computing power is strong, pay close attention in real time and calculate and be better than Focus Control and issued transaction; 2) for type signal, process and have specialised hardware support, as multiply-add operation, linear addressing; 3) common feature of embedded microprocessor: no more than 32 of address and instruction path, no more than 32 of most data paths; Non-Precise Interrupt; The job-program mode of the debugging of short-term off-line, long-term online resident operation (but not method that universal cpu debugging moves); 4) integrated Peripheral Interface is made as master with outer fast, is beneficial to especially online transceiving high speed AD/DA data, also supports that between DSP, high speed is direct-connected.
In fact general science is calculated also needs high-performance power dissipation ratio in essence.Therefore advantage, how to bring into play this low-power consumption of DSP, high calculated performance is used it for general science and is calculated and become a technical matters urgently to be resolved hurrily.Because general DSP exists following several problem while calculating as general science: (1), bit wide are little, make computational accuracy and addressing space not enough.General science computing application at least needs 64 precision; (2), lack the software and hardware supports such as task management, document control, process scheduling, interrupt management, lack in other words operating system (OS) hardware environment, make troubles to general, multiple tracks calculation task management; (3), lack unified advanced language programming pattern support, the support of multinuclear, vector, data parallel etc. is relied on to assembly routine programming substantially, be not easy to universal programming; (4), do not support local host's program debug pattern, only rely on its machine cross debugging emulation.These problems have seriously limited DSP and in general science, have calculated the application in field.In sum, how to realize a kind of advantage of not only having brought into play DSP high-performance power dissipation ratio, but also be applicable to the new type microprocessor of general science computing application, thereby can either obviously be beneficial to the performance that improves computing machine, high precision, the high-performance embedded field that can need operating system to support to those again, as Radar Signal Processing, underwater sound processing etc., provide a kind of suitable New DSP to become a key technical problem urgently to be resolved hurrily.
Summary of the invention
For the above-mentioned shortcoming of prior art, the technical problem to be solved in the present invention is to provide a kind of advantage that can either keep DSP embedded essential characteristic and high-performance low-power-consumption, can efficiently support again the general-purpose computations digital signal processor that general science is calculated.
In order to solve the problems of the technologies described above, the technical solution used in the present invention is:
A general-purpose computations digital signal processor, comprising:
CPU vouching unit, comprises at least one CPU core, and described CPU core comprises the generic transaction management of storage administration, document control, process scheduling, interrupt management task and the complete support to the general-purpose operating system is provided for being responsible for;
DSP vouching unit, comprises at least one for the DSP core to 64 bit arithmetics and the support of operating system micro-kernel is provided;
Multi-level interconnection structure, for connecting CPU core and DSP core, for being provided for task scheduling, startup stops and the quick fine granularity of synchronous operation is communicated by letter and realized coarsegrain by high speed DMA mode data communication between CPU core and DSP core;
In sheet, share storage array, be used to CPU core and DSP core to provide high bandwidth for number support, and the atomic operation that provides the hardware in comprising data failure and being updated in directly to support is supported cache coherence operation;
Sheet external memory interface, is used to CPU core and DSP to examine existing sheet external memory expansion;
The first high speed IO interface, for realizing the external data exchange of CPU core;
The second high speed IO interface, for realizing the external data exchange of DSP core;
Direct-connected interface between sheet, direct-connected for supporting between the sheet between general-purpose computations digital signal processor;
Internuclear synchronous device, for providing the internuclear hardware synchronization mechanism of DSP to support the synchronous operation between multinuclear, certain DSP core waits for by internuclear synchronous device the DSP core that other need to this core new data, this DSP core produces new data and writes back to shared storage array in sheet by data write-back mechanism, this DSP core 21 makes the DSP core that other need to this core new data continue operation by internuclear synchronous device, and the DSP core that other need to this core new data cancels by data, and mechanism guarantees not have old data and shared storage array, read new data in sheet;
Described CPU core connects by multi-level interconnection structure and DSP nuclear phase respectively, described CPU core, DSP core are connected with shared storage array in sheet respectively, in described, sharing storage array is connected with sheet external memory interface, described CPU core is connected with the first high speed IO interface, described DSP core respectively with the second high speed IO interface, sheet between direct-connected interface be connected, described internuclear synchronous device connects with each DSP nuclear phase respectively.
Further improvement as technique scheme of the present invention:
Described DSP core comprises:
64 scalar processing units, for realizing the support of operating system micro-kernel, as main control unit be responsible for scalar program execution, be responsible for CPU core between communicate by letter and the execution that 64 bit vectors are processed arrays controlled, be responsible for shared same operation in 64 bit vectors processing arrays;
64 bit vectors are processed array, for supporting resolving of the intensive processor active task of application;
Instruction distributes parts, for distributing instruction to 64 scalar processing units and 64 bit vectors processing array;
Described instruction distributes parts and is connected with 64 scalar processing units, 64 bit vectors processing arrays respectively, and described 64 scalar processing units, 64 bit vectors are processed between arrays and interconnected.
The present invention also comprises JTAG debugging interface and PCIE interface, described DSP core also comprises the artificial debugging parts of the visible all storeies of addressable DSP core internal program person and register, and described artificial debugging parts are connected with JTAG debugging interface, PCIE interface, CPU vouching unit respectively by internal bus.
General-purpose computations digital signal processor of the present invention has following technique effect:
1, the present invention includes CPU vouching unit, DSP vouching unit, multi-level interconnection structure, in sheet, share storage array, sheet external memory interface, the first high speed IO interface, the second high speed IO interface, direct-connected interface and synchronous device between sheet, CPU core and DSP core are connected to form respectively close-coupled structure and the organizational form of heterogeneous polynuclear by multi-level interconnection structure, between CPU core and DSP core, by multi-level interconnected mechanism, realize programmed control and the closely-coupled cooperation mode of data processing: the data path that control path fast and register stage are set between CPU core and DSP core on the one hand, efficient support has fine granularity, the control of hard real-time feature and data interaction are (as task scheduling, startup stops, the symbiont redirect that CPU core and DSP are internuclear and Fast synchronization etc.), between CPU core and DSP core, the mode by direct memory access (DMA) (DMA) passage and shared storage realizes the data communication of coarsegrain and shares on the other hand, therefore the present invention can realize CPU core and the internuclear tight coordinated of DSP from different levels, Embedded real-time signal processing is calculated and combined with general science, both kept height in DSP Embedded real-time signal processing to calculate in real time and the advantage of low-power consumption, realized again the precision of general science calculating and the support of versatility.
2, DSP vouching of the present invention unit comprises at least one for the DSP core that 64 bit arithmetics and operating system micro-kernel are supported is provided, and is applicable to wide numerical digit DSP nuclear structure and the organizational form of science computational accuracy and address space.By by the instruction of DSP core, data bit width more than 64, address bus is more than 40, thereby can support 64 double-precision floating points and 64 fixed point arithmetic logical calculated, particularly support 64 double-precision floating points, fixed point to take advantage of add operation, the efficient data of data path support of the general-purpose register file that employing comprises at least 64 bit wides and data bus are supplied with, by the support to 64 bit arithmetics, can realize the significantly lifting to computational accuracy and addressing space.
3, the control requirement of DSP essential characteristic and general-purpose computations is merged in the present invention, by CPU core, for being responsible for, comprising the generic transaction management of storage administration, document control, process scheduling, interrupt management task and the complete support to the general-purpose operating system is provided, by DSP core, provide to 64 bit arithmetics and to only thering is the operating system micro-kernel of the basic functions such as task scheduling and storage administration and support, in DSP core, realize the appropriateness support to complicated Flow Control structure, the support providing the branch instruction of vector of valae ARRAY PROCESSING efficiency is provided; Multi-level, support scheme that can cutting to OS are provided, be supported in the appropriateness support to the coherency mechanism such as the Cache such as data failure, renewal is provided in storage array on sheet, take the coherence scheme of the high-speed cache of software, hardware coordinated, make the present invention can realize the multi-level support to operating system, obviously improved the software and hardware support of the present invention to task management, document control, process scheduling, interrupt management, facilitated common tasks scheduling operation, meanwhile, multi-level support scheme has also been brought hard-wired dirigibility.
4, the present invention can support the unified multiple programming method of GPDSP structure, by compiling, instruct statement to describe the internuclear Thread-Level Parallelism of CPU, task level between multi-DSP core is parallel, Thread-Level Parallelism between CPU core and DSP core with synchronize, and identify respectively the code of CPU core and DSP core, in unified compiler framework, call respectively CPU and DSP compiler compiles different Accounting Legend Codes, and the unified executable code that is linked as single-chip, realize the unified multiple programming of the automatic paralleling and vectorization of multithreading in GPDSP, be conducive to strengthen advanced language programming particularly to multinuclear, the development efficiency of the simultaneous resources such as vector operation array, there is versatility and ease for use good, the advantage of applied range.
5, the present invention further comprises JTAG debugging interface and PCIE interface, DSP core also comprises the artificial debugging parts of the visible all storeies of addressable DSP core internal program person and register, artificial debugging parts by internal bus respectively with JTAG debugging interface, PCIE interface, CPU vouching unit is connected, local cpu host was both provided debugging, the debugging design and construction method that simultaneously keeps again its machine intersection artificial debugging, therefore can realize between its machine host and DSP core address resource (such as register, the resources such as storer) accurate, high speed access mechanism, the Obtaining Accurate of realization to dsp operation state, convenient realize checking and the enhanced debugging such as modification operating system micro-kernel resource, there is debug function convenience and quickness and high efficiency.
Accompanying drawing explanation
Fig. 1 is the general frame structural representation of the embodiment of the present invention.
Fig. 2 is the framed structure schematic diagram of DSP core in the embodiment of the present invention.
Fig. 3 is the debugging interface schematic diagram of the embodiment of the present invention.
Fig. 4 is the partial detailed framed structure schematic diagram of the embodiment of the present invention.
Fig. 5 is the uniform programming model schematic diagram of the automatic paralleling and vectorization of multithreading in the embodiment of the present invention.
Marginal data: 1, CPU vouching unit; 101, JTAG debugging interface; 102, PCIE interface; 11, CPU core; 2, DSP vouching unit; 21, DSP core; 211,64 scalar processing units; 2111, Instruction Control Unit; 2112, one-level speed buffering; 2113, scalar operation unit; 2114, scalar scu; 2115, scalar register file; 212,64 bit vectors are processed array; 2121, DMA parts; 2122, share array memory bank; 2123, processing unit; 2124, vectorial scu; 2125, vector registor file; 2126, vector operation unit; 213, instruction distributes parts; 214, artificial debugging parts; 3, multi-level interconnection structure; 4, in sheet, share storage array; 5, sheet external memory interface; 6, the first high speed IO interface; 7, the second high speed IO interface; 8, direct-connected interface between sheet; 9, internuclear synchronous device.
Embodiment
As shown in Figure 1, the general-purpose computations digital signal processor of the present embodiment comprises:
CPU vouching unit 1, comprises a plurality of CPU core 11, and CPU core 11 comprises the generic transaction management of storage administration, document control, process scheduling, interrupt management task and the complete support to the general-purpose operating system is provided for being responsible for; And the quantity that comprises CPU core 11 in CPU vouching unit 1 can be entered to adjust as required, the quantity of CPU core 11 is 1 and all can realizes above.
DSP vouching unit 2, comprises a plurality of for the DSP core 21 to 64 bit arithmetics and the support of operating system micro-kernel is provided; And the quantity that comprises DSP core 21 in DSP vouching unit 2 can be entered to adjust as required, the quantity of DSP core 21 is 1 and all can realizes above.
Multi-level interconnection structure 3, for connecting CPU core 11 and DSP core 21, for being provided for task scheduling, startup stops and the quick fine granularity of synchronous operation is communicated by letter and realized coarsegrain by high speed DMA mode data communication between CPU core 11 and DSP core 21.Multi-level interconnection structure 3 connects CPU core 11 and DSP core 21, can either be provided for task scheduling by " fine granularity control and data path ", start and stop and the quick fine granularity communication of synchronous operation, can realize by the mode of " quick DMA " data communication of coarsegrain again.
In sheet, share storage array 4, be used to CPU core 11 and DSP core 21 to provide high bandwidth for number support, and the atomic operation that provides the hardware in comprising data failure and being updated in directly to support is supported cache coherence operation.Sharing storage array 4 in sheet provides high bandwidth for number support for CPU core 11 and DSP core 21, further strengthened the efficiency of coarsegrain data communication, and cache coherence is had to appropriate support function, the atomic operation that can provide hardware such as data failure and renewal directly to support.
Sheet external memory interface 5, is used to CPU core 11 and DSP core 21 to realize the expansion of sheet external memory, makes the expansion of storage more flexible.
The first high speed IO interface 6, for realizing the external data exchange of CPU core 11.
The second high speed IO interface 7, for realizing the external data exchange of DSP core 21.
Direct-connected interface 8 between sheet, direct-connected for supporting between the sheet between general-purpose computations digital signal processor.
Internuclear synchronous device 9, for providing the internuclear hardware synchronization mechanism of DSP to support the synchronous operation between multinuclear, certain DSP core 21 waits for by internuclear synchronous device 9 the DSP core 21 that other need to this core new data, this DSP core 21 produces new data and writes back to shared storage array 4 in sheet by data write-back mechanism, this DSP core 21 makes the DSP core 21 that other need to this core new data continue operation by internuclear synchronous device 9, other need to this core new data DSP core 21 by the data mechanism of cancelling, guarantee not have old data and share storage array 4 and read new data in sheet.
CPU core 11 is connected with DSP core 21 by multi-level interconnection structure 3 respectively, CPU core 11, DSP core 21 are connected with shared storage array 4 in sheet respectively, in sheet, sharing storage array 4 is connected with sheet external memory interface 5, CPU core 11 is connected with the first high speed IO interface 6, DSP core 21 respectively with the second high speed IO interface 7, sheet between direct-connected interface 8 be connected, internuclear synchronous device 9 is connected with each DSP core 21 respectively.
As shown in Figure 2, DSP core 21 comprises:
64 scalar processing units 211, for realizing the support of operating system micro-kernel, as main control unit be responsible for scalar program execution, be responsible for CPU core 11 between efficient communication and the execution that 64 bit vectors are processed arrays 212 is controlled, be responsible for same operation shared in 64 bit vectors processing arrays 212 (comprising the unified flow-control operation of the configuration of public variable and modification, processing array).
64 bit vectors are processed array 212, for supporting resolving of the intensive processor active task of application.
Instruction distributes parts 213, for distributing instruction to 64 scalar processing units 211 and 64 bit vectors processing array 212.
Instruction distributes parts 213 and is connected with 64 scalar processing units, 211,64 bit vectors processing arrays 212 respectively, and 64 scalar processing unit 211,64 bit vectors are processed between arrays 212 and interconnected.
The DSP core 21 of the present embodiment is processed array 212 by 64 scalar processing units, 211,64 bit vectors, support to 64 bit arithmetics can be provided, efficient support to double-precision floating point and fixed-point operation is particularly provided, by its 64 scalar processing units 211, provide only supporting the support of the operating system micro-kernel of the basic functions such as process scheduling, storage administration, by resolving of intensive processor active task in its 64 bit vector processing array, 212 support application.When carry out calculating, instruction distribute parts 213 distribute out respectively scalar instruction to 64 scalar processing units 211, distribute vector instruction and process arrays 212 to 64 bit vectors.64 scalar processing units 211 and 64 bit vectors are processed coordinated between array 212, provide affecting the appropriateness support of the complicated flow-control operation of vector array execution efficiency, the for example support to branched structure in loop body, these flow-control operations comprise the branched structure in loop body, and the uncertain loop structure of while class cycle index.
As shown in Figure 3, the present embodiment also comprises JTAG debugging interface 101 and PCIE interface 102, DSP core 21 also comprises the artificial debugging parts 214 of the addressable DSP core visible all storeies of 21 internal program person and register, and artificial debugging parts 214 are connected with JTAG debugging interface 101, PCIE interface 102, CPU vouching unit 1 respectively by internal bus.Artificial debugging parts 214 can, Obtaining Accurate to DSP vouching unit 2 running statuses is provided, and to the checking and the senior debug function such as modification of operating system micro-kernel resource, these storeies and register are also processed arrays 212 access by 64 scalar processing units, 211,64 bit vectors.Because artificial debugging parts 214 are connected with JTAG debugging interface 101, PCIE interface 102, CPU vouching unit 1 respectively by internal bus, so the present embodiment can be supported three kinds of debugging modes: JTAG debugging mode, PCIE debugging mode and CPU debugging mode.The debugging request that artificial debugging parts 214 are supported from JTAG debugging interface 101, and according to storer or register in these request access the present embodiment.JTAG debugging interface 101 is emulation interface that traditional DSP generally supports, supports external host access DSP internal storage and register.The jtag interface agreement of using due to JTAG debugging interface 101 belongs to serial protocol, and clock frequency is less than 100MHz conventionally, and its access speed is slower.The present embodiment is also supported to support local CPU vouching unit 1 or external host to ask by the debugging of PCIE interface 102 by internal bus, because internal bus adopts parallel transmission agreement and frequency of operation much larger than the interface frequency of JTAG debugging interface 101, therefore can support debugging access at a high speed.By this high speed debugging access, the data path just between CPU core 11 and DSP core 21 with control path fast and register stage, to such as task scheduling, start stop, the symbiont redirect of 21 of CPU core 11 and DSP cores and Fast synchronization etc. have fine granularity, hard real-time feature that efficient support is provided alternately.
The present embodiment has merged the control requirement of DSP essential characteristic and general-purpose computations, by CPU core 11, provide the complete support to OS, DSP core 21 adopts very long instruction word structure, DSP core 21 comprises 64 scalar processing units 211 and 212,64 scalar processing units 211 of 64 bit vectors processing array can be realized only having the support of the OS micro-kernel of the basic functions such as task scheduling and storage administration.As shown in Figure 4, 64 scalar processing unit 211 inside comprise Instruction Control Unit 2111, one-level speed buffering 2112, scalar operation unit 2113, scalar scu 2114, scalar register file 2115, Instruction Control Unit 2111 respectively and instruction distributes parts 213, scalar register file 2115, one-level speed buffering 2112 is connected, one-level speed buffering 2112 is connected with shared storage array 4 in sheet, scalar scu 2114 respectively with one-level speed buffering 2112, scalar register file 2115 is connected, 2113 of scalar operation unit are connected with scalar register file 2115.Instruction Control Unit 2111 is responsible for from one-level speed buffering 2112, obtaining instruction and sending to instruction distributing parts 213, and instruction distributes parts 213 and command adapted thereto sent to respectively to 64 scalar processing units, 211,64 bit vectors again and process in arrays 212 and process.64 bit vectors are processed a plurality of processing units 2123 that array 212 comprises DMA parts 2121, shares array memory bank 2122 and isomorphism or isomery, between processing unit 2123, by sharing array memory bank 2122, realize the associated treatment of data and intercommunication mutually, share 2122 of array memory bank and be connected with shared storage array 4 in sheet by DMA parts 2121.Wherein, processing unit 2123 is for being responsible for resolving of the intensive processor active task of application, and each processing unit 2123 comprises vectorial scu 2124, vector registor file 2125 and the vector operation unit 2126 being connected successively.It should be noted that, the present embodiment also can adopt 64 scalar processing units 211 of other structure and 64 bit vectors to process array 212 as required, does not repeat them here.In the present embodiment, DSP core 21 can be realized the appropriateness support to complicated Flow Control structure, first DSP core 21 reduces branch's expense by two kinds of mechanism: the first mechanism is to adopt conditional execution instruction, conditional execution instruction can judge whether to carry out according to a certain buffer status in register file (scalar register file 2115 or vector registor file 2125), if do not carried out, this instruction changes blank operation into.The value of branch condition is placed into a certain register, and the instruction in fundamental block all can service condition instruction, and does not need to use branch instruction, the expense of avoiding branch to bring.Meanwhile, adopt fundamental block instruction scheduling before Tapped Delay Cao Jiang branch in Tapped Delay groove, hardware guarantees that the instruction in Tapped Delay groove must be carried out after carrying out branch instruction.This mechanism can reduce the expense of branch instruction.The second mechanism is the branch instruction based on Vector Processing array status, this instruction according to 64 bit vectors process in arrays 212 certain in each vector registor file 2125 together the value of numbered register determine whether to occur redirect.Data consistency between each DSP core 21 is safeguarded and has been worked in coordination with by software and hardware, hardware provides the data of one-level speed buffering 2112 to write back mechanism and internuclear synchronous transposition, and software is realized internuclear data consistency by operation one-level speed buffering 2112 and internuclear synchronous device 9 and operated.In addition, it is machine-processed that the one-level speed buffering 2112 of DSP core 21 also provides data to cancel, can make all or part of data in one-level speed buffering 2112 invalid, internuclear synchronous device 9 is realized synchronous between a plurality of DSP core 21, safeguards that the detailed step of internuclear data consistency is as follows in the present embodiment: (1) certain DSP core 21 waits for by internuclear synchronous device 9 the DSP core 21 that other need to this core new data; (2) the data write-back mechanism that this DSP core 21 produces new data and provides by one-level speed buffering 2112, writes back to shared storage array 4 in sheet by new data; (3) this DSP core 21 is by internuclear synchronous device 9, makes the DSP core 21 that other need to this core new data continue operation; (4) core that other need to these DSP core 21 new datas guarantees there is no old data by the data of one-level speed buffering 2112 mechanism of cancelling, and shares storage array 4 and read new data in sheet.
The hardware configuration of the present embodiment can be supported the unified multiple programming method of GPDSP structure, the unified multiple programming implementation method of the present embodiment is on the OpenMP of standard multiple programming method basis, by expanding some DSP compilings, instruct statement, the unified multiple programming of realization and vectorization automatic paralleling to multithreading in CPU+DSP heterogeneous multi-nucleus processor, by compiling, instruct statement to describe the Thread-Level Parallelism of 11 of CPU cores, the task level that multi-DSP core is 21 is parallel, Thread-Level Parallelism between CPU core 11 and DSP core 21 with synchronize, and identify respectively the code of CPU core 11 and DSP core 21, in unified compiler framework, call respectively CPU and DSP compiler compiles different Accounting Legend Codes, and the unified executable code that is linked as single-chip, realize the unified multiple programming of the automatic paralleling and vectorization of multithreading.The detailed step of the unified multiple programming method of the general-purpose computations digital signal processor that as shown in Figure 5, the present embodiment is supported is as follows:
1) programmer uses standard program language and grammer (as standard C/C++ grammer) to write application program, before the statement block that requires CPU vouching unit multi-threaded parallel to carry out, inserts OpenMP compiling simultaneously and instructs statement, inserts DSP compiling before requiring DSP vouching unit to carry out the statement block calculating and instruct statement;
2) when compiling application program, OpenMP compiling instructs statement to instruct CPU compiler to realize multithreading automatically parallelizing, and DSP compiling instructs statement to instruct DSP compiler to realize the vector code compiling of DSP core end;
3) CPU end compiler instrument is unified compiling and link to the object code of the object code of CPU end and DSP end, and final output can be used for the executable code that general-purpose computations digital signal processor is carried out.
Therefore, that according to the general science of difference, calculates need to carry out above-mentioned steps 1)~3) unify multiple programming and can obtain corresponding executable code, executable code can directly be used general-purpose computations digital signal processor to carry out, thereby can utilize CPU vouching unit 1 and DSP vouching unit 2 in the present embodiment general-purpose computations digital signal processor, on the basis of advantage that keeps DSP embedded essential characteristic and high-performance low-power-consumption, support efficiently general science to calculate.The unified multiple programming method of the present embodiment support instructs statement to describe the internuclear Thread-Level Parallelism of CPU by compiling, task level between multi-DSP core is parallel, Thread-Level Parallelism between CPU core and DSP core with synchronize, and identify respectively the code of CPU core and DSP core, in unified compiler framework, call respectively CPU and DSP compiler compiles different Accounting Legend Codes, and the unified executable code that is linked as single-chip, realize the unified multiple programming of the automatic paralleling and vectorization of multithreading in GPDSP, be conducive to strengthen advanced language programming particularly to multinuclear, the development efficiency of the simultaneous resources such as vector operation array, there is versatility and ease for use good, the advantage of applied range.
The above is only the preferred embodiment of the present invention, and protection scope of the present invention is also not only confined to above-described embodiment, and all technical schemes belonging under thinking of the present invention all belong to protection scope of the present invention.It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (3)

1. a general-purpose computations digital signal processor, is characterized in that comprising:
CPU vouching unit (1), comprise at least one CPU core (11), described CPU core (11) comprises the generic transaction management of storage administration, document control, process scheduling, interrupt management task and the complete support to the general-purpose operating system is provided for being responsible for;
DSP vouching unit (2), comprises at least one for the DSP core (21) to 64 bit arithmetics and the support of operating system micro-kernel is provided;
Multi-level interconnection structure (3), be used for connecting CPU core (11) and DSP core (21), for being provided for task scheduling, startup stops and the quick fine granularity of synchronous operation is communicated by letter and realized coarsegrain by high speed DMA mode data communication between CPU core (11) and DSP core (21);
In sheet, share storage array (4), be used to CPU core (11) and DSP core (21) to provide high bandwidth for number support, and the atomic operation that provides the hardware in comprising data failure and being updated in directly to support is supported cache coherence operation;
Sheet external memory interface (5), is used to CPU core (11) and DSP core (21) to realize the expansion of sheet external memory;
The first high speed IO interface (6), for realizing the external data exchange of CPU core (11);
The second high speed IO interface (7), for realizing the external data exchange of DSP core (21);
Direct-connected interface (8) between sheet, direct-connected for supporting between the sheet between general-purpose computations digital signal processor;
Internuclear synchronous device (9), for providing the internuclear hardware synchronization mechanism of DSP to support the synchronous operation between multinuclear; Certain DSP core (21) waits for by internuclear synchronous device (9) the DSP core (21) that other need to this core new data, this DSP core (21) produces new data and writes back to shared storage array (4) in sheet by data write-back mechanism, this DSP core (21) makes the DSP core (21) that other need to this core new data continue operation by internuclear synchronous device (9), and the DSP core (21) that other need to this core new data cancels by data, and mechanism guarantees not have old data and shared storage array (4), read new data in sheet;
Described CPU core (11) is connected with DSP core (21) by multi-level interconnection structure (3) respectively, described CPU core (11), DSP core (21) are connected with shared storage array (4) in sheet respectively, in described, sharing storage array (4) is connected with sheet external memory interface (5), described CPU core (11) is connected with the first high speed IO interface (6), described DSP core (21) respectively with the second high speed IO interface (7), sheet between direct-connected interface (8) be connected, described internuclear synchronous device (9) is connected with each DSP core (21) respectively.
2. general-purpose computations digital signal processor according to claim 1, is characterized in that described DSP core (21) comprising:
64 scalar processing units (211), for realizing the support of operating system micro-kernel, as main control unit be responsible for scalar program execution, be responsible for CPU core (11) between communicate by letter and the execution that 64 bit vectors are processed arrays (212) controlled, be responsible for shared same operation in 64 bit vectors processing arrays (212);
64 bit vectors are processed array (212), for supporting resolving of the intensive processor active task of application;
Instruction distributes parts (213), for distributing instruction to 64 scalar processing units (211) and 64 bit vectors processing arrays (212);
Described instruction distributes parts (213) and is connected with 64 scalar processing units (211), 64 bit vectors processing arrays (212) respectively, and described 64 scalar processing units (211), 64 bit vectors are processed between arrays (212) and interconnected.
3. general-purpose computations digital signal processor according to claim 1 and 2, it is characterized in that: also comprise JTAG debugging interface (101) and PCIE interface (102), described DSP core (21) also comprises the artificial debugging parts (214) of the visible all storeies of an addressable DSP core (21) internal program person and register, and described artificial debugging parts (214) are connected with JTAG debugging interface (101), PCIE interface (102), CPU vouching unit (1) respectively by internal bus.
CN201310725118.6A 2013-12-25 2013-12-25 universal computing digital signal processor Active CN103714039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310725118.6A CN103714039B (en) 2013-12-25 2013-12-25 universal computing digital signal processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310725118.6A CN103714039B (en) 2013-12-25 2013-12-25 universal computing digital signal processor

Publications (2)

Publication Number Publication Date
CN103714039A true CN103714039A (en) 2014-04-09
CN103714039B CN103714039B (en) 2017-01-11

Family

ID=50407032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310725118.6A Active CN103714039B (en) 2013-12-25 2013-12-25 universal computing digital signal processor

Country Status (1)

Country Link
CN (1) CN103714039B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615584A (en) * 2015-02-06 2015-05-13 中国人民解放军国防科学技术大学 Method for vectorization computing of solution of large-scale trigonometric linear system of equations for GPDSP
CN104615516A (en) * 2015-02-06 2015-05-13 中国人民解放军国防科学技术大学 Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
CN104615557A (en) * 2015-01-22 2015-05-13 中国人民解放军国防科学技术大学 Multi-core fine grit synchronous DMA transmission method used for GPDSP
CN104636316A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented large-scale matrix multiplication calculation method
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting
CN104679691A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP and adopting host counting
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN105718242A (en) * 2016-01-15 2016-06-29 中国人民解放军国防科学技术大学 Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing)
CN106201939A (en) * 2016-06-30 2016-12-07 中国人民解放军国防科学技术大学 Multinuclear catalogue concordance device towards GPDSP framework
CN108874727A (en) * 2018-05-29 2018-11-23 中国人民解放军国防科技大学 GPDSP-oriented multi-core parallel computing implementation method
CN109542830A (en) * 2018-11-21 2019-03-29 北京灵汐科技有限公司 A kind of data processing system and data processing method
CN109558226A (en) * 2018-11-05 2019-04-02 上海无线通信研究中心 A kind of DSP multi-core parallel concurrent calculating dispatching method based on internuclear interruption
CN110032407A (en) * 2019-03-08 2019-07-19 阿里巴巴集团控股有限公司 Promote the method and device and electronic equipment of CPU parallel performance
WO2020062305A1 (en) * 2018-09-30 2020-04-02 华为技术有限公司 Computational accelerator, exchanger, task scheduling method, and processing system
CN112134814A (en) * 2020-08-24 2020-12-25 合肥学院 Board-level internet structure and communication method
CN116028418A (en) * 2023-02-13 2023-04-28 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer
TWI807471B (en) * 2017-11-21 2023-07-01 美商谷歌有限責任公司 Low-power ambient computing device with machine learning and a method performed by the same

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120110280A1 (en) * 2010-11-01 2012-05-03 Bryant Christopher D Out-of-order load/store queue structure
US20120284680A1 (en) * 2011-05-05 2012-11-08 Advanced Micro Devices, Inc. Method and apparatus for designing an integrated circuit
CN103279445A (en) * 2012-09-26 2013-09-04 上海中科高等研究院 Computing method and super-computing system for computing task

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120110280A1 (en) * 2010-11-01 2012-05-03 Bryant Christopher D Out-of-order load/store queue structure
US20120284680A1 (en) * 2011-05-05 2012-11-08 Advanced Micro Devices, Inc. Method and apparatus for designing an integrated circuit
CN103279445A (en) * 2012-09-26 2013-09-04 上海中科高等研究院 Computing method and super-computing system for computing task

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘轶等: "一种面向多核处理器并行系统的启发式任务分配算法", 《计算机研究与发展》 *
汪东: "异构多核DSP数据流前瞻关键技术研究", 《中国博士学位论文全文数据库信息科技辑(月刊)》 *
陈书明等: "一种面向多核DSP 的小容量紧耦合快速共享数据池", 《计算机学报》 *
陈海燕等: "面向SDR应用的向量存储器的设计与优化", 《国防科技大学学报》 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679689B (en) * 2015-01-22 2017-12-12 中国人民解放军国防科学技术大学 A kind of multinuclear DMA segment data transmission methods counted using slave for GPDSP
CN104615557B (en) * 2015-01-22 2018-08-21 中国人民解放军国防科学技术大学 A kind of DMA transfer method that multinuclear fine granularity for GPDSP synchronizes
CN104615557A (en) * 2015-01-22 2015-05-13 中国人民解放军国防科学技术大学 Multi-core fine grit synchronous DMA transmission method used for GPDSP
CN104679691B (en) * 2015-01-22 2017-12-12 中国人民解放军国防科学技术大学 A kind of multinuclear DMA segment data transmission methods using host count for GPDSP
CN104679689A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP (general purpose digital signal processor) and adopting slave counting
CN104679691A (en) * 2015-01-22 2015-06-03 中国人民解放军国防科学技术大学 Multi-core DMA (direct memory access) subsection data transmission method used for GPDSP and adopting host counting
CN104636316A (en) * 2015-02-06 2015-05-20 中国人民解放军国防科学技术大学 GPDSP-oriented large-scale matrix multiplication calculation method
CN104615584A (en) * 2015-02-06 2015-05-13 中国人民解放军国防科学技术大学 Method for vectorization computing of solution of large-scale trigonometric linear system of equations for GPDSP
CN104615516B (en) * 2015-02-06 2019-01-29 中国人民解放军国防科学技术大学 The method that extensive high-performance Linpack test benchmark towards GPDSP is realized
CN104615584B (en) * 2015-02-06 2017-12-22 中国人民解放军国防科学技术大学 The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group
CN104615516A (en) * 2015-02-06 2015-05-13 中国人民解放军国防科学技术大学 Method for achieving large-scale high-performance Linpack testing benchmark for GPDSP
CN104699631A (en) * 2015-03-26 2015-06-10 中国人民解放军国防科学技术大学 Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN104699631B (en) * 2015-03-26 2018-02-02 中国人民解放军国防科学技术大学 It is multi-level in GPDSP to cooperate with and shared storage device and access method
CN105718242A (en) * 2016-01-15 2016-06-29 中国人民解放军国防科学技术大学 Processing method and system for supporting software and hardware data consistency in multi-core DSP (Digital Signal Processing)
CN105718242B (en) * 2016-01-15 2018-08-17 中国人民解放军国防科学技术大学 The processing method and system of software and hardware data consistency are supported in multi-core DSP
CN106201939A (en) * 2016-06-30 2016-12-07 中国人民解放军国防科学技术大学 Multinuclear catalogue concordance device towards GPDSP framework
CN106201939B (en) * 2016-06-30 2019-04-05 中国人民解放军国防科学技术大学 Multicore catalogue consistency device towards GPDSP framework
US11714477B2 (en) 2017-11-21 2023-08-01 Google Llc Low-power ambient computing system with machine learning
TWI807471B (en) * 2017-11-21 2023-07-01 美商谷歌有限責任公司 Low-power ambient computing device with machine learning and a method performed by the same
CN108874727B (en) * 2018-05-29 2019-09-10 中国人民解放军国防科技大学 GPDSP-oriented multi-core parallel computing implementation method
CN108874727A (en) * 2018-05-29 2018-11-23 中国人民解放军国防科技大学 GPDSP-oriented multi-core parallel computing implementation method
CN112867998A (en) * 2018-09-30 2021-05-28 华为技术有限公司 Operation accelerator, exchanger, task scheduling method and processing system
CN112867998B (en) * 2018-09-30 2024-05-10 华为技术有限公司 Operation accelerator, switch, task scheduling method and processing system
WO2020062305A1 (en) * 2018-09-30 2020-04-02 华为技术有限公司 Computational accelerator, exchanger, task scheduling method, and processing system
US11403250B2 (en) 2018-09-30 2022-08-02 Huawei Technologies Co., Ltd. Operation accelerator, switch, task scheduling method, and processing system
CN109558226B (en) * 2018-11-05 2021-03-30 上海无线通信研究中心 DSP multi-core parallel computing scheduling method based on inter-core interruption
CN109558226A (en) * 2018-11-05 2019-04-02 上海无线通信研究中心 A kind of DSP multi-core parallel concurrent calculating dispatching method based on internuclear interruption
CN109542830B (en) * 2018-11-21 2022-03-01 北京灵汐科技有限公司 Data processing system and data processing method
CN109542830A (en) * 2018-11-21 2019-03-29 北京灵汐科技有限公司 A kind of data processing system and data processing method
US11080094B2 (en) 2019-03-08 2021-08-03 Advanced New Technologies Co., Ltd. Method, apparatus, and electronic device for improving parallel performance of CPU
CN110032407A (en) * 2019-03-08 2019-07-19 阿里巴巴集团控股有限公司 Promote the method and device and electronic equipment of CPU parallel performance
CN112134814A (en) * 2020-08-24 2020-12-25 合肥学院 Board-level internet structure and communication method
CN116028418A (en) * 2023-02-13 2023-04-28 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer

Also Published As

Publication number Publication date
CN103714039B (en) 2017-01-11

Similar Documents

Publication Publication Date Title
CN103714039B (en) universal computing digital signal processor
Prakash et al. Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms
CN112199173B (en) Data processing method for dual-core CPU real-time operating system
CN107341053A (en) The programmed method of heterogeneous polynuclear programmable system and its memory configurations and computing unit
CN102955737B (en) The program debugging method of heterogeneous processor system and system
CN115858017A (en) Processor having multiple cores, shared core extension logic, and shared core extension utilization instructions
Igual et al. Unleashing the high-performance and low-power of multi-core DSPs for general-purpose HPC
Bell Scalable, parallel computers: alternatives, issues, and challenges
Abellán et al. Efficient hardware barrier synchronization in many-core cmps
CN113495761A (en) Techniques for coordinating phases of thread synchronization
Benz et al. A high-performance, energy-efficient modular DMA engine architecture
Penna et al. On the performance and isolation of asymmetric microkernel design for lightweight manycores
CN109558226A (en) A kind of DSP multi-core parallel concurrent calculating dispatching method based on internuclear interruption
CN114365086A (en) Device link management
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN116757132A (en) Heterogeneous multi-core FPGA circuit architecture, construction method and data transmission method
CN112783511B (en) Optimization method, system and terminal of grid cell few-group parameter calculation module program
Linhares et al. NOCA—A Notification-Oriented Computer Architecture: Prototype and Simulator
Melot Algorithms and framework for energy efficient parallel stream computing on many-core architectures
Chen et al. ARAPrototyper: Enabling rapid prototyping and evaluation for accelerator-rich architectures
Kim et al. Software platform for hybrid resource management of a many-core accelerator for multimedia applications
WO2022126621A1 (en) Reconfigurable processing element array for zero-buffer pipelining, and zero-buffer pipelining method
Campanoni et al. Automatically accelerating non-numerical programs by architecture-compiler co-design
Zheng et al. Accelerating the Task Activation and Data Communication for Dataflow Computing
Brillu et al. Cluster based MPSoC architecture: an on-chip message passing implementation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant