CN103714039B - universal computing digital signal processor - Google Patents

universal computing digital signal processor Download PDF

Info

Publication number
CN103714039B
CN103714039B CN201310725118.6A CN201310725118A CN103714039B CN 103714039 B CN103714039 B CN 103714039B CN 201310725118 A CN201310725118 A CN 201310725118A CN 103714039 B CN103714039 B CN 103714039B
Authority
CN
China
Prior art keywords
dsp core
core
dsp
cpu
interface
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310725118.6A
Other languages
Chinese (zh)
Other versions
CN103714039A (en
Inventor
陈书明
杨学军
万江华
刘仲
陈海燕
郭阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201310725118.6A priority Critical patent/CN103714039B/en
Publication of CN103714039A publication Critical patent/CN103714039A/en
Application granted granted Critical
Publication of CN103714039B publication Critical patent/CN103714039B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Microcomputers (AREA)

Abstract

The invention discloses a universal computing digital signal processor which comprises a CPU core unit, a DSP core unit, a multi-layer interconnection structure, an in-chip share storage array, an off-chip memory interface, a high-speed input and output interface, a second high-speed input and output interface, an inter-chip direct connection interface and an inter-core synchronous device. The CPU core unit comprises a plurality of CPU cores. The DSP core unit comprises a plurality of DSP cores. The CPU cores and the DSP cores are respectively connected with the in-chip share storage array through the multi-layer interconnection structure. The CPU cores are connected with the first high-speed input and output interface. The DSP cores are connected with the second-speed input and output interface and the inter-chip direct connection interface. The application program of the universal computing digital signal processor is obtained in a mode that uniform compiling and linking are conducted on CPU end object codes and DSP end object codes obtained in a compiling mode through a uniform and parallel programming method. The universal computing digital signal processor can maintain the basic characteristics of an embedded type DSP and the advantages of high performance and low consumption, and can effectively support universal scientific calculation.

Description

General-purpose computations digital signal processor
Technical field
The present invention relates to microprocessor architecture design field, be specifically related to be suitable to 64 general scientific algorithm, there is the general-purpose computations digital signal processor (General-Purpose of the coenocytism of embedded dsp basic feature simultaneously Digital Signal Processor is called for short GPDSP).
Background technology
Due to its low-power consumption and the advantage of hard real-time, digital signal processor (Digital Signal Processor is called for short DSP) has been widely used in various embedded system as the typical embedded microprocessor of one.Although the architecture of current DSP is provided with the many features with central processing unit (CPU) homogeneity on calculating and controlling, but it is believed that the following is DSP is different from the basic feature of CPU: 1) computing capability is strong, and pay close attention to calculate in real time and be better than Focus Control and issued transaction;2) there is specialised hardware support, such as multiply-add operation, linear addressing for type signal process;3) common feature of embedded microprocessor: address and instruction path are not more than 32, most data paths are not more than 32;Non-precision is interrupted;The debugging of short-term off-line, the job-program mode (rather than universal cpu debugs the method i.e. run) of long-term online resident operation;4) integrated Peripheral Interface is set to master with outer, is especially beneficial online transceiving high speed AD/DA data, also supports that between DSP, high speed is direct-connected.
The most general scientific algorithm is substantially also required to high-performance power dissipation ratio.Therefore, how to play this low-power consumption of DSP, the advantage of high calculated performance uses it for general scientific algorithm becomes a technical problem urgently to be resolved hurrily.Because general DSP is used as during general scientific algorithm to there is following Railway Project: (1), bit wide are little so that computational accuracy and addressing space are not enough.The application of general scientific algorithm at least needs 64 precision;(2), lack the software and hardware supports such as task management, document control, process scheduling, interrupt management, lack operating system (OS) hardware environment in other words, calculate task management make troubles to general, multiple tracks;(3), lacking unified advanced language programming pattern support, the support to multinuclear, vector, data parallel etc. substantially relies on assembly program to program, is not easy to universal programming;(4), do not support the program debugging pattern of local host, only rely on its machine cross debugging emulation.These problems seriously limit the DSP application in general scientific algorithm field.In sum, how to realize a kind of not only played DSP high-performance power dissipation ratio advantage, but also be suitable for the new type microprocessor of general scientific algorithm application, thus can either substantially be beneficial to improve the performance of computer, those can be needed the high accuracy that operating system supports, high-performance embedded field again, such as Radar Signal Processing, underwater sound process etc., it is provided that a kind of suitable New DSP has become as a key technical problem urgently to be resolved hurrily.
Summary of the invention
For the disadvantages mentioned above of prior art, the technical problem to be solved in the present invention is to provide a kind of advantage that can either keep DSP embedded basic feature and high-performance low-power-consumption, can efficiently support again the general-purpose computations digital signal processor of general scientific algorithm.
In order to solve above-mentioned technical problem, the technical solution used in the present invention is:
A kind of general-purpose computations digital signal processor, including:
CPU core unit, comprises at least one CPU core, and described CPU core is for the generic transaction management being responsible for including storage management, document control, process scheduling, interrupt management task and provides the complete support to the general-purpose operating system;
DSP core unit, comprises at least one for providing the DSP core supporting 64 bit arithmetics and operating system micro-kernel;
Multi-level interconnection structure, is used for connecting CPU core and DSP core, for providing for task scheduling between CPU core and DSP core, starting stopping and the quick fine granularity communication of simultaneously operating and realized the data communication of big granularity by high speed DMA mode;
Share storage array in sheet, for providing high bandwidth for number support for CPU core and DSP core, and provide and comprise data failure and atomic operation that the hardware that updates directly is supported is to support cache coherence operations;
Sheet external memory interface, for realizing the extension of sheet external memory for CPU core and DSP core;
First high speed input/output interface, for realizing the external data exchange of CPU core;
Second high speed input/output interface, for realizing the external data exchange of DSP core;
Direct-connected interface between sheet, direct-connected for supporting between the sheet between general-purpose computations digital signal processor;
Internuclear synchronizer, for providing hardware synchronization mechanism between DSP core to support how internuclear simultaneously operating, certain DSP core makes other need the DSP core of this core new data to wait by internuclear synchronizer, this DSP core shares storage array in producing new data and writing back to sheet by write back data mechanism, this DSP core 21 makes other need the DSP core of this core new data to continue to run with by internuclear synchronizer, and other needs the DSP core of this core new data to guarantee do not have old data and read new data shared storage array in sheet by data calcellation mechanism;
Described CPU core is connected with DSP core by multi-level interconnection structure respectively, described CPU core, DSP core storage array shared with in sheet respectively is connected, share storage array in described to be connected with sheet external memory interface, described CPU core and the first high speed input/output interface are connected, described DSP core is connected with direct-connected interface between the second high speed input/output interface, sheet respectively, and described internuclear synchronizer is connected with each DSP core respectively.
Further improvement as technique scheme of the present invention:
Described DSP core includes:
64 scalar processing units, for realizing the support of operating system micro-kernel, as main control unit be responsible for the execution of scalar program, the communication being responsible between CPU core and 64 bit vectors are processed arrays perform controls, be responsible for 64 bit vectors and process same operation shared in arrays;
64 bit vectors process array, for supporting the resolving of intensive operations task in application;
Instruction distributes parts, distributes instruction for processing array to 64 scalar processing units and 64 bit vectors;
Described instruction distributes parts and is connected with 64 scalar processing units, 64 bit vectors process arrays respectively, and described 64 scalar processing units, 64 bit vectors process and are connected with each other between arrays;
The address bus bit wide of described DSP core is more than 40.
Present invention additionally comprises JTAG debugging interface and PCIE interface, described DSP core also includes the visible all memorizeies of addressable DSP core internal program person and the artificial debugging parts of depositor, and described artificial debugging parts are connected with JTAG debugging interface, PCIE interface, CPU core unit respectively by internal bus.
The general-purpose computations digital signal processor of the present invention has the following technical effect that:
null1、The present invention includes CPU core unit,DSP core unit,Multi-level interconnection structure、Storage array is shared in sheet、Sheet external memory interface、First high speed input/output interface、Second high speed input/output interface、Direct-connected interface and synchronizer between sheet,CPU core and DSP core are connected to form close-coupled structure and the organizational form of heterogeneous polynuclear respectively by multi-level interconnection structure,Realize programme-control by multi-level interconnection mechanism between CPU core and DSP core and data process closely-coupled cooperation mode: on the one hand the data path of quick control access and register stage is set between CPU core and DSP core,Efficiently support that there is fine granularity、The control of hard real-time feature and data interaction are (such as task scheduling、Start and stop、Coroutine between CPU core and DSP core redirects and Fast synchronization etc.);On the other hand by the way of direct memory access (DMA) (DMA) passage and shared storage, the data communication of big granularity is realized between CPU core and DSP core with shared, therefore the present invention can achieve the tight coordinated between CPU core and DSP core from different levels, Embedded real-time signal processing is combined with general scientific algorithm, both keep the height in DSP Embedded real-time signal processing to calculate the advantage with low-power consumption in real time, realize again the precision to general scientific algorithm and the support of versatility.
2, the DSP core unit of the present invention comprises at least one for providing 64 bit arithmetics and the DSP core of operating system micro-kernel support, applicable scientific algorithm precision and the wide numerical digit DSP core structure of address space and organizational form.By by the instruction of DSP core, data bit width more than 64, address bus more than 40, it is thus possible to support 64 double-precision floating points and 64 fixed point arithmetic logical calculated, particularly support 64 double-precision floating points, pinpoint multiply-add operation, the data path using general-purpose register file and the data/address bus comprising at least 64 bit wides supports the supply of efficient data, by the support to 64 bit arithmetics, it is possible to achieve computational accuracy and addressing space are substantially improved.
3, the present invention merges the control requirement of DSP basic feature and general-purpose computations, managed and the offer complete support to the general-purpose operating system for the generic transaction being responsible for including storage management, document control, process scheduling, interrupt management task by CPU core, there is provided to 64 bit arithmetics and to the operating system micro-kernel support only with the basic function such as task scheduling and storage management by DSP core, the appropriateness support to complexity stream control structure is realized, including the support providing the branch instruction to vector of valae ARRAY PROCESSING efficiency in DSP core;There is provided to OS multi-level, can the support scheme of cutting, support storage array provides on sheet the appropriateness support to the such as Cache such as data failure, renewal coherency mechanism, take the coherence scheme of the cache of software, hardware coordinated, allow the invention to realize the multi-level support to operating system, significantly improve the present invention to task management, document control, process scheduling, the software and hardware support of interrupt management, facilitate common tasks scheduling operation, meanwhile, multi-level support scheme also brings hard-wired motility.
4, the present invention can support the unified multiple programming method of GPDSP structure, statement is instructed to describe the Thread-Level Parallelism between CPU core by compiling, task-level parallelism between multi-DSP core, Thread-Level Parallelism between CPU core and DSP core is with synchronization, and identify the code of CPU core and DSP core respectively, the CPU calculating code different with DSP compiler compiling is called respectively in unified compiler framework, and unification is linked as the executable code of single-chip, realize the automatic paralleling unified multiple programming with vectorization of multithreading in GPDSP, be conducive to strengthening advanced language programming particularly to multinuclear, the development efficiency of the simultaneous resources such as vector operation array, there is versatility and ease for use is good, the advantage of applied range.
5, the present invention farther includes JTAG debugging interface and PCIE interface, DSP core also includes the visible all memorizeies of addressable DSP core internal program person and the artificial debugging parts of depositor, artificial debugging parts by internal bus respectively with JTAG debugging interface, PCIE interface, CPU core unit is connected, both local cpu host was provided to debug, keep debugging structure and the method for designing of its machine intersection artificial debugging the most simultaneously, therefore, it is possible to realize address resource (such as depositor between its machine host and DSP core, the resources such as memorizer) accurately, high speed access mechanism, realize the accurate acquisition to dsp operation state, convenient realize operating system micro-kernel resource is checked and the enhanced debugging such as amendment, there is debugging function convenience and the advantage of quickness and high efficiency.
Accompanying drawing explanation
Fig. 1 is the general frame structural representation of the embodiment of the present invention.
Fig. 2 is the frame structure schematic diagram of DSP core in the embodiment of the present invention.
Fig. 3 is the debugging interface schematic diagram of the embodiment of the present invention.
Fig. 4 is the partial detailed frame structure schematic diagram of the embodiment of the present invention.
Fig. 5 is the automatic paralleling uniform programming model schematic diagram with vectorization of multithreading in the embodiment of the present invention.
Marginal data: 1, CPU core unit;101, JTAG debugging interface;102, PCIE interface;11, CPU core;2, DSP core unit;21, DSP core;211,64 scalar processing units;2111, Instruction Control Unit;2112, first-level cache;2113, scalar operation unit;2114, scalar scu;2115, scalar register file;212,64 bit vectors process array;2121, DMA parts;2122, array memory bank is shared;2123, processing unit;2124, vector scu;2125, vector register file;2126, vector operation unit;213, instruction distributes parts;214, artificial debugging parts;3, multi-level interconnection structure;4, storage array is shared in sheet;5, sheet external memory interface;6, the first high speed input/output interface;7, the second high speed input/output interface;8, direct-connected interface between sheet;9, internuclear synchronizer.
Detailed description of the invention
As it is shown in figure 1, the general-purpose computations digital signal processor of the present embodiment includes:
CPU core unit 1, comprises multiple CPU core 11, and CPU core 11 is for the generic transaction management being responsible for including storage management, document control, process scheduling, interrupt management task and provides the complete support to the general-purpose operating system;And, the quantity comprising CPU core 11 in CPU core unit 1 can be entered to adjust as required, and the quantity of CPU core 11 is more than 1 and all can realize.
DSP core unit 2, comprises multiple for providing the DSP core 21 supporting 64 bit arithmetics and operating system micro-kernel;And, the quantity comprising DSP core 21 in DSP core unit 2 can be entered to adjust as required, and the quantity of DSP core 21 is more than 1 and all can realize.
Interconnection structure 3 at many levels, is used for connecting CPU core 11 and DSP core 21, for providing for task scheduling between CPU core 11 and DSP core 21, starting stopping and the quick fine granularity communication of simultaneously operating and realized the data communication of big granularity by high speed DMA mode.Multi-level interconnection structure 3 connects CPU core 11 and DSP core 21, can either be provided by " fine granularity controls and data path " and stop and the quick fine granularity communication of simultaneously operating for task scheduling, startup, the data communication of big granularity can be realized again by the way of " quick DMA ".
Share storage array 4 in sheet, for providing high bandwidth for number support for CPU core 11 and DSP core 21, and provide and comprise data failure and atomic operation that the hardware that updates directly is supported is to support cache coherence operations.Sharing storage array 4 in sheet provides high bandwidth for number support for CPU core 11 and DSP core 21, further enhancing the efficiency of big granularity data communication, and cache coherence is had appropriateness support function, it is possible to provide the atomic operation that such as hardware such as data failure and renewal is directly supported.
Sheet external memory interface 5, for realizing the extension of sheet external memory for CPU core 11 and DSP core 21 so that the extension of storage is more flexible and convenient.
First high speed input/output interface 6, for realizing the external data exchange of CPU core 11.
Second high speed input/output interface 7, for realizing the external data exchange of DSP core 21.
Direct-connected interface 8 between sheet, direct-connected for supporting between the sheet between general-purpose computations digital signal processor.
Internuclear synchronizer 9, for providing hardware synchronization mechanism between DSP core to support how internuclear simultaneously operating, certain DSP core 21 makes other need the DSP core of this core new data 21 to wait by internuclear synchronizer 9, this DSP core 21 shares storage array 4 in producing new data and writing back to sheet by write back data mechanism, this DSP core 21 makes other need the DSP core of this core new data 21 to continue to run with by internuclear synchronizer 9, other needs the DSP core 21 of this core new data to guarantee do not have old data and read new data shared storage array 4 in sheet by data calcellation mechanism.
CPU core 11 is connected with DSP core 21 by multi-level interconnection structure 3 respectively, CPU core 11, DSP core 21 storage array 4 shared with in sheet respectively is connected, share storage array 4 in sheet to be connected with sheet external memory interface 5, CPU core 11 is connected with the first high speed input/output interface 6, DSP core 21 is connected with interface 8 direct-connected between the second high speed input/output interface 7, sheet respectively, and internuclear synchronizer 9 is connected with each DSP core 21 respectively.
As in figure 2 it is shown, DSP core 21 includes:
64 scalar processing units 211, for realizing the support of operating system micro-kernel, as main control unit be responsible for the execution of scalar program, the efficient communication being responsible between CPU core 11 and 64 bit vectors are processed arrays 212 perform controls, be responsible for 64 bit vectors and process same operation (including configuration and amendment, the unified flow-control operation of process array of public variable) shared in arrays 212.
64 bit vectors process array 212, for supporting the resolving of intensive operations task in application.
Instruction distributes parts 213, distributes instruction for processing array 212 to 64 scalar processing units 211 and 64 bit vector.
Instruction distributes parts 213 and is connected with 64 scalar processing units 211,64 bit vector process array 212 respectively, and 64 scalar processing unit 211,64 bit vectors process and are connected with each other between arrays 212.
The DSP core 21 of the present embodiment processes array 212 by 64 scalar processing unit 211,64 bit vectors, support to 64 bit arithmetics can be provided, efficient support to double-precision floating point and fixed-point operation is particularly provided, there is provided the support to the operating system micro-kernel only supporting the basic function such as process scheduling, storage management by its 64 scalar processing units 211, process array 212 by its 64 bit vector and support the resolving of intensive operations task in application.When performing to calculate, instruction distribute parts 213 distribute out respectively scalar instruction to 64 scalar processing units 211, distribute vector instruction and process arrays 212 to 64 bit vectors.64 scalar processing unit 211 and 64 bit vectors process coordinated between array 212, appropriateness support on the complicated flow-control operation affecting vector array execution efficiency is provided, such as to the support of branched structure in loop body, these flow-control operations include the branched structure in loop body, and the uncertain loop structure of while class cycle-index.
As shown in Figure 3, the present embodiment also includes JTAG debugging interface 101 and PCIE interface 102, DSP core 21 also includes that the addressable visible all memorizeies of DSP core 21 internal program person and the artificial debugging parts 214 of depositor, artificial debugging parts 214 are connected with JTAG debugging interface 101, PCIE interface 102, CPU core unit 1 respectively by internal bus.Artificial debugging parts 214 are permissible, accurate acquisition to DSP core unit 2 running status is provided, and operating system micro-kernel resource checked and the enhanced debugging function such as amendment, these memorizeies and depositor are also processed array 212 by 64 scalar processing units 211,64 bit vector and access.Owing to artificial debugging parts 214 are connected with JTAG debugging interface 101, PCIE interface 102, CPU core unit 1 respectively by internal bus, therefore the present embodiment can support three kinds of debugging modes: JTAG debugging mode, PCIE debugging mode and CPU debugging mode.Artificial debugging parts 214 support the debugging request from JTAG debugging interface 101, and access the memorizer in the present embodiment or depositor according to these requests.JTAG debugging interface 101 is the emulation interface that tradition DSP generally supports, supports that external host accesses DSP internal storage and depositor.The jtag interface agreement used due to JTAG debugging interface 101 belongs to serial protocol, and clock frequency is typically smaller than 100MHz, and its access speed is slower.The present embodiment is also supported to support local CPU core unit 1 or the external host debugging request by PCIE interface 102 by internal bus, owing to internal bus uses parallel transmission agreement and the operating frequency interface frequency much larger than JTAG debugging interface 101, therefore, it is possible to support that debugging accesses at a high speed.Accessed by the debugging of this high speed, just having the data path of quick control access and register stage between CPU core 11 and DSP core 21, redirect the coroutine between such as task scheduling, startup stopping, CPU core 11 and DSP core 21 and Fast synchronization etc. has fine granularity, the mutual of hard real-time feature provides efficiently support.
The present embodiment has merged the control requirement of DSP basic feature and general-purpose computations, the complete support to OS is provided by CPU core 11, DSP core 21 uses very long instruction word structure, DSP core 21 comprises 64 scalar processing units 211 and 64 bit vector process 212,64 scalar processing units 211 of array and is capable of the support to the OS micro-kernel only with the basic functions such as task scheduling and storage management.As shown in Figure 4, 64 scalar processing unit 211 inside include Instruction Control Unit 2111, first-level cache 2112, scalar operation unit 2113, scalar scu 2114, scalar register file 2115, Instruction Control Unit 2111 distributes parts 213 respectively with instruction, scalar register file 2115, first-level cache 2112 is connected, first-level cache 2112 storage array 4 shared with in sheet is connected, scalar scu 2114 respectively with first-level cache 2112, scalar register file 2115 is connected, scalar operation unit 2113 is then connected with scalar register file 2115.Instruction Control Unit 2111 is responsible for obtaining instruction from first-level cache 2112 and being sent to instruction distributing parts 213, and instruction distributes parts 213 and is separately sent to command adapted thereto in 64 scalar processing units 211,64 bit vector process array 212 process again.64 bit vectors process array 212 and include DMA parts 2121, share array memory bank 2122 and isomorphism or multiple processing units 2123 of isomery, realized the collaborative process of data between processing unit 2123 by shared array memory bank 2122 and be in communication with each other, share array memory bank 2122 then by DMA parts 2121 with share storage array 4 in sheet and be connected.Wherein, processing unit 2123 is for being responsible for the resolving of intensive operations task in application, and each processing unit 2123 includes vectorial scu 2124, vector register file 2125 and the vector operation unit 2126 being sequentially connected.It should be noted that 64 scalar processing units 211 and 64 bit vector that the present embodiment can also use other structure as required processes array 212, do not repeat them here.In the present embodiment, DSP core 21 is capable of the appropriateness support to complexity stream control structure, first DSP core 21 reduces branch's expense by two kinds of mechanism: the first mechanism is to use conditional execution instruction, conditional execution instruction can judge whether to perform according to a certain buffer status in register file (scalar register file 2115 or vector register file 2125), if do not performed, this instruction is changed into do-nothing operation.The value of branch condition is placed into a certain depositor, and the instruction in basic block all can use conditional order, without using branch instruction, it is to avoid the expense that branch brings.Meanwhile, using Tapped Delay groove to be instructed by the basic block before branch and be dispatched in Tapped Delay groove, hardware ensures that the instruction in Tapped Delay groove has to carry out after performing branch instruction.This mechanism can reduce the expense of branch instruction.The second mechanism is branch instruction based on Vector Processing array status, and this instruction decides whether to redirect according to the together value of numbered register of certain in each vector register file 2125 in 64 bit vectors process arrays 212.Data sign processing between each DSP core 21 has been worked in coordination with by software and hardware, hardware provides the data of first-level cache 2112 to write back machine-processed and internuclear synchronization transposition, and software realizes internuclear data consistency operation by operation first-level cache 2112 and internuclear synchronizer 9.In addition, the first-level cache 2112 of DSP core 21 also provides for data and cancels machine-processed, the all or part of data invalid in first-level cache 2112 can be made, internuclear synchronizer 9 realizes the synchronization between multiple DSP core 21, safeguards that the detailed step of internuclear data consistency is as follows in the present embodiment: (1) certain DSP core 21 makes other need the DSP core of this core new data 21 to wait by internuclear synchronizer 9;(2) this DSP core 21 produces new data and by the write back data mechanism that first-level cache 2112 provides, and shares storage array 4 in new data writes back to sheet;(3) this DSP core 21 is by internuclear synchronizer 9 so that other needs the DSP core 21 of this core new data to continue to run with;(4) other needs the core of this DSP core 21 new data to guarantee do not have old data by the data calcellation mechanism of first-level cache 2112, and reads new data shared storage array 4 in sheet.
nullThe hardware configuration of the present embodiment can support the unified multiple programming method of GPDSP structure,The unified multiple programming implementation method of the present embodiment is on the basis of the OpenMP multiple programming method of standard,Statement is instructed by expanding some DSP compiling,To in CPU+DSP heterogeneous multi-nucleus processor, multithreading is automatic paralleling and multiple programming is unified in vectorization in realization,Statement is instructed to describe the Thread-Level Parallelism between CPU core 11 by compiling、Task-level parallelism between multi-DSP core 21、Thread-Level Parallelism between CPU core 11 with DSP core 21 is with synchronization,And identify the code of CPU core 11 and DSP core 21 respectively,The CPU calculating code different with DSP compiler compiling is called respectively in unified compiler framework,And unification is linked as the executable code of single-chip,Realize the automatic paralleling unified multiple programming with vectorization of multithreading.As it is shown in figure 5, the detailed step of the unified multiple programming method of the general-purpose computations digital signal processor of the present embodiment support is as follows:
1) programmer uses standard programming language and grammer (such as standard C/C++ grammer) to write application program, inserts OpenMP compiling simultaneously and instruct statement, insert DSP compile and instruct statement before requiring DSP core unit to perform the statement block calculated before requiring the statement block that CPU core unit multi-threaded parallel performs;
2) when compiling application program, OpenMP compiling instructs statement to instruct CPU compiler to realize multithreading automatically parallelizing, and DSP compiling instructs statement to instruct DSP compiler to realize the vector code compiling of DSP core end;
3) CPU end compiler instrument carries out Uniform compilation and link to the object code of CPU end and the object code of DSP end, and final output can be used for the executable code that general-purpose computations digital signal processor performs.
Therefore, according to different general scientific algorithm need carry out above-mentioned steps 1)~3) carry out unifying the i.e. available corresponding executable code of multiple programming, executable code can be used directly general-purpose computations digital signal processor and performs, it is thus possible to utilize CPU core unit 1 and DSP core unit 2 in the present embodiment general-purpose computations digital signal processor, on the basis of keeping the advantage of DSP embedded basic feature and high-performance low-power-consumption, support general scientific algorithm efficiently.The unified multiple programming method of the present embodiment support instructs statement to describe the Thread-Level Parallelism between CPU core by compiling, task-level parallelism between multi-DSP core, Thread-Level Parallelism between CPU core and DSP core is with synchronization, and identify the code of CPU core and DSP core respectively, the CPU calculating code different with DSP compiler compiling is called respectively in unified compiler framework, and unification is linked as the executable code of single-chip, realize the automatic paralleling unified multiple programming with vectorization of multithreading in GPDSP, be conducive to strengthening advanced language programming particularly to multinuclear, the development efficiency of the simultaneous resources such as vector operation array, there is versatility and ease for use is good, the advantage of applied range.
The above is only the preferred embodiment of the present invention, and protection scope of the present invention is not limited merely to above-described embodiment, and all technical schemes belonged under thinking of the present invention belong to protection scope of the present invention.It should be pointed out that, for those skilled in the art, some improvements and modifications without departing from the principles of the present invention, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (3)

1. a general-purpose computations digital signal processor, it is characterised in that including:
CPU core unit (1), comprises at least one CPU core (11), and described CPU core (11) is for the generic transaction management being responsible for including storage management, document control, process scheduling, interrupt management task and provides the complete support to the general-purpose operating system;
DSP core unit (2), comprises at least one for providing the DSP core (21) supporting 64 bit arithmetics and operating system micro-kernel;
Multi-level interconnection structure (3), for connecting CPU core (11) and DSP core (21), for providing for task scheduling between CPU core (11) and DSP core (21), starting stopping and the quick fine granularity communication of simultaneously operating and realized the data communication of big granularity by high speed DMA mode;
Share storage array (4) in sheet, for providing high bandwidth for number support for CPU core (11) and DSP core (21), and provide and comprise data failure and atomic operation that the hardware that updates directly is supported is to support cache coherence operations;
Sheet external memory interface (5), for realizing the extension of sheet external memory for CPU core (11) and DSP core (21);
First high speed input/output interface (6), is used for realizing the external data exchange of CPU core (11);
Second high speed input/output interface (7), is used for realizing the external data exchange of DSP core (21);
Direct-connected interface (8) between sheet, direct-connected for supporting between the sheet between general-purpose computations digital signal processor;
Internuclear synchronizer (9), for providing hardware synchronization mechanism between DSP core to support how internuclear simultaneously operating;Certain DSP core (21) makes other need the DSP core of this core new data (21) to wait by internuclear synchronizer (9), this DSP core (21) shares storage array (4) in producing new data and writing back to sheet by write back data mechanism, this DSP core (21) makes other need the DSP core of this core new data (21) to continue to run with by internuclear synchronizer (9), and other needs the DSP core (21) of this core new data to guarantee do not have old data and share the data that storage array (4), reading is new in sheet by data calcellation mechanism;
Described CPU core (11) is connected with DSP core (21) by multi-level interconnection structure (3) respectively, described CPU core (11), DSP core (21) storage array (4) shared with in sheet respectively is connected, share storage array (4) in described to be connected with sheet external memory interface (5), described CPU core (11) is connected with the first high speed input/output interface (6), described DSP core (21) is connected with direct-connected interface (8) between the second high speed input/output interface (7), sheet respectively, and described internuclear synchronizer (9) is connected with each DSP core (21) respectively.
General-purpose computations digital signal processor the most according to claim 1, it is characterised in that described DSP core (21) including:
64 scalar processing units (211), for realizing the support of operating system micro-kernel, as main control unit be responsible for the execution of scalar program, the communication being responsible between CPU core (11) and 64 bit vectors are processed array (212) perform controls, be responsible for same operation shared in 64 bit vectors process array (212);
64 bit vectors process array (212), for supporting the resolving of intensive operations task in application;
Instruction distributes parts (213), distributes instruction for processing array (212) to 64 scalar processing units (211) and 64 bit vectors;
Described instruction distributes parts (213) and is connected with 64 scalar processing units (211), 64 bit vectors process array (212) respectively, and described 64 scalar processing units (211), 64 bit vectors process and are connected with each other between array (212);
The address bus bit wide of described DSP core (21) is more than 40.
General-purpose computations digital signal processor the most according to claim 1 and 2, it is characterized in that: also include JTAG debugging interface (101) and PCIE interface (102), described DSP core (21) also includes the visible all memorizeies of addressable DSP core (21) internal program person and the artificial debugging parts (214) of depositor, and described artificial debugging parts (214) are connected with JTAG debugging interface (101), PCIE interface (102), CPU core unit (1) respectively by internal bus.
CN201310725118.6A 2013-12-25 2013-12-25 universal computing digital signal processor Active CN103714039B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310725118.6A CN103714039B (en) 2013-12-25 2013-12-25 universal computing digital signal processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310725118.6A CN103714039B (en) 2013-12-25 2013-12-25 universal computing digital signal processor

Publications (2)

Publication Number Publication Date
CN103714039A CN103714039A (en) 2014-04-09
CN103714039B true CN103714039B (en) 2017-01-11

Family

ID=50407032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310725118.6A Active CN103714039B (en) 2013-12-25 2013-12-25 universal computing digital signal processor

Country Status (1)

Country Link
CN (1) CN103714039B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104679689B (en) * 2015-01-22 2017-12-12 中国人民解放军国防科学技术大学 A kind of multinuclear DMA segment data transmission methods counted using slave for GPDSP
CN104615557B (en) * 2015-01-22 2018-08-21 中国人民解放军国防科学技术大学 A kind of DMA transfer method that multinuclear fine granularity for GPDSP synchronizes
CN104679691B (en) * 2015-01-22 2017-12-12 中国人民解放军国防科学技术大学 A kind of multinuclear DMA segment data transmission methods using host count for GPDSP
CN104636316B (en) * 2015-02-06 2018-01-12 中国人民解放军国防科学技术大学 The method calculated towards GPDSP extensive matrix multiplication
CN104615584B (en) * 2015-02-06 2017-12-22 中国人民解放军国防科学技术大学 The method for solving vectorization calculating towards GPDSP extensive triangular linear equation group
CN104615516B (en) * 2015-02-06 2019-01-29 中国人民解放军国防科学技术大学 The method that extensive high-performance Linpack test benchmark towards GPDSP is realized
CN104699631B (en) * 2015-03-26 2018-02-02 中国人民解放军国防科学技术大学 It is multi-level in GPDSP to cooperate with and shared storage device and access method
CN105718242B (en) * 2016-01-15 2018-08-17 中国人民解放军国防科学技术大学 The processing method and system of software and hardware data consistency are supported in multi-core DSP
CN106201939B (en) * 2016-06-30 2019-04-05 中国人民解放军国防科学技术大学 Multicore catalogue consistency device towards GPDSP framework
EP3552076B1 (en) * 2017-11-21 2023-08-30 Google LLC Low-power ambient computing system with machine learning
CN108874727B (en) * 2018-05-29 2019-09-10 中国人民解放军国防科技大学 GPDSP-oriented multi-core parallel computing implementation method
WO2020062305A1 (en) * 2018-09-30 2020-04-02 华为技术有限公司 Computational accelerator, exchanger, task scheduling method, and processing system
CN109558226B (en) * 2018-11-05 2021-03-30 上海无线通信研究中心 DSP multi-core parallel computing scheduling method based on inter-core interruption
CN109542830B (en) * 2018-11-21 2022-03-01 北京灵汐科技有限公司 Data processing system and data processing method
CN110032407B (en) 2019-03-08 2020-12-22 创新先进技术有限公司 Method and device for improving parallel performance of CPU (Central processing Unit) and electronic equipment
CN112134814B (en) * 2020-08-24 2022-04-12 合肥学院 Board-level internet structure and communication method
CN116028418B (en) * 2023-02-13 2023-06-20 中国人民解放军国防科技大学 GPDSP-based extensible multi-core processor, acceleration card and computer

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8713263B2 (en) * 2010-11-01 2014-04-29 Advanced Micro Devices, Inc. Out-of-order load/store queue structure
US8584065B2 (en) * 2011-05-05 2013-11-12 Advanced Micro Devices, Inc. Method and apparatus for designing an integrated circuit
CN103279445A (en) * 2012-09-26 2013-09-04 上海中科高等研究院 Computing method and super-computing system for computing task

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
一种面向多核DSP 的小容量紧耦合快速共享数据池;陈书明等;《计算机学报》;20081031;第31卷(第10期);第1737-1744页 *
一种面向多核处理器并行系统的启发式任务分配算法;刘轶等;《计算机研究与发展》;20091231;第46卷(第6期);第1058-1064页 *
异构多核DSP数据流前瞻关键技术研究;汪东;《中国博士学位论文全文数据库信息科技辑(月刊)》;20090715(第07期);第29-136页 *
面向SDR应用的向量存储器的设计与优化;陈海燕等;《国防科技大学学报》;20120630;第34卷(第3期);第98-102页 *

Also Published As

Publication number Publication date
CN103714039A (en) 2014-04-09

Similar Documents

Publication Publication Date Title
CN103714039B (en) universal computing digital signal processor
US7941791B2 (en) Programming environment for heterogeneous processor resource integration
Prakash et al. Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms
US20130332711A1 (en) Systems and methods for efficient scheduling of concurrent applications in multithreaded processors
CN112199173B (en) Data processing method for dual-core CPU real-time operating system
Moyer Real World Multicore Embedded Systems
CN110427337B (en) Processor core based on field programmable gate array and operation method thereof
CN112580792B (en) Neural network multi-core tensor processor
Ma et al. Specializing FGPU for persistent deep learning
Liu et al. Scratchpad memory architectures and allocation algorithms for hard real-time multicore processors
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
Vujic et al. DMA++: On the fly data realignment for on-chip memories
Bauer et al. Programmable hsa accelerators for zynq ultrascale+ mpsoc systems
Mäkelä et al. Design of the language replica for hybrid PRAM-NUMA many-core architectures
Lin et al. Compilers for low power with design patterns on embedded multicore systems
Chen et al. ARAPrototyper: Enabling rapid prototyping and evaluation for accelerator-rich architectures
Zaykov et al. Reconfigurable multithreading architectures: A survey
Bai et al. A software-only scheme for managing heap data on limited local memory (LLM) multicore processors
Kaouane et al. SysCellC: Systemc on cell
Natvig et al. Multi‐and Many‐Cores, Architectural Overview for Programmers
Aleem et al. A comparative study of heterogeneous processor simulators
Nunez-Yanez et al. Parallelizing workload execution in embedded and high-performance heterogeneous systems
WO2022126621A1 (en) Reconfigurable processing element array for zero-buffer pipelining, and zero-buffer pipelining method
US20130061028A1 (en) Method and system for multi-mode instruction-level streaming
Stavrou et al. Hardware budget and runtime system for data-driven multithreaded chip multiprocessor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant