CN117608532A - OpenMP implementation method based on domestic multi-core DSP - Google Patents

OpenMP implementation method based on domestic multi-core DSP Download PDF

Info

Publication number
CN117608532A
CN117608532A CN202311587658.2A CN202311587658A CN117608532A CN 117608532 A CN117608532 A CN 117608532A CN 202311587658 A CN202311587658 A CN 202311587658A CN 117608532 A CN117608532 A CN 117608532A
Authority
CN
China
Prior art keywords
core
openmp
implementation method
shared memory
semaphore
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311587658.2A
Other languages
Chinese (zh)
Inventor
侯旋
曾令将
万勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Shipbuilding Lingjiu Electronics Wuhan Co ltd
Original Assignee
China Shipbuilding Lingjiu Electronics Wuhan Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Shipbuilding Lingjiu Electronics Wuhan Co ltd filed Critical China Shipbuilding Lingjiu Electronics Wuhan Co ltd
Priority to CN202311587658.2A priority Critical patent/CN117608532A/en
Publication of CN117608532A publication Critical patent/CN117608532A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/545Interprogram communication where tasks reside in different layers, e.g. user- and kernel-space

Abstract

The invention discloses an OpenMP implementation method based on domestic multi-core DSP, which comprises a hardware layer, a system layer, a program layer and a user layer; the system layer comprises a Bootloader, a real-time operating system and an OpenMP adaptation layer, wherein the OpenMP adaptation layer comprises thread creation, thread synchronization and mutual exclusion, shared memory management and multi-core task distribution; the implementation of thread creation, shared memory management and multi-core task distribution is packaged based on POSIX standard interfaces provided by an operating system; for synchronization and mutual exclusion of threads, hardware semaphores and shared memory are adopted; and the OpenMP adaptation layer is used for realizing the dependence of the OpenMP interface on the system layer. The invention realizes synchronization and mutual exclusion of threads of the operating system in the AMP mode, and realizes that the same OS mirror image is operated by the multi-core in the AMP mode.

Description

OpenMP implementation method based on domestic multi-core DSP
Technical Field
The invention relates to an OpenMP implementation method, in particular to an OpenMP implementation method based on a domestic multi-core DSP, and belongs to the technical field of digital signal processing.
Background
In the parallel computing field, openMP is one of the most popular programming models, which provides a portable and scalable model for developers of shared memory parallel applications, and is characterized by providing a simple method for writing multi-thread programs without complex thread creation, synchronization, load balancing and destruction by programmers.
In general, openMP on a general CPU is implemented by calling GNU to provide an open source libgomp library, and the basic principle is based on multithreading scheduling in an operating system SMP mode (all cores run the same OS image, resources are shared with each other, tasks are scheduled uniformly), and the created threads are distributed to each core through an affinity setting interface; however, because of the defect of the Cache itself, the DSP processor usually runs AMP mode (each core runs an independent OS image, isolated from each other, and tasks are scheduled independently), so that SMP scheduling of the operating system cannot be implemented, and therefore, the OpenMP implementation on the general CPU cannot be implemented in the DSP.
In addition, in the prior art, the TI company provides a set of OpenMP implementation method based on the SYS/BIOS operating system for the multi-core DSP, the method is strongly coupled with the SYS/BIOS operating system, and inter-core communication is implemented based on a Queue Manager (QMSS) provided by the DSP, and the QMSS is a hardware module responsible for the accelerated management of packet queues. Since the domestic multi-core DSP processor (Feiteng) does not have the QMSS module, the OpenMP scheme provided by TI cannot be implemented on the processor.
In summary, for the domestic multi-core DSP processor, there is no OpenMP implementation method with strong versatility and portability, which can be compatible with different operating systems.
Disclosure of Invention
The invention aims to provide an OpenMP implementation method based on a domestic multi-core DSP, which can not meet the development problem of application based on an OpenMP programming model in the parallel computing field because the domestic multi-core DSP processor can not support the OpenMP implementation scheme on a mainstream CPU or a DSP due to the Cache defect and the lack of a QMS module.
The invention realizes the above purpose through the following technical scheme: an OpenMP implementation method based on domestic multi-core DSP, the OpenMP implementation method is implemented by using a layered architecture, based on a design mode of the layered architecture, the decoupling of a user application program and a bottom system software is implemented through an OpenMP adaptation layer, so that the dependence of the application program on a hardware platform or an operating system can be reduced, and the method has better portability and expandability, and the layered architecture comprises:
the hardware layer comprises a domestic multi-core DSP processor and all cores which share external storage;
the system layer comprises a Bootloader, a real-time operating system and an OpenMP adaptation layer;
the program layer comprises a precompiled instruction, a compiler, an OpenMP interface and an environment variable;
a user layer based on application programs developed by the OpenMP parallel programming module;
the OpenMP adaptation layer comprises thread creation, thread synchronization and mutual exclusion, shared memory management and multi-core task distribution; the implementation of thread creation, shared memory management and multi-core task distribution is packaged based on POSIX standard interfaces provided by an operating system; for synchronization and mutual exclusion of threads, hardware semaphores and shared memory are adopted;
and the OpenMP adaptation layer is used for realizing the dependence of the OpenMP interface on the system layer.
As a further technical scheme of the invention: the implementation method of the OpenMP adaptation layer comprises the following steps:
1) Realizing synchronization and mutual exclusion of threads of an operating system in an AMP mode based on a hardware semaphore and a shared memory, realizing mutual exclusion access to the multi-core shared memory by using the hardware semaphore, defining a variable in the shared memory as a value of the semaphore, adding 1 when the semaphore is released, subtracting 1 when the semaphore is acquired, calling a sched_yieldfunction when the value of the semaphore is 0, and promoting the current thread to release occupation of a CPU until the value of the semaphore is not 0;
2) The multi-core task distribution of the operating system in the AMP mode is realized by adopting a master-slave mode and combining an inter-core communication module, and the same OS mirror image is operated by multiple cores in the AMP mode.
As a further technical scheme of the invention: the method for realizing the multi-core task distribution comprises the following steps:
one core of the domestic multi-core DSP processor is set as a master core, the other cores are set as slave cores, and after the loading of the OS mirror image is completed, the OS kernel is started to complete the initialization of the inter-core communication module.
As a further technical scheme of the invention: the method for realizing the inter-core communication module based on the shared memory and inter-core IPC interrupt mode comprises the following steps:
1) Dividing a region for each core in a shared memory region and forming a circulating message queue, maintaining the message queue through a mapping table, determining the message queue of a receiver by a sender through sending an ID, copying the message, and simultaneously sending an IPC interrupt notification to the other party;
2) And blocking the receiver to wait for the IPC to interrupt and release the semaphore, inquiring the message queue after the check is updated after the semaphore is received, and taking out the message to complete one-time inter-core communication.
As a further technical scheme of the invention: the method for realizing the master-slave mode comprises the following steps:
1) The main core executes critical area protection configuration initialization, openMP environment variable initialization and workgroup initialization by creating a user program thread, and invokes a multi-core concurrency control function to realize the encapsulation and the transmission of multi-core concurrency task information, so as to complete multi-core task distribution;
2) The slave core creates a task receiving thread, enters a loop blocking receiving master core to send a message, creates a task executing thread after receiving the message, firstly analyzes parallel block function entry and shared working area information from the message, then jumps to the parallel block function entry to execute the parallel task, executes multi-core synchronization after completing the task, and completes the execution of the concurrent task.
As a further technical scheme of the invention:
the implementation method for operating the same OS image by the multiple cores in the AMP mode comprises the following steps:
1) Selecting a reserved address interval from 32-bit logic address intervals of a domestic multi-core DSP processor;
2) Address intervals are mapped to different physical addresses for different cores in Bootloader.
The beneficial effects of the invention are as follows:
1) The method adopts a design mode of a layered architecture, and based on a domestic Feiteng multi-core DSP processor, an OpenMP task parallel distribution mechanism of operating system running the same OS mirror image in a multi-core mode is realized through the technologies of hardware semaphore, shared memory, multi-core memory address mapping and the like;
2) The OpenMP implementation method which has good universality and strong portability and can be compatible with various operating systems is provided for the application development of parallel programming based on the domestic Feiteng multi-core DSP processor.
Drawings
FIG. 1 is a diagram of an OpenMP architecture of the present invention;
FIG. 2 is a flow chart of an AMP pattern semaphore implementation of the invention;
FIG. 3 is a flow chart of the multi-core task distribution of the present invention;
FIG. 4 is a flow chart of inter-core communication according to the present invention;
FIG. 5 is a flow chart of a user program thread of the present invention;
FIG. 6 is a flow chart of a multi-core concurrency control function of the present invention;
FIG. 7 is a flow chart of a task execution thread process according to the present invention.
FIG. 8 is a diagram of the core operation logical address remapping according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In embodiment 1, fig. 1 is an OpenMP architecture diagram of a domestic multi-core DSP, and in order to improve the universality and portability of OpenMP, the dependency of a specific functional module of the domestic multi-core DSP processor or a specific working component of an operating system is reduced as much as possible, and the OpenMP needs to be divided.
OpenMP adopts a layered architecture, and comprises four layers, namely a user layer, a program layer, a system layer and a hardware layer in sequence; the user layer is also called an application layer and is an application program developed based on an OpenMP parallel programming module; the program layer comprises instructions (pre-compiled instructions), a compiler, an OpenMP interface and environment variables, and is realized by a compiling tool chain and an open source libgomp library; the system layer comprises a Bootloader, a real-time operating system and an OpenMP adaptation layer; the hardware layer is the domestic multi-core DSP processor and all cores share external storage.
Embodiment 2 as shown in fig. 1 to 3, the purpose of the OpenMP adaptation layer is to implement the dependency of the OpenMP interface on the underlying layer (system layer). The OpenMP adaptation layer comprises thread creation, thread synchronization and mutual exclusion, shared memory management and multi-core task distribution; the implementation of the thread creation, shared memory management and multi-core task distribution functions is directly based on a POSIX standard interface provided by an operating system for encapsulation; for synchronization and mutual exclusion of threads, because the operating system operates in an AMP mode, the semaphore and the mutual exclusion amount only act on the threads of the core, and synchronization and mutual exclusion among threads on different cores cannot be realized, and therefore, the synchronization and mutual exclusion among threads on different cores are realized in a hardware semaphore and shared memory mode.
The synchronization and mutual exclusion of the threads of the operating system in the AMP mode are realized through hardware semaphores and a shared memory mode. Because the hardware semaphore of the domestic multi-core DSP processor can only be owned by one core, other cores judge whether the processor is idle or not by inquiring the state of the processor, and if the processor is always occupied, the processor always waits until the processor is idle. In this way (hardware semaphores) protection of critical areas (shared memory etc.) is achieved.
The interface function comprises signal quantity initialization, signal quantity acquisition and signal quantity release, and specifically comprises the following steps: defining a variable in the shared memory as the value of the semaphore, adding 1 when releasing the semaphore, subtracting 1 when acquiring the semaphore, and calling the sched_yieldfunction by the current thread to release the occupation of the CPU until the value of the semaphore is not 0 when the value of the semaphore is 0, wherein the specific implementation is shown in figure 2.
Aiming at multi-core task distribution, a master-slave mode is adopted and an inter-core communication module is combined to realize multi-core task distribution of an operating system in an AMP mode, and the same OS mirror image is operated by multiple cores in the AMP mode.
The implementation method of multi-core task distribution is specifically shown in fig. 3: 1) Setting one core of the multi-core DSP processor as a master core and setting other cores as slave cores; 2) After the loading of the OS mirror image is completed, the master core and the slave cores (collectively called as all cores) start the OS kernel to complete the initialization of the inter-core communication module.
As shown in fig. 4, the inter-core communication module is implemented based on a mode of "shared memory+inter-core IPC interrupt", and the implementation method specifically includes: 1) Dividing a region for each core in a shared memory region, forming a circulating message queue, maintaining the message queue through a mapping table, determining the message queue of a receiver by a sender through sending an ID, copying the message, and simultaneously sending an IPC interrupt notification to the opposite side; 2) And blocking the receiver to wait for the IPC to interrupt and release the semaphore, inquiring the message queue after the check is updated after the semaphore is received, and taking out the message to complete one-time inter-core communication.
Embodiment 3, as shown in fig. 5 to 7, includes, in addition to all the technical features in embodiment 1 and embodiment 2:
after the inter-core communication module completes configuration, all cores (a master core and a slave core) need to create task receiving threads, the task receiving threads are used for receiving tasks distributed by the master core, after the slave core enters the task receiving threads, the slave core circularly blocks and receives messages sent by the master core, and creates task executing threads according to message content, and then the master core continues to create user program threads.
FIG. 5 is a flow chart of execution of a user program thread. The main core sequentially executes critical area protection configuration initialization (such as initialization of a mutual exclusion lock), openMP environment variable initialization and workgroup initialization by creating a user program thread, and jumps to a user program entry after the creation is completed.
And after the user program thread is executed to the concurrency point, the user program thread is executed by the main core, the code is segmented according to the precompiled instruction by the compiler corresponding to the # pragma omp parallel precompiled instruction in the code, then data such as a parallel function block inlet, a private variable and the like are used as parameters to be transferred to the multi-core concurrency control function, and the multi-core concurrency control function is called to realize the encapsulation and the transmission of multi-core concurrency task information, so that the multi-core task distribution is completed.
As shown in fig. 6, the flow of the multi-core concurrency control function is shown, the main core executes the multi-core concurrency control function in the background of the system, a work group is first created, a shared work area is initialized, then the information such as the parallel block function entry, the thread attribute, the shared work area and the like is packaged into a message, and finally the message is sequentially sent to the cores participating in the parallel processing. It is noted that "sequentially sending messages to cores that participate in parallel processing" includes both the slave core and the master core (since the master core also participates in parallel processing).
After receiving the messages, the cores participating in parallel processing sequentially send the messages to each other, and create a task execution program, as shown in fig. 7, which is a processing flow of a task execution thread. Firstly, interpreting the information such as the parallel block function entry and the shared working area information from the information, then jumping to the parallel block function entry to execute the parallel tasks, executing multi-core synchronization after completing the tasks, and waiting for all cores to complete the tasks and then carrying out subsequent processes.
In order to realize that the same OS image is operated by multiple cores in AMP mode, bootloader needs to be optimized, and the Bootloader carries each core OS image to a corresponding operation address space, if all cores are operated by the same OS image, a conflict will be caused, so before the Bootloader carries the OS image, remapping is performed on a logical address, as shown in fig. 8, the specific implementation process is as follows:
1) The mapping relation between the logical address and the physical address of the default DDR of the domestic multi-core DSP processor is that 0x 80000000-0 xFFFFFFFF corresponds to 0x 8000000000-0 x87FFFFFFF;
2) Selecting a reserved address interval from the 32-bit logic address intervals, for example, 0x 50000000-0 x50C00000 as the running interval of the operating system kernel;
3) 0x 50000000-0 x50C00000 are mapped to different physical addresses for different cores in Bootloader.
Thus, each core accesses the same logical address while running the same OS image, but is actually a different physical address, so that no conflict problem occurs.
Working principle: firstly, mapping the same logical address to different physical addresses in a boot stage in a memory mapping mode, solving the problem that all cores operate the same operating system image in an AMP mode, realizing an OpenMP function based on an open source libgomp library, controlling a starting flow according to a core number in a core starting stage, configuring the number of concurrent cores by a main core after all cores complete the initialization of an OpenMP operating environment, creating a task distribution thread, entering an application program, and performing task distribution in the application program according to compiling guidance sentences; the slave cores enter a task receiving thread to block waiting for the master core to distribute tasks, and after receiving the information, the slave cores analyze the information and create processing tasks. The inter-core communication is realized by adopting a mode of shared memory and inter-core interrupt, the inter-core mutual exclusion is realized by adopting hardware semaphore, and the interfaces such as thread creation and the like are realized based on POSIX standard interfaces provided by an operating system.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (6)

1. An OpenMP implementation method based on a domestic multi-core DSP, the OpenMP implementation method being implemented using a layered architecture, the layered architecture comprising:
the hardware layer comprises a domestic multi-core DSP processor and all cores which share external storage;
the system layer comprises a Bootloader, a real-time operating system and an OpenMP adaptation layer;
the program layer comprises a precompiled instruction, a compiler, an OpenMP interface and an environment variable;
a user layer based on application programs developed by the OpenMP parallel programming module;
the OpenMP adaptation layer comprises thread creation, thread synchronization and mutual exclusion, shared memory management and multi-core task distribution; the implementation of thread creation, shared memory management and multi-core task distribution is packaged based on POSIX standard interfaces provided by an operating system; for synchronization and mutual exclusion of threads, hardware semaphores and shared memory are adopted;
and the OpenMP adaptation layer is used for realizing the dependence of the OpenMP interface on the system layer.
2. The OpenMP implementation method according to claim 1, wherein the OpenMP adaptation layer implementation method includes:
1) Realizing synchronization and mutual exclusion of threads of an operating system in an AMP mode based on a hardware semaphore and a shared memory, realizing mutual exclusion access to the multi-core shared memory by using the hardware semaphore, defining a variable in the shared memory as a value of the semaphore, adding 1 when the semaphore is released, subtracting 1 when the semaphore is acquired, calling a sched_yieldfunction when the value of the semaphore is 0, and promoting the current thread to release occupation of a CPU until the value of the semaphore is not 0;
2) The multi-core task distribution of the operating system in the AMP mode is realized by adopting a master-slave mode and combining an inter-core communication module, and the same OS mirror image is operated by multiple cores in the AMP mode.
3. The OpenMP implementation method according to claim 2, wherein the implementation method for multi-core task distribution includes:
one core of the domestic multi-core DSP processor is set as a master core, the other cores are set as slave cores, and after the loading of the OS mirror image is completed, the OS kernel is started to complete the initialization of the inter-core communication module.
4. The OpenMP implementation method of claim 3, wherein: the method for realizing the inter-core communication module based on the shared memory and inter-core IPC interrupt mode comprises the following steps:
1) Dividing a region for each core in a shared memory region and forming a circulating message queue, maintaining the message queue through a mapping table, determining the message queue of a receiver by a sender through sending an ID, copying the message, and simultaneously sending an IPC interrupt notification to the other party;
2) And blocking the receiver to wait for the IPC to interrupt and release the semaphore, inquiring the message queue after the check is updated after the semaphore is received, and taking out the message to complete one-time inter-core communication.
5. The OpenMP implementation method of claim 3, wherein: the method for realizing the master-slave mode comprises the following steps:
1) The main core executes critical area protection configuration initialization, openMP environment variable initialization and workgroup initialization by creating a user program thread, and invokes a multi-core concurrency control function to realize the encapsulation and the transmission of multi-core concurrency task information, so as to complete multi-core task distribution;
2) The slave core creates a task receiving thread, enters a loop blocking receiving master core to send a message, creates a task executing thread after receiving the message, firstly analyzes parallel block function entry and shared working area information from the message, then jumps to the parallel block function entry to execute the parallel task, executes multi-core synchronization after completing the task, and completes the execution of the concurrent task.
6. The OpenMP implementation method according to claim 3, wherein the implementation method for running the same OS image by multiple cores in the AMP mode includes:
1) Selecting a reserved address interval from 32-bit logic address intervals of a domestic multi-core DSP processor;
2) Address intervals are mapped to different physical addresses for different cores in Bootloader.
CN202311587658.2A 2023-11-23 2023-11-23 OpenMP implementation method based on domestic multi-core DSP Pending CN117608532A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311587658.2A CN117608532A (en) 2023-11-23 2023-11-23 OpenMP implementation method based on domestic multi-core DSP

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311587658.2A CN117608532A (en) 2023-11-23 2023-11-23 OpenMP implementation method based on domestic multi-core DSP

Publications (1)

Publication Number Publication Date
CN117608532A true CN117608532A (en) 2024-02-27

Family

ID=89955635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311587658.2A Pending CN117608532A (en) 2023-11-23 2023-11-23 OpenMP implementation method based on domestic multi-core DSP

Country Status (1)

Country Link
CN (1) CN117608532A (en)

Similar Documents

Publication Publication Date Title
US20070150895A1 (en) Methods and apparatus for multi-core processing with dedicated thread management
JP5678135B2 (en) A mechanism for scheduling threads on an OS isolation sequencer without operating system intervention
US7698540B2 (en) Dynamic hardware multithreading and partitioned hardware multithreading
US8276145B2 (en) Protected mode scheduling of operations
US20070204271A1 (en) Method and system for simulating a multi-CPU/multi-core CPU/multi-threaded CPU hardware platform
US9110692B2 (en) Method and apparatus for a compiler and related components for stream-based computations for a general-purpose, multiple-core system
US20050188177A1 (en) Method and apparatus for real-time multithreading
JP2013546106A (en) Distributed computing architecture
WO2016159765A1 (en) Many-core processor architecture and many-core operating system
US20130061231A1 (en) Configurable computing architecture
Gohringer et al. RAMPSoCVM: runtime support and hardware virtualization for a runtime adaptive MPSoC
Reppy et al. Parallel concurrent ML
US8387009B2 (en) Pointer renaming in workqueuing execution model
Hetherington et al. Edge: Event-driven gpu execution
CN117608532A (en) OpenMP implementation method based on domestic multi-core DSP
CN114281529A (en) Distributed virtualized client operating system scheduling optimization method, system and terminal
Halle et al. A mutable hardware abstraction to replace threads
US9547522B2 (en) Method and system for reconfigurable virtual single processor programming model
Verwielen Performance of resource access protocols
Thomadakis et al. Runtime Support for Performance Portability on Heterogeneous Distributed Platforms
Labarta et al. Hybrid Parallel Programming with MPI/StarSs
Tang et al. SNCL: a supernode OpenCL implementation for hybrid computing arrays
Shi et al. DFlow: Efficient Dataflow-based Invocation Workflow Execution for Function-as-a-Service
Phung et al. Accelerating Function-Centric Applications by Discovering, Distributing, and Retaining Reusable Context in Workflow Systems
Avula Adapting operating systems to embedded manycores: Scheduling and inter-process communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination