CN116302592A

CN116302592A - Message transmission system between master core and slave core based on local memory

Info

Publication number: CN116302592A
Application number: CN202310075604.1A
Authority: CN
Inventors: 陈虎; 周鹏灵
Original assignee: Guangdong Science & Technology Infrastructure Center; South China University of Technology SCUT
Current assignee: Guangdong Science & Technology Infrastructure Center; South China University of Technology SCUT
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-06-23

Abstract

The invention discloses a message passing system between a master core and a slave core based on local memory. The system provides a universal messaging programming interface on different platforms such as an x86 microprocessor, an SW26010 processor, a heterogeneous fusion accelerator for E-level computing, and the like. Compared with the traditional unique interface programming based on the domestic high-performance many-core processor, the method has the following advantages: the programming model is simple and easy to learn, and programming difficulty is reduced; application software can be quickly migrated on different types of domestic high-performance microprocessors with only the compiler configuration modified; in the software development method, the high-performance computing software can be developed and debugged by using the model based on the x86 platform, and then the application software is transplanted to the domestic high-performance many-core processor, so that the development difficulty can be effectively reduced. These features will effectively promote the efficiency of domestic high performance computing software development and migration.

Description

Message transmission system between master core and slave core based on local memory

Technical Field

The present invention relates to the field of many-core processors, and in particular to a local memory-based message passing system between a master core and a slave core.

Background

(1) Domestic many-core processor architecture

As shown in FIG. 1, a SW26010 microprocessor (Haohuan FU, junfeng LIAO. The Sunway Taihu Light supercomputer: system and applications [ J ]. Science China Information Sciences,2016,59 (7): 1-16) contains 4 heterogeneous groups. Each heterogeneous group comprises a master core and a slave core cluster consisting of 64 slave cores, and the master frequency is 1.5GHz, as shown in fig. 2. The memory hierarchy of each heterogeneous group is the same, and the memory hierarchy consists of a heterogeneous group memory (8 GB) and a slave core local storage space. The main core has an L1 data Cache with a capacity of 32KB and an L2Cache (data and instructions) with a capacity of 256 KB. Each slave core has 64KB of local memory and 16KB of instruction store, supporting a 256-bit SIMD instruction set. The slave core may access the master memory by direct access or DMA.

Accelerator chips (Liu Sheng, lu Kai, guo Yang, liu Zhong, chen Haiyan, lei Yuanwu, sun Haiyan, yang Qianning, chen Xiaowen, chen Shenggang, liu Biwei, lu Jianzhuang. A Self-Designed Heterogeneous Accelerator for Exascale High Performance Computing [ J ]. Journal of Computer Research and Development,2021,58 (6): 1234-1237 ]) for class E high performance computing employ heterogeneous fusion architecture of CPU+GPDSP consisting of a multicore CPU and 4 GPDSP_Clusters, as shown in FIG. 3. The multi-core CPU comprises 16 FT-C662CPU cores. Each gpdsp_cluster contains 6 DSP nodes (each DSP node contains 4 DSP cores). The multi-core CPU adopts hardware to maintain Cache consistency, and comprises an L2Cache of 16 MB. GPDSP clusters use a three-level storage structure of 80MB private storage, 24MB global shared storage and 32GB HBM storage. A 64KB private scalar memory SM and 768KB private vector memory AM are included on each DSP core. The vector component consists of 16 isomorphic VPE arrays, and supports 1024-bit SIMD instruction operation at the highest.

(2) Architecture abstraction of domestic many-core processor

Taking SW26010 and an accelerator chip for E-level high-performance computing as an example, the domestic many-core high-performance microprocessor has the following characteristics:

1. the system adopts an asymmetric structure, comprises a small number of complex main cores and a plurality of simpler computing cores, wherein the main processor is responsible for processing complex logic control tasks, and the coprocessor is responsible for processing large-scale data parallel tasks with high computation density and simple logic branches.

2. Each compute core has an independent local memory space and these memory spaces do not have Cache coherency, requiring a programmer to host data exchanges with the respective compute core memory via an explicit program control system.

3. There are two methods of data exchange between the master core and the slave core: 1) The slave core directly accesses the memory space of the master core, has longer delay and is only suitable for transmitting control information; 2) From the core initiated DMA process, larger-scale data may be transferred.

4. SIMD instructions are supported from the cores, with different processors differing in SIMD width.

5. Operating system support for multiple processes (threads) is not available on the slave core, and only one thread is supported to run on the slave core. Different processors have different slave core thread programming interfaces.

These two different types of many-core processors may be described using the abstract structure depicted in fig. 4. One master core and N slave cores form a complete processor cluster. The main core passes through the on-chip Cache and accesses the main memory. The slave core has a local memory, and does not support a Cache consistency protocol, and the DMA completes data exchange between the main core memory and the slave core. Each slave core has a SIMD instruction system thereon. The data width of SIMDs varies among different processors. Table 1 gives the main architectural parameters of SW26010 and heterogeneous fusion accelerator for class E computing.

(3) Existing multi-core processor programming model

OpenMP (DE SUPINSKI B R, SCOGLAND T R W, DURANA, et al ongoing evolution of OpenMP [ J ]. Proceedings of the IEEE,2018,106 (11): 2004-2019.) is a common multi-threaded programming interface on current symmetric multiprocessor systems, and is widely supported. Applications developed based on this standard have good portability.

Cilk (Leiserson, charles E.; plaat, aspe (1998), "Programming parallel applications in Cilk". SIAM News.31.) is a task-based multithreaded parallel programming extension. On this basis, cilk++ (Leiseson C E.the Cilk++ concurrency platform [ J ]. The Journal of Supercomputing,2010,51 (3): 244-257.) C/C++ is extended in parallel using three keys-Cilk_for, _Cilk_sphn, and-Cilk_sync. The run-time application takes care of the method of dividing and controlling to schedule tasks among the working threads so as to ensure the load balance of a plurality of threads.

Intel corporation proposed an open source thread build library TBB (Threading Building Blocks) (Anonymous "Intel threading building blocks; outlining C++ for multi-core processor parallelism," SciTech Book News, vol.32, (3), 2008// REINDERS J.Intel threading building blocks: outlining c++ for multi-core processor parallelism st edition). The TBB takes tasks as a scheduling unit and has portability on POSIX and Windows thread libraries. In 2018, intel corporation has published a software programming framework for OneAPI. OneAPI aims to provide a unified programming model and application programming interface for CPU, GPU, FPGA, neural network processors, or other hardware accelerators. The core of OneAPI is the programming language of Data Parallel C++ (James Reinders et al Data Parallel C++ M. Apress, berkeley, CA,2021.Gerhard R.Joubert,Hugh Leather,Mark Parsons,Frans Peters,Mark Sawyer,Ruyman Reyes,Victor Lom u, ller. SYCL: single-source C++ accelerator programming [ J ]. Advances in Parallel Computing,2016,27 ]), which is essentially an extension of C++, increases support for SYCL programming models, can support Data Parallel and heterogeneous programming across CPUs and accelerators to simplify programming and improve code reusability across different hardware, while enabling tuning according to specific accelerators.

Taking SW26010, a heterogeneous fusion accelerator for E-level computing as an example, the SW26010 many-core processor provides a set of Athread function libraries that create and manage threads, one slave core per thread binding. In Athread, the main core interface part is responsible for controlling operations such as creation reclamation of threads, thread scheduling control, interrupt exception management, asynchronous mask support, and the like. The slave core interface part is responsible for initiating data transmission, executing core calculation, thread identification, interrupt sending and other operations.

An hthread multithreaded programming interface is used on the heterogeneous fusion accelerator for class E computing. The programming interface comprises a master core end programming interface and a slave core equipment end programming interface, wherein the host end programming interface mainly comprises equipment management, mirror image management, thread management, equipment end storage management and equipment end shared resource management; the equipment-side programming interface mainly comprises a parallel management interface, a DSP on-chip storage management interface, synchronous management, a terminal/exception handling function interface and a vectorization function interface.

In summary, improving portability of application software on different hardware platforms has become a major working direction of the international high-performance software programming model. But the architecture and the operating system of the domestic high-performance many-core microprocessor have the characteristics of the architecture and the operating system, are difficult to directly use the existing programming model, are mutually not universal, and seriously obstruct the development of domestic high-performance software. Meanwhile, the current SIMD programming model also has some problems, such as OpenMP and Cilk++ require compiler version support, MAL only supports macros in part of ISA, and cannot be used for domestic many-core processors, a vc library and gSIMD method encapsulate SIMD instructions, a user does not directly operate vector instructions, the library fills vector width, and the supported instruction set is also very limited.

Disclosure of Invention

The SW26010 and the heterogeneous fusion accelerator facing E-level computing which are independently developed in China are high-performance many-core processors, a small number of main cores and a plurality of auxiliary cores are adopted, the auxiliary cores adopt local memories without Cache consistency, and the local memories are greatly different from the traditional SMP (symmetrical multiprocessor) and CC-UMA (Cache consistency unified memory access) structures. Meanwhile, interfaces such as thread use on a slave core, local memory data transmission and the like are unique to each processor, and have great differences from the international common standard. This directly leads to two problems: 1) The interface of the bottommost layer is needed to be directly used for developing domestic high-performance many-core processor software, and the software is generally difficult to develop because the software can only be remotely connected to the super computing center for debugging; 2) The domestic software on different domestic high-performance many-core processors cannot be used universally, so that the research and development forces of the domestic software which is originally very weak are more dispersed, and a lot of repeated development work is caused.

The invention provides a message transfer system between a main core and a slave core based on a local memory, which provides a general method for message transfer between the main core and the slave core, and provides a general message transfer programming interface on different platforms such as an x86 microprocessor, an SW26010 processor, a heterogeneous fusion accelerator facing E-level computation and the like. Compared with the traditional unique interface programming based on the domestic high-performance many-core processor, the method has the following advantages: 1) The programming model is simple and easy to learn, and programming difficulty is reduced; 2) Application software can be quickly migrated on different types of domestic high-performance microprocessors with only the compiler configuration modified; 3) In the software development method, the high-performance computing software can be developed and debugged by using the model based on the x86 platform, and then the application software is transplanted to the domestic high-performance many-core processor, so that the development difficulty can be effectively reduced. These features will effectively promote the efficiency of domestic high performance computing software development and migration.

The object of the invention is achieved by at least one of the following technical solutions.

Message transmission system between master core and slave core based on local memory, including master core set M, respectively denoted as M ₁ ,…,m _|M| Where |M| represents the number of master cores in master core set M; main core m _i Corresponding to one or more slave core sets S _a And satisfy |S _a |＝|S _b |，|S _a The I represents the slave core set S _a The number of secondary cores in the process is 1-a, and b-M;

a master core m via a slave core thread management interface _i Slave core set S may be managed _i ，1≤i≤|M|；

Wherein, in the creation of the ith main core m _i To the ith slave core set S _i The j-th slave core s in (a) _i,j Is the kth message queue q _i,j,k In which，s _i,j ∈S _i ，1≤j≤|S _i I, 1 is less than or equal to k, and a calling interface can be utilized in a main core m _i And slave core s _i,j Creates a corresponding message queue q in the memory of the computer system _i,j,k All master cores m _i To slave core s _i,j Is a message queue q of (2) _i,j,k Constitutes the set Q, Q _i,j,k E Q, complete master core m _i And slave core s _i,j The connection between the two;

main core m _i Or from core s _i,j A series of messages r are sent by a message sending mechanism _x And transmitted to message queue q _i,j,k In the process, a message sequence set R is obtained, and the messages in the message sequence set R are orderly sent, wherein x and R are not less than 1 _x ∈R；

Slave core s _i,j Or main core m _i According to message queue q _i,j,k Selecting a corresponding message R from the sequence of messages R _x Wherein r is _x E, R,1 is less than or equal to x is less than or equal to |R|, and the user acquires the message R _x Complete the self-defining message r _x After processing of (a), message queue q _i,j,k Release the message r _x The memory used;

slave core s _i,j Logging off slave cores s after processing data _i,j Is cached in the master core m _i Run thread reclamation slave core s _i,j And continue processing master core m _i If there is no task, the main core m _i And logging off the cache, and ending the program in parallel by multiple threads.

Further, creating a message queue on the primary core requires specifying the following parameters:

the message queue name qName of the character string type, the slave core number slave ID connected, the message size msgSize, the message quantity mSize contained in the master core part of the message queue, the message quantity sSize contained in the slave core part, the starting address mQaddr of the master core message queue, the memory type sType occupied by the message queue in the slave core and the direction of the message queue; after the call is successful, a handle number handle is returned;

wherein the master core identifies a queue entity with (slave core number, handle number) or (slave core number, queue name); the slave core takes the handle number or the queue name as the unique identification number of the queue to determine a unique queue entity; the handles of the same queue on the master core are the same as the handles on the slave cores;

The message queue is only used for communication between the master core and the slave core, and the user can specify the slave core slave ID where the queue is located; between a pair of the master core and the slave core, a plurality of cores may be provided; different message queues;

the size of each message in the message queue is not greater than msgSize bytes;

a message queue is distributed in a main core memory and a local memory of a slave core, and the number of the messages held by the main core memory and the local memory is mSize and sSize respectively;

the initial address of the message queue on the main core memory is a continuous memory space designated by an application program, and the initial address is mQaddr;

if the local memories on the slave cores are of different types, the type of local memory occupied by the message queue may be specified by the slave core memory type sType;

the message queue adopts one direction, and is divided into a main core writing/reading direction and a slave core writing/reading direction, and the directions are specified by the direction parameters;

the master core can create a plurality of message queues between the master core and one slave core, and the message queues between the master core and all the slave cores form a message queue set;

the master core completes the control of the slave core thread according to the slave core thread management interface, mainly creates and starts a slave core thread group for the interface, waits for the thread group to terminate, closes the thread group and loads an image file to the device by the master core.

Further, a message queue has a continuous storage space for storing message contents in a master core part and a slave core part, the number of messages which can be contained in the message queue is divided into mSize and sSize, and the occupied memory capacity is mSize×msgSize bytes and sSize×msgSize bytes respectively; the capacity of the message queue from the core portion is limited by the capacity of the local memory;

the control information layout of each message queue is divided into two parts: a status list and a location index;

the position index is divided into: IMTran, IMReady, IMLocked and IMIdle associated with the master core location, ISTran, ISReady, ISLocked and ISIdle associated with the slave core location; according to different message queue directions, different designs are also provided, and in the message queue control information layout of the master core to the slave core, IMLocked and IMIdle are stored in a master core address area; IMTran, IMReady and the remaining 4 location indices are located in the slave core local memory; while IMReady, IMLocked and IMIdle are stored in the master core address area in the message queue control information layout sent from the slave core to the master core; the IMTran and the other 4 position indexes are both located in the slave core local memory;

IMTran indicates that the first message block state in the main core space is the message position index in transmission; IMReady represents the message location index in the main core space where the first message block state is ready for a message; IMLocked represents the first message block state in the main core space as the message position index in the message lock; IMIdle indicates that the first message block state in the main core space is the message position index in the message idle;

ISTran indicates that the first message block state in kernel space is the message location index in transmission; ISReady represents the message location index that is ready for a message from the first message block state in the kernel space; ISLocked denotes the message location index from the first message block state in core space in message lock; ISIdle indicates that the first message block state in the kernel space is the message position index in the message idle;

each state in the state list corresponds to each message block in the annular message block data area one by one; the message block state list of the master core part and the message block state list of the slave core part are respectively marked as MState and SState and are respectively positioned in a master core address area and a slave core local memory;

a message queue is divided into a master core part and a slave core part;

when the message queue is created, the number of messages that the master core portion and the slave core portion can accommodate is already determined;

the position index in the message queue control information layout can have different designs according to different message queue directions, and the unnecessary variables of the master core are placed in the slave core for storage, so that the access of the slave core code to the variables of the master core can be reduced, and the performance of the model is improved.

Further, in the message queue from the master core to the slave core, the state of a message block in the master core portion includes: masterIdle, masterLocked, masterReady, MTransferring; a message block includes SlaveIdle, STransferring, slaveReady, slaveLocked in the state of the slave core portion; the status information of each message block is stored in a respective memory;

After the message queue is created, all message blocks of the master core part are in a MasterIdle state, and all message blocks of the slave core part are in a SlaveIdle state;

MasterIdle indicates that the message block in the main core is in an idle and allocable state, masterLocked indicates that the message block in the main core is in a locking state, masterReady indicates that the message block in the main core is in a ready and available state, and MTransferring indicates that the message block in the main core is in a transmission state;

SlaveIdle indicates that the message block in the slave core is in an idle and allocable state, STransferring indicates that the message block in the slave core is in a transmission state, slaveReady indicates that the message block in the slave core is in a ready and usable state, and SlaveLocked indicates that the message block in the slave core is in a locking state;

the interface provided by the local memory-based message passing system between the master core and the slave core for the master core application program comprises:

m1, mAllocateMsg (), obtaining an address of a message block in a message queue main core part;

m2, mshendmsg (), starting the master core to transfer a message to the slave core;

m3, mRecvMsg (), receives a message sent from the core;

m4, mReleaseMsg (), releasing a message of a main core part;

the message queue system provides interfaces for the slave core application program including:

S1, sRecMsg (), and receiving a message sent by a main core;

s2, sReleaseMsg (), releasing a message from the core part;

s3, sAblocateMsg (), and obtaining an address of a message block from the core part in a message queue;

s4, sSendMsg (), and starting the slave core to transmit a message to the master core;

in the above interface, M1, M2, S1 and S2 are used for the master core to transfer messages to the slave core, and M3, M4, S3 and S4 are used for the slave core to transfer messages to the master core.

Further, the sequence of operations for sending a message from the master core to the slave core includes:

a1, calling mAllocateMsg (); based on a message transmission system between a main core and a slave core of a local memory, distributing an idle message block pointed by a position index IMIdle in a main core part of a message queue, setting the block to be in a MasterLocked state, circularly moving the IMIdle, and returning the block address MasterMsg to a main core application program;

a2, the master core application program sets a message to be sent in an idle message block pointed by the MasterMsg;

a3, the master core application program calls mSendMsg (), obtains a first message block MasterMsg pointed by the position index IMLocked, sets the message block MasterMsg to be in a MasterReady state based on a message transfer system between the master core and the slave core of the local memory, and circularly moves the IMLocked;

a4, the message transmission system between the main core and the slave core based on the local memory allocates an idle message block message storage space SlaveMsg pointed by a position index ISIdle for the message block to be transmitted in the slave core part at a set time, and circularly moves the ISIdle; starting DMA to transmit a message block in the MasterMsg to SlaveMsg, acquiring a first message block MasterMsg pointed by a position index IMReady, setting the message block MasterMsg to be in an MTransferring state, and setting the SlaveMsg to be in the STransferring state; after the DMA transmission is finished, the message block SlaveMsg pointed by the slave core message block position index ISTran is set to be in a SlaveReady state by the message transfer system between the master core and the slave core based on the local memory, and the message block MasterMsg pointed by the master core message block position index IMTran is set to be in a MasterIdle state;

a5, calling sRecvMsg (); the message queue returns a message block SlaveMsg which is pointed to by a position index ISReady of the slave core part to the slave core application program, and the message block SlaveMsg is set to SlaveLocked;

a6, reading the content in the SlaveMsg from the core application program;

a7, calling sReleaseMsg (); the message queue sets slave core message block SlaveMsg to a SlaveIdle state;

The sequence of operations for the slave core to send a message to the master core includes:

b1, calling sAblocateMsg (); based on the message transmission system between the main core and the auxiliary core of the local memory, distributing an idle message block pointed by a position index ISIdle in the auxiliary core part of the message queue, setting the block to be in a SlaveLocked state, circularly moving the ISIdle, and returning the block address SlaveMsg to the main core application program;

b2, setting a message to be sent in an idle message block pointed by SlaveMsg from the core application program;

b3, calling sSendMsg (); the method comprises the steps that a first message block SlaveMsg pointed by a position index ISLocked is obtained, and the message block SlaveMsg is set to be in a SlaveReady state based on a message transfer system between a master core and a slave core of a local memory; circularly moving the ISLocked;

b4, distributing an idle message block message storage space MasterMsg pointed by a position index IMIdle for a message block to be transmitted in a slave core part at a set time based on a message transfer system between a master core and a slave core of the local memory; starting DMA to transfer the message block in the Slave Msg to the MasterMsg, setting the message block MasterMsg to be in an MTransferring state, and setting the Slave Msg to be in an STransferring state; after the DMA transmission is finished, a message transfer system between a main core and a slave core based on a local memory sets a MasterMsg pointed by a main core message block position index IMTran to be in a MasterReady state, and sets a SlaveMsg pointed by a slave core message block position index ISTran to be in a SlaveIdle state;

b5, calling mRecvMsg (); the message queue returns a message block address MasterMsg pointed by a position index IMReady which is already positioned in the main core part to the main core application program;

b6, the master kernel application program reads the content in the MasterMsg;

b7, calling mReleaseMsg (); the message queue sets the master core message block MasterMsg to be in a MasterIdle state;

the application program of the master core or the slave core directly reads and writes the content of the message block in the memory area managed by the message queue, and the message content is not required to be moved to other memory spaces; therefore, the data moving expense of the message content can be reduced, and the use amount of the secondary core local memory can be effectively reduced;

the master core or the slave core application merely initiates the transmission of the message or receives the message without regard to the specific implementation of the transmission of the message between the master core and the slave core; the realization of message transmission is completed by a message transmission system between a master core and a slave core based on local memory; on the one hand, the application program design is simplified, and meanwhile, the application program has better portability.

Further, the blocking type message transmission process between the master core and the slave core is specifically as follows:

Message queues will maintain a set of DMA requests DMAReqs in each message queue; the set is initialized to an empty set;

the application calls the interface sRecvMsg (); in sRecvMsg (), the following steps are performed:

A1. judging whether a DMA request set DMAReqs of the message queue is empty, if so, executing the step A2, otherwise, executing the step A3;

A2. checking each request req in DMAREqs in sequence, checking whether the request req completes DMA, ignoring if not, setting a req.SMsg state as SlaveReady if complete, setting a req.MMsg state as MasterIdle, and removing req in the DMAREqs;

A3. judging whether the slave core part can acquire the message block SMsg in the SlaveIdle state, and if so, executing the step A4, otherwise, directly executing the step A5;

A4. setting the state of the message block corresponding to the MMsg as an MTransferring state, setting the state of the message block corresponding to the SMsg as an STransferring state, starting an asynchronous DMA request with the length of MsgSize bytes from the MMsg to the SMsg, adding req into DMAREqs, and executing the step A3 again;

A5. If the message in the core part message is in the SlaveReady state, setting the earliest SlaveReady state message Msg to be in a SlaveLosed state, returning the Msg to the end of the application program, otherwise, executing the step A1;

wherein the DMA request set DMAReqs is initialized to null;

the slave core application calls the interface sssendmsg (); in the sSendMsg (), the following steps are performed:

B1. judging whether a DMA request set DMAReqs of the message queue is empty, if so, executing the step B2, otherwise, executing the step B3;

B2. checking each request req in DMAREqs in sequence, checking whether the request req completes DMA, ignoring if not, setting req.SMsg state as SlaveIdle if complete, setting req.MMsg state as MasterReady, and removing req in DMAREqs;

B3. judging whether the slave core part can acquire the message block SMsg in the SlaveReady state, and if so, executing the step B4, otherwise, directly executing the step B5;

B4. setting the state of the message block corresponding to the MMsg as an MTransferring state, setting the state of the message block corresponding to the SMsg as an STransferring state, starting an asynchronous DMA request with the length of MsgSize bytes from the MMsg to the SMsg, adding req into DMAREqs, and executing the step B3 again;

B5. If the message sent from the core part at this time is in a SlaveLocked state, setting the Msg of the SlaveLocked state message as a SlaveReady state, returning the Msg to the end of the application program, otherwise, executing the step B1;

wherein the DMA request set DMAreqs is initialized to null.

Further, the memory space of the slave core accessing the master core has two different modes of direct access and asynchronous DMA transmission; the direct access mode has low efficiency and is suitable for small amount of data access; the asynchronous DMA transmission mode comprises two steps of starting a DMA transmission process and inquiring a DMA result; after starting DMA transmission, the software system finishes other works without waiting for the end of DMA, and knows whether the DMA is finished or not by inquiring the DMA result;

in the process of blocking the main core from sending/receiving the message, the message is returned only after the slave core receives the message of the main core, otherwise, the message is always waited for being sent by the main core;

when the slave core receives the message, the slave core starts a DMA transmission process of the message in the master ready state in the master core part; when the slave core has two or more message blocks and the speed at which the master core transmits messages is higher than the speed at which the slave core uses messages, it is possible to achieve that the slave core application program reads messages and the DMA transfer process is completed in parallel.

Further, the message queue is created by the master core, and a new queue handle is generated in both the master core and the slave core; in the aspect of a master core, handles are set according to different slave core number partitions, and in the aspect of the slave core, a unique queue entity can be determined by a handle number handle or a queue name qName; the handles of the same queue on the master core include handles on the slave cores, that is, the queue corresponding to the slave core (slave id) and the queue of which the handle corresponding to the slave core (slave id) is the same queue entity; the state of a specific message queue can be inquired through the identification number handle of the message queue, and the state mainly comprises the existence, the direction, the size and the number of the messages in the current queue;

m5, mQueryQueue (), inquiring whether a message queue exists;

m6, mQueueDirection (), obtaining the queue direction of the message queue;

m7, mQueueMsgNumInMaster (), obtaining the number of messages which can be accommodated by a control core part of a message queue;

m8, mQueueMsgNumInSlave (), obtaining the number of messages which can be accommodated by a computing core part of a message queue;

M9, mQueueMsgSize (), obtaining the maximum byte number of each message in the message queue;

m10, mQueueMsgSlaveMemTYpe (), acquiring a memory type of a slave core part in a message queue;

m11, mQueueMsgNumStatus (), obtaining the dynamic information of the message queue;

m12, mCreateQueue (), creating a message queue;

the interface provided by the local memory-based messaging system between the master and slave cores for the slave core application includes:

s5, sQueryQueue (), inquiring whether a message queue exists;

s6, sQueueDirection () is carried out to obtain the queue direction of the message queue;

s7, sQueueMsgNumInMaster (), obtaining the number of messages which can be accommodated by a control core part of a message queue;

s8, sQueueMsgNumInSlave (), acquiring the number of messages which can be accommodated by a computing core part of a message queue;

s9, obtaining the maximum byte number of each message in a message queue;

s10, sQueueMsgSlaveMemTYpe (), acquiring a memory type of a slave core part in a message queue;

s11, sQueueMsgNumStatus (), obtaining dynamic information of a message queue;

the interfaces M5-M12 are used for inquiring the related message queue information on the master core, and the interfaces S5-S11 are used for inquiring the related message queue information on the slave core.

Further, when a user creates a message queue, a dedicated message queue handle is generated, and the unique message queue can be obtained through a handle number or a queue name;

the corresponding state information of the message queue can be acquired in the aspects of the master core and the slave core; the master core may determine a unique queue entity because it is to communicate with multiple slave cores (slave core number, handle number) or (slave core number, queue name); in the slave core, the handle number handle or queue name qName may determine the unique queue entity.

Further, the following interfaces are arranged on different high-performance many-core processors, which cover the steps required by communication between a master core and a slave core, and comprise a slave core management mechanism on the master core, and the interfaces can be used for enabling codes to be quickly transplanted to various high-performance many-core processors while completing corresponding functions; when the code is transplanted to a new platform, the code is only required to be recompiled, and the compiling options of the corresponding platform are appointed during compiling;

m13, mHaltDevice, exit the run environment ();

M14, mHMess QueueInit, an initialization method ();

m15, mHMess QueueQuit, control the logging off method ();

m16, mLoadDatFile: load image file to device, only MT3 needs to use ();

m17, mUnloadDatFile, offload image files to device, only MT3 needs to use ();

m18, mGETSlaveCoreNum, namely acquiring the number of computing cores, namely controlling a cancellation method ();

m19, mGETMSize, memory size of acquisition control core and computation core, unit is byte

M20, mGETSlaveSIMDLanes, the number of channels for parallel processing of SIMD instructions to acquire a compute core

M21, mInitDevice, loading the running Environment of the acceleration device ()

M22, mTinitThreadID, acquire initialized thread data Structure

M23, mStartSlaveThread, create and initiate, bind thread group of computing core

M24, mWAITSlaveThreads waiting for thread group termination

M25, mdestroySlaveThreads: closing thread group

M26, mSlaveThreadActive, obtaining whether the thread of the computing core is active

s12, sHMessQueueInit (): initializing a message queue slave core;

S13, calculating a logging-off method () of the core part cache, and obtaining the queue direction of the message queue;

s14, sGetSlaveNum is used for acquiring a calculation core number (), and acquiring the number of messages which can be accommodated by a control core part of a message queue;

s15, sGetSlaveID, namely acquiring the number (), of the current computing core, and acquiring the number of the messages which can be accommodated by the computing core part of the message queue;

s16, obtaining the maximum byte number of each message in a message queue;

s17, sSIMDLanes, namely acquiring the number of channels (), processed by SIMD instructions of a computing core in parallel, and acquiring the memory type of a slave core part in a message queue;

M13-M20 is used for inquiring relevant information queue information on the main core, M21-M26 is used for managing the thread of the slave core on the main core, S12-S20 is used for inquiring relevant information queue information on the slave core;

the interface covers the functions required by the current different high-performance many-core processors, and the provided interface set I is the bottom layer interface L of the different high-performance many-core processors _i 1.ltoreq.i, i= { L ₁ ∪L ₂ …; if there is no processor L with corresponding function ₁ The up-call interface set I is a function I, i.e.,

the code will not have any negative impact;

By utilizing programming of macro definition, the predefined macros are set when different processors correspond to different predefined macros, the same effect is achieved by calling the same interface on different high-performance many-core processors, and the difference of the bottom libraries among different many-core processors is encapsulated.

Compared with the prior art, the invention has the advantages that:

aiming at the problem that the thread programming libraries of the domestic high-performance many-core processor are not uniform, the invention provides a slave-core thread management mechanism for controlling threads under various platforms; aiming at the problem that each slave core adopts independent memory space without Cache consistency and needs a program to explicitly control data exchange between a system main memory and each computing core memory, the model provides a message queue;

under the support of the programming model, the high-performance computing software can be developed and debugged by using the model based on the x86 platform, and then the application software is transplanted to the domestic high-performance many-core processor. Thus, not only the development difficulty can be effectively reduced, but also the same software can be quickly migrated on two different types of domestic high-performance microprocessors, and the development and migration efficiency of domestic high-performance computing software can be effectively improved

Drawings

Fig. 1 is a diagram of a single heterogeneous group in a SW26011 processor.

FIG. 2 is a schematic diagram of an accelerator chip architecture for E-level high performance computing.

FIG. 3 is an abstract schematic diagram of the architecture of a domestic high-performance heterogeneous processor.

Fig. 4 is a schematic diagram of a memory structure of a message queue and a status of a message in an embodiment of the present invention.

FIG. 5 is a diagram illustrating a control information layout of a master-to-slave direction message queue in an embodiment of the present invention.

FIG. 6 is a flow chart of an implementation of a local memory-based message passing system between a master core and a slave core in an embodiment of the invention.

FIG. 7 is a graph of password guess program performance in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, a detailed description of the specific implementation of the present invention will be given below with reference to the accompanying drawings and examples.

Examples:

Wherein, in the creation of the ith main core m _i To the ith slave core set S _i The j-th slave core s in (a) _i,j Is the kth message queue q _i,j,k In which s is _i,j ∈S _i ，1≤j≤|S _i I, 1 is less than or equal to k, and a calling interface can be utilized in a main core m _i And slave core s _i,j Creates a corresponding message queue q in the memory of the computer system _i,j,k All master cores m _i To slave core s _i,j Is a message queue q of (2) _i,j,k Constitutes the set Q, Q _i,j,k E Q, complete master core m _i And slave core s _i,j The connection between the two;

Further, a message queue has a continuous memory space for storing message contents in both the master core portion and the slave core portion, and the number of messages which can be accommodated in both the master core portion and the slave core portion is divided into mSize and sisize, and the occupied memory capacities are msize×msgsize bytes and sisize×msgsize bytes, respectively, as shown in fig. 4; the capacity of the message queue from the core portion is limited by the capacity of the local memory;

a message queue is divided into a master core part and a slave core part;

m3, mRecvMsg (), receives a message sent from the core;

m4, mReleaseMsg (), releasing a message of a main core part;

S1, sRecMsg (), and receiving a message sent by a main core;

s2, sReleaseMsg (), releasing a message from the core part;

a6, reading the content in the SlaveMsg from the core application program;

b6, the master kernel application program reads the content in the MasterMsg;

wherein the DMA request set DMAReqs is initialized to null;

wherein the DMA request set DMAreqs is initialized to null.

m5, mQueryQueue (), inquiring whether a message queue exists;

m6, mQueueDirection (), obtaining the queue direction of the message queue;

m12, mCreateQueue (), creating a message queue;

s5, sQueryQueue (), inquiring whether a message queue exists;

s9, obtaining the maximum byte number of each message in a message queue;

s11, sQueueMsgNumStatus (), obtaining dynamic information of a message queue;

m13, mHaltDevice, exit the run environment ();

M14, mHMess QueueInit, an initialization method ();

m15, mHMess QueueQuit, control the logging off method ();

m16, mLoadDatFile: load image file to device, only MT3 needs to use ();

m17, mUnloadDatFile, offload image files to device, only MT3 needs to use ();

M21, mInitDevice, loading the running Environment of the acceleration device ()

M22, mTinitThreadID, acquire initialized thread data Structure

M24, mWAITSlaveThreads waiting for thread group termination

M25, mdestroySlaveThreads: closing thread group

s12, sHMessQueueInit (): initializing a message queue slave core;

s16, obtaining the maximum byte number of each message in a message queue;

the code will not have any negative impact;

In a specific embodiment, the message passing system between the master core and the slave core based on the local memory, the implementation flow of which is shown in fig. 6, comprises the following steps:

step 1, determining the operated platform t ₁ ；

Step 2, programming model will be in master core m _i And corresponding to the slave core set S _i Each slave core s _i,j Initializing and starting the slave core thread management mechanism by an initialization mechanism respectively, wherein s is as follows _i,j ∈S _i A main core m _i Slave core set S may be managed _i ；；

Step 3, aiming at the platform t operated by the code ₁ In the main core m _i The message transfer interface according to the invention creates a corresponding set of message queues Q in memory, and uses the queue interface to create a connection between a master core and a slave core；

Step 4, the master core/slave core uses the message mechanism of the invention to transmit the message r to the message queue q _i Among them, q _i E Q, obtaining a message sequence R to orderly send a message R,

Step 5, the slave core/master core selects the corresponding message R from R according to the related information of the queue _i ，r _i E, R, i is more than or equal to 1 and less than or equal to |R| and the memory used by the message is released from the message queue; receiving the data from the core and performing corresponding processing;

step 6, slave core s _i,j After processing data, logging out the buffer memory of the core, and the main core m _i Reclaiming threads s _i,j And checking whether tasks exist, if not, the main core cancels the cache, and the program multithreading is finished in parallel.

Step 7, transplanting the program to another platform t ₂ And (4) recompilation is carried out, and compiling options of the corresponding platform are designated during compiling, so that codes can be operated.

In a specific embodiment, in a variety of high-performance many-core processors, namely, an architecture of a small number of main cores and a plurality of auxiliary cores, the auxiliary cores adopt a processor without a local memory with Cache consistency, and master-slave core communication programming is needed. The invention can effectively improve the portability of the application software and the research and development capability of the high-performance computing software.

In a specific embodiment, the password guessing procedure is performed by using a high-performance many-core processor, specifically as follows:

in this embodiment, there is a password guessing program for MD5, which needs to run on multiple many-core processors, and the ciphertext used in the experiment is: 25d55ad283aa400af464c76d713c07ad with corresponding password of 12345678, and different organization structures are adopted by the many-core processors, code reconstruction is needed to be respectively used for the respective processors twice without using the invention, and two queues, namely 'play' and 'result' can be respectively established in the main processor and the slave cores 1-N of the many-core processors based on the invention. Then, the multithreading model is used for communication, and the multithreading model can be directly transplanted and run on a plurality of many-core processors, and the formed handles are shown in the following table, as shown in table 1:

/>

TABLE 1

As shown in fig. 6, the three test methods are based on algorithm guesses, which are based on the present invention, and the trends of the three methods are basically consistent with those of the algorithm guesses without using the present invention, and the performance can be leveled with that of the code without using the present invention. The invention does not affect program performance on the basis of increased portability.

Claims

1. A message transmission system between a master core and a slave core based on local memory is characterized by comprising a master core set M, which is respectively denoted as M ₁ ,…,m _|M| Where |M| represents the number of master cores in master core set M; main core m _i Corresponding to one or more slave core sets S _a And satisfy |S _a |＝|S _b |，|S _a The I represents the slave core set S _a The number of secondary cores in the process is 1-a, and b-M;

main core m _i Or from core s _i,j A series of messages r are sent by a message sending mechanism _x And transmitted to message queue q _i,j,k Among them, a message sequence set R is obtained, and the messages in it are orderly sent outDelivering x and r are not less than 1 _x ∈R；

2. The local memory-based inter-master and slave messaging system according to claim 1, wherein the creation of a message queue on the master requires specifying the following parameters:

3. The local memory-based message passing system between a master core and a slave core according to claim 2 wherein a message queue has a continuous block of memory space for storing message contents in both the master core portion and the slave core portion, the number of messages that can be accommodated by both being divided into mSize and sisize, the memory capacities occupied being mSize x msgSize bytes and sisize x msgSize bytes, respectively; the capacity of the message queue from the core portion is limited by the capacity of the local memory;

a message queue is divided into a master core part and a slave core part;

At the time of message queue creation, the number of messages that the master core portion and the slave core portion can accommodate has been determined.

4. A local memory based master-slave messaging system according to claim 3 wherein in the master-to-slave message queue, the state of a message block in the master portion comprises: masterIdle, masterLocked, masterReady, MTransferring; a message block includes SlaveIdle, STransferring, slaveReady, slaveLocked in the state of the slave core portion; the status information of each message block is stored in a respective memory;

m3, mRecvMsg (), receives a message sent from the core;

m4, mReleaseMsg (), releasing a message of a main core part;

s1, sRecMsg (), and receiving a message sent by a main core;

s2, sReleaseMsg (), releasing a message from the core part;

5. The local memory-based inter-master and slave messaging system according to claim 4, wherein the sequence of operations for transmitting a message from the master to the slave comprises:

a6, reading the content in the SlaveMsg from the core application program;

b6, the master kernel application program reads the content in the MasterMsg;

b7, calling mReleaseMsg (); the message queue sets the master core message block MasterMsg to the MasterIdle state.

6. The local memory-based message passing system between a master core and a slave core according to claim 4, wherein the blocking message transmission procedure between the master core and the slave core is specifically as follows:

wherein the DMA request set DMAReqs is initialized to null;

wherein the DMA request set DMAreqs is initialized to null.

7. The local memory-based inter-master and slave core messaging system according to claim 6, wherein the slave core accesses the master core's memory space in two different ways, direct access and asynchronous DMA transfer; the asynchronous DMA transmission mode comprises two steps of starting a DMA transmission process and inquiring a DMA result; after starting DMA transmission, the software system finishes other works without waiting for the end of DMA, and knows whether the DMA is finished or not by inquiring the DMA result;

when the slave core receives the message, the slave core starts a DMA transmission process of the message in the master ready state in the master core part; when the slave core has two or more message blocks and the speed of the master core sending the message is higher than the speed of the slave core using the message, the reading of the message from the slave core application program and the DMA transfer process are completed in parallel.

8. The local memory-based message passing system between a master core and a slave core of claim 1, wherein the message queue is created by the master core, and a new queue handle is generated in both the master core and the slave core; in the aspect of a master core, handles are set according to different slave core number partitions, and in the aspect of the slave core, a unique queue entity can be determined by a handle number handle or a queue name qName; the handles of the same queue on the master core include handles on the slave cores, that is, the queue corresponding to the slave core (slave id) and the queue of which the handle corresponding to the slave core (slave id) is the same queue entity; the state of a specific message queue can be inquired through the identification number handle of the message queue, and the state mainly comprises the existence, the direction, the size and the number of the messages in the current queue;

m5, mQueryQueue (), inquiring whether a message queue exists;

m6, mQueueDirection (), obtaining the queue direction of the message queue;

m12, mCreateQueue (), creating a message queue;

s5, sQueryQueue (), inquiring whether a message queue exists;

s9, obtaining the maximum byte number of each message in a message queue;

s11, sQueueMsgNumStatus (), obtaining dynamic information of a message queue;

9. The local memory-based message passing system between a master core and a slave core of claim 1 wherein a unique message queue handle is generated when a user creates a message queue, the unique message queue being obtained by a handle number or a queue name;

10. The local memory-based message passing system between a master core and a slave core according to claim 1, wherein interfaces are provided for different high-performance many-core processors, covering the steps required for communication between the master core and the slave core, including a slave core management mechanism on the master core, with which the code can be quickly portable to a plurality of high-performance many-core processors while performing the corresponding functions; when the code is transplanted to a new platform, the code is only required to be recompiled, and the compiling options of the corresponding platform are appointed during compiling;

m13, mHaltDevice, exit the run environment ();

m14, mHMess QueueInit, an initialization method ();

m15, mHMess QueueQuit, control the logging off method ();

m16, mLoadDatFile: load image file to device, only MT3 needs to use ();

m17, mUnloadDatFile, offload image files to device, only MT3 needs to use ();

M21, mInitDevice, loading the running Environment of the acceleration device ()

M22, mTinitThreadID, acquire initialized thread data Structure

M24, mWAITSlaveThreads waiting for thread group termination

M25, mdestroySlaveThreads: closing thread group

s12, sHMessQueueInit (): initializing a message queue slave core;

s16, obtaining the maximum byte number of each message in a message queue;

M13-M20 are used for inquiring relevant information queue information on the master core, M21-M26 are used for managing threads of the slave core on the master core, and S12-S20 are used for inquiring relevant information queue information on the slave core.