CN112463711B

CN112463711B - Slave-core software cache sharing method for many-core processor

Info

Publication number: CN112463711B
Application number: CN202011439357.1A
Authority: CN
Inventors: 杨海龙; 陈邦铎; 敦明
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2023-03-31
Anticipated expiration: 2040-12-11
Also published as: CN112463711A

Abstract

The invention discloses a slave core software cache sharing method for a many-core processor, which comprises the following steps: 1) Reasonably dividing a slave core into a computing core, a routing core and a management core, wherein different types of cores bear different tasks; 2) Managing data according to pages, and supporting the conversion between the global address of the main memory and the local address of the software cache; 3) During operation, the data in the nuclear software cache is automatically switched in and out through detection without manual control of a user; 4) The cache consistency protocol suitable for the slave cores of the multi-core processor Shenwei SW26010 is provided, and the problem of data consistency among the slave core software caches is solved; 5) And the specifications, locks and barrier operation primitives of the communication from the inter-core registers based on the multi-core processor Shenwei SW26010 are realized. The invention realizes the sharing of the slave core software cache, avoids the complexity of manually managing the software cache, and further improves the development efficiency of the software on the multi-core processor Shenwei SW 26010.

Description

Slave-core software cache sharing method for many-core processor

Technical Field

The invention relates to the fields of development of parallel software based on a multi-core processor application SW26010, slave core software cache management, slave core software cache consistency and the like, in particular to a data sharing method aiming at slave core software cache.

Background

The Shenwei SW26010 processor is a first self-developed many-core processor in China, and adopts an on-chip fusion heterogeneous many-core architecture. The processor includes four Core Groups (CGs), each Core group consisting of a master Core (MPE) and an array of 8 × 8 slave Cores (CPEs). Each core group has a Memory Controller (MC for short), and each core group also has an independent Memory space, and the connection between the core group and the Memory depends on the Memory Controller allocated by the core group.

In addition, each main core (MPE) has a 32KB instruction L1 cache and a 32KB Data L1 cache, and also has a 256KB instruction Data mixed L2 cache, each compute Core (CPE) independently has a 16KB instruction L1 cache, and also independently has a 64KB Local Data storage space (LDM), which is a software cache, and the software cache supports user autonomous allocation and does not support cache coherence, which needs user guarantee. In the compute core array, every two adjacent rows of compute cores share a DMA controller that is responsible for transferring data between the software cache and main memory. The secondary core can directly access the main memory, but the secondary core has no data cache, so that the efficiency of accessing the main memory from the secondary core is low, and meanwhile, because the software cache is positioned on a chip and the access delay is low, the data in the main memory is generally loaded into the LDM firstly in actual use, and then the secondary core directly uses the data in the software cache, so that the aim of improving the program execution performance is fulfilled. However, since all DMA operations must be explicitly written in the source program, and there is no guarantee of consistency between software caches, this greatly reduces the programmability of the program.

Disclosure of Invention

In order to solve the technical problem, the invention provides a slave core software cache sharing method for a many-core processor, which comprises the following steps:

step 1: the application is started by the master core, and the function needing to be run on the slave core is specified by a user;

step 2: after the slave cores are started to operate, each slave core acquires a corresponding core division type according to the number of the slave core, and respectively calls an initialization function of the core of the corresponding type to complete corresponding initialization work;

and 3, step 3: the different types of the secondary cores enter corresponding execution parts, the routing core is responsible for data forwarding between the computing core and the management core, the management core stores management information of all data and processes consistency affairs of all data, and the routing core is used as a lock manager and is responsible for actual computing tasks;

and 4, step 4: the computing core uses the global address of the data to access and store, carries out the conversion from the global address to the local address, and directly returns the local address if the data corresponding to the global address is positioned on the software cache of the slave core and the corresponding page type is matched with the access type; if the data to be accessed by the computing core is not in the software cache of the secondary core or the page type is not matched with the access type, different processing is carried out according to the access type of the secondary core data, so that the data page to be accessed is swapped into the software cache, and one data page is selected to be swapped out to the main memory according to specific conditions;

and 5: realizing mutual exclusion among different computing cores by using a locking mechanism;

and 6: synchronization among different computing cores is realized by utilizing a fence mechanism;

and 7: performing stipulation operation by using a stipulation primitive to return a stipulation result;

and step 8: repeating the steps 4 to 7 until the computing core completes the corresponding computing task;

and step 9: after completing the computing task of the computing core, firstly, data in the software cache of the computing core is swapped out to a main memory, then, an exit request is sent to a management core, and then, the computing core executes the cleaning work and exits; the management core counts the number of the exit requests, when the count is equal to the number of the calculation cores, the management core sends an exit command to the routing core, the routing core executes cleaning work and exits after receiving the exit command, and then the management core also executes corresponding cleaning work and exits.

Further, in the step 4, the step of swapping the page corresponding to the data to be accessed into the software cache is implemented as follows:

(1) If the computing core performs read-only access on the data and the data is not in the software cache, firstly judging whether a space exists in the software cache or not, if so, directly loading the data to be accessed into the software cache, otherwise, selecting a data page already in the software cache, replacing the data page out of the software cache according to the type of the data page, and then loading the data page to be used. And marking the data page type as a read-only type;

(2) If the computing core performs single-write access on the data and the data is not in the software cache, firstly judging whether a space exists in the software cache or not, if so, directly loading the data to be accessed into the software cache, otherwise, selecting a data page already in the software cache, replacing the data page out of the software cache according to the type of the data page, and then loading the data page to be used. Marking the data page type as a single-writing type;

(3) If the computing core reads and accesses the data and the data is not in the software cache, firstly judging whether a space exists in the software cache or not, if not, firstly selecting a data page in the software cache, and replacing the data page out of the cache according to the type of the data page; then a page reading request of a page to be accessed needs to be sent to the management core, after the management core receives the request, whether other slave cores write back the corresponding page is judged according to the page management information stored by the management core, and if the slave cores write back the corresponding page, the management core needs to wait until the corresponding slave cores write back the page; the management core then sends a read-granted page response to the compute core and marks in the page management information that the page is being read by the compute core and records that the request has a read copy of the page from the core. After receiving the read page permission response, the computing core starts to read the corresponding page into the software cache; after the reading is finished, the computing core also sends a page-fetching finishing request to the management core, then sets address translation information and marks the read page as a read page, and after the management core receives the request, the mark of the page being read by the core is removed from the page management information;

(4) If the computing core performs write access on the data, the data is in the software cache, but the page type is a read page, the write permission needs to be added, and at this time, the computing core needs to send a permission promotion request to the management core. After receiving the request, the management core firstly judges whether other cores of the page are writing back to the main memory, if so, the management core waits for the completion of the writing back of the page. And then judging whether the page is written back, 1) if not, the management core can directly send an authority promotion permission response, sets the page management information of the corresponding page about the computing core as a writing mark, and changes the type of the page into a writing page after the computing core receives the permission response. 2) If the page is written back, the computing core needs to read the page again, so the management core sets the page management information of the page about the computing core as being read, meanwhile, a copy of the page is created in the memory, then a page re-fetching response is sent to the computing core, a request is marked in the page management information, the computing core has a write copy of the corresponding page from the core, the page is read from the main memory again after receiving the page re-fetching response, a page write completion request is sent to the management core after the reading is completed, then corresponding translation information is set, page type information is set as being written, and the mark that the core is writing the page is removed from the page management information after receiving the page re-fetching response;

(5) If the computing core performs write access on the data and the data does not exist in the software cache, the computing core sends a page writing and fetching request to the management core, after receiving the request, the management core firstly judges whether other slave cores write the page back, if so, the management core waits for the completion of the write-back process, then the management core marks the page being written and fetched by the computing slave cores on page management information, sends a page writing and fetching permission response to the computing slave cores, and simultaneously marks the page writing and fetching request of the corresponding page in the page management information. And after receiving the response information, the computing core reads the page, sends a page writing and fetching completion request to the management core after the page fetching is completed, sets the page type as a writing page, and removes the mark that the core is writing and fetching the page from the page management information after the management core receives the page writing and fetching completion request.

Further, in the step 4, the step of swapping out the page corresponding to the data to be accessed to the main memory is implemented as:

(1) If the computing core needs to write the read-only page back to the main memory or needs to swap out the write-once page, the read-only page directly marks the corresponding page as invalid without reporting to the management core, and the write-once page directly writes back to the main memory and marks as invalid.

(2) If the computing core needs to swap out the read page in the software cache, the computing slave core needs to send corresponding read page swap-out information to the management core, and then directly marks the data in the software cache as invalid. After receiving the information, the management core directly deletes the information of the slave core about the page in the page management information;

(3) If the compute core needs to swap out a write page in the software cache, the compute core first sends a writeback request for the page to the management core. After receiving the request, the management core firstly judges whether the page is written back to the main memory by other slave cores, and if so, the management core needs to wait for the other slave cores to finish the write-back process; then judging whether the page only requests that the software cache of the slave core has a copy of the page and the page is not written back to the main memory by other slave cores before:

(3-1) if yes, the management core marks the write-back position of the page in the management information and sends a direct write-back response to the computing core, the computing core directly writes back the data to the main memory after receiving the response, the computing core sends a write-back completion request to the management core after the write-back is completed, and then corresponding address translation information is cleared; the management core clears the page of relevant information about the computing core upon receipt.

(3-2) otherwise, the management core marks the write-back bit of the page in the management information, then judges whether the slave core writes the page back to the computing core of the main memory for the first one of the slave cores holding the write copy of the page, if so, creates a copy of the page for the other slave cores holding the write copy of the page, and sends a direct write-back response to the core. After receiving the response, the computing core directly writes the data back to the main memory, sends a write-back completion request to the management core after the write-back is completed, and then clears the corresponding address translation information. The management core clears the relevant information of the page about the computing core after receiving the page. If not, an indirect write-back response needs to be sent to the slave core of the computation, after receiving the response, the computation core loads a backup of the core about the page from the main memory, compares the copy in the software cache with the backup, finds out the modified part, regenerates a new page, writes the page back, sends an indirect write-back completion request to the management core after the write-back is completed, and then clears the corresponding address translation information. And after receiving the page management information, the management core deletes the backup of the page corresponding to the computing core and clears all information of the page related to the computing core in the page management information.

Further, in the step 5, the lock mechanism is divided into two parts, one part is to acquire the lock, and the other part is to release the lock;

1) The process of acquiring the lock is as follows: the method comprises the steps that a computing core needing to acquire a lock swaps all data pages in a software cache of the computing core into a main memory, then sends a lock application request to a management core, and after the management core receives the request, the management core firstly judges whether a lock applied by the computing core is acquired by other cores, and if so, the corresponding core is waited to release the lock; then sending a lock acquisition permission response to the computing core and recording, after receiving the response, the computing core indicates that the applied lock is acquired, and then the computing core continues to execute;

2) The process of releasing the lock is as follows: the method comprises the steps that a computing core needing to release a lock swaps out a page modified by the computing core to a main memory, then a lock release request is sent to a management core, after the management core receives the request, a record of the lock held by the computing core is deleted, a lock release permission response is sent to the computing core, after the computing core receives the response, the corresponding lock is released, and then the computing core continues to execute.

Further, in step 6, a barrier mechanism is used to implement synchronization of different cores, and the process is as follows:

each computing core firstly needs to swap out a data page in the software cache of the computing core to a main memory; then sending information of arriving fences to the management core, after receiving the request, the management core counts +1 of arriving fences, then judges whether all computation cores arrive at fences, and if not, continues to wait until all computation cores arrive at fences; the management core sends fence arrival responses to all the calculation cores and clears fence counts; the computational core can continue execution after receiving the response.

Further, the procedure of the protocol primitive based on register communication in step 7 is as follows:

after all the computing cores call corresponding stipulation primitives, each computing core sends own data to the computing core positioned at each row head, the computing core at each row head carries out partial stipulation, then sends partial stipulation results to the first computing core No. 0, and the computing core No. 0 carries out final stipulation operation and writes back final stipulation results.

Has the advantages that:

compared with the prior art, the invention has the advantages that: the invention realizes the automatic exchange of the data between the core cache and the main memory of the software, avoids the complexity of manually managing the slave core cache of the software and ensures the data consistency between the slave core software caches; the lock mechanism provided facilitates access to data from mutual exclusion between core threads. The provided fence mechanism is convenient for realizing the synchronization among threads; the provided register communication-based reduced operation primitive greatly reduces the delay of the reduced operation compared with the direct access main memory reduced operation mode.

Drawings

FIG. 1 is a diagram of a hardware architecture for implementing the proposed method of the present invention;

FIG. 2 is a schematic diagram of the slave core task partitioning and communication process proposed by the present invention;

FIG. 3 is a schematic diagram of a read-only page fetching process according to the present invention;

FIG. 4 is a schematic diagram of a page fetch process for a single-write page according to the present invention;

FIG. 5 is a schematic diagram of a page reading process according to the present invention;

FIG. 6 is a schematic diagram of a page writing and fetching process according to the present invention;

FIG. 7 is a schematic diagram of a read-only page swap-out process according to the present invention;

FIG. 8 is a schematic diagram of a page swap-out process according to the present invention;

FIG. 9 is a schematic diagram of a page read-out process according to the present invention;

FIG. 10 is a schematic diagram of a page-writing swap-out process according to the present invention;

FIG. 11 is a schematic diagram of a lock application flow proposed by the present invention;

FIG. 12 is a schematic diagram of a lock release process proposed by the present invention;

FIG. 13 is a schematic flow chart of the fence operation proposed by the present invention;

fig. 14 is a schematic diagram of the protocol operation communication process proposed by the present invention.

The communication process between the computation core and the management core is partially represented by dotted lines in fig. 3-13.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific examples described herein are intended to be illustrative only and are not intended to be limiting. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The present invention requires the addition of an API as shown in table 1.

The hardware architecture of the present invention is shown in fig. 1, which is illustrated here as a slave core array of a core group, and the task division and communication process is shown in fig. 2.

The method comprises the following specific implementation steps:

step 1: the application is started by a master core (MPE), and functions required to be operated on the slave core are specified by a user;

table 1 is the required new API for the proposed process of the present invention;

/>

and 2, step: after the slave cores are started to operate, each slave core acquires a corresponding core division type according to the number of the slave core per se, and respectively calls an initialization function of the core of the corresponding type to complete corresponding initialization work, wherein the division types are shown in FIG. 2;

and 3, step 3: the different types of slave cores enter the corresponding execution parts, the routing core is responsible for data forwarding between the computing core and the management core, the management core stores management information of all data, processes consistency transactions of all data and serves as a lock manager. The computational core is responsible for the actual computational tasks.

And 4, step 4: the method realizes the conversion from the global address to the local address, and if the data corresponding to the global address is positioned on the software cache of the slave core and the corresponding page type is matched with the access type, the local address is directly returned; if the data to be accessed by the computing core is not in the software cache of the secondary core or the page type is not matched with the access type, different processing needs to be carried out according to the access type of the secondary core data, so that the data page to be accessed is swapped in the software cache, and some data pages are swapped out to the main memory according to specific conditions.

The step of converting the page corresponding to the data to be accessed into the software cache is realized as follows:

(1) If the computing core carries out read-only access on the data and the data is not in the software cache, firstly judging whether a space exists in the software cache or not, if so, directly loading the data to be accessed into the software cache, otherwise, selecting a data page in the software cache, replacing the data page out of the software cache according to the type of the data page, and then loading the data. And mark the data page type as read-only. As shown in fig. 3.

(2) If the computing core performs single-write access on the data and the data is not in the software cache, firstly judging whether a space exists in the software cache or not, if so, directly loading the data to be accessed into the software cache, otherwise, selecting a data page in the software cache, replacing the data page out of the software cache according to the type of the data page, and then loading the data. And mark the data page type as write-once. As shown in fig. 4.

(3) If the computing core performs read access on the data and the data is not in the software cache, firstly judging whether a space exists in the software cache or not, if not, firstly selecting a data page already in the software cache, and replacing the data page out of the cache according to the data page type. And then a page reading request of a page to be accessed needs to be sent to the management core, after the management core receives the request, whether other slave cores write back the corresponding page is judged according to the page management information stored by the management core, and if the slave cores write back the corresponding page, the management core needs to wait until the corresponding slave cores write back the page. The management core then sends a read-permitted page response to the compute core and marks in the page management information that the page is being read by the compute core, while recording in the page management information that the slave core has a read copy of the page. And after receiving the read page permission response, the computing core starts to read the corresponding page into the software cache. After the reading is completed, the computing core also sends a page-fetching completion request to the management core, and then sets address translation information and marks the read page as a read page. The management core, upon receiving the request, removes from the page management information the flag that the core is reading the page. As shown in fig. 5.

(4) If the computing core performs write access on the data and the data is in the software cache but the page type is a read page, the writing authority needs to be added, and at this time, the computing core needs to send an authority lifting request to the management core. After receiving the request, the management core firstly judges whether other cores of the page are writing back to the main memory, if so, the management core waits for the completion of the writing back of the page. And then judging whether the page is written back, 1) if not, the management core can directly send an authority promotion permission response, set the page management information of the corresponding page about the computing core as a write mark, and change the type of the page into a write page after the computing core receives the permission response. 2) If the page is written back, the computing core needs to read the page again, so the management core sets the page management information of the page about the computing core as being read, creates a copy of the page in the memory, then sends a page re-fetching response to the computing core, sets the page management information of the corresponding page about the computing core as a write flag, the computing core reads the page from the main memory again after receiving the page, sends a page write and fetch completion request to the management core after the page is read, and then sets corresponding translation information and sets the page type information as being written. The management core, upon receipt, removes from the page management information the flag that the core is writing to the page. As shown in fig. 6.

(5) If the computing core performs write access on the data and the data does not exist in the software cache, the computing core sends a page writing and fetching request to the management core, after receiving the request, the management core firstly judges whether other slave cores write the page back, if so, the management core waits for the completion of the write-back process, then the management core marks the page being written and fetched by the computing slave cores on page management information, and sends a page writing and fetching permission response to the computing slave cores. And after receiving the response information, the computing core reads the page, sends a page writing and fetching completion request to the management core after the page fetching is completed, and sets the page type as a page writing. The management core, upon receipt, removes the flag that the core is writing pages from the page management information.

The step of swapping out the page corresponding to the data to be accessed to the main memory is realized as follows:

(1) If the computing core needs to write the read-only page back to the main memory or needs to write the single-write page back to the main memory, the corresponding page can be directly written back without reporting to the management core. As shown in fig. 7 and 8, respectively.

(2) If the computing core needs to swap out the read page in the software cache, the computing slave core needs to send corresponding read page swap-out information to the management core, and then directly marks the data in the software cache as invalid. After receiving the information, the management core directly deletes the information of the slave core about the page in the page management information. As shown in fig. 9.

(3) If the compute core needs to swap out a write page in the software cache, the compute core first sends a writeback request for the page to the management core. After receiving the request, the management core first determines whether the page is being written back to the main memory by another slave core, and if so, the management core needs to wait for the other slave core to complete the write-back process. Then judging whether the page only requests that the software cache of the slave core has a copy of the page and the page is not written back to the main memory by other slave cores before:

(3-1) if yes, the management core marks the write-back bit of the page in the management information and sends a direct write-back response to the computing core, the computing core directly writes back the data to the main memory after receiving the response, the computing core sends a write-back completion request to the management core after the write-back is completed, and then corresponding address translation information is cleared. The management core clears the relevant information of the page about the computing core after receiving the page.

(3-2) otherwise, the management core marks the write-back bit of the page in the management information, then judges whether the slave core writes the page back to the computing core of the main memory for the first one of the slave cores which hold the write copy of the page, if so, creates a copy of the page for other slave cores which hold the write copy of the page, and sends a direct write-back response to the core. After receiving the response, the computing core directly writes the data back to the main memory, sends a write-back completion request to the management core after the write-back is completed, and then clears the corresponding address translation information. The management core clears the page of relevant information about the computing core upon receipt. If not, an indirect write-back response needs to be sent to the slave core of the computation, after receiving the response, the computation core loads a backup of the core about the page from the main memory, compares the copy in the software cache with the backup, finds out the modified part, regenerates a new page, writes the page back, sends an indirect write-back completion request to the management core after the write-back is completed, and then clears the corresponding address translation information. And after receiving the page information, the management core deletes the backup of the page corresponding to the computing core and clears all information of the computing core about the page in the page management information. As shown in fig. 10.

And 5: the method also provides a locking mechanism for realizing mutual exclusion among different computational cores. The lock mechanism is divided into two parts, one part is to acquire the lock and the other part is to release the lock. 1) The process of acquiring the lock is as follows: the computing core to acquire the lock swaps out all data pages in the software cache of the computing core to the main memory, and then the computing core sends a lock application request to the management core. After receiving the request, the management core firstly judges whether the lock applied by the computing core is acquired by other cores, and if so, waits for the corresponding core to release the lock. And then sending a lock acquisition permission response to the computing core and recording, wherein after receiving the response, the computing core explains that the applied lock is acquired, and then the computing core can continue to execute. As shown in fig. 11, 2) the process of releasing the lock is as follows: the computing core that wants to release the lock swaps out the pages that it modifies to main memory and then sends a lock release request to the management core. After receiving the request, the management core deletes the record of the lock held by the computing core, and sends a lock release permission response to the computing core, and after receiving the response, the computing core indicates that the corresponding lock is released, and then the computing core can continue to execute. As shown in fig. 12.

Step 6: the method also provides a fence mechanism for achieving synchronization between different compute cores. The process is as follows: each compute core first swaps out the data pages in its own software cache into main memory. And then sending arrival fence information to the management core, wherein the management core receives the request, then counts the arrival fence by +1, then judges whether all the computing cores arrive at the fence, and if not, continues to wait until all the computing cores arrive at the fence. The management core sends fence arrival responses to all compute cores and clears the fence count. The computational core can continue execution after receiving the response. As shown in fig. 13.

And 7: the method also provides a reduction primitive based on register communication, and supports reduction operations such as continuous addition and continuous multiplication. The process is as follows: all the computing cores call corresponding specification primitives and specify operations and variables to be specified, then, each computing core (such as cores which are not in the first column, such as C1, C2, C8 and C9) sends data which need specification to the computing cores which are positioned at the heads of all the rows (such as cores which are in the first column, such as C0, C7 and C43) through register communication (Reg), after the computing cores at the heads of the rows carry out partial specification, partial results obtained by the partial specification operation are sent to the first (C0) computing core through register communication (Reg), and the No. 0 computing core carries out the final specification operation and writes back the final specification result. As shown in fig. 14.

And 8: and repeating the steps 4 to 7 until the computing core completes the corresponding computing task.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, but various changes may be apparent to those skilled in the art, and it is intended that all inventive concepts utilizing the inventive concepts set forth herein be protected without departing from the spirit and scope of the present invention as defined and limited by the appended claims.

Claims

1. A slave core software cache sharing method for a many-core processor is characterized by comprising the following steps:

step 1: the application is started by the master core, and the function required to run on the slave core is specified by a user;

and step 3: the different types of slave cores enter corresponding execution parts, the routing core is responsible for data forwarding between the computing core and the management core, the management core stores management information of all data, processes consistency affairs of all data and is used as a lock manager, and the computing core is responsible for actual computing tasks;

and 4, step 4: the computing core uses the global address of the data to access and store, and carries out conversion from the global address to the local address, if the data corresponding to the global address is positioned on the software cache of the slave core and the corresponding page type is matched with the access type, the local address is directly returned; if the data to be accessed by the computing core is not in the software cache of the secondary core or the page type is not matched with the access type, different processing is carried out according to the access type of the secondary core data, so that the data page to be accessed is swapped into the software cache, and one data page is selected to be swapped out to the main memory according to specific conditions;

step 6: synchronization among different computing cores is realized by utilizing a fence mechanism;

and 7: carrying out specification operation by using a specification primitive and returning a specification result;

and 8: repeating the steps 4 to 7 until the computing core completes the corresponding computing task;

2. The method according to claim 1, wherein in step 5, the lock mechanism is divided into two parts, one part is to acquire the lock and the other part is to release the lock;

3. The method for sharing the slave-core software cache of the many-core processor according to claim 1, wherein in the step 6, the synchronization of different cores is realized by using a fence mechanism, which comprises the following steps:

each computing core firstly needs to swap out a data page in a software cache of the computing core to a main memory; then sending information of arriving fences to the management core, after receiving the request, the management core counts +1 of arriving fences, then judges whether all computation cores arrive at fences, and if not, continues to wait until all computation cores arrive at fences; the management core sends fence arrival responses to all the calculation cores and clears fence counts; the computational core can continue execution after receiving the response.

4. The method for sharing the slave-core software cache of the many-core processor according to claim 1, wherein the protocol primitive based on register communication in step 7 is processed as follows: