CN113900968B

CN113900968B - Method and device for realizing synchronous operation of multi-copy non-atomic write storage sequence

Info

Publication number: CN113900968B
Application number: CN202111497698.9A
Authority: CN
Inventors: 夏军; 晏小波; 蔡学武; 霍泊帆; 陈锞; 陈杨阳
Original assignee: Nanhu Laboratory
Current assignee: Nanhu Laboratory
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-03-11
Anticipated expiration: 2041-12-09
Also published as: CN113900968A

Abstract

The invention discloses a method and a device for realizing the synchronous operation of a multi-copy non-atomic write storage sequence, which comprises a microprocessor framework adopting a non-atomic write realization mode, wherein the microprocessor framework comprises a plurality of Cache agents, a plurality of directory agents and a plurality of IO agents, the Cache agents are connected to the directory agents and the IO agents through an on-chip interconnection network, the on-chip interconnection network is provided with at least three message channels for a Cache consistency protocol to use, the Cache agents are connected to the on-chip interconnection network through a plurality of synchronous processing modules for realizing strong synchronization, and each Cache agent corresponds to one synchronous processing module. The invention can ensure that the read-write operation sent out in front is really executed when the execution of the strong synchronization instruction of the hardware thread is finished, thereby solving the problem of realizing the strong synchronization semantics of the Non-MCAW, realizing the strong synchronization on the basis of the Non-MCAW, keeping the advantages of the microprocessing architecture realized by the Non-MCAW and effectively simplifying the hardware design.

Description

Method and device for realizing synchronous operation of multi-copy non-atomic write storage sequence

Technical Field

The invention belongs to the technical field of non-atomic write storage order synchronous operation, and particularly relates to a method and a device for realizing multi-copy non-atomic write storage order synchronous operation.

Background

Modern microprocessors generally adopt a multi-core and multi-thread shared storage architecture, and storage consistency is the basis for supporting the correct operation of parallel programs under the architecture. The storage consistency defines the ordering rule of storage access operation on different addresses, and according to the allowed ordering rule, the storage consistency can be provided with various storage consistency implementation models, such as a Sequential (Sequential) consistency model, a TSO (Total Store order) model, a Relaxed (delayed) consistency model and the like. Because different legal storage orders are defined by different storage consistency models, a software programmer can write a correct parallel program on the multi-core multi-thread shared storage architecture only by knowing the storage consistency model adopted by the architecture.

In order to improve the access performance, the modern multi-core multi-thread microprocessor generally realizes multi-level Cache. The introduction of the Cache can cause inconsistency among multiple data copies with the same address, and in order to solve the problem, the multi-core multi-thread microprocessor generally adopts a Cache consistency protocol to ensure the consistency of the data copies in different caches. The Cache coherence protocol is generally implemented by a Cache Agent (CA), a directory Agent (HA), and an IO Agent (IA) in most modern multi-core and multi-threaded microprocessors. The Cache in the CA may Cache copies of data, may issue access requests, receive data responses, and snoop requests (to invalidate a copy of data or to obtain the most recent data). The HA is used for recording the state of the data copy, and can receive the access request, send out the data response and listen to the request. The IA cannot cache a copy of the data and can issue access requests and data responses, receive access requests, data responses, and snoop requests (for obtaining the most recent data). From the above, in order to avoid protocol deadlock, implementation of the coherence protocol requires at least three message channels, namely, a request channel, a response channel, and a snoop channel, for transmitting an access request, a data response, and a snoop request, respectively. CA. The HA and IA are connected by an on-chip interconnect network, and this network is typically a dimension network (i.e. messages sent from the same source Agent will arrive sequentially at the same destination Agent).

The multi-core multi-thread microprocessor may have a plurality of caches between a processor core and a memory, and the latest write data sent by a hardware thread of the processor core needs to pass through a plurality of caches to finally reach the memory, so that the data in some caches is the latest data copy, and the data in some caches is the old data copy, which can introduce the atomicity problem of write operation. In addition to defining legal storage order rules, storage consistency also requires defining atomic properties of write operations to describe the atomicity problem of write operations. Depending on the time at which new values of write data are observed by different hardware threads, the atomic nature of write operations may be classified into two categories, multi-copy atomic write (MCAW) and Non-multi-copy atomic write (Non-MCAW). MCAW ensures that when a hardware thread observes a write and the hardware thread is not the initiator of the write, all other threads can observe the write. MCAW causes the write data of a hardware thread to be either simultaneously observed by other hardware threads (i.e., both can read the new value) or neither can be observed (i.e., both can read only the old value), which indicates that the write operation is of an atomic nature. Non-MCAW cannot guarantee the atomic nature of a write operation, and when a certain hardware thread observes a certain write operation, other threads cannot necessarily observe the write operation, that is, some hardware threads can read the new value of the write operation, and some threads can only read the old value of the write operation.

The microprocessor architecture adopting MCAW implementation is X86, ARMV8 and the like. These architectures typically employ write-back caches, and in Cache coherency protocols, new values can only be written to the Cache after the write operation authority is obtained (i.e., all other cached copies of data are invalidated). Parallel programming of software may be simplified because MCAW enables software personnel to see a view of write operations with atomic characteristics, but these architectures incur a significant hardware penalty in order to support the atomic characteristics of writes. The microprocessor architecture using the Non-MCAW implementation is primarily a POWER family of processors. These architectures generally employ write-through caches, and in a Cache coherency protocol, caches can be written directly with new values without acquiring write permissions, waiting for all other cached copies of data to be invalidated. Non-MCAW may simplify hardware design but may increase the difficulty of parallel programming because the software personnel now see a write operation view with Non-atomic characteristics.

In parallel programming, in addition to storing the storage access sequence agreed by the consistency model, software personnel can adjust the storage sequence through a storage sequence synchronization instruction to realize the expected storage access operation sequence. A strong synchronization instruction in a store order synchronization instruction may cause no store access operations following the instruction to be performed until the store access operations preceding the instruction are actually completed (i.e., the data responses for read operations have all been returned and the data for write operations are globally visible). The strong synchronization command of ARMV8 is DMB, while the strong synchronization command of POWER is SYNC (also known as HWSYNC or SYNC 0). Assume that hardware thread 0 executes two write access instructions in sequence: x1=1 x2=1, hardware thread 1 executes two read access instructions in succession: r1= x2 r2= x1, and x1 and x2 are all 0 as initial values. According to the storage consistency model of the ARMV8 or POWER, the write operation of x2 can be completed before the write operation of x1, so that r1=1 r2=0 is a legal output result. If a strong synchronization instruction is added between two memory access instructions of each hardware thread, i.e. hardware thread 0 performs x1=1 dmb/sync x2=1, and hardware thread 1 performs r1= x2 dmb/sync r2= x1, then r1=1 r2=0 is an illegal output result, because the use of the strong synchronization instruction makes the write operation of x2 executed only after the write operation of x1 executed on hardware thread 0 is globally visible (i.e. all hardware threads can access the new value 1 of x 1), and the read access order of hardware thread 1 to x2 and x1 cannot be reversed.

The microprocessor architecture adopting the MCAW implementation mode can easily realize strong synchronous instruction semantics, when a certain hardware thread executes the instruction, only the Cache of the processor core where the hardware thread is located needs to wait for all storage access requests to be processed, because all read access responses are returned at the moment, and all write operations invalidate the remotely cached data copy, all hardware threads can access the new value of the write operation executed by the hardware thread (namely the data of the write operation is globally visible). However, for a microprocessor architecture adopting the Non-MCAW implementation manner, the implementation of the strong synchronization instruction cannot adopt the implementation method of MCAW because when the Cache of the processor core finishes processing all the storage access requests, it can only be guaranteed that all the read access responses are returned at this time, and it cannot be guaranteed that the write operation is really completed, that is, data of all the write operations are globally visible (since the write operation is completed without acquiring the write permission, the write operation may not reach the HA/IA or a data copy cached in the Cache of another processor core at this time HAs not been invalidated). Therefore, for the microprocessor architecture adopting Non-MCAW, another method must be sought to solve the semantic implementation problem of the strong synchronization instruction, but no disclosure is provided for the implementation method of the strong synchronization instruction in the Non-MCAW case.

Disclosure of Invention

The invention aims to solve the problems and provides a method and a device for realizing the storage order synchronous operation of multi-copy non-atomic writing.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-copy non-atomic write storage sequence synchronous operation implementation device comprises a microprocessor framework adopting a non-atomic write implementation mode, wherein the microprocessor framework comprises a plurality of Cache agents, a plurality of directory agents and a plurality of IO agents, the Cache agents are connected to the directory agents and the IO agents through an on-chip internet, the on-chip internet is provided with at least three message channels for a Cache consistency protocol to use, and the multi-copy non-atomic write storage sequence synchronous operation implementation device is characterized in that the Cache agents are connected to the on-chip internet through a plurality of synchronous processing modules for realizing strong synchronization, and each Cache agent corresponds to one synchronous processing module.

In the above device for implementing synchronous operation of a multi-copy non-atomic write storage sequence, the on-chip internetwork at least has three message channels, namely a request channel, a response channel and a monitoring channel; the synchronous processing module comprises a request channel filter, a response channel filter, a monitoring channel filter and a group of synchronous counters, each hardware thread in the Cache agent connected with the corresponding synchronous processing module corresponds to one synchronous counter in the group of synchronous counters, namely the number of the synchronous counters in the synchronous processing module is consistent with that of the hardware threads in the corresponding Cache agent.

In the above device for implementing synchronous operation of a multi-copy non-atomic write storage sequence, the request channel filter is configured to determine whether an access request sent from a corresponding Cache agent to an on-chip interconnection network is a strong synchronization request, and perform corresponding processing;

the response channel filter is used for judging whether the response message sent from the on-chip interconnection network to the corresponding Cache agent is a synchronous response message or not and carrying out corresponding processing;

the monitoring channel filter is used for judging whether the monitoring message sent from the on-chip interconnection network to the corresponding Cache agent is a synchronous monitoring message or not and carrying out corresponding processing;

each synchronization counter is used to record the number of synchronization response messages that have not been received by the corresponding hardware thread executing the strong synchronization instruction.

In the above apparatus for implementing synchronous operation of storage sequences of multi-copy non-atomic writes, the implementation processes of the request channel filter, the response channel filter and the snoop channel filter of each synchronous processing module are respectively as follows:

the implementation process of the request channel filter comprises the following steps:

if the access request is judged not to be a strong synchronization request, bypassing the access request and directly sending the access request to a request channel of the on-chip interconnection network; if the access request is judged to be a strong synchronization request, initializing a synchronization counter of a corresponding hardware thread, generating a synchronization request message and sending the synchronization request message to a request channel of the on-chip interconnection network;

the implementation process of the response channel filter comprises the following steps:

if the response message is judged not to be the synchronous response message, bypassing the response message and directly sending the response message to the corresponding Cache agent; if the response message is a synchronous response message, subtracting one from the synchronous counter of the corresponding hardware thread;

the implementation process of the listening channel filter comprises the following steps:

if the monitoring message is judged not to be the synchronous monitoring message, bypassing the monitoring message and directly sending the monitoring message to a corresponding Cache agent; if the monitoring information is synchronous monitoring information and the source access request comes from a Cache agent connected with the synchronous processing module, the synchronous counter of the corresponding hardware thread is reduced by one, and if the source access request does not come from the Cache agent connected with the synchronous processing module, a synchronous response information is generated and sent to a response channel of the on-chip interconnection network through a response channel filter.

In the above apparatus for implementing synchronous operation of multi-copy non-atomic write storage sequences, the message formats of the synchronization response message, the synchronization snoop message, and the synchronization request message all include a destination node number bit field, a source node number bit field, a message type bit field, and a hardware thread number bit field;

the destination node number of the synchronous response message is a Cache agent connected with a source synchronous processing module, the source node number is an IO agent for sending out the synchronous response message or a Cache agent connected with the synchronous processing module, namely the Cache agent connected with the synchronous processing module for sending out the synchronous response message, the message type is SYNC _ ACK, and the hardware thread number is a hardware thread for executing a strong synchronous instruction in the Cache agent connected with the source synchronous processing module;

the destination node numbers of the synchronous monitoring messages are all Cache agents connected with a synchronous processing module, and the scheme preferably selects all Cache agents to be connected with the synchronous processing module, so that the destination node numbers of the synchronous monitoring messages are all Cache agents, and the source node numbers are and

the Cache agent connected with the source synchronous processing module has the message type of SYNC _ SNP, and the hardware thread number of the hardware thread is the hardware thread executing the strong synchronous instruction in the Cache agent connected with the source synchronous processing module;

the destination node number of the synchronous request message is all directory agents and IO agents, the source node number is a Cache agent connected with the synchronous processing module, namely the Cache agent connected with the synchronous processing module sending the synchronous request message, the message type is SYNC _ REQ, and the hardware thread number is a hardware thread executing a strong synchronous instruction in the Cache agent connected with the synchronous processing module.

A method for realizing multi-copy non-atomic write storage order synchronous operation comprises the following steps:

s1, when receiving a strong synchronization request, a synchronization processing module initializes a corresponding synchronization counter and broadcasts a synchronization request message to all directory agents and IO agents;

s2, when the directory agent receives the synchronization request message, broadcasting a synchronization monitoring message to all the synchronization processing modules;

when receiving the synchronization request message, the IO agent returns a synchronization response message to the source synchronization processing module (i.e., the synchronization processing module that initiated the synchronization request message);

s3, when other synchronous processing modules receive the synchronous monitoring message, a synchronous response message is returned to the source synchronous processing module through the on-chip internet;

when the source synchronous processing module receives the synchronous monitoring message, directly updating a corresponding synchronous counter;

and S4, the source synchronous processing module receives synchronous response messages from the on-chip internet, and when the source synchronous processing module collects all synchronous response messages, a synchronous completion response is returned to the corresponding Cache agent.

In the above method for implementing a multi-copy non-atomic write storage order synchronization operation, step S1 specifically includes:

s11, when the synchronous processing module receives an access request sent by a certain hardware thread of a Cache agent connected with the synchronous processing module, judging whether the request is a strong synchronous request;

s12, if the request is not a strong synchronization request, sending the request to a request channel of the on-chip interconnection network;

if the request is a strong synchronization request, initializing a synchronization counter corresponding to the hardware thread, and setting the value of the synchronization counter to be m × n + k, wherein m is the number of Cache agents, n is the number of directory agents, and k is the number of IO agents;

and S13, generating a synchronization request message and sending the synchronization request message to the on-chip internet.

In the above method for implementing a multi-copy non-atomic write storage order synchronization operation, step S3 specifically includes:

s31, when the synchronous processing module receives a monitoring message from the on-chip interconnection network, judging whether the monitoring message is a synchronous monitoring message;

s32, if the monitoring message is not a synchronous monitoring message, sending a monitoring request to a Cache agent connected with the monitoring message;

if the monitoring message is a synchronous monitoring message, detecting whether a source node number of the synchronous monitoring message is a Cache agent connected with the source node number, if so, indicating that the synchronous processing module is a source synchronous processing module, subtracting one from a synchronous counter corresponding to a hardware thread number in the synchronous monitoring message, and if the synchronous counter returns to zero after subtracting one, sending a synchronous completion response to the corresponding Cache agent;

and if the source node number is not the Cache agent connected with the source node number, generating a synchronous response message and sending the generated synchronous response message to a response channel of the on-chip interconnection network.

In the above method for implementing a storage order synchronization operation of multi-copy non-atomic write, step S4 specifically includes:

s41, when the source synchronous processing module receives a response message from the on-chip interconnection network, judging whether the response message is a synchronous response message;

s42, if the response message is not a synchronous response message, sending a data response to a Cache agent connected with the response message;

if the response message is a synchronous response message, subtracting one from a synchronous counter corresponding to the hardware thread number in the synchronous response message, and if the synchronous counter returns to zero after subtracting one, sending a synchronous completion response to the corresponding Cache agent to indicate that the execution of the strong synchronous instruction of the corresponding hardware thread is completed.

In the above method for implementing a storage order synchronization operation of multi-copy non-atomic write, in step S2, the process of the directory agent on the request message includes:

when the directory agent receives a request message from an on-chip interconnection network, judging whether the request message is a synchronous request message;

if the request message is not a synchronous request message, the directory agent processes the request message according to the original Cache consistency protocol; if the request message is a synchronous request message, generating a synchronous monitoring message, and sending the generated synchronous monitoring message to a monitoring channel of the on-chip interconnection network so as to broadcast the synchronous monitoring message to all synchronous processing modules;

in step S2, the processing procedure of the request message by the IO agent includes:

when an IO agent receives a request message from an on-chip interconnection network, judging whether the request message is a synchronous request message;

if the request message is not a synchronous request message, the IO agent processes the request message according to the original Cache consistency protocol; and if the request message is a synchronous request message, generating a synchronous response message, and sending the generated synchronous response message to a response channel of the on-chip interconnection network so as to return the synchronous response message to the source synchronous processing module.

The invention has the advantages that: the method and the device for realizing the multi-copy Non-atomic write storage order synchronous operation can ensure that the read-write operation sent out in front is really executed when the execution of the strong synchronous instruction of the hardware thread is finished, thereby solving the problem of realizing the strong synchronous semantics of the Non-MCAW, realizing the strong synchronization on the basis of the Non-MCAW, keeping the advantages of a microprocessing architecture realized by the Non-MCAW and effectively simplifying the hardware design; by means of the existing Cache consistency protocol and message channels, the problem of globally visible data of write operation is solved by adding message types, two-stage broadcast protocol flows and synchronous counters related to strong synchronous processing, and hardware implementation cost is remarkably reduced.

Drawings

FIG. 1 is a schematic diagram of a multi-core multi-threaded microprocessor architecture without a synchronization processing module;

FIG. 2 is a diagram of a multi-core multi-threaded microprocessor architecture with synchronous processing modules according to the present invention;

FIG. 3 illustrates the type of synchronization message and the format of the synchronization message;

FIG. 4 is a flow chart of a synchronization request process of the synchronization processing module according to the present invention;

FIG. 5 is a flow chart of the synchronous snoop message processing of the synchronous processing module in the present invention;

FIG. 6 is a flow chart of the synchronization response message processing of the synchronization processing module according to the present invention;

FIG. 7 is a flow chart of the HA synchronization request message processing in the present invention;

FIG. 8 is a flow chart of the process of the synchronization request message in IA of the present invention;

FIG. 9 is a diagram illustrating an implementation of a synchronization processing module according to the present invention;

fig. 10 is an example of a synchronization processing protocol message flow diagram in the present invention.

Cache agent-CA; directory agent-HA; IO agent-IA.

Detailed Description

The present solution is further described below with reference to the accompanying drawings:

as shown in FIGS. 1-10, the present invention provides a method and an apparatus for implementing a multi-copy Non-atomic write storage order synchronization operation, which aims at the problem of implementing strong synchronization semantics of Non-MCAW.

FIG. 1 is a diagram of a multi-core and multi-threaded microprocessor architecture to which the present invention is applicable. The multi-core multi-threaded microprocessor architecture consists of m CAs, n HAs, and k IAs, which are connected by an on-chip interconnect network. CA. The HA, the IA and the on-chip interconnection network jointly realize a Cache consistency protocol, and the consistency of cached data copies in the Cache is ensured. The CA is a Cache agent and generally comprises a processor core, a first-level instruction Cache, a first-level data Cache (write-through Cache) and the like. The HA is a directory agent, and generally comprises a secondary Cache, a directory controller, a storage controller, and the like. The IA is an IO agent, and generally comprises an IO controller, an IO device, and the like. The interconnection network on chip at least needs to provide three message channels for the Cache consistency protocol to use, namely a request channel, a response channel and a monitoring channel, which can respectively transmit a request message, a response message and a monitoring message, so as to solve the problem of protocol deadlock. In order to improve the performance of the protocol, the on-chip interconnection network can also provide more message channels for the Cache consistency protocol to use, but for realizing strong synchronous semantics, the invention only needs to use the request channel, the response channel and the monitoring channel. The interconnection network on the chip is a dimension order network, and messages sent from the same source node (CA, HA or IA) can sequentially reach the same destination node (CA, HA or IA).

When executing the strong synchronization instruction, the processor core in the CA generally waits for the read-write access request to be completed before issuing the strong synchronization request, and at this time, the read response of the previous read access request is completely returned, that is, the read operation is actually completed. However, in the Non-MCAW case, for the previous write access requests, it can only be guaranteed that all the write access requests have been sent to the on-chip interconnection network, and it cannot be guaranteed that the write access requests have reached the corresponding HA or IA, or that all snoop requests (for invalidating the data copy cached in the CA) resulting from the write access requests have reached the corresponding CA. Because the data of the write operation is not globally visible, the write operation is not really executed at this time, and the semantic requirement of the strong synchronous operation cannot be met.

Fig. 2 is a schematic diagram of a multi-core multithreaded processor architecture with a synchronous processing module according to the present invention. In order to realize strong synchronization semantics with smaller hardware overhead and reduce modification of CA, a synchronization processing module is added between CA and the on-chip interconnection network in the scheme and is used for realizing a strong synchronization function. The key to the implementation of strong synchronization semantics is how to make all the data of the write operations issued before the strong synchronization instruction globally visible, i.e., all the write access requests reach the corresponding HA or IA, and all the snoop requests resulting from these write access requests reach the corresponding CA. Since the CA can generally complete processing of snoop requests very quickly (typically 1 to 3 clock cycles), invalidating the cached data copies in the Cache, the invalidation operation can be considered to have been completed as long as the snoop request reaches the CA.

The synchronous processing module can realize the global visibility of the write operation data through two-stage broadcasting and synchronous counting by means of an original Cache consistency protocol. The synchronous processing module realizes a group of synchronous counters, each hardware thread in the CA connected with the synchronous processing module corresponds to one of the synchronous counters in the group of synchronous counters, and the synchronous counters are used for recording the number of synchronous response messages which are not received by the hardware thread executing the strong synchronous instruction. When receiving a strong synchronization request, the synchronization processing module initializes a corresponding synchronization counter and broadcasts a synchronization request message to all the HAs and IAs. The first level of broadcasting can ensure that when the synchronization request message reaches the corresponding HA or IA, the previous write access request also reaches the corresponding HA or IA, because the on-chip interconnection network is a dimensional sequential network. When the HA receives the synchronization request message, the HA broadcasts a synchronization monitoring message to all the synchronization processing modules connected with the CA. The second level broadcast ensures that when a synchronous snoop message reaches the corresponding synchronous processing module, a snoop request generated by a write access request also reaches the corresponding synchronous processing module before the synchronous snoop message reaches the corresponding synchronous processing module. The arrival of the synchronization processing module indicates the arrival of the corresponding CA, i.e. means that the data copy cached in the CA is invalidated. Because the IA will not buffer the data, the HA will not generate the synchronous snoop message to the IA; the data in the IA is not buffered by the CA, so the IA does not generate a synchronous snoop message to the CA (or a synchronous processing module connected thereto). When the IA receives the synchronization request message, it returns a synchronization response message to the source synchronization processing module (i.e. the synchronization processing module that initiated the synchronization request message); and when other synchronous processing modules receive the synchronous monitoring message, returning a synchronous response message to the source synchronous processing module. And when the source synchronization module receives the synchronization monitoring message, directly updating the corresponding synchronization counter. When the source synchronous processing module collects all synchronous response messages (namely, the corresponding synchronous counter is zero at the moment), a synchronous completion response is returned to the local CA, which indicates that all write access requests reach the corresponding HA or IA at the moment, and the data copy cached in the CA is also invalidated, namely, all write operation data are globally visible at the moment.

The implementation of the present solution needs to involve a synchronous response message, a synchronous monitoring message, and a synchronous request message, and as shown in fig. 3, the message formats of the synchronous response message, the synchronous monitoring message, and the synchronous request message in the present solution are:

the synchronous request message is used for the synchronous processing module to broadcast a synchronous request to all HA and IA, and the message format consists of a destination node number bit field, a source node number bit field, a message type bit field and a hardware thread number bit field. The target node numbers are all directory agents and IO agents, the mode of broadcasting the synchronization request and the specific form of the target node numbers are different according to whether the on-chip interconnection network supports the broadcast operation, if the on-chip interconnection network supports the broadcast operation, the synchronization processing module only needs to send a synchronization request message to the on-chip interconnection network, and the target node numbers are broadcast vectors (including all HA and IA) defined by the on-chip interconnection network; if the on-chip interconnection network does not support the broadcast operation, the synchronization processing module sends a synchronization request message to each HA or IA through the on-chip interconnection network, wherein the destination node number of the synchronization request message is the corresponding HA or IA. The source node number of the synchronization request message is the CA connected with the synchronization processing module, the message type is SYNC _ REQ, and the hardware thread number is the hardware thread executing the strong synchronization instruction in the CA connected with the synchronization processing module.

The synchronization response message is used for returning a synchronization response to the source synchronization processing module (i.e. the synchronization processing module which initiates the synchronization request message) by the IA or the synchronization processing module, and the message format is composed of a destination node number bit field, a source node number bit field, a message type bit field and a hardware thread number bit field. The destination node number of the synchronous response message is CA connected with the source synchronous processing module, the source node number is IA sending out the synchronous response message or CA connected with the synchronous processing module, the message type is SYNC _ ACK, and the hardware thread number is the hardware thread executing strong synchronous instruction in CA connected with the source synchronous processing module.

The synchronous monitoring message is used for broadcasting synchronous monitoring to all synchronous processing modules by the HA, and the message format consists of a destination node number bit field, a source node number bit field, a message type bit field and a hardware thread number bit field. The destination node numbers are all Cache agents connected with the synchronous processing modules, and similarly, according to whether the on-chip interconnection network supports the broadcast operation, the broadcast synchronous monitoring mode and the specific form of the destination node numbers are different, if the on-chip interconnection network supports the broadcast operation, the HA only needs to send a synchronous monitoring message to the on-chip interconnection network, and the destination node number is a broadcast vector defined by the on-chip interconnection network (namely, the CA comprises all the CAs connected with the synchronous processing modules-in the embodiment, all the CAs have one synchronous processing module connected with the CA, so all the CAs are included); if the interconnection network on chip does not support the broadcast operation, the HA needs to send a synchronous monitoring message to each synchronous processing module through the interconnection network on chip, and the destination node number of the synchronous monitoring message is the corresponding CA connected with the synchronous processing module. The source node number of the synchronization monitoring message is a CA (i.e., the source node number of the synchronization request message received by the HA, i.e., the CA that sends the synchronization access request), which is connected to the source synchronization processing module, the message type is SYNC _ SNP, and the hardware thread number is a hardware thread (i.e., the hardware thread number of the synchronization request message received by the HA) that executes a strong synchronization instruction in the CA connected to the source synchronization processing module. It should be noted that, unlike the synchronization request message and the synchronization response message, the source node number of the synchronization snoop message is not the originator of the message, which is to enable the synchronization processing module to obtain the CA information connected to the source synchronization processing module when receiving the synchronization snoop message, so that the destination node number of the synchronization response message can be generated.

The added synchronous message is added with a hardware thread number bit field on the basis of the original message format, and the length of the synchronous message does not exceed the original message length because the hardware thread number generally does not exceed the length of the additional bit field of the message.

As shown in fig. 4, the synchronization request processing flow of the synchronization processing module is as follows:

when the synchronous processing module receives an access request sent by a certain hardware thread of a local CA (namely, a CA connected with the synchronous processing module), whether the request is a strong synchronous request is detected.

If the request is not a strong synchronization request, the request is sent onto a request message channel of the on-chip interconnect network.

And if the request is a strong synchronization request, initializing a synchronization counter corresponding to the hardware thread, and setting the value of the synchronization counter to be m × n + k. m is the number of CA's, n is the number of HA's, and k is the number of IA's. All HAs will generate m x n total synchronous snoop messages, each synchronous snoop message will generate a synchronous response message (except the synchronous snoop message received by the source synchronous processing module), and each synchronous response message will make the corresponding synchronous counter decrease by one. When the source synchronous processing module receives the synchronous monitoring message, the synchronous response message is not generated and is sent to the on-chip interconnection network, and the corresponding synchronous counter is directly reduced by one. Each IA will generate one synchronization response message upon receiving the synchronization request message, and all IAs will generate k synchronization response messages, each of which will decrement the corresponding synchronization counter by one. Therefore, the initial value of the synchronization counter should be set to m × n + k. A synchronization request message is generated as shown in fig. 3 and the generated synchronization request message is transmitted to a request message channel of the on-chip interconnection network.

As shown in fig. 5, the synchronous snoop message processing flow of the synchronous processing module is as follows:

when the synchronous processing module receives a monitoring message from the on-chip interconnection network, whether the monitoring message is a synchronous monitoring message is detected.

If the snoop message is not a synchronous snoop message, a snoop request is issued to the local CA to which it is connected.

If the listening message is a synchronous listening message, detecting whether the source node number of the synchronous listening message is a local CA connected with the synchronous listening message.

If the source node number is a local CA, it indicates that the synchronization processing module is a source synchronization processing module, and therefore, the synchronization counter corresponding to the hardware thread number in the synchronization snooping message needs to be decremented by one. If the synchronous counter returns to zero after the subtraction of one, a synchronous completion response needs to be sent to the local CA, which indicates that the execution of the strong synchronous instruction of the corresponding hardware thread is completed.

If the source node number is not a local CA, a synchronization response message is generated as shown in FIG. 3 and sent to the response message channel of the on-chip interconnect network.

As shown in fig. 6, the synchronization response message processing flow of the synchronization processing module is as follows:

when the synchronization processing module receives a response message from the on-chip interconnection network, it detects whether the response message is a synchronization response message.

If the response message is not a synchronization response message, a data response is sent to the local CA to which it is connected.

If the response message is a synchronization response message, the synchronization counter corresponding to the hardware thread number in the synchronization response message needs to be decreased by one. If the synchronous counter returns to zero after the subtraction of one, a synchronous completion response needs to be sent to the local CA, which indicates that the execution of the strong synchronous instruction of the corresponding hardware thread is completed.

As shown in fig. 7, the process flow of the synchronization request message of the HA is as follows:

when the HA receives a request message from an on-chip interconnection network, it detects whether the request message is a synchronization request message.

If the request message is not a synchronous request message, the HA processes the request message according to the original Cache consistency protocol.

If the request message is a synchronization request message, a synchronization snoop message is generated as shown in fig. 3, and the generated synchronization snoop message is transmitted to a snoop message channel of the on-chip interconnection network.

As shown in fig. 8, the process flow of the synchronization request message in IA is as follows:

when the IA receives a request message from the on-chip interconnect network, it detects whether the request message is a synchronization request message.

If the request message is not a synchronization request message, the IA will process the request message according to the original Cache coherency protocol.

If the request message is a synchronization request message, a synchronization response message is generated as shown in fig. 3 and the generated synchronization response message is transmitted to a response message channel of the on-chip interconnection network.

If shown in fig. 9, the synchronization processing module consists of a request channel filter, a response channel filter, a snoop channel filter, and a set of synchronization counters.

The synchronization counter is used for recording the number of synchronization response messages which have not been received by the hardware thread executing the strong synchronization instruction, and each hardware thread in the local CA corresponds to one synchronization counter. When the hardware thread executes the strong synchronization instruction and sends a strong synchronization request to the synchronization processing module, the synchronization counter corresponding to the hardware thread is initialized to the number of synchronization response messages which should be received by the hardware thread. Each time a synchronization processing module receives a synchronization response message, the synchronization counter of the corresponding hardware thread is decremented by one. In particular, if the synchronization processing module receives a synchronization snoop message with a source node number of the message being the local CA, the synchronization counter of the corresponding hardware thread is also decremented by one. If the sync counter returns to zero after decrementing by one, a sync complete response is generated and sent to the local CA.

The request channel filter is used for identifying a strong synchronization request from access requests sent by the local CA to the on-chip interconnection network and correspondingly processing the strong synchronization request. All access requests issued by the local CA must be pre-processed by the request path filter. If the access request is not a strong synchronization request, the request channel filter bypasses the access request and sends it directly onto the request message channel of the on-chip interconnect network. If the access request is a strong synchronization request, a synchronization counter of the corresponding hardware thread needs to be initialized, and a synchronization request message is generated and sent to a request message channel of the on-chip interconnection network.

The response channel filter is used for identifying the synchronous response message from the response message sent by the on-chip interconnection network to the local CA and correspondingly processing the synchronous response message. All response messages from the on-chip interconnect network must be pre-processed by the response channel filter. If the response message is not a synchronous response message, the response channel filter will bypass the response message and send it directly to the local CA. If the response message is a synchronization response message, the synchronization counter of the corresponding hardware thread is decremented by one. The response channel filter will also forward the synchronization response message from the listening channel filter.

The monitoring channel filter is used for identifying the synchronous monitoring message from the monitoring messages sent from the on-chip interconnection network to the local CA and correspondingly processing the synchronous monitoring message. All snoop messages from the on-chip interconnect network must be pre-processed by the snoop channel filter. If the snoop message is not a synchronous snoop message, the snoop channel filter bypasses the snoop message and sends it directly to the local CA. If the snoop message is a synchronous snoop message and its source node number is a local CA, then the synchronous counter for the corresponding hardware thread needs to be decremented by one. If the snoop message is a synchronous snoop message and its source node number is not a local CA, a synchronous response message is generated and sent to the response message channel of the on-chip interconnect network through the response channel filter.

As shown in fig. 10, to make the reader more aware of the present solution, an example of a synchronization processing protocol message flow is given below:

assume that a multi-core multithreaded microprocessor is composed of two CAs (CA 0, CA 1), two HAs (HA 0, HA 1), 1 IA, and an on-chip interconnection network, and the on-chip interconnection network does not support broadcast operation. Each CA is connected to the on-chip interconnect network through a corresponding synchronization processing module, wherein CA0 is connected to synchronization processing module 0 and CA1 is connected to synchronization processing module 1.

Hardware thread T0 in CA0 executed a strong synchronization instruction and issued a strong synchronization request SYNC to synchronization processing module 0.

After receiving the SYNC, the synchronization processing module 0 sets the synchronization counter CNT corresponding to the hardware thread T0 to 5, and sends synchronization request messages SYNC _ REQ to HA0, HA1, and IA, respectively, where destination node numbers of the three synchronization request messages are HA0, HA1, and IA, source node numbers are CA0, and hardware thread numbers are T0.

After receiving the SYNC _ REQ, the HA0 sends synchronization monitoring messages SYNC _ SNP to the synchronization processing module 0 and the synchronization processing module 1, where the destination node numbers of the two synchronization monitoring messages are CA0 and CA1, the source node numbers are CA0, and the hardware thread numbers are T0.

After receiving the SYNC _ REQ, the HA1 sends synchronization monitoring messages SYNC _ SNP to the synchronization processing module 0 and the synchronization processing module 1, where the destination node numbers of the two synchronization monitoring messages are CA0 and CA1, the source node numbers are CA0, and the hardware thread numbers are T0.

After receiving the SYNC _ REQ, the IA sends a synchronization response message SYNC _ ACK to the synchronization processing module 0, where the destination node number of the message is CA0, the source node number is IA, and the hardware thread number is T0.

After receiving the SYNC _ SNP, the synchronization processing module 1 sends a synchronization response message SYNC _ ACK to the synchronization processing module 0. Since the synchronization processing module 1 receives two SYNC _ SNPs and the source node numbers of the two messages are CA0, the synchronization processing module 1 sends two SYNC _ ACKs to the synchronization processing module 0, and the destination node numbers are CA0, the source node numbers are CA1, and the hardware thread numbers are T0.

Synchronization processing module 0 receives two synchronization snoop messages SYNC _ SNP, and the source node numbers thereof are CA0, so each SYNC _ SNP results in the synchronization counter CNT of T0 being decremented by one.

Synchronization processing module 0 receives three synchronization response messages SYNC _ ACK, each of which results in the synchronization counter CNT of T0 being decremented by one.

As can be seen from fig. 10, synchronization processing module 0 receives SYNC _ SNP from HA0, SYNC _ ACK from synchronization processing module 1, SYNC _ SNP from HA1, SYNC _ ACK from synchronization processing module 1, and SYNC _ ACK from IA in sequence, and the arrival of each message causes synchronization counter CNT of T0 to be decreased by one. After CNT is zeroed, synchronization processing module 0 sends a synchronization completion response ACK to CA 0.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Although the terms synchronous processing module, request channel, response channel, snoop channel, request channel filter, response channel filter, snoop channel filter, synchronous counter, Cache agent, directory agent, IO agent, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims

1. A multi-copy non-atomic write storage sequence synchronous operation implementation device comprises a microprocessor framework adopting a non-atomic write implementation mode, wherein the microprocessor framework comprises a plurality of Cache agents, a plurality of directory agents and a plurality of IO agents, and the Cache agents are connected to the directory agents and the IO agents through an on-chip interconnection network;

the on-chip internet at least comprises a request channel, a response channel and a monitoring channel; the synchronous processing module comprises a request channel filter, a response channel filter, a monitoring channel filter and a group of synchronous counters, and each hardware thread in the Cache agent connected with the corresponding synchronous processing module corresponds to one synchronous counter in the group of synchronous counters.

2. The device for realizing the storage order synchronous operation of the multi-copy non-atomic write according to claim 1, wherein the request channel filter is configured to determine whether an access request sent from a corresponding Cache agent to the on-chip interconnection network is a strong synchronous request, and perform corresponding processing;

3. The apparatus for implementing storage order synchronous operation of multi-copy non-atomic write according to claim 2, wherein the request channel filter, the response channel filter and the snoop channel filter of each synchronous processing module are implemented as follows:

4. The apparatus according to claim 3, wherein the message formats of the synchronous response message, the synchronous snoop message, and the synchronous request message each include a destination node number bit field, a source node number bit field, a message type bit field, and a hardware thread number bit field;

the destination node number of the synchronous response message is a Cache agent connected with a source synchronous processing module, the source node number is an IO agent sending the synchronous response message or a Cache agent connected with the synchronous processing module, the message type is SYNC _ ACK, and the hardware thread number is a hardware thread executing a strong synchronous instruction in the Cache agent connected with the source synchronous processing module;

the destination node numbers of the synchronous monitoring messages are all Cache agents connected with a synchronous processing module, the source node numbers are Cache agents connected with the source synchronous processing module, the message types are SYNC _ SNP, and the hardware thread numbers are hardware threads for executing strong synchronous instructions in the Cache agents connected with the source synchronous processing module;

the destination node number of the synchronous request message is all directory agents and IO agents, the source node number is a Cache agent connected with the synchronous processing module, the message type is SYNC _ REQ, and the hardware thread number is a hardware thread executing a strong synchronous instruction in the Cache agent connected with the synchronous processing module.

5. A method for realizing multi-copy non-atomic write storage order synchronous operation is characterized by comprising the following steps:

when the IO agent receives the synchronization request message, a synchronization response message is returned to the source synchronization processing module;

6. The method for implementing storage order synchronization operation of multi-copy non-atomic write according to claim 5, wherein step S1 specifically includes:

7. The method for implementing storage order synchronization operation of multi-copy non-atomic write according to claim 5, wherein step S3 specifically includes:

8. The method for implementing storage order synchronization operation of multi-copy non-atomic write according to claim 5, wherein step S4 specifically includes:

9. The method for implementing storage order synchronization operation of multi-copy non-atomic write according to claim 5, wherein in step S2, the processing procedure of the request message by the directory agent includes: