CN112699061A

CN112699061A - Systems, methods, and media for implementing cache coherency for PCIe devices

Info

Publication number: CN112699061A
Application number: CN202011429683.4A
Authority: CN
Inventors: 缪露鹏
Original assignee: Haiguang Information Technology Co Ltd
Current assignee: Haiguang Information Technology Co Ltd
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2021-04-23
Anticipated expiration: 2040-12-07
Also published as: CN112699061B

Abstract

Systems, methods, and media for implementing cache coherency for PCIe devices are provided, the systems comprising: the CPU cache master controller is configured to send a first read-write request of the CPU to a first address to the memory cache slave controller through a cache command of the internal bus; the memory cache slave controller is configured to update the state of the memory cache line of the first address according to the state of the memory cache line of the first address of the first read-write request, and send the first address and a command for updating the state of the memory cache line of the PCIe device into the first state to the input/output bridge controller through a cache command of the internal bus; an input/output bridge controller configured to: and sending the first address and a command for updating the state of the PCIe device cache line into the first state to the PCIe device through a first PCIe bus message, receiving first data from the first address of the PCIe device and a response for updating the state of the PCIe device cache line into the first state through a second PCIe bus message, and sending the response to the memory cache slave through a cache command of the internal bus.

Description

Systems, methods, and media for implementing cache coherency for PCIe devices

Technical Field

The present application relates to the field of integrated circuits, and more particularly, to systems, methods, and media for implementing cache coherency for PCIe devices.

Background

Peripheral Component Interconnect Express (PCIe) bus is one of the computer bus Peripheral Component Interconnect standards (PCI), which follows the existing PCI programming concept and communication standard, but is based on a faster serial communication system. The PCIe bus applies only to the internal interconnect. Since the PCIe system is based on the existing PCI system, the existing PCI system can be converted into the PCIe system only by modifying the physical layer without modifying software. The PCIe bus has a faster rate to replace almost all of the existing internal buses (including Accelerated Graphics Port (AGP) bus and PCI bus).

The PCIe bus link uses an end-to-end data transfer manner, as shown in fig. 1, and fig. 1 shows a connection manner of the PCIe bus link in the conventional technology. In fig. 1, a root complex 102 is connected to the CPU subsystem 101 and the system memory 103 for managing both, and the root complex 102 also interacts with a PCIe Endpoint (EP) device 105 through a PCIe Switch device (Switch) 104 to exchange and execute commands from the CPU subsystem 101 and commands from the PCIe endpoint device. Where one end of a PCIe bus link can only connect to one sending or receiving device, such as PCIe endpoint device 105. Therefore, the PCIe switch device 104 must be used to extend the PCIe bus link before multiple devices can be connected.

Conventional PCIe protocols do not support PCIe Endpoint (EP) devices to maintain data coherency with memory and a Central Processing Unit (CPU). However, in the case where the EP device and the CPU have their own caches, both caches may have copies of the same data, but in the case where both caches independently modify the copies of the same data stored in the respective caches, the copies of the same data may be inconsistent between different caches.

Accordingly, there is a need for techniques to achieve cache coherency for PCIe devices.

Disclosure of Invention

According to one aspect of the present invention, there is provided a system for implementing cache coherency for a PCIe device, comprising: the CPU cache master controller is coupled with the CPU through an internal bus and is configured to send a first read-write request of the CPU to a first address to the memory cache slave controller through a cache command of the internal bus through the internal bus; the memory cache slave controller is coupled with the memory through the internal bus, is configured to update the state of the memory cache line of the first address according to the state of the memory cache line of the first address of the first read-write request, and sends the first address and a command for updating the state of the PCIe device cache line into the first state to the input/output bridge controller through a cache command of the internal bus through the internal bus; an input/output bridge controller coupled with the PCIe device through a PCIe bus configured to: sending the first address and a command to update a status of a PCIe device cache line to a first status to the PCIe device via a PCIe bus in a first PCIe bus message, receiving first data from the first address of the PCIe device and a response that the state of the PCIe device cache line is updated to a first state through a second PCIe bus message, and a cache command through an internal bus is transmitted to the memory cache slave through the internal bus, wherein the memory cache slave is configured to update the data and state of the memory cache line of the first address with the first data and the first state in the second PCIe bus message, and sends the first data and the response of the updated state to the CPU cache master through the cache command of the internal bus, wherein the CPU cache master is configured to update the data and state of the CPU cache line with the first data and a response of the updated state.

According to another aspect of the present invention, there is provided a method for implementing cache coherency for a PCIe device, including: a CPU cache master controller coupled with a CPU of a central processing unit through an internal bus sends a first read-write request of the CPU to a first address to a memory cache slave controller through a cache command of the internal bus through the internal bus; updating the state of the memory cache line of the first address according to the state of the memory cache line of the first address of the first read-write request by a memory cache slave coupled with a memory through the internal bus, and sending the first address and a command for updating the state of the memory cache line of the PCIe device into the first state to an input/output bridge controller through a cache command of the internal bus through the internal bus; by an input/output bridge controller coupled with a PCIe device through a PCIe bus: sending the first address and a command to update a status of a PCIe device cache line to a first status to the PCIe device via a PCIe bus in a first PCIe bus message, receiving first data from the first address of the PCIe device and a response that the state of the PCIe device cache line is updated to a first state through a second PCIe bus message, and a cache command through an internal bus is transmitted to the memory cache slave through the internal bus, wherein the memory cache slave updates the data and state of the memory cache line of the first address with the first data and the first state in the second PCIe bus message, and sends the first data and the response of the updated state to the CPU cache master through the cache command of the internal bus, wherein the CPU cache master updates the data and state of the CPU cache line with the first data and the updated state response.

According to another aspect of the invention, there is provided a computer storage medium storing computer program instructions which, when executed by a processor, perform the method of the various embodiments of the invention.

According to another aspect of the invention, there is provided a processing system comprising a processor and a computer storage medium, wherein the computer storage medium stores computer program instructions which, when executed by the processor, perform the method of the various embodiments of the invention.

Drawings

Fig. 1 shows a connection manner of PCIe bus links in the conventional art.

FIG. 2 illustrates the hierarchical structure for devices of the PCIe bus Specification.

Fig. 3 is a diagram illustrating an EP device reading or writing data in a memory through a DMA engine.

FIG. 4 illustrates a system implementing cache coherency for a PCIe device according to an embodiment of the invention.

FIG. 5 illustrates the data flow between the memory, the input/output bridge controller, and the PCIe device, according to an embodiment of the invention.

Fig. 6A shows an example of the format of a message header in the 4DW format.

Fig. 6B shows an example of a format of a Vendor-Defined (Vendor-Defined) message header.

FIG. 7A shows header and body fields of a Vendor Defined Message (VDM).

Fig. 7B and 7C illustrate a VDM0 packet format and a VDM1 packet format, respectively, according to an embodiment of the present invention.

Fig. 8 shows an example of an operation in a case where a CPU requests a read/write of an address in a memory to monopolize data of the address according to an embodiment of the present invention.

FIG. 9 illustrates an example of operation where a PCIe device requests a read or write to an address in memory to monopolize the data for the address in accordance with an embodiment of the invention.

FIG. 10 shows a flow diagram of a method of implementing cache coherency for a PCIe device in accordance with an embodiment of the invention.

FIG. 11 illustrates a block diagram of an exemplary computer system/server suitable for use in implementing embodiments of the present invention.

Detailed Description

Reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the specific embodiments, it will be understood that they are not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. It should be noted that the method steps described herein may be implemented by any functional block or functional arrangement, and that any functional block or functional arrangement may be implemented as a physical entity or a logical entity, or a combination of both.

The PCIe bus specification adopts a layered structure for the design of devices, and includes a transaction layer, a data link layer, and a physical layer, each of which is divided into two function blocks, i.e., a transmitting function block and a receiving function block.

FIG. 2 illustrates the hierarchical structure for devices of the PCIe bus Specification. As shown in fig. 2, at the transmitter, the application (device a) forms a Transaction Layer Packet (TLP) in the Transaction Layer, stores the TLP in the transmission buffer, and waits for the TLP to be pushed to the lower Layer. At the Data Link Layer, concatenating some additional information on the TLP packet, where the additional information is used for error checking when the TLP packet is received by the other party, and forming a Data Link Layer Packet (DLLP); at the physical layer, the DLLP packet is encoded, occupying an available channel in the PCIe bus link, and sent out from the transmitter side (TX).

At the receiver end, it is actually the "inverse" process at the transmitter end. If the transmitter side is packing continuously, the receiver side (RX) is unpacking continuously, and finally, useful data information is extracted for the application program of the device B.

Conventional PCIe protocols do not support EP devices to maintain data coherency with memory and CPU. A common way is that the EP device reads or writes data in the Memory through a Direct Memory Access (DMA) engine.

One simple solution is to apply for a segment of non-cached (non-cache) memory in the memory, which is dedicated to the EP device access. But the disadvantage is also obvious, and the performance of data reading and writing is reduced because of non-cache.

However, the performance improvement brought by the cache (cache) requires software to maintain the data consistency. Fig. 3 shows a schematic diagram of an EP device 301 reading or writing data in a memory 303 through a DMA engine mechanism. As shown in fig. 3, if data is moved from the EP device 301 to the memory 303 (i.e., DMA read), the DMA controller 302 needs to invalidate (invalidate) the cache in the address range in the cache 304 before data transfer, and after data transfer is finished, the software program will not read the old data due to cache hit (cache hit); if data is moved from the memory 303 to the EP device 301 (DMA write), before data transfer, the CPU 305 needs to clear (clear) the cache in the address range in the cache 306 of the CPU to ensure that the data in the cache 306 is written back to the memory 303, so that the data transfer of the EP device 301 will not move the old data in the memory 303 to the EP device 301. The interaction of the CPU and memory and EP devices is implemented via a system bus. Both of these directions of operation require that the software cannot access the EP device's address space when the EP device transfer is not complete.

Since multiple copies of the same data in the memory may exist in different cache memories at the same time, if the CPU Or the EP device modifies its local copy, data inconsistency may be caused, and in order to solve the inconsistency problem, a coherency protocol, such as modifying a proprietary Exclusive Shared Or Invalid (MESI) protocol, is usually required.

The MESI protocol is an invalidation based cache coherency protocol and is one of the most common protocols that support write-back caches. It defines four states of a cache line, modified (M), exclusive (E), shared (S), invalid (I).

Modification (M): a cache line exists only in the current cache and is dirty. It has been modified from a value in main memory (M state). The cache is required to write data back to main memory at some future time before allowing any further reads of the (no longer valid) main memory state. The write back modifies the line to the shared state (S). When a block is marked as M (modified), copies of the block in other caches will be marked as I (invalid).

Exclusive (E): the cache line is only present in the current cache, but is clean. It matches the main memory. It can change to the shared state at any time in response to a read request. Alternatively, it may be changed to the modified state at the time of writing.

Shared (S): indicating that this cache line may be stored in other caches of the computer and clean. It matches the main memory. The line may be discarded (changed to an invalid state) at any time.

Null (I): indicating that the cache line is invalid (unused).

The EP device may be a compute intensive device (such as an accelerator card for PCIe devices) that is used primarily to improve the performance of the platform in terms of, for example, critical data computation, content encryption processing, etc. Typically, processors process and execute instructions in a sequential order. In contrast, the most significant difference of the hardware accelerator is that the hardware can process in parallel, which makes the efficiency of the hardware processing much higher than that of the software processing, and therefore, the software code which needs a large amount of computation is transferred to the hardware accelerator for implementation, so as to improve the overall performance of the system. Compute intensive PCIe devices all have high performance, low latency requirements.

The disclosed technology may: the performance of the calculation-intensive equipment is improved by utilizing the cache, when the EP equipment wants to read data, the data does not need to be fetched from the memory every time, if the address is hit in the cache, the data can be directly fetched from the cache, the data is prevented from being fetched from the main memory, and the data processing speed can be effectively improved; and the cache consistency can be maintained by the hardware level, and the whole process is transparent to software, thereby reducing the complexity of software programming.

In order that those skilled in the art will better understand the present invention, the following detailed description of the invention is provided in conjunction with the accompanying drawings and the detailed description of the invention.

Note that the example to be described next is only a specific example, and is not intended as a limitation on the embodiments of the present invention, and specific shapes, hardware, connections, steps, numerical values, conditions, data, orders, and the like, are necessarily shown and described. Those skilled in the art can, upon reading this specification, utilize the concepts of the present invention to construct more embodiments than those specifically described herein.

FIG. 4 illustrates a system 400 for implementing cache coherency for PCIe devices in accordance with an embodiment of the present invention.

A system (or Root Complex, RC) 400 for implementing cache coherency for PCIe devices includes a CPU cache master 401 coupled to a central processing unit CPU (subsystem) 404 via an internal bus, configured to send a first read/write request to a first address from the CPU 404 to a memory cache slave 402 via the internal bus via a cache command of the internal bus, the memory cache slave 402 coupled to a memory 405 via the internal bus, configured to update a state of a memory cache line of the first address according to a state of the memory cache line of the first address of the first read/write request, send the first address and a command to update a state of a PCIe device cache line to the first state to an input/output bridge controller via the internal bus via a cache command of the internal bus, an input/output bridge controller 403 coupled to the PCIe device 407 via the PCIe bus, is configured to: the first address and the command for updating the state of the PCIe device cache line to the first state are sent to the PCIe device 407 through the PCIe bus by the first PCIe bus message, the first data from the first address of the PCIe device 407 and the response for updating the state of the PCIe device cache line to the first state are received by the second PCIe bus message, and the cache command through the internal bus is sent to the memory cache slave 402 through the internal bus. The memory cache slave 402 is configured to update the data and status of the memory cache line of the first address with the first data and the first status in the second PCIe bus message and send the first data and the response of the updated status to the CPU cache master 401 through a cache command of the internal bus.

In one embodiment, the CPU cache master 401 is configured to update the data and state of the CPU cache line with the first data and the response of the updated state.

In one embodiment, the CPU cache master 401 is configured to receive the first read/write request from the CPU 404, determine whether the first address hits in the CPU cache line according to the first read/write request and the state of the CPU cache line, and send the first read/write request to the memory cache slave 402 if the first address hits in the CPU cache line.

In one embodiment, the first read/write request includes data to modify or monopolize or share the first address, the status of the memory cache line of the first address of the first read/write request includes having been monopolized by the PCIe device, the first status is an invalid status, wherein the memory cache slave 402 is configured to update the data of the memory cache line of the first address with the first data and the first status in the second PCIe bus message and the update status is modified, monopolized or shared by the CPU, and the CPU cache master 401 is configured to update the data of the CPU cache line with the first data and the first status and the update status is modified, monopolized or shared by the CPU.

Thus, by setting three hardware, namely, the CPU cache master 401, the memory cache slave 402 and the input/output bridge controller 403, when the CPU requests a read/write request to an address in the memory to modify, monopolize and share data of the address, the cache consistency of the CPU 404, the memory 406 and the PCIe device 407 is maintained through interaction of various messages.

In a multi-PCIe device connected SYSTEM as shown in fig. 4, a Root Complex (RC) 400 connects a CPU 404, a SYSTEM MEMORY (SYSTEM MEMORY) 405 through an internal bus, and connects various PCIe external devices 407 through PCIe bus link extensions through a PCIe switch device (switch) 406. Wherein the input/output bridge controller 403 may send the first PCIe bus message to the PCIe device 407 through the PCIe switch device 406. The PCIe device 407 may include a PCIe device cache controller 409 to process messages from the PCIe switch device 406 over the PCIe bus link and send a second PCIe bus message through the PCIe switch device 406 to the input/output bridge controller 403. The internal bus in the root complex is used to interconnect the various cache masters 401 and cache slaves 402 and input/output bridge controllers 403.

Where CPU subsystem 404 includes its own Cache Memory (not shown), which may be comprised of Static Random-Access Memory (SRAM), each line of Cache includes a flag field to indicate the coherency state of the data and a data field. The CPU cache Master (Coherent Master) 401 of the CPU subsystem 404 is configured to receive a read/write request from the CPU 404 and snoop a snoop request (snoop) on an internal bus of the RC 400, and perform corresponding processing on its cache memory.

A memory cache Slave (Coherent Slave) 402 of the system memory 405 is configured to manage data and states of cache lines in all caches, receive a read/write request from a CPU cache master 401 of each CPU to the memory 406, and send a snoop request to the relevant CPU cache master 401 according to a read/write request address, so as to maintain cache consistency.

The primary role of the input/output Bridge Controller (IO Bridge Controller) 403 is to convert requests from PCIe bus links to requests on the internal bus or requests to access PCIe peripherals on the internal bus to PCIe requests, which includes an input/output Bridge cache Master (Coherent Master) 408 to receive cacheable read and write requests from PCIe devices 407 and to snoop requests from memory cache slaves (Coherent Slave) 402 inside the root complex 400.

The accelerator card system of the PCIe device 407 includes a functional logic of the accelerator card itself and a PCIe device Cache Controller (Cache Controller) 409, where the PCIe device Cache Controller 409 is configured to receive a read-write request from the accelerator card of the PCIe device and a snoop request from a PCIe bus link.

FIG. 5 illustrates the flow of data between memory 405, input/output bridge controller 403, and PCIe device 407 according to an embodiment of the invention. As shown in the data flow of fig. 5, if the read-write request address from the PCIe device 407 is found not to hit in the internal cache, the PCIe device 407 generates a corresponding cacheable read-write command (cache request), routed over the PCIe bus via the PCIe SWITCH device (SWITCH) to the i/o bridge cache master 408 on the i/o bridge controller 403 in the form of a Vendor Defined Message 0 (VDM 0), parsed by the i/o bridge cache master 408, and converted into corresponding cache requests for transmission to the memory cache slave 402, the memory cache slave 402 then sends the cache response back to the i/o bridge cache master 408, and the input/output bridge cache master 408 converts it into a cache response in the form of a Vendor Defined Message 1 (VDM 1) Message to be sent to the PCIe device cache controller 409. Similarly, for a snoop request received from the memory cache slave 402, the snoop request converted by the i/o bridge cache master 408 into the form of a VDM0 message is routed over the PCIe bus through the PCIe switch device to the PCIe device cache controller 409 of the PCIe device 407, while all returned responses (snoop responses) are passed between the i/o bridge cache master 408 and the PCIe device cache controller 409 of the PCIe device 407 in the form of VDM1 messages, the i/o bridge cache master 408 sending the snoop response back to the memory cache slave 402.

The VDM0 and VMD1 messages are in VDM message format. The VDM message is a custom message format. The message is a type of TLP. In the PCIe protocol, PCIe bus messages (messages) are used to replace these out-of-band signals of interrupt, error, power consumption management in the conventional PCI, and all Message headers (Message headers) are in a 4 Double Word (DW) format, as shown in fig. 6A, and fig. 6A shows an example of the format of the Message header in the 4DW format. The message field (message code) therein determines what type the message is.

In this scheme, Vendor-Defined (Vendor-Defined) messages in PCIe bus messages are mainly used. This message type is a message on the PCIe protocol that is a custom function extension to the designer, and its format is shown in fig. 6B, and fig. 6B shows an example of the format of the vendor defined message header. Wherein the message field (message code) is 0111111 x.

The message includes two types: vendor Defined Message 0 (VDM 0) and Vendor Defined Message 1 (VDM 1), both types of messages being used by the present application.

The two types differ in that the behavior will be different if the recipient does not support this type of message:

if it is VDM1, then the process is directly discarded.

In the case of VDM0, it is recorded by the device as an unsupported request.

Since the VDM message is a design-customized field from 12B, the cache-related commands and responses may be encapsulated into the VDM body field and then transmitted over the PCIe bus, as shown in fig. 7A, which shows the VDM header and body fields.

The implementation of the VDM packet format is flexible, and the following method is adopted in the scheme to encapsulate the cache-related commands and responses into the text field of the VDM:

for the cache/snoop request, the VDM0 packet format as shown in fig. 7B is employed. For the cache/snoop response, the VDM1 packet format as shown in fig. 7C is used. Fig. 7B and 7C illustrate a VDM0 packet format and a VDM1 packet format, respectively, according to an embodiment of the present invention. Of course, the present application is not limited to these two packet format messages, and other format messages may also be utilized, which is not illustrated here.

In step 801, if the CPU wants to monopolize the data of the address a, whether the address a hits in the CPU cache line is determined according to the read-write request of the monopolize address a and the state of the CPU cache line of the address a, and if the address a does not hit in the CPU cache line, the CPU cache master 401 of the CPU initiates a request for the data of the monopolize address a, such as a cache command RdBlkE of the internal bus, to the memory cache slave 402 via the internal bus.

In step 802, after receiving the command RdBlkE, the memory cache slave 402 searches for the state of the cache line corresponding to the address a, and finds that the cache line has been monopolized by the PCIe device, the memory cache slave 402 updates the state of the cache line of the address a to an invalid state, and sends the address a and a command to update the state of the cache line of the PCIe device to be monopolized by the CPU to the input/output bridge cache master 408 of the input/output bridge controller 403 via an internal bus cache command, such as a snoop command SnpBlkE, so as to notify the PCIe device that the state of the cache line needs to be subsequently switched, first switched to the invalid state, and then switched to be monopolized by the CPU.

At step 803, upon receipt of the command SnpBlkE by the input/output bridge cache master 408, the address a and command information (e.g., to update the PCIe device cache line status as exclusive to CPU and other information) are extracted and encapsulated into a message (snoop message) of VDM0, which is sent onto the PCIe bus to be snooped by the PCIe device cache controller 409 of the PCIe device.

In step 804, after snooping the VDM0 message, the PCIe device cache controller 409 checks the status of the corresponding cache line at address a in its internal cache, finds that the data of the cache line has been modified (status M), and therefore, first modifies the status of the cache line to be invalid (status I), and encapsulates the modified data and status information (switching has been successful) in the VDM1 message and puts the message on the PCIe bus. If the data of the cache line is not modified, the data is not transmitted, and only the state information is transmitted (the switching is successful). Of course, the VDM1 message also includes a tag number that can confirm that it is a response to the SnpBlkE command received in step 803.

In step 805, the i/o bridge cache master 408 receives the VDM1 message, parses the fields, determines from the tag number that the response is a response to the sniff command SnpBlkE received in step 803, sends the internal bus cache command SnpRspStatus to the slave 402 via the internal bus memory cache, and carries the modified data and status information (the switch was successful) to be sniffed by the memory cache slave 402.

In step 806, after the memory cache slave 402 snoops the snprpvstatus command, it determines that the state switching of the cache line has been completed according to the state information, updates the carried data into the memory, modifies the state of the cache line to be exclusive to the CPU, and returns the data and the state to the CPU cache master 401 of the CPU by using the cache command RspStatus of the internal bus.

In step 807, the CPU cache master 401 changes the state of its internal cache line to the CPU-exclusive state (E) according to the returned data and state, and updates the cache line with the returned data.

Of course, the above example only takes the CPU initiating an exclusive read/write request as an example, but the above example may also be referred to in the case of read/write requests such as modify and share.

Thus, by setting three hardware of the CPU cache master 401, the memory cache slave 402, and the input/output bridge controller 403 including the input/output bridge cache master 408, the cache consistency of the CPU, the memory, and the PCIe device is maintained through interaction of various messages when the CPU modifies, monopolizes, and shares data of an address in the memory by a read-write request of the address.

An embodiment of a PCIe device initiating a read and write request is described below.

In one embodiment, the input/output bridge controller 403 is configured to: a second read write request to the second address by the PCIe device is received in response to the first PCIe bus message and a cache command over the internal bus is sent to the memory cache slave 402 via the internal bus.

The memory cache slave 402 is configured to send second data of a second address and a second status to the input/output bridge cache master 408 of the input/output bridge controller 403 via the internal bus by a cache command of the internal bus according to the second status of the CPU cache line of the second address.

The input/output bridge cache master 408 is configured to send the second data and the second state transition to the PCIe device via the PCIe bus through the second PCIe bus message such that the PCIe device updates the data and state of the PCIe device cache line at the second address with the second data and the second state.

In one embodiment, the PCIe device is configured to respond to data requiring the second address by determining whether the second address hits in a PCIe device cache line, and if not, sending a second read write request to the input/output bridge cache master 408 and recording the status of the PCIe device cache line for the second address as invalid.

In one embodiment, the second read write request is data exclusive or sharing the second address, wherein the second status is invalid and the PCIe device updates the data of the PCIe device cache line at the second address with the second data and the second status and updates the status as modified, exclusive or shared by the PCIe device.

Thus, by setting three hardware, namely the CPU cache master 401, the memory cache slave 402 and the input/output bridge controller 403 including the input/output bridge cache master 408, the cache consistency of the CPU, the memory and the PCIe device is maintained through interaction of various messages under the condition that the PCIe device modifies, monopolizes and shares data of an address in the memory according to a read-write request of the address.

In step 901, the PCIe device needs to use data corresponding to a certain address B, first search whether there is a matching address in its own cache, that is, whether there is a hit, and after the PCIe device cache controller 409 finds that the address B is not hit in the lookup Table (TLB) of the cache line, the PCIe device cache controller 409 initiates an exclusive read operation of reading the data from the address B of the memory to the input/output bridge cache master 408 of the input/output bridge controller 403 through the PCIe bus by using a VDM0 message, expecting to be able to exclusively use the data of the address B, where at this time, the state of the cache line recorded in the PCIe device cache controller 409 is I, that is, the cache line is invalid.

At step 902, the input/output bridge cache master 408 receives and parses the VDM0 message, converts it into a cache command for the internal bus of the root complex internal bus, such as the RdBlkE command, based on address B and the requested command, and sends it to the memory cache slave 402, carrying address B and other information.

In step 903, the memory cache slave 402 checks the address, finds that the state of the cache line of the address B in a TLB lookup table maintained by the memory itself is empty, i.e. I indicates that the cache line can be exclusively owned by other devices, and therefore generates a cache command of the internal bus, such as RspStatus, which carries corresponding data and state information, and returns the cache command to the I/o bridge cache master 408. And if the state of the cache line of the address B is searched for and is already exclusive, which indicates that the cache line cannot be exclusive by other devices, the state information is unsuccessful. The RspStatus also carries a tag number to indicate which RdBlkE command is the response information.

At step 904, i/o bridge cache master 408 determines from the tag number carried by the RspStatus that it is a response message corresponding to the previous RdBlkE command, encapsulates the corresponding data and status information into a VDM1 message, and sends it onto the PCIe bus for PCIe device snooping. The VDM1 message also contains ID information indicating that this is the successful completion of the read request issued at step 901.

In step 905, when the PCIe device snoops the VDM1 message, by comparing the ID information therein, it indicates that this is successfully completed corresponding to the read request issued in step 901, and then the status of updating data to the corresponding cache line and changing the cache line is changed from I (invalid) to exclusive (E) by the PCIe device.

Of course, the above example only takes the PCIe device initiating an exclusive read/write request as an example, but the above example may also be referred to in the case of read/write requests such as modify and share.

The method 1000 for implementing cache coherency for a PCIe device comprises: 1001, a CPU cache master 401 coupled to a CPU of a CPU unit through an internal bus sends a first read-write request of the CPU for a first address to a memory cache slave 402 through a cache command of the internal bus via the internal bus; step 1002, updating the state of the memory cache line of the first address according to the state of the memory cache line of the first address of the first read-write request by the memory cache slave 402 coupled with the memory through the internal bus, and sending the first address and a command for updating the state of the PCIe device cache line to the first state to the input/output bridge controller 403 through the cache command of the internal bus via the internal bus; step 1003, by the input/output bridge controller 403 coupled to the PCIe device through the PCIe bus: the method comprises the steps of sending a first address and a command for updating the state of a PCIe device cache line into a first state to a PCIe device through a first PCIe bus message, receiving first data from the first address of the PCIe device and a response for updating the state of the PCIe device cache line into the first state through a second PCIe bus message, sending the first data and the response to the updating state to a memory cache slave 402 through a cache command of an internal bus, wherein the first data and the first state in the second PCIe bus message are used by the memory cache slave 402 to update the data and the state of the memory cache line of the first address, and the first data and the response to the updating state are sent to a CPU cache master 401 through the cache command of the internal bus, wherein the CPU cache master 401 updates the data and the state of the CPU cache line with the first data and the response to the updating state.

In one embodiment, the CPU cache master 401 receives the first read/write request from the CPU, and determines whether the first address hits in the CPU cache line according to the first read/write request and the state of the CPU cache line, and if not, sends the first read/write request to the memory cache slave 402.

In one embodiment, the first read/write request includes data for modifying or monopolizing or sharing the first address, the state of the memory cache line of the first address of the first read/write request includes that the memory cache line has been monopolized by the PCIe device, and the first state is an invalid state, where the memory cache slave 402 updates the data of the memory cache line of the first address and the update state to be modified, monopolized or shared by the CPU using the first data and the first state in the second PCIe bus message, and the CPU cache master 401 updates the data of the CPU cache line and the update state to be modified, monopolized or shared by the CPU using the first data and the first state.

In one embodiment, the second data and the second status for the second address are sent by the input/output bridge controller 403 via the internal bus to the input/output bridge controller 403 via a cache command over the internal bus according to the second status of the CPU cache line for the second address by the memory cache slave 402, and the second data and the second status for the second address are sent by the input/output bridge controller 403 via the internal bus to the PCIe device via a cache command over the internal bus according to the second status of the CPU cache line for the second address, such that the PCIe device updates the data and status of the PCIe device cache line for the second address with the second data and the second status.

In one embodiment, a second read write request is sent by the PCIe device to the input/output bridge controller 403 in response to data requiring the second address, based on whether the second address hits in the PCIe device cache line, and if not, the status of the PCIe device cache line at the second address is recorded as invalid.

In one embodiment, the second read write request is data exclusive or sharing the second address, wherein the second status is invalid, and the data and update status of the PCIe device cache line at the second address is modified, exclusive or shared by the PCIe device using the second data and the second status.

Thus, by setting three hardware, namely the CPU cache master 401, the memory cache slave 402 and the input/output bridge controller 403, under the condition that the CPU or PCIe requests for reading and writing an address in the memory to modify, monopolize and share the data of the address, the cache consistency of the CPU, the memory and the PCIe device is maintained through interaction of various messages.

FIG. 11 illustrates a block diagram of an exemplary computer system suitable for use to implement embodiments of the present invention.

The computer system may include a processor (H1); a memory (H2) coupled to the processor (H1) and having stored therein computer-executable instructions for performing, when executed by the processor, the steps of the respective methods of embodiments of the present invention.

The processor (H1) may include, but is not limited to, for example, one or more processors or microprocessors or the like.

The memory (H2) may include, but is not limited to, for example, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, floppy disk, solid state disk, removable disk, CD-ROM, DVD-ROM, Blu-ray disk, and the like.

In addition, the computer system may include a data bus (H3), an input/output (I/O) bus (H4), a display (H5), and an input/output device (H6) (e.g., a keyboard, a mouse, a speaker, etc.), among others.

The processor (H1) may communicate with external devices (H5, H6, etc.) via a wired or wireless network (not shown) over an I/O bus (H4).

The memory (H2) may also store at least one computer-executable instruction for performing, when executed by the processor (H1), the functions and/or steps of the methods in the embodiments described in the present technology.

Of course, the above-mentioned embodiments are merely examples and not limitations, and those skilled in the art can combine and combine some steps and apparatuses from the above-mentioned separately described embodiments to achieve the effects of the present invention according to the concepts of the present invention, and such combined and combined embodiments are also included in the present invention, and such combined and combined embodiments are not necessarily described herein.

It is noted that advantages, effects, and the like, which are mentioned in the present disclosure, are only examples and not limitations, and they are not to be considered essential to various embodiments of the present invention. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the invention is not limited to the specific details described above.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The flowchart of steps in the present disclosure and the above description of methods are merely illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the order presented. As will be appreciated by those skilled in the art, the order of the steps in the above embodiments may be performed in any order. Words such as "thereafter," "then," "next," etc. are not intended to limit the order of the steps; these words are only used to guide the reader through the description of these methods. Furthermore, any reference to an element in the singular, for example, using the articles "a," "an," or "the" is not to be construed as limiting the element to the singular.

In addition, the steps and devices in the embodiments are not limited to be implemented in a certain embodiment, and in fact, some steps and devices in the embodiments may be combined according to the concept of the present invention to conceive new embodiments, and these new embodiments are also included in the scope of the present invention.

The individual operations of the methods described above may be performed by any suitable means capable of performing the corresponding functions. The means may include various hardware and/or software components and/or modules including, but not limited to, a hardware circuit, an Application Specific Integrated Circuit (ASIC), or a processor.

The various illustrative logical blocks, modules, and circuits described may be implemented or described with a general purpose processor, a Digital Signal Processor (DSP), an ASIC, a field programmable gate array signal (FPGA) or other Programmable Logic Device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any commercially available processor, controller, microcontroller or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, a microprocessor in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may reside in any form of tangible storage medium. Some examples of storage media that may be used include Random Access Memory (RAM), Read Only Memory (ROM), flash memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, and the like. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. A software module may be a single instruction, or many instructions, and may be distributed over several different code segments, among different programs, and across multiple storage media.

The methods disclosed herein comprise acts for implementing the described methods. The methods and/or acts may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of actions is specified, the order and/or use of specific actions may be modified without departing from the scope of the claims.

The above-described functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as instructions on a tangible computer-readable medium. A storage media may be any available tangible media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, disk (disk) and disc (disc) includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

Accordingly, a computer program product may perform the operations presented herein. For example, such a computer program product may be a computer-readable tangible medium having instructions stored (and/or encoded) thereon that are executable by a processor to perform the operations described herein. The computer program product may include packaged material.

Software or instructions may also be transmitted over a transmission medium. For example, the software may be transmitted from a website, server, or other remote source using a transmission medium such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, or microwave.

Further, modules and/or other suitable means for carrying out the methods and techniques described herein may be downloaded and/or otherwise obtained by a user terminal and/or base station as appropriate. For example, such a device may be coupled to a server to facilitate the transfer of means for performing the methods described herein. Alternatively, the various methods described herein can be provided via storage means (e.g., RAM, ROM, a physical storage medium such as a CD or floppy disk) so that the user terminal and/or base station can obtain the various methods when coupled to or providing storage means to the device. Moreover, any other suitable technique for providing the methods and techniques described herein to a device may be utilized.

Other examples and implementations are within the scope and spirit of the disclosure and the following claims. For example, due to the nature of software, the functions described above may be implemented using software executed by a processor, hardware, firmware, hard-wired, or any combination of these. Features implementing functions may also be physically located at various locations, including being distributed such that portions of functions are implemented at different physical locations. Also, as used herein, including in the claims, "or" as used in a list of items beginning with "at least one" indicates a separate list, such that a list of "A, B or at least one of C" means a or B or C, or AB or AC or BC, or ABC (i.e., a and B and C). Furthermore, the word "exemplary" does not mean that the described example is preferred or better than other examples.

Various changes, substitutions and alterations to the techniques described herein may be made without departing from the techniques of the teachings as defined by the appended claims. Moreover, the scope of the claims of the present disclosure is not limited to the particular aspects of the process, machine, manufacture, composition of matter, means, methods and acts described above. Processes, machines, manufacture, compositions of matter, means, methods, or acts, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein may be utilized. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or acts.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the invention. Thus, the present invention is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the invention to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A system for implementing peripheral component interconnect extended PCIe bus device cache coherency, comprising:

the CPU cache master controller is coupled with the CPU through an internal bus and is configured to send a first read-write request of the CPU to a first address to the memory cache slave controller through a cache command of the internal bus through the internal bus;

the memory cache slave controller is coupled with the memory through the internal bus, is configured to update the state of the memory cache line of the first address according to the state of the memory cache line of the first address of the first read-write request, and sends the first address and a command for updating the state of the PCIe device cache line into the first state to the input/output bridge controller through a cache command of the internal bus through the internal bus;

an input/output bridge controller coupled with the PCIe device through a PCIe bus configured to:

sending the first address and a command to update a status of a PCIe device cache line to a first status to the PCIe device via a PCIe bus in a first PCIe bus message,

receiving first data from the first address of the PCIe device and a response of updating the state of a PCIe device cache line into a first state through a second PCIe bus message, and sending a cache command through an internal bus to the memory cache slave through the internal bus,

wherein the memory cache slave is configured to update the data and status of the memory cache line of the first address with the first data and the first status in the second PCIe bus message and send a response of the first data and the updated status to the CPU cache master via a cache command of an internal bus,

wherein the CPU cache master is configured to update the data and state of the CPU cache line with the first data and a response of the updated state.

2. The system of claim 1, wherein,

the CPU cache master controller is configured to receive the first read-write request from a CPU, judge whether the first address hits in a CPU cache line according to the first read-write request and the state of the CPU cache line, and send the first read-write request to the memory cache slave controller if the first address hits in the CPU cache line.

3. The system of claim 2, wherein the first read write request includes data to modify or monopolize or share the first address, the status of the memory cache line of the first address of the first read write request includes having been monopolized by a PCIe device, the first status is an invalid state, wherein the memory cache slave is configured to update the data of the memory cache line of the first address and the update status as modified, monopolized or shared by a CPU with the first data and the first status in the second PCIe bus message, the CPU cache master is configured to update the data of the CPU cache line and the update status as modified, monopolized or shared by a CPU with the first data and the first status.

4. The system of claim 1, wherein,

the input/output bridge controller is configured to: in response to receiving a second read write request for a second address by a PCIe device over the first PCIe bus message and sending a cache command over an internal bus to the memory cache slave via the internal bus,

the memory cache slave is configured to send second data of the second address and the second state to the input/output bridge controller via the internal bus by a cache command of the internal bus according to the second state of the CPU cache line of the second address,

the input/output bridge controller is configured to send the second data and the second state transition to the PCIe device via a PCIe bus through a second PCIe bus message, such that the PCIe device updates the data and state of the PCIe device cache line at the second address with the second data and the second state.

5. The system of claim 4, wherein,

the PCIe device is configured to respond to data needing a second address, judge whether the second address hits in a PCIe device cache line, if not, send the second read-write request to the input/output bridge controller, and record the state of the PCIe device cache line of the second address as an invalid state.

6. The system of claim 5, wherein the second read write request is data that modifies, monopolizes, or shares the second address, wherein the second state is invalid, and wherein the PCIe device updates data of a PCIe device cache line of the second address with the second data and the second state and updates the state as modified, monopolized, or shared by the PCIe device.

7. A method for realizing the cache consistency of a peripheral component interconnect extended PCIe bus device comprises the following steps:

a CPU cache master controller coupled with a CPU of a central processing unit through an internal bus sends a first read-write request of the CPU to a first address to a memory cache slave controller coupled with a memory through the internal bus through a cache command of the internal bus;

updating the state of the memory cache line of the first address by the memory cache slave according to the state of the memory cache line of the first address of the first read-write request, and sending the first address and a command for updating the state of the memory cache line of the PCIe device into the first state to the input/output bridge controller through a cache command of an internal bus through the internal bus;

by an input/output bridge controller coupled with a PCIe device through a PCIe bus:

wherein the memory cache slave updates the data and status of the memory cache line of the first address with the first data and the first status in the second PCIe bus message, and sends a response of the first data and the updated status to the CPU cache master through a cache command of an internal bus,

wherein the CPU cache master updates the data and state of the CPU cache line with the first data and the updated state response.

8. The method of claim 7, wherein,

and the CPU cache master controller receives the first read-write request from a CPU, judges whether the first address hits in the CPU cache line according to the first read-write request and the state of the CPU cache line, and sends the first read-write request to the memory cache slave controller if the first address does not hit in the CPU cache line.

9. The system of claim 8, wherein the first read/write request includes data to modify or monopolize or share the first address, the status of the memory cache line of the first address of the first read/write request includes having been monopolized by a PCIe device, the first status is an invalid status, wherein the data and update status of the memory cache line of the first address is modified, monopolized or shared by the memory cache slave using the first data and the first status in the second PCIe bus message, and the data and update status of the CPU cache line is modified, monopolized or shared by the CPU cache master using the first data and the first status.

10. The method of claim 7, wherein,

receiving, by the input/output bridge controller, a second read-write request for a second address by a PCIe device in response to the first PCIe bus message and sending a cache command over an internal bus to the memory cache slave via the internal bus,

sending, by the memory cache slave, second data of the second address and the second state to the input/output bridge controller via an internal bus by a cache command of the internal bus according to the second state of the CPU cache line of the second address,

sending, by the input/output bridge controller, the second data and the second state transition to the PCIe device via a PCIe bus through a second PCIe bus message, so that the PCIe device updates data and state of a PCIe device cache line at a second address with the second data and the second state.

11. The method of claim 10, wherein,

and the PCIe device responds to data needing a second address, judges whether the second address is hit in a PCIe device cache line, and if the second address is not hit, sends the second read-write request to the input/output bridge controller and records the state of the PCIe device cache line at the second address as an invalid state.

12. The method of claim 11, wherein the second read write request is data that modifies, monopolizes, or shares the second address, wherein the second state is an invalid state, and wherein the data and update status of the PCIe device cache line at the second address is modified, monopolized, or shared by the PCIe device using the second data and the second state.

13. A computer storage medium storing computer program instructions, wherein the computer program instructions, when executed by a processor, perform the method of any of claims 7-12.

14. A processing system comprising a processor and a computer storage medium, wherein the computer storage medium stores computer program instructions, wherein the computer program instructions, when executed by the processor, perform the method of any of claims 7-12.