EP3588310B1 - Technologies for demoting cache lines to shared cache - Google Patents
Technologies for demoting cache lines to shared cache Download PDFInfo
- Publication number
- EP3588310B1 EP3588310B1 EP19177464.5A EP19177464A EP3588310B1 EP 3588310 B1 EP3588310 B1 EP 3588310B1 EP 19177464 A EP19177464 A EP 19177464A EP 3588310 B1 EP3588310 B1 EP 3588310B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- cache
- core
- cache line
- data
- compute device
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000005516 engineering process Methods 0.000 title description 5
- 238000000034 method Methods 0.000 claims description 20
- 230000004044 response Effects 0.000 claims description 4
- 239000004744 fabric Substances 0.000 claims description 3
- 238000004891 communication Methods 0.000 description 25
- 238000007726 management method Methods 0.000 description 18
- 230000006870 function Effects 0.000 description 16
- 238000013500 data storage Methods 0.000 description 7
- 230000002093 peripheral effect Effects 0.000 description 7
- 238000011112 process operation Methods 0.000 description 6
- 239000000872 buffer Substances 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000013519 translation Methods 0.000 description 4
- 230000014616 translation Effects 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000007689 inspection Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/084—Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/10—Address translation
- G06F12/1072—Decentralised address translation, e.g. in distributed shared memory systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/12—Replacement control
- G06F12/121—Replacement control using replacement algorithms
- G06F12/128—Replacement control using replacement algorithms adapted to multidimensional cache systems, e.g. set-associative, multicache, multiset or multilevel
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L49/00—Packet switching elements
- H04L49/10—Packet switching elements characterised by the switching fabric construction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/15—Use in a specific computing environment
- G06F2212/154—Networked environment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/62—Details of cache specific to multiprocessor cache arrangements
- G06F2212/622—State-only directory, i.e. not recording identity of sharing or owning nodes
Definitions
- the data networks typically include one or more network computing devices (e.g., compute servers, storage servers, etc.) to route communications (e.g., via switches, routers, etc.) that enter/exit a network (e.g., north-south network traffic) and between network computing devices in the network (e.g., east-west network traffic).
- network computing devices e.g., compute servers, storage servers, etc.
- route communications e.g., via switches, routers, etc.
- a network e.g., north-south network traffic
- network computing devices in the network e.g., east-west network traffic
- a transmission device e.g., a network interface controller (NIC) of the computing device
- the computing device Upon receipt of a network packet, the computing device typically performs one or more processing operations (e.g., security, network address translation (NAT), load-balancing, deep packet inspection (DPI), transmission control protocol (TCP) optimization, caching, Internet Protocol (IP) management, etc.) to determine what the computing device is to do with the network packet (e.g., drop the network packet, process/store at least a portion of the network packet, forward the network packet, etc.). To do so, such packet processing is often performed in a packet processing pipeline (e.g., a service function chain) where at least a portion of the data of the network packet is passed from one processor core to another as it is processed. However, during such packet processing, stalls can occur due to cross-core snoops and cache pollution with stale data can be a problem.
- processing operations e.g., security, network address translation (NAT), load-balancing, deep packet inspection (DPI), transmission control protocol (TCP) optimization, caching, Internet Protocol (IP) management
- US 2018/0095880 A1 relates to processors and methods for managing cache tiering with gather-scatter vector semantics and discloses a compute device according to the preamble of claim 1.
- the disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof.
- the disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors.
- a machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
- a system 100 for demoting cache lines to shared cache includes a source compute device 102 communicatively coupled to a network compute device 106 via a network 104. While illustratively shown as having a single source compute device 102 and a single network compute device 106, the system 100 may include multiple source compute devices 102 and multiple network compute devices 106, in other embodiments. It should be appreciated that the source compute device 102 and network compute device 106 have been illustratively designated herein as being one of a "source” and a "destination" for the purposes of providing clarity to the description and that the source compute device 102 and/or the network compute device 106 may be capable of performing any of the functions described herein.
- the source compute device 102 and the network compute device 106 may reside in the same data center or high-performance computing (HPC) environment. In other words, the source compute device 102 and network compute device 106 may reside in the same network 104 connected via one or more interconnects.
- HPC high-performance computing
- the source compute device 102 and the network compute device 106 transmit and receive network traffic (e.g., network packets, frames, etc.) to/from each other.
- the network compute device 106 may receive a network packet from the source compute device 102.
- the network compute device 106 or more particularly a host fabric interface (HFI) 126 of the network compute device 106, identifies one or more processing operations to be performed on at least a portion of the network packet and performs some level of processing thereon.
- a processor core 112 requests access to data which may have been previously stored or moved into shared cache memory, typically on-processor or near-processor cache.
- the network compute device 106 is configured to move the requested data to a core-local cache (e.g., the core-local cache 114) for quicker access to the requested data by the requesting processor core 112.
- a core-local cache e.g., the core-local cache 114
- more than one processing operation e.g., security, network address translation (NAT), load-balancing, deep packet inspection (DPI), transmission control protocol (TCP) optimization, caching, Internet Protocol (IP) management, etc.
- NAT network address translation
- DPI deep packet inspection
- TCP transmission control protocol
- IP Internet Protocol
- the data accessed by one processor core needs to be released (e.g., demoted to the shared cache 116) upon processing completion in order for the next processor core to perform its designated processing operation.
- the network compute device 106 is configured to either transmit instructions to a cache manager to demote cache line(s) from the core-local cache 114 to the shared cache 116 or transmit a command to an offload device (see, e.g., the cache line offload device 130) to trigger a cache line demotion operation to be performed by the offload device to demote cache line(s) from the core-local cache 114 to the shared cache 116, based on a size of the network packet.
- an offload device see, e.g., the cache line offload device 130
- each processor core demotes the applicable packet cache lines to the shared cache 116 once processing has been completed, which allows better cache reuse on a first processing core and saves cross-core snoops on a second processing core in the packet processing pipeline (e.g., modifying data) or input/output (I/O) pipeline. Accordingly, unlike present technologies, stalls due to cross-core snoops and cache pollution can be effectively avoided. Additionally, also unlike present technologies, the cost attributable to an ownership request when the requested data is not in the shared cache or otherwise inaccessible by the requesting processor core can be avoided.
- the network compute device 106 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a computer, a server (e.g., stand-alone, rack-mounted, blade, etc.), a sled (e.g., a compute sled, an accelerator sled, a storage sled, a memory sled, etc.), an enhanced or smart network interface controller (NIC)/HFI, a network appliance (e.g., physical or virtual), a router, switch (e.g., a disaggregated switch, a rack-mounted switch, a standalone switch, a fully managed switch, a partially managed switch, a full-duplex switch, and/or a half-duplex communication mode enabled switch), a web appliance, a distributed computing system, a processor-based system, and/or a multiprocessor system.
- a server e.g., stand-alone, rack-mounted, blade, etc.
- a sled
- the illustrative network compute device 106 includes one or more processors 108, memory 118, an I/O subsystem 120, one or more data storage devices 122, communication circuitry 124, a demote device 130, and, in some embodiments, one or more peripheral devices 128. It should be appreciated that the network compute device 106 may include other or additional components, such as those commonly found in a typical computing device (e.g., various input/output devices and/or other components), in other embodiments. Additionally, in some embodiments, one or more of the illustrative components may be incorporated in, or otherwise form a portion of, another component.
- the processor(s) 108 may be embodied as any type of device or collection of devices capable of performing the various compute functions as described herein.
- the processor(s) 108 may be embodied as one or more multi-core processors, digital signal processors (DSPs), microcontrollers, or other processor(s) or processing/controlling circuit(s).
- DSPs digital signal processors
- microcontrollers or other processor(s) or processing/controlling circuit(s).
- the processor(s) 108 may be embodied as, include, or otherwise be coupled to an integrated circuit, an embedded system, a field-programmable-array (FPGA) (e.g., reconfigurable circuitry), a system-on-a-chip (SOC), an application specific integrated circuit (ASIC), reconfigurable hardware or hardware circuitry, or other specialized hardware to facilitate performance of the functions described herein.
- FPGA field-programmable-array
- reconfigurable circuitry e.g., reconfigurable circuitry
- SOC system-on-a-chip
- ASIC application specific integrated circuit
- the illustrative processor(s) 108 includes multiple processor cores 110 (e.g., two processor cores, four processor cores, eight processor cores, sixteen processor cores, etc.) and a cache memory 112.
- processor cores 110 may be embodied as an independent logical execution unit capable of executing programmed instructions.
- the network compute device 106 e.g., in supercomputer embodiments
- Each of the processor(s) 108 may be connected to a physical connector, or socket, on a motherboard (not shown) of the network compute device 106 that is configured to accept a single physical processor package (i.e., a multi-core physical integrated circuit).
- each of the processor cores 110 is communicatively coupled to at least a portion of the cache memory 112 and functional units usable to independently execute programs, operations, threads, etc. It should be appreciated that the processor(s) 108 as described herein are not limited to being on the same die, or socket.
- the cache memory 112 which may be embodied as any type of cache that the processor 104 can access more quickly than the memory 118 (i.e., main memory), such as an on-die cache, or on-processor cache. In other embodiments, the cache memory 108 may be an off-die cache, but reside on the same system-on-a-chip (SoC) as the processor 104.
- the illustrative cache memory 112 includes a multi-level cache architecture embodied as a core-local cache 114 and a shared cache 116.
- the core-local cache 114 may be embodied as a cache memory dedicated to a particular one of the processor cores 110. Accordingly, while illustratively shown as a single core-local cache 114, it should be appreciated that there may be at least one core-local cache 114 for each processor core 110, in some embodiments.
- the shared cache 116 may be embodied as a cache memory, typically larger than the core-local cache 114 and shared by all of the processor cores 110 of a processor 108.
- the core-local cache 114 may be embodied as a level 1 (L1) cache and a level 2 (L2) cache, while the shared cache 116 may be embodied as a layer 3 (L3) cache.
- L1 cache may be embodied as any memory type local to a processor core 110, commonly referred to as a "primary cache" that is the fastest memory closest to the processor 108.
- the L2 cache may be embodied as any type of memory local to a processor core 110, commonly referred to as a "mid-level cache” that is capable of feeding the L1 cache, having larger, slower memory than the L1 cache, but typically smaller, faster memory than the L3/shared cache 116 (i.e., last-level cache (LLC)).
- the multi-level cache architecture may include additional and/or alternative levels of cache memory.
- the cache memory 112 includes a memory controller (see, e.g., the cache manager 214 of FIG. 2 ), which may be embodied as a controller circuit or other logic that serves as an interface between the processor 108 and the memory 118.
- the memory 118 may be embodied as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein.
- the memory 118 may store various data and software used during operation of the network compute device 106, such as operating systems, applications, programs, libraries, and drivers. It should be appreciated that the memory 118 may be referred to as main memory (i.e., a primary memory).
- Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium.
- volatile memory may include various types of random access memory (RAM), such as dynamic random access memory (DRAM) or static random access memory (SRAM).
- RAM random access memory
- DRAM dynamic random access memory
- SRAM static random access memory
- Each of the processor(s) 108 and the memory 118 are communicatively coupled to other components of the network compute device 106 via the I/O subsystem 114, which may be embodied as circuitry and/or components to facilitate input/output operations with the processor(s) 108, the memory 118, and other components of the network compute device 106.
- the I/O subsystem 114 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, integrated sensor hubs, firmware devices, communication links (e.g., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.), and/or other components and subsystems to facilitate the input/output operations.
- the I/O subsystem 114 may form a portion of a SoC and be incorporated, along with one or more of the processors 108, the memory 118, and other components of the network compute device 106, on a single integrated circuit chip.
- the one or more data storage devices 122 may be embodied as any type of storage device(s) configured for short-term or long-term storage of data, such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices.
- Each data storage device 122 may include a system partition that stores data and firmware code for the data storage device 122.
- Each data storage device 122 may also include an operating system partition that stores data files and executables for an operating system.
- the communication circuitry 124 may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications between the network compute device 106 and other computing devices, such as the source compute device 102, as well as any network communication enabling devices, such as an access point, switch, router, etc., to allow communication over the network 104. Accordingly, the communication circuitry 124 may be configured to use any one or more communication technologies (e.g., wireless or wired communication technologies) and associated protocols (e.g., Ethernet, Bluetooth ® , Wi-Fi ® , WiMAX, LTE, 5G, etc.) to effect such communication.
- communication technologies e.g., wireless or wired communication technologies
- associated protocols e.g., Ethernet, Bluetooth ® , Wi-Fi ® , WiMAX, LTE, 5G, etc.
- the communication circuitry 124 may include specialized circuitry, hardware, or combination thereof to perform pipeline logic (e.g., hardware algorithms) for performing the functions described herein, including processing network packets (e.g., parse received network packets, determine destination computing devices for each received network packets, forward the network packets to a particular buffer queue of a respective host buffer of the network compute device 106, etc.), performing computational functions, etc.
- pipeline logic e.g., hardware algorithms
- performance of one or more of the functions of communication circuitry 124 as described herein may be performed by specialized circuitry, hardware, or combination thereof of the communication circuitry 124, which may be embodied as a SoC or otherwise form a portion of a SoC of the network compute device 106 (e.g., incorporated on a single integrated circuit chip along with a processor 108, the memory 118, and/or other components of the network compute device 106).
- the specialized circuitry, hardware, or combination thereof may be embodied as one or more discrete processing units of the network compute device 106, each of which may be capable of performing one or more of the functions described herein.
- the illustrative communication circuitry 124 includes the HFI 126, which may be embodied as one or more add-in-boards, daughtercards, network interface cards, controller chips, chipsets, or other devices that may be used by the network compute device 106 to connect with another compute device (e.g., the source compute device 102).
- the HFI 126 may be embodied as part of a SoC that includes one or more processors, or included on a multichip package that also contains one or more processors.
- the HFI 126 may include a local processor (not shown) and/or a local memory (not shown) that are both local to the HFI 126.
- the local processor of the HFI 126 may be capable of performing one or more of the functions of a processor 108 described herein. Additionally or alternatively, in such embodiments, the local memory of the HFI 126 may be integrated into one or more components of the network compute device 106 at the board level, socket level, chip level, and/or other levels.
- the one or more peripheral devices 128 may include any type of device that is usable to input information into the network compute device 106 and/or receive information from the network compute device 106.
- the peripheral devices 128 may be embodied as any auxiliary device usable to input information into the network compute device 106, such as a keyboard, a mouse, a microphone, a barcode reader, an image scanner, etc., or output information from the network compute device 106, such as a display, a speaker, graphics circuitry, a printer, a projector, etc.
- one or more of the peripheral devices 128 may function as both an input device and an output device (e.g., a touchscreen display, a digitizer on top of a display screen, etc.).
- peripheral devices 128 connected to the network compute device 106 may depend on, for example, the type and/or intended use of the network compute device 106. Additionally or alternatively, in some embodiments, the peripheral devices 128 may include one or more ports, such as a USB port, for example, for connecting external peripheral devices to the network compute device 106.
- the cache line demote device 130 may be embodied as any type of firmware, software, and/or hardware device that is usable to initiate a cache line demotion from core-local cache 114 to shared cache 116.
- the cache line demote device 130 may be embodied as, but is not limited to a copy engine, a direct memory access (DMA) device usable to copy data, an offload read-capable device, etc. It should be appreciated that the cache line demote device 130 may be any type of device that is capable of reading or pretending to read data, so long as when the device interacts with the data or otherwise requests access to the data, the cache lines associated with that data will get demoted to shared cache 116 as a side effect.
- DMA direct memory access
- the source compute device 102 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a smartphone, a mobile computing device, a tablet computer, a laptop computer, a notebook computer, a computer, a server (e.g., stand-alone, rack-mounted, blade, etc.), a sled (e.g., a compute sled, an accelerator sled, a storage sled, a memory sled, etc.), a network appliance (e.g., physical or virtual), a web appliance, a distributed computing system, a processor-based system, and/or a multiprocessor system.
- a smartphone e.g., a mobile computing device, a tablet computer, a laptop computer, a notebook computer, a computer, a server (e.g., stand-alone, rack-mounted, blade, etc.), a sled (e.g., a compute sled, an accelerator sled, a storage
- source compute device 102 includes similar and/or like components to those of the illustrative network compute device 106. As such, figures and descriptions of the like components are not repeated herein for clarity of the description with the understanding that the description of the corresponding components provided above in regard to the network compute device 106 applies equally to the corresponding components of the source compute device 102.
- the computing devices may include additional and/or alternative components, depending on the embodiment.
- the network 104 may be embodied as any type of wired or wireless communication network, including but not limited to a wireless local area network (WLAN), a wireless personal area network (WPAN), an edge network (e.g., a multi-access edge computing (MEC) network), a fog network, a cellular network (e.g., Global System for Mobile Communications (GSM), Long-Term Evolution (LTE), 5G, etc.), a telephony network, a digital subscriber line (DSL) network, a cable network, a local area network (LAN), a wide area network (WAN), a global network (e.g., the Internet), or any combination thereof.
- WLAN wireless local area network
- WPAN wireless personal area network
- MEC multi-access edge computing
- fog network e.g., a fog network
- a cellular network e.g., Global System for Mobile Communications (GSM), Long-Term Evolution (LTE), 5G, etc.
- GSM Global System for Mobile Communications
- LTE Long
- the network 104 may serve as a centralized network and, in some embodiments, may be communicatively coupled to another network (e.g., the Internet). Accordingly, the network 104 may include a variety of other virtual and/or physical network computing devices (e.g., routers, switches, network hubs, servers, storage devices, compute devices, etc.), as needed to facilitate communication between the network compute device 106 and the source compute device 102, which are not shown to preserve clarity of the description.
- the network 104 may serve as a centralized network and, in some embodiments, may be communicatively coupled to another network (e.g., the Internet). Accordingly, the network 104 may include a variety of other virtual and/or physical network computing devices (e.g., routers, switches, network hubs, servers, storage devices, compute devices, etc.), as needed to facilitate communication between the network compute device 106 and the source compute device 102, which are not shown to preserve clarity of the description.
- other virtual and/or physical network computing devices e.g., router
- the network compute device 106 establishes an environment 200 during operation.
- the illustrative environment 200 includes the processor(s) 108, the HFI 126, and the cache line demote device 130 of FIG. 1 , as well as a cache manager 214 and a demotion manager 220.
- the illustrative HFI 126 includes a network traffic ingress/egress manager 208
- the illustrative cache line demote device 130 includes an interface manager 210
- the illustrative processor(s) 108 include a packet process operation manager 212.
- the various components of the environment 200 may be embodied as hardware, firmware, software, or a combination thereof.
- one or more of the components of the environment 200 may be embodied as circuitry or collection of electrical devices (e.g., network traffic ingress/egress management circuitry 208, demote device interface management circuitry 210, packet process operation management circuitry 212, cache management circuitry 214, demotion management circuitry 220, etc.).
- electrical devices e.g., network traffic ingress/egress management circuitry 208, demote device interface management circuitry 210, packet process operation management circuitry 212, cache management circuitry 214, demotion management circuitry 220, etc.
- the network traffic ingress/egress management circuitry 208, the demote device interface management circuitry 210, the packet process operation management circuitry 212, the cache management circuitry 214, and the demotion management circuitry 220 form a portion of a particular component of the network compute device 106.
- the network traffic ingress/egress management circuitry 208 may be performed, at least in part, by one or more other components of the network compute device 106.
- the demote device interface management circuitry 210 may be performed, at least in part, by one or more other components of the network compute device 106.
- one or more of the illustrative components may form a portion of another component and/or one or more of the illustrative components may be independent of one another.
- one or more of the components of the environment 200 may be embodied as virtualized hardware components or emulated architecture, which may be established and maintained by the HFI 126, the processor(s) 108, or other components of the network compute device 106.
- the network compute device 106 may include other components, sub-components, modules, sub-modules, logic, sub-logic, and/or devices commonly found in a computing device, which are not illustrated in FIG. 2 for clarity of the description.
- the network compute device 106 additionally includes cache line address data 202, demotion data 204, and network packet data 206, each of which may be accessed by the various components and/or sub-components of the network compute device 106. Further, each of the cache line address data 202, the demotion data 204, and the network packet data 206 may be accessed by the various components of the network compute device 106. Additionally, it should be appreciated that in some embodiments the data stored in, or otherwise represented by, each of the cache line address data 202, the demotion data 204, and the network packet data 206 may not be mutually exclusive relative to each other.
- data stored in the cache line address data 202 may also be stored as a portion of one or more of the demotion data 204 and/or the network packet data 206, or in another alternative arrangement.
- the various data utilized by the network compute device 106 is described herein as particular discrete data, such data may be combined, aggregated, and/or otherwise form portions of a single or multiple data sets, including duplicative copies, in other embodiments.
- the network traffic ingress/egress manager 208 which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to receive inbound and route/transmit outbound network traffic. To do so, the illustrative network traffic ingress/egress manager 208 is configured to facilitate inbound network communications (e.g., network traffic, network packets, network flows, etc.) to the network compute device 106 (e.g., from the source computing device 102).
- inbound network communications e.g., network traffic, network packets, network flows, etc.
- the network traffic ingress/egress manager 208 is configured to manage (e.g., create, modify, delete, etc.) connections to physical and virtual network ports (i.e., virtual network interfaces) of the network compute device 106 (e.g., via the communication circuitry 124), as well as the ingress buffers/queues associated therewith.
- the network traffic ingress/egress manager 208 is configured to facilitate outbound network communications (e.g., network traffic, network packet streams, network flows, etc.) from the network compute device 106. To do so, the network traffic ingress/egress manager 208 is configured to manage (e.g., create, modify, delete, etc.) connections to physical and virtual network ports/interfaces of the network compute device 106 (e.g., via the communication circuitry 124), as well as the egress buffers/queues associated therewith.
- manage e.g., create, modify, delete, etc.
- At least a portion of the network packet (e.g., at least a portion of a header of the network packet, at least a portion of a payload of the network packet, a checksum, etc.) may be stored in the network packet data 206.
- the demote device interface manager 210 which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to manage the interface of the cache line demote device 130.
- the demote device interface manager 210 is configured to receive cache line demote commands from the processor(s) 108 that are usable to identify which cache line(s) are to be demoted from core-local cache 114 to shared cache 116.
- the demote device interface manager 210 is configured to perform some operation (e.g., a read request) in response to having received a cache line demote command to demote one or more cache lines from core-local cache 114 to shared cache 116.
- the cache line demote command includes an identifier of each cache line that is to be demoted from core-local cache 114 to shared cache 116 and each identifier is usable by the cache line demote device 130 to demote (e.g., copy, evict, etc.) the applicable cache line(s).
- the packet process operation manager 212 which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to identify which packet processing operations are to be performed on at least a portion of the data of a received network packet (e.g., a header field of the network packet, a portion of the payload of the network packet, etc.) and the associated processor core 110 that each packet processing operation is to be performed thereby. Additionally, in some embodiments, the packet process operation manager 212 may be configured to identify when each packet processing operation has completed and provide an indication of completion (e.g., to the demotion manager 220). It should be appreciated that, while described herein as being performed by an associated processor core 110, one or more of the packet processing operations may be performed by any type of compute device/logic (e.g., an accelerator device/logic) that may need to access the cache memory 112.
- compute device/logic e.g., an accelerator device/logic
- the cache manager 214 which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to manage the cache memory 112 (e.g., the core-local cache 114 and the shared cache 116). To do so, the cache manager 214 is configured to manage the addition and eviction of entries into and out of the cache memory 112. Accordingly the cache manager 214, which may be embodied as or otherwise include a memory management unit is further configured to record results of virtual address to physical address translations. In such embodiments, the translations may be stored in the cache line address data 202.
- the cache manager 214 is additionally configured to facilitate the fetching of data from main memory and the storage of cached data to main memory, as well as the demotion of data from the applicable core-local cache 114 to the shared cache 116 and the promotion of data from the shared cache 116 to the applicable core-local cache 114.
- the demotion manager 220 which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to manage the demotion of data from the core-local cache 114 to the shared cache 116. To do so, the demotion manager 220 is configured to either transmit instructions to a cache memory manger (e.g., the cache manager 214) to demote (e.g., copy, evict, etc.) the processed data from the core-local cache 114 to the shared cache 116, or transmit a command to the cache line demote device 130 to demote the processed data the core-local cache 114 to the shared cache 116. To determine whether to send the cache line demotion instruction to the cache manager 214 or the cache line demotion command to the cache line demote device 130, the demotion manager 220 is further configured to compare a size of a network packet to a predetermined packet size threshold.
- a cache memory manger e.g., the cache manager 214
- demote
- the demotion manager 220 determines the network packet size is greater than the packet size threshold, the demotion manager 220 is configured to transmit the cache line demotion instruction to the cache manager 214. Otherwise, if the demotion manager 220 determines the network packet size is less than or equal to the packet size threshold, the demotion manager 220 is configured to transmit the cache line demotion command to the cache line demote device 130. Additionally, the demotion manager 220 is configured to include an identifier of each cache line, or a range of cache lines, to be demoted from the core-local cache 114 to the shared cache 116 in the cache line demotion instructions/commands.
- the demotion manager 220 may be configured as an offload device; however, in some embodiments, the functions as described herein may be performed by or the demotion manager 220 may otherwise form a portion of the processor 108, or the processor cores 110. It should be appreciated that under such conditions in which the next cache location is known ahead of time, the demotion manager 220 may be configured to move the data to known core-local cache entries of the core-local cache associated with the next processor core in the packet processing pipeline.
- a method 300 for demoting cache lines to shared cache is shown which may be executed by a compute device (e.g., the network compute device 106 of FIGS. 1 and 2 ).
- the method 300 begins with block 302, in which the network compute device 106 determines whether to process a network packet (e.g., a processor 108 has polled the HFI 126 for the next packet to process). If so, the method 300 advances to block 304, in which the network compute device 106 identifies one or more packet processing operations to be performed on at least a portion of a network packet by a processor core 110.
- the network compute device 106 performs the identified packet processing operation(s) on the applicable portion of the network packet to be processed. It should be appreciated that, while described herein as being performed by a requesting processor core 110, one or more of the packet processing operations may be performed by any type of compute device/logic (e.g., an accelerator device/logic) that may need to access the cache memory 112.
- compute device/logic e.g., an accelerator device/logic
- the network compute device 106 determines whether the requesting processor core 110, or applicable compute device/logic, has completed the identified packet processing operation(s), such as may be indicated by the requesting processor core 110. If so, the method 300 advances to block 310, in which the network compute device 106 determines which one or more cache lines in core-local cache 114 are associated with the processed network packet. Additionally, in block 312, the network compute device 106 identifies a size of the network packet. In block 314, the network compute device 106 compares the identified network packet size to a packet size threshold. In block 316, the network compute device 106 determines whether the identified network packet size is greater than the packet size threshold.
- the method 300 branches to block 318, in which the network compute device 106 is configured to transmit a cache line demotion instruction to the cache manager 214 to demote the one or more cache lines associated with the processed network packet from the core-local cache 114 to the shared cache 116. Additionally, in block 320, the network compute device includes a cache line identifier of each determined cache line in the core-local cache 114 in the cache line demotion instruction.
- the method 300 branches to block 322, in which the network compute device 106 transmits a cache line demotion command to the cache line demote device 130 to trigger a cache line demotion operation to demote one or more cache lines associated with the processed network packet from the core-local cache 114 to the shared cache 116. Additionally, in block 324, the network compute device 106 includes one or more cache line identifiers corresponding to the one or more cache lines to be demoted in the cache line demotion command.
- the network compute device 106 establishes an illustrative environment 400 for demoting cache lines to shared cache 116 via cache line demote instructions and an illustrative environment 500 for demoting cache lines to shared cache 116 via cache line demote commands to a cache line demote device 130.
- the illustrative environment 400 includes the HFI 126, a processor core 110, the core-local cache 114, the shared cache 116, and the demote device 130 of FIG. 1 , as well as the cache manager 214 of FIG. 2 .
- Each of the illustrative core-local cache 114 and the shared cache 116 include multiple cache entries.
- the core-local cache 114 includes multiple core-local cache entries 404.
- the illustrative core-local cache entries 404 include a first core-local cache entry designated as core-local cache entry (1) 404a, a second core-local cache entry designated as core-local cache entry (2) 404b, a third core-local cache entry designated as core-local cache entry (3) 404c, a fourth core-local cache entry designated as core-local cache entry (4) 404d, and a fifth core-local cache entry designated as core-local cache entry (N) 404e (i.e., the "Nth" core-local cache entry 404, wherein "N" is a positive integer and designates one or more additional core-local cache entries 404).
- the illustrative shared cache 116 includes multiple shared cache entries 406.
- the illustrative shared cache entries 406 include a first shared cache entry designated as shared cache entry (1) 406a, a second shared cache entry designated as shared cache entry (2) 406b, a third shared cache entry designated as shared cache entry (3) 406c, a fourth shared cache entry designated as shared cache entry (4) 406d, and a fifth shared cache entry designated as shared cache entry (N) 406e (i.e., the "Nth" shared cache entry 406, wherein "N" is a positive integer and designates one or more additional shared cache entries 406).
- the illustrative environment 500 includes the HFI 126, the processor core 110, the core-local cache 114, the shared cache 116, and the demote device 130 of FIG. 1 , as well as the cache manager 214 of FIG. 2 .
- the processor core 110 is configured to poll an available network packet form processing from the HFI 126 (e.g., via an HFI/host interface (not shown)) and perform some level of processing operation on at least a portion of the data of the network packet.
- the processor core 110 is further configured to provide some indication that one or more cache lines are to be demoted from the core-local cache 114 to the shared cache 116.
- the indication provided by the processor core 110 is in the form of one or more cache line demote instructions. It should be appreciated that each cache line demote instruction is usable to identify a cache line from the core-local cache 114 and demote the data to the shared cache 116. As such, it should be appreciated that such instructions may not be as efficient for larger packets. Accordingly, the processor 110 is configured to, for larger blocks of data, utilize the cache line demote device to offload the demote operation. To do so, referring again to FIG.
- the processor 110 is configured to transmit a cache line demotion command 502 to the cache line demote device 130 to trigger a cache line demotion operation to be performed by the cache line demote device 130, such as may be performed via a data read request, a DMA request, etc., or any other type of request that will result in the data being demoted to shared cache 116 as a side effect without wasting processor core cycles.
- core-local cache line (1) 404a, core-local cache line (2) 404b, and core-local cache line (3) 404c is associated with the processed network packet, as indicated by the highlighted outline surrounding each of those core-local cache lines 404.
- the cache line demotion operation results in that data being demoted such that the data in core-local cache line (1) 404a is demoted to shared cache line (1) 406a, the data in core-local cache line (2) 404b is demoted to shared cache line (2) 406b, and the data in core-local cache line (3) 404c is demoted to shared cache line (3) 406c; however, it should be appreciated that, as a result of the cache line demotion operation, the demoted cache lines may be moved to any available shared cache lines 406.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Description
- Modern computing devices have become ubiquitous tools for personal, business, and social uses. As such, many modern computing devices are capable of connecting to various data networks, including the Internet, to transmit and receive data communications over the various data networks at varying rates of speed. To facilitate communications between computing devices, the data networks typically include one or more network computing devices (e.g., compute servers, storage servers, etc.) to route communications (e.g., via switches, routers, etc.) that enter/exit a network (e.g., north-south network traffic) and between network computing devices in the network (e.g., east-west network traffic). In present packet-switched network architectures, data is transmitted in the form of network packets between networked computing devices. At a high level, data is packetized into a network packet at one computing device and the resulting packet transmitted, via a transmission device (e.g., a network interface controller (NIC) of the computing device), to another computing device over a network.
- Upon receipt of a network packet, the computing device typically performs one or more processing operations (e.g., security, network address translation (NAT), load-balancing, deep packet inspection (DPI), transmission control protocol (TCP) optimization, caching, Internet Protocol (IP) management, etc.) to determine what the computing device is to do with the network packet (e.g., drop the network packet, process/store at least a portion of the network packet, forward the network packet, etc.). To do so, such packet processing is often performed in a packet processing pipeline (e.g., a service function chain) where at least a portion of the data of the network packet is passed from one processor core to another as it is processed. However, during such packet processing, stalls can occur due to cross-core snoops and cache pollution with stale data can be a problem.
-
US 2018/0095880 A1 relates to processors and methods for managing cache tiering with gather-scatter vector semantics and discloses a compute device according to the preamble ofclaim 1. - The concepts described herein are illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. Where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements.
-
FIG. 1 is a simplified block diagram of at least one embodiment of a system for demoting cache lines to shared cache that includes a source compute device and a network compute device communicatively coupled via a network; -
FIG. 2 is a simplified block diagram of at least one embodiment of an environment of the network compute device of the system ofFIG. 1 ; -
FIG. 3 is a simplified flow diagram of at least one embodiment of a method for demoting cache lines to shared cache that may be executed by the network compute device ofFIGS. 1 and2 ; and -
FIGS. 4 and5 are simplified block diagram of at least one embodiment of another environment of the network compute device ofFIGS. 1 and2 for demoting cache lines to shared cache. - The disclosed embodiments may be implemented, in some cases, in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. A machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
- In the drawings, some structural or method features may be shown in specific arrangements and/or orderings. However, it should be appreciated that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, such features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of a structural or method feature in a particular figure is not meant to imply that such feature is required in all embodiments and, in some embodiments, may not be included or may be combined with other features.
- Referring now to
FIG. 1 , in an illustrative embodiment, a system 100 for demoting cache lines to shared cache includes asource compute device 102 communicatively coupled to anetwork compute device 106 via anetwork 104. While illustratively shown as having a singlesource compute device 102 and a singlenetwork compute device 106, the system 100 may include multiplesource compute devices 102 and multiplenetwork compute devices 106, in other embodiments. It should be appreciated that thesource compute device 102 andnetwork compute device 106 have been illustratively designated herein as being one of a "source" and a "destination" for the purposes of providing clarity to the description and that thesource compute device 102 and/or thenetwork compute device 106 may be capable of performing any of the functions described herein. It should be further appreciated that thesource compute device 102 and thenetwork compute device 106 may reside in the same data center or high-performance computing (HPC) environment. In other words, thesource compute device 102 andnetwork compute device 106 may reside in thesame network 104 connected via one or more interconnects. - In use, the
source compute device 102 and thenetwork compute device 106 transmit and receive network traffic (e.g., network packets, frames, etc.) to/from each other. For example, thenetwork compute device 106 may receive a network packet from thesource compute device 102. Upon receipt of a network packet, thenetwork compute device 106, or more particularly a host fabric interface (HFI) 126 of thenetwork compute device 106, identifies one or more processing operations to be performed on at least a portion of the network packet and performs some level of processing thereon. To do so, aprocessor core 112 requests access to data which may have been previously stored or moved into shared cache memory, typically on-processor or near-processor cache. Thenetwork compute device 106 is configured to move the requested data to a core-local cache (e.g., the core-local cache 114) for quicker access to the requested data by the requestingprocessor core 112. - Oftentimes, more than one processing operation (e.g., security, network address translation (NAT), load-balancing, deep packet inspection (DPI), transmission control protocol (TCP) optimization, caching, Internet Protocol (IP) management, etc.) is performed by the network compute device, with each operation typically performed by a different processor core in a packet processing pipeline, such as a service function chain. Accordingly, the data accessed by one processor core needs to be released (e.g., demoted to the shared cache 116) upon processing completion in order for the next processor core to perform its designated processing operation.
- To do so, as will be described in further detail below, the
network compute device 106 is configured to either transmit instructions to a cache manager to demote cache line(s) from the core-local cache 114 to the sharedcache 116 or transmit a command to an offload device (see, e.g., the cache line offload device 130) to trigger a cache line demotion operation to be performed by the offload device to demote cache line(s) from the core-local cache 114 to the sharedcache 116, based on a size of the network packet. In other words, each processor core demotes the applicable packet cache lines to the sharedcache 116 once processing has been completed, which allows better cache reuse on a first processing core and saves cross-core snoops on a second processing core in the packet processing pipeline (e.g., modifying data) or input/output (I/O) pipeline. Accordingly, unlike present technologies, stalls due to cross-core snoops and cache pollution can be effectively avoided. Additionally, also unlike present technologies, the cost attributable to an ownership request when the requested data is not in the shared cache or otherwise inaccessible by the requesting processor core can be avoided. - The
network compute device 106 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a computer, a server (e.g., stand-alone, rack-mounted, blade, etc.), a sled (e.g., a compute sled, an accelerator sled, a storage sled, a memory sled, etc.), an enhanced or smart network interface controller (NIC)/HFI, a network appliance (e.g., physical or virtual), a router, switch (e.g., a disaggregated switch, a rack-mounted switch, a standalone switch, a fully managed switch, a partially managed switch, a full-duplex switch, and/or a half-duplex communication mode enabled switch), a web appliance, a distributed computing system, a processor-based system, and/or a multiprocessor system. - As shown in
FIG. 1 , the illustrativenetwork compute device 106 includes one ormore processors 108,memory 118, an I/O subsystem 120, one or moredata storage devices 122, communication circuitry 124, ademote device 130, and, in some embodiments, one or moreperipheral devices 128. It should be appreciated that thenetwork compute device 106 may include other or additional components, such as those commonly found in a typical computing device (e.g., various input/output devices and/or other components), in other embodiments. Additionally, in some embodiments, one or more of the illustrative components may be incorporated in, or otherwise form a portion of, another component. - The processor(s) 108 may be embodied as any type of device or collection of devices capable of performing the various compute functions as described herein. In some embodiments, the processor(s) 108 may be embodied as one or more multi-core processors, digital signal processors (DSPs), microcontrollers, or other processor(s) or processing/controlling circuit(s). In some embodiments, the processor(s) 108 may be embodied as, include, or otherwise be coupled to an integrated circuit, an embedded system, a field-programmable-array (FPGA) (e.g., reconfigurable circuitry), a system-on-a-chip (SOC), an application specific integrated circuit (ASIC), reconfigurable hardware or hardware circuitry, or other specialized hardware to facilitate performance of the functions described herein.
- The illustrative processor(s) 108 includes multiple processor cores 110 (e.g., two processor cores, four processor cores, eight processor cores, sixteen processor cores, etc.) and a
cache memory 112. Each ofprocessor cores 110 may be embodied as an independent logical execution unit capable of executing programmed instructions. It should be appreciated that, in some embodiments, the network compute device 106 (e.g., in supercomputer embodiments) may include thousands of processor cores. Each of the processor(s) 108 may be connected to a physical connector, or socket, on a motherboard (not shown) of thenetwork compute device 106 that is configured to accept a single physical processor package (i.e., a multi-core physical integrated circuit). Further, each of theprocessor cores 110 is communicatively coupled to at least a portion of thecache memory 112 and functional units usable to independently execute programs, operations, threads, etc. It should be appreciated that the processor(s) 108 as described herein are not limited to being on the same die, or socket. - The
cache memory 112, which may be embodied as any type of cache that theprocessor 104 can access more quickly than the memory 118 (i.e., main memory), such as an on-die cache, or on-processor cache. In other embodiments, thecache memory 108 may be an off-die cache, but reside on the same system-on-a-chip (SoC) as theprocessor 104. Theillustrative cache memory 112 includes a multi-level cache architecture embodied as a core-local cache 114 and a sharedcache 116. The core-local cache 114 may be embodied as a cache memory dedicated to a particular one of theprocessor cores 110. Accordingly, while illustratively shown as a single core-local cache 114, it should be appreciated that there may be at least one core-local cache 114 for eachprocessor core 110, in some embodiments. - The shared
cache 116 may be embodied as a cache memory, typically larger than the core-local cache 114 and shared by all of theprocessor cores 110 of aprocessor 108. For example, in an illustrative embodiment, the core-local cache 114 may be embodied as a level 1 (L1) cache and a level 2 (L2) cache, while the sharedcache 116 may be embodied as a layer 3 (L3) cache. In such embodiments, it should be appreciated that the L1 cache may be embodied as any memory type local to aprocessor core 110, commonly referred to as a "primary cache" that is the fastest memory closest to theprocessor 108. It should be further appreciated that, in such embodiments, the L2 cache may be embodied as any type of memory local to aprocessor core 110, commonly referred to as a "mid-level cache" that is capable of feeding the L1 cache, having larger, slower memory than the L1 cache, but typically smaller, faster memory than the L3/shared cache 116 (i.e., last-level cache (LLC)). In other embodiments, the multi-level cache architecture may include additional and/or alternative levels of cache memory. While not illustratively shown inFIG. 1 , it should be further appreciated that thecache memory 112 includes a memory controller (see, e.g., thecache manager 214 ofFIG. 2 ), which may be embodied as a controller circuit or other logic that serves as an interface between theprocessor 108 and thememory 118. - The
memory 118 may be embodied as any type of volatile or non-volatile memory or data storage capable of performing the functions described herein. In operation, thememory 118 may store various data and software used during operation of thenetwork compute device 106, such as operating systems, applications, programs, libraries, and drivers. It should be appreciated that thememory 118 may be referred to as main memory (i.e., a primary memory). Volatile memory may be a storage medium that requires power to maintain the state of data stored by the medium. Non-limiting examples of volatile memory may include various types of random access memory (RAM), such as dynamic random access memory (DRAM) or static random access memory (SRAM). - Each of the processor(s) 108 and the
memory 118 are communicatively coupled to other components of thenetwork compute device 106 via the I/O subsystem 114, which may be embodied as circuitry and/or components to facilitate input/output operations with the processor(s) 108, thememory 118, and other components of thenetwork compute device 106. For example, the I/O subsystem 114 may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, integrated sensor hubs, firmware devices, communication links (e.g., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.), and/or other components and subsystems to facilitate the input/output operations. In some embodiments, the I/O subsystem 114 may form a portion of a SoC and be incorporated, along with one or more of theprocessors 108, thememory 118, and other components of thenetwork compute device 106, on a single integrated circuit chip. - The one or more
data storage devices 122 may be embodied as any type of storage device(s) configured for short-term or long-term storage of data, such as, for example, memory devices and circuits, memory cards, hard disk drives, solid-state drives, or other data storage devices. Eachdata storage device 122 may include a system partition that stores data and firmware code for thedata storage device 122. Eachdata storage device 122 may also include an operating system partition that stores data files and executables for an operating system. - The communication circuitry 124 may be embodied as any communication circuit, device, or collection thereof, capable of enabling communications between the
network compute device 106 and other computing devices, such as thesource compute device 102, as well as any network communication enabling devices, such as an access point, switch, router, etc., to allow communication over thenetwork 104. Accordingly, the communication circuitry 124 may be configured to use any one or more communication technologies (e.g., wireless or wired communication technologies) and associated protocols (e.g., Ethernet, Bluetooth®, Wi-Fi®, WiMAX, LTE, 5G, etc.) to effect such communication. - It should be appreciated that, in some embodiments, the communication circuitry 124 may include specialized circuitry, hardware, or combination thereof to perform pipeline logic (e.g., hardware algorithms) for performing the functions described herein, including processing network packets (e.g., parse received network packets, determine destination computing devices for each received network packets, forward the network packets to a particular buffer queue of a respective host buffer of the
network compute device 106, etc.), performing computational functions, etc. - In some embodiments, performance of one or more of the functions of communication circuitry 124 as described herein may be performed by specialized circuitry, hardware, or combination thereof of the communication circuitry 124, which may be embodied as a SoC or otherwise form a portion of a SoC of the network compute device 106 (e.g., incorporated on a single integrated circuit chip along with a
processor 108, thememory 118, and/or other components of the network compute device 106). Alternatively, in some embodiments, the specialized circuitry, hardware, or combination thereof may be embodied as one or more discrete processing units of thenetwork compute device 106, each of which may be capable of performing one or more of the functions described herein. - The illustrative communication circuitry 124 includes the
HFI 126, which may be embodied as one or more add-in-boards, daughtercards, network interface cards, controller chips, chipsets, or other devices that may be used by thenetwork compute device 106 to connect with another compute device (e.g., the source compute device 102). In some embodiments, theHFI 126 may be embodied as part of a SoC that includes one or more processors, or included on a multichip package that also contains one or more processors. In some embodiments, theHFI 126 may include a local processor (not shown) and/or a local memory (not shown) that are both local to theHFI 126. In such embodiments, the local processor of theHFI 126 may be capable of performing one or more of the functions of aprocessor 108 described herein. Additionally or alternatively, in such embodiments, the local memory of theHFI 126 may be integrated into one or more components of thenetwork compute device 106 at the board level, socket level, chip level, and/or other levels. - The one or more
peripheral devices 128 may include any type of device that is usable to input information into thenetwork compute device 106 and/or receive information from thenetwork compute device 106. Theperipheral devices 128 may be embodied as any auxiliary device usable to input information into thenetwork compute device 106, such as a keyboard, a mouse, a microphone, a barcode reader, an image scanner, etc., or output information from thenetwork compute device 106, such as a display, a speaker, graphics circuitry, a printer, a projector, etc. It should be appreciated that, in some embodiments, one or more of theperipheral devices 128 may function as both an input device and an output device (e.g., a touchscreen display, a digitizer on top of a display screen, etc.). It should be further appreciated that the types ofperipheral devices 128 connected to thenetwork compute device 106 may depend on, for example, the type and/or intended use of thenetwork compute device 106. Additionally or alternatively, in some embodiments, theperipheral devices 128 may include one or more ports, such as a USB port, for example, for connecting external peripheral devices to thenetwork compute device 106. - The cache
line demote device 130 may be embodied as any type of firmware, software, and/or hardware device that is usable to initiate a cache line demotion from core-local cache 114 to sharedcache 116. In some embodiments, the cacheline demote device 130 may be embodied as, but is not limited to a copy engine, a direct memory access (DMA) device usable to copy data, an offload read-capable device, etc. It should be appreciated that the cacheline demote device 130 may be any type of device that is capable of reading or pretending to read data, so long as when the device interacts with the data or otherwise requests access to the data, the cache lines associated with that data will get demoted to sharedcache 116 as a side effect. - The
source compute device 102 may be embodied as any type of computation or computer device capable of performing the functions described herein, including, without limitation, a smartphone, a mobile computing device, a tablet computer, a laptop computer, a notebook computer, a computer, a server (e.g., stand-alone, rack-mounted, blade, etc.), a sled (e.g., a compute sled, an accelerator sled, a storage sled, a memory sled, etc.), a network appliance (e.g., physical or virtual), a web appliance, a distributed computing system, a processor-based system, and/or a multiprocessor system. While not illustratively shown, it should be appreciated thatsource compute device 102 includes similar and/or like components to those of the illustrativenetwork compute device 106. As such, figures and descriptions of the like components are not repeated herein for clarity of the description with the understanding that the description of the corresponding components provided above in regard to thenetwork compute device 106 applies equally to the corresponding components of thesource compute device 102. Of course, it should be appreciated that the computing devices may include additional and/or alternative components, depending on the embodiment. - The
network 104 may be embodied as any type of wired or wireless communication network, including but not limited to a wireless local area network (WLAN), a wireless personal area network (WPAN), an edge network (e.g., a multi-access edge computing (MEC) network), a fog network, a cellular network (e.g., Global System for Mobile Communications (GSM), Long-Term Evolution (LTE), 5G, etc.), a telephony network, a digital subscriber line (DSL) network, a cable network, a local area network (LAN), a wide area network (WAN), a global network (e.g., the Internet), or any combination thereof. It should be appreciated that, in such embodiments, thenetwork 104 may serve as a centralized network and, in some embodiments, may be communicatively coupled to another network (e.g., the Internet). Accordingly, thenetwork 104 may include a variety of other virtual and/or physical network computing devices (e.g., routers, switches, network hubs, servers, storage devices, compute devices, etc.), as needed to facilitate communication between thenetwork compute device 106 and thesource compute device 102, which are not shown to preserve clarity of the description. - Referring now to
FIG. 2 , in use, thenetwork compute device 106 establishes anenvironment 200 during operation. Theillustrative environment 200 includes the processor(s) 108, theHFI 126, and the cacheline demote device 130 ofFIG. 1 , as well as acache manager 214 and ademotion manager 220. Theillustrative HFI 126 includes a network traffic ingress/egress manager 208, the illustrative cacheline demote device 130 includes aninterface manager 210, and the illustrative processor(s) 108 include a packetprocess operation manager 212. The various components of theenvironment 200 may be embodied as hardware, firmware, software, or a combination thereof. As such, in some embodiments, one or more of the components of theenvironment 200 may be embodied as circuitry or collection of electrical devices (e.g., network traffic ingress/egress management circuitry 208, demote deviceinterface management circuitry 210, packet processoperation management circuitry 212,cache management circuitry 214,demotion management circuitry 220, etc.). - As illustratively shown, the network traffic ingress/
egress management circuitry 208, the demote deviceinterface management circuitry 210, the packet processoperation management circuitry 212, thecache management circuitry 214, and thedemotion management circuitry 220 form a portion of a particular component of thenetwork compute device 106. However, while illustratively shown as being performed by a particular component of thenetwork compute device 106, it should be appreciated that, in other embodiments, one or more functions described herein as being performed by the network traffic ingress/egress management circuitry 208, the demote deviceinterface management circuitry 210, the packet processoperation management circuitry 212, thecache management circuitry 214, and/or thedemotion management circuitry 220 may be performed, at least in part, by one or more other components of thenetwork compute device 106. - Additionally, in some embodiments, one or more of the illustrative components may form a portion of another component and/or one or more of the illustrative components may be independent of one another. Further, in some embodiments, one or more of the components of the
environment 200 may be embodied as virtualized hardware components or emulated architecture, which may be established and maintained by theHFI 126, the processor(s) 108, or other components of thenetwork compute device 106. It should be appreciated that thenetwork compute device 106 may include other components, sub-components, modules, sub-modules, logic, sub-logic, and/or devices commonly found in a computing device, which are not illustrated inFIG. 2 for clarity of the description. - In the
illustrative environment 200, thenetwork compute device 106 additionally includes cacheline address data 202,demotion data 204, andnetwork packet data 206, each of which may be accessed by the various components and/or sub-components of thenetwork compute device 106. Further, each of the cacheline address data 202, thedemotion data 204, and thenetwork packet data 206 may be accessed by the various components of thenetwork compute device 106. Additionally, it should be appreciated that in some embodiments the data stored in, or otherwise represented by, each of the cacheline address data 202, thedemotion data 204, and thenetwork packet data 206 may not be mutually exclusive relative to each other. For example, in some implementations, data stored in the cacheline address data 202 may also be stored as a portion of one or more of thedemotion data 204 and/or thenetwork packet data 206, or in another alternative arrangement. As such, although the various data utilized by thenetwork compute device 106 is described herein as particular discrete data, such data may be combined, aggregated, and/or otherwise form portions of a single or multiple data sets, including duplicative copies, in other embodiments. - The network traffic ingress/
egress manager 208, which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to receive inbound and route/transmit outbound network traffic. To do so, the illustrative network traffic ingress/egress manager 208 is configured to facilitate inbound network communications (e.g., network traffic, network packets, network flows, etc.) to the network compute device 106 (e.g., from the source computing device 102). Accordingly, the network traffic ingress/egress manager 208 is configured to manage (e.g., create, modify, delete, etc.) connections to physical and virtual network ports (i.e., virtual network interfaces) of the network compute device 106 (e.g., via the communication circuitry 124), as well as the ingress buffers/queues associated therewith. - Additionally, the network traffic ingress/
egress manager 208 is configured to facilitate outbound network communications (e.g., network traffic, network packet streams, network flows, etc.) from thenetwork compute device 106. To do so, the network traffic ingress/egress manager 208 is configured to manage (e.g., create, modify, delete, etc.) connections to physical and virtual network ports/interfaces of the network compute device 106 (e.g., via the communication circuitry 124), as well as the egress buffers/queues associated therewith. In some embodiments, at least a portion of the network packet (e.g., at least a portion of a header of the network packet, at least a portion of a payload of the network packet, a checksum, etc.) may be stored in thenetwork packet data 206. - The demote
device interface manager 210, which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to manage the interface of the cacheline demote device 130. For example, the demotedevice interface manager 210 is configured to receive cache line demote commands from the processor(s) 108 that are usable to identify which cache line(s) are to be demoted from core-local cache 114 to sharedcache 116. Additionally, the demotedevice interface manager 210 is configured to perform some operation (e.g., a read request) in response to having received a cache line demote command to demote one or more cache lines from core-local cache 114 to sharedcache 116. It should be appreciated that the cache line demote command includes an identifier of each cache line that is to be demoted from core-local cache 114 to sharedcache 116 and each identifier is usable by the cacheline demote device 130 to demote (e.g., copy, evict, etc.) the applicable cache line(s). - The packet
process operation manager 212, which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to identify which packet processing operations are to be performed on at least a portion of the data of a received network packet (e.g., a header field of the network packet, a portion of the payload of the network packet, etc.) and the associatedprocessor core 110 that each packet processing operation is to be performed thereby. Additionally, in some embodiments, the packetprocess operation manager 212 may be configured to identify when each packet processing operation has completed and provide an indication of completion (e.g., to the demotion manager 220). It should be appreciated that, while described herein as being performed by an associatedprocessor core 110, one or more of the packet processing operations may be performed by any type of compute device/logic (e.g., an accelerator device/logic) that may need to access thecache memory 112. - The
cache manager 214, which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to manage the cache memory 112 (e.g., the core-local cache 114 and the shared cache 116). To do so, thecache manager 214 is configured to manage the addition and eviction of entries into and out of thecache memory 112. Accordingly thecache manager 214, which may be embodied as or otherwise include a memory management unit is further configured to record results of virtual address to physical address translations. In such embodiments, the translations may be stored in the cacheline address data 202. Thecache manager 214 is additionally configured to facilitate the fetching of data from main memory and the storage of cached data to main memory, as well as the demotion of data from the applicable core-local cache 114 to the sharedcache 116 and the promotion of data from the sharedcache 116 to the applicable core-local cache 114. - The
demotion manager 220, which may be embodied as hardware, firmware, software, virtualized hardware, emulated architecture, and/or a combination thereof as discussed above, is configured to manage the demotion of data from the core-local cache 114 to the sharedcache 116. To do so, thedemotion manager 220 is configured to either transmit instructions to a cache memory manger (e.g., the cache manager 214) to demote (e.g., copy, evict, etc.) the processed data from the core-local cache 114 to the sharedcache 116, or transmit a command to the cacheline demote device 130 to demote the processed data the core-local cache 114 to the sharedcache 116. To determine whether to send the cache line demotion instruction to thecache manager 214 or the cache line demotion command to the cacheline demote device 130, thedemotion manager 220 is further configured to compare a size of a network packet to a predetermined packet size threshold. - If the
demotion manager 220 determines the network packet size is greater than the packet size threshold, thedemotion manager 220 is configured to transmit the cache line demotion instruction to thecache manager 214. Otherwise, if thedemotion manager 220 determines the network packet size is less than or equal to the packet size threshold, thedemotion manager 220 is configured to transmit the cache line demotion command to the cacheline demote device 130. Additionally, thedemotion manager 220 is configured to include an identifier of each cache line, or a range of cache lines, to be demoted from the core-local cache 114 to the sharedcache 116 in the cache line demotion instructions/commands. As illustratively shown, thedemotion manager 220 may be configured as an offload device; however, in some embodiments, the functions as described herein may be performed by or thedemotion manager 220 may otherwise form a portion of theprocessor 108, or theprocessor cores 110. It should be appreciated that under such conditions in which the next cache location is known ahead of time, thedemotion manager 220 may be configured to move the data to known core-local cache entries of the core-local cache associated with the next processor core in the packet processing pipeline. - Referring now to
FIG. 3 , amethod 300 for demoting cache lines to shared cache is shown which may be executed by a compute device (e.g., thenetwork compute device 106 ofFIGS. 1 and2 ). Themethod 300 begins withblock 302, in which thenetwork compute device 106 determines whether to process a network packet (e.g., aprocessor 108 has polled theHFI 126 for the next packet to process). If so, themethod 300 advances to block 304, in which thenetwork compute device 106 identifies one or more packet processing operations to be performed on at least a portion of a network packet by aprocessor core 110. Inblock 306, thenetwork compute device 106, or more particularly the requestingprocessor core 110, performs the identified packet processing operation(s) on the applicable portion of the network packet to be processed. It should be appreciated that, while described herein as being performed by a requestingprocessor core 110, one or more of the packet processing operations may be performed by any type of compute device/logic (e.g., an accelerator device/logic) that may need to access thecache memory 112. - In
block 308, thenetwork compute device 106 determines whether the requestingprocessor core 110, or applicable compute device/logic, has completed the identified packet processing operation(s), such as may be indicated by the requestingprocessor core 110. If so, themethod 300 advances to block 310, in which thenetwork compute device 106 determines which one or more cache lines in core-local cache 114 are associated with the processed network packet. Additionally, inblock 312, thenetwork compute device 106 identifies a size of the network packet. Inblock 314, thenetwork compute device 106 compares the identified network packet size to a packet size threshold. Inblock 316, thenetwork compute device 106 determines whether the identified network packet size is greater than the packet size threshold. - If the
network compute device 106 determines that the identified network packet size is less than or equal to the packet size threshold themethod 300 branches to block 318, in which thenetwork compute device 106 is configured to transmit a cache line demotion instruction to thecache manager 214 to demote the one or more cache lines associated with the processed network packet from the core-local cache 114 to the sharedcache 116. Additionally, inblock 320, the network compute device includes a cache line identifier of each determined cache line in the core-local cache 114 in the cache line demotion instruction. Referring back to block 316, if thedemotion manager 220 determines that the network packet size is greater than the packet size threshold, themethod 300 branches to block 322, in which thenetwork compute device 106 transmits a cache line demotion command to the cacheline demote device 130 to trigger a cache line demotion operation to demote one or more cache lines associated with the processed network packet from the core-local cache 114 to the sharedcache 116. Additionally, inblock 324, thenetwork compute device 106 includes one or more cache line identifiers corresponding to the one or more cache lines to be demoted in the cache line demotion command. - Referring now to
FIGS. 4 and5 , in use, thenetwork compute device 106 establishes anillustrative environment 400 for demoting cache lines to sharedcache 116 via cache line demote instructions and anillustrative environment 500 for demoting cache lines to sharedcache 116 via cache line demote commands to a cacheline demote device 130. Referring now toFIG. 4 , theillustrative environment 400 includes theHFI 126, aprocessor core 110, the core-local cache 114, the sharedcache 116, and thedemote device 130 ofFIG. 1 , as well as thecache manager 214 ofFIG. 2 . Each of the illustrative core-local cache 114 and the sharedcache 116 include multiple cache entries. - As illustratively shown, the core-
local cache 114 includes multiple core-local cache entries 404. The illustrative core-local cache entries 404 include a first core-local cache entry designated as core-local cache entry (1) 404a, a second core-local cache entry designated as core-local cache entry (2) 404b, a third core-local cache entry designated as core-local cache entry (3) 404c, a fourth core-local cache entry designated as core-local cache entry (4) 404d, and a fifth core-local cache entry designated as core-local cache entry (N) 404e (i.e., the "Nth" core-local cache entry 404, wherein "N" is a positive integer and designates one or more additional core-local cache entries 404). Similarly, the illustrative sharedcache 116 includes multiple sharedcache entries 406. The illustrative sharedcache entries 406 include a first shared cache entry designated as shared cache entry (1) 406a, a second shared cache entry designated as shared cache entry (2) 406b, a third shared cache entry designated as shared cache entry (3) 406c, a fourth shared cache entry designated as shared cache entry (4) 406d, and a fifth shared cache entry designated as shared cache entry (N) 406e (i.e., the "Nth" sharedcache entry 406, wherein "N" is a positive integer and designates one or more additional shared cache entries 406). - Referring now to
FIG. 5 , similar to the illustrative environment ofFIG. 4 , theillustrative environment 500 includes theHFI 126, theprocessor core 110, the core-local cache 114, the sharedcache 116, and thedemote device 130 ofFIG. 1 , as well as thecache manager 214 ofFIG. 2 . As described previously, theprocessor core 110 is configured to poll an available network packet form processing from the HFI 126 (e.g., via an HFI/host interface (not shown)) and perform some level of processing operation on at least a portion of the data of the network packet. As also described previously, upon completion of the processing operation, theprocessor core 110 is further configured to provide some indication that one or more cache lines are to be demoted from the core-local cache 114 to the sharedcache 116. - Referring back to
FIG. 4 , as illustratively shown, the indication provided by theprocessor core 110 is in the form of one or more cache line demote instructions. It should be appreciated that each cache line demote instruction is usable to identify a cache line from the core-local cache 114 and demote the data to the sharedcache 116. As such, it should be appreciated that such instructions may not be as efficient for larger packets. Accordingly, theprocessor 110 is configured to, for larger blocks of data, utilize the cache line demote device to offload the demote operation. To do so, referring again toFIG. 5 , theprocessor 110 is configured to transmit a cacheline demotion command 502 to the cacheline demote device 130 to trigger a cache line demotion operation to be performed by the cacheline demote device 130, such as may be performed via a data read request, a DMA request, etc., or any other type of request that will result in the data being demoted to sharedcache 116 as a side effect without wasting processor core cycles. - As illustratively shown in both
FIGS. 4 and5 , the data in core-local cache line (1) 404a, core-local cache line (2) 404b, and core-local cache line (3) 404c is associated with the processed network packet, as indicated by the highlighted outline surrounding each of those core-local cache lines 404. As also illustratively shown, the cache line demotion operation results in that data being demoted such that the data in core-local cache line (1) 404a is demoted to shared cache line (1) 406a, the data in core-local cache line (2) 404b is demoted to shared cache line (2) 406b, and the data in core-local cache line (3) 404c is demoted to shared cache line (3) 406c; however, it should be appreciated that, as a result of the cache line demotion operation, the demoted cache lines may be moved to any available shared cache lines 406.
Claims (11)
- A compute device (106) for demoting cache lines to a shared cache, the compute device comprising:one or more processors (108), wherein each of the one or more processors (108) includes a plurality of processor cores (110);a cache memory (112), wherein the cache memory (112) includes a core-local cache (114) and a shared cache (116), wherein the core-local cache (114) includes a plurality of core-local cache lines (404a .. 404e), and wherein the shared cache includes a plurality of shared cache lines (406a .. 406e); anda host fabric interface, HFI, (126) to receive a network packet, characterized in that the compute device (106) further comprises:a cache line demote device (130); and in thata processor core of a processor of the one or more processors (108) is to:retrieve at least a portion of data of the received network packet, wherein to retrieve the data comprises to move the data into one or more core-local cache lines of the plurality of core-local cache lines;perform one or more processing operations on the data; andtransmit, subsequent to having completed the one or more processing operations on the data and subsequent to a determination by the processor core that a size of the received network packet is greater than a packet size threshold, a cache line demotion command to the cache line demote device (130), andtransmit, subsequent to having determined that the size of the received network packet is less than or equal to the packet size threshold, a cache line demote instruction to a cache manager (214) of the cache memory, andwherein the cache line demote device (130) is to perform, in response to having received the cache line demotion command, a cache line demotion operation to demote the data from the one or more core-local cache lines to one or more shared cache lines of the shared cache (116), andwherein the cache manager (214) is to demote the data from the one or more core-local cache lines to the one or more shared cache lines of the shared cache (116) based on the cache line demote instruction.
- The compute device of claim 1, wherein the cache line demote instruction bypasses the cache line demote device (130), and wherein to transmit the cache line demotion instruction includes to transmit one or more cache line identifiers corresponding to the one or more shared cache lines.
- The compute device of claim 1, wherein to perform the cache line demotion operation comprises to perform a read request or a direct memory access.
- The compute device of claim 1, wherein the cache line demotion command includes an indication of the core-local cache lines associated with the received network packet that are to be demoted to the shared cache.
- The compute device of claim 1, wherein the cache line demote device (130) comprises one of a copy engine, a direct memory access, DMA, device usable to copy data, or an offload device usable to perform a read operation.
- The compute device of claim 1, wherein to transmit the cache line demotion command includes to transmit one or more cache line identifiers corresponding to the one or more shared cache lines.
- A method (300) for demoting cache lines to a shared cache, the method comprising:retrieving, by a processor of a compute device, at least a portion of data of a network packet received by a host fabric interface, HFI, of the compute device, wherein to retrieve the data comprises to move the data into one or more core-local cache lines of a plurality of core-local cache lines of a core-local cache of the compute device, and wherein the processor includes a plurality of processor cores;performing (306), by a processor core of the plurality of processor cores, one or more processing operations on the data;transmitting (322), by the processor core, subsequent to having completed the one or more processing operations on the data and in response to a determination by the processor core that a size of the received network packet is greater than a packet size threshold, a cache line demotion command to a cache line demote device of the compute device;transmitting (318), by the processor core and subsequent to having determined that the size of the received network packet is less than or equal to the packet size threshold, a cache line demote instruction to a cache manager of a cache memory that includes the core-local cache and the shared cache;performing, by the cache line demote device and in response to having received the cache line demotion command, a cache line demotion operation to demote the data from the one or more core-local cache lines to one or more shared cache lines of a shared cache of the compute device; anddemoting, by the cache manager, the data from the one or more core-local cache lines to the one or more shared cache lines of the shared cache based on the cache line demote instruction.
- The method of claim 7, wherein transmitting the cache line demotion instruction includes transmitting one or more cache line identifiers corresponding to the one or more shared cache lines.
- The method of claim 7, wherein performing the cache line demotion operation comprises performing one of a read request or a direct memory access.
- The method of claim 7, wherein transmitting the cache line demotion command includes transmitting one or more cache line identifiers corresponding to the one or more shared cache lines.
- One or more machine-readable storage media comprising a plurality of instructions stored thereon that, when executed by a compute device according to claim 1, cause the compute device to perform the method of any of claims 7-10.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/024,773 US10657056B2 (en) | 2018-06-30 | 2018-06-30 | Technologies for demoting cache lines to shared cache |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3588310A1 EP3588310A1 (en) | 2020-01-01 |
EP3588310B1 true EP3588310B1 (en) | 2021-12-08 |
Family
ID=65231651
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP19177464.5A Active EP3588310B1 (en) | 2018-06-30 | 2019-05-29 | Technologies for demoting cache lines to shared cache |
Country Status (3)
Country | Link |
---|---|
US (1) | US10657056B2 (en) |
EP (1) | EP3588310B1 (en) |
CN (1) | CN110659222A (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12066939B2 (en) * | 2020-10-30 | 2024-08-20 | Intel Corporation | Cache line demote infrastructure for multi-processor pipelines |
WO2023168835A1 (en) * | 2022-03-09 | 2023-09-14 | Intel Corporation | Improving spinlock performance with cache line demote in operating system kernel |
US11995000B2 (en) * | 2022-06-07 | 2024-05-28 | Google Llc | Packet cache system and method |
CN115061953A (en) * | 2022-06-23 | 2022-09-16 | 上海兆芯集成电路有限公司 | Processor and method for specified target to perform intra-core to extra-core cache content migration |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE112005003689T5 (en) * | 2005-09-28 | 2008-06-26 | Intel Corporation, Santa Clara | Updating inputs stored by a network processor |
KR20090110291A (en) * | 2006-10-26 | 2009-10-21 | 인터랙틱 홀딩스 엘엘시 | A network interface card for use in parallel computing systems |
US7966453B2 (en) * | 2007-12-12 | 2011-06-21 | International Business Machines Corporation | Method and apparatus for active software disown of cache line's exlusive rights |
US8937942B1 (en) * | 2010-04-29 | 2015-01-20 | Juniper Networks, Inc. | Storing session information in network devices |
CN102063407B (en) | 2010-12-24 | 2012-12-26 | 清华大学 | Network sacrifice Cache for multi-core processor and data request method based on Cache |
US9952982B2 (en) * | 2016-06-06 | 2018-04-24 | International Business Machines Corporation | Invoking demote threads on processors to demote tracks indicated in demote ready lists from a cache when a number of free cache segments in the cache is below a free cache segment threshold |
US10268580B2 (en) | 2016-09-30 | 2019-04-23 | Intel Corporation | Processors and methods for managing cache tiering with gather-scatter vector semantics |
US10402327B2 (en) * | 2016-11-22 | 2019-09-03 | Advanced Micro Devices, Inc. | Network-aware cache coherence protocol enhancement |
-
2018
- 2018-06-30 US US16/024,773 patent/US10657056B2/en active Active
-
2019
- 2019-05-29 EP EP19177464.5A patent/EP3588310B1/en active Active
- 2019-05-30 CN CN201910463139.2A patent/CN110659222A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
CN110659222A (en) | 2020-01-07 |
US10657056B2 (en) | 2020-05-19 |
EP3588310A1 (en) | 2020-01-01 |
US20190042419A1 (en) | 2019-02-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3588310B1 (en) | Technologies for demoting cache lines to shared cache | |
US20230412365A1 (en) | Technologies for managing a flexible host interface of a network interface controller | |
US12093746B2 (en) | Technologies for hierarchical clustering of hardware resources in network function virtualization deployments | |
US20180152383A1 (en) | Technologies for processing network packets in agent-mesh architectures | |
EP3057272B1 (en) | Technologies for concurrency of cuckoo hashing flow lookup | |
US20170237703A1 (en) | Network Overlay Systems and Methods Using Offload Processors | |
US7818459B2 (en) | Virtualization of I/O adapter resources | |
US10932202B2 (en) | Technologies for dynamic multi-core network packet processing distribution | |
US20170289036A1 (en) | Technologies for network i/o access | |
EP3588881A1 (en) | Technologies for reordering network packets on egress | |
US11068399B2 (en) | Technologies for enforcing coherence ordering in consumer polling interactions by receiving snoop request by controller and update value of cache line | |
US20120102245A1 (en) | Unified i/o adapter | |
EP3563534B1 (en) | Transferring packets between virtual machines via a direct memory access device | |
EP3629189A2 (en) | Technologies for using a hardware queue manager as a virtual guest to host networking interface | |
US11157336B2 (en) | Technologies for extending triggered operations | |
EP3588879A1 (en) | Technologies for buffering received network packet data | |
US11044210B2 (en) | Technologies for performing switch-based collective operations in distributed architectures | |
US9137167B2 (en) | Host ethernet adapter frame forwarding | |
US20180351812A1 (en) | Technologies for dynamic bandwidth management of interconnect fabric | |
US11487695B1 (en) | Scalable peer to peer data routing for servers | |
US11048528B2 (en) | Method and apparatus for compute end point based collective operations | |
EP1215585A2 (en) | Virtualization of i/o adapter resources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20200630 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 12/084 20160101AFI20210518BHEP Ipc: G06F 12/128 20160101ALI20210518BHEP Ipc: H04L 12/933 20130101ALI20210518BHEP |
|
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: GRANT OF PATENT IS INTENDED |
|
INTG | Intention to grant announced |
Effective date: 20210628 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE PATENT HAS BEEN GRANTED |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: REF Ref document number: 1454296 Country of ref document: AT Kind code of ref document: T Effective date: 20211215 Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R096 Ref document number: 602019009819 Country of ref document: DE |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D |
|
REG | Reference to a national code |
Ref country code: LT Ref legal event code: MG9D |
|
REG | Reference to a national code |
Ref country code: NL Ref legal event code: MP Effective date: 20211208 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: RS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220308 |
|
REG | Reference to a national code |
Ref country code: AT Ref legal event code: MK05 Ref document number: 1454296 Country of ref document: AT Kind code of ref document: T Effective date: 20211208 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: NO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220308 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: HR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220309 Ref country code: ES Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SM Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220408 Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R097 Ref document number: 602019009819 Country of ref document: DE |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20220408 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: AL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
26N | No opposition filed |
Effective date: 20220909 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
REG | Reference to a national code |
Ref country code: BE Ref legal event code: MM Effective date: 20220531 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20220529 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20220531 Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20220531 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20220529 Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20220531 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20220531 |
|
P01 | Opt-out of the competence of the unified patent court (upc) registered |
Effective date: 20230518 |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20230529 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20230529 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT; INVALID AB INITIO Effective date: 20190529 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20240326 Year of fee payment: 6 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20211208 |