CN118051470A

CN118051470A - Method for operating a computing device and storage device

Info

Publication number: CN118051470A
Application number: CN202311455047.2A
Authority: CN
Inventors: 玛丽·麦·阮; 瑞卡·皮塔楚玛尼; 奇亮奭
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2022-11-17
Filing date: 2023-11-02
Publication date: 2024-05-17

Abstract

Methods and storage devices for operating a computing device are disclosed. The method may comprise: performing, at the computing storage, a first computing task of the workload using the first data stored at the computing storage, wherein the step of performing the first computing task of the workload may include generating second data; transmitting second data from the computing storage device to the computing device using the interconnect fabric; and performing, at the computing device, a second computing task of the workload using the second data. The step of transmitting the second data may include transmitting the second data using a root complex of the interconnect structure. The step of transmitting the second data may include transmitting the second data using a switch of the interconnect fabric. The step of transmitting the second data may comprise performing a peer-to-peer transmission. The step of transferring the second data may comprise performing a direct memory access.

Description

Method for operating a computing device and storage device

The present application claims priority and benefit from U.S. provisional patent application Ser. No. 63/426,361, filed 11/17/2022, and U.S. patent application Ser. No. 18/121,586, filed 3/14 2023, which are incorporated herein by reference.

Technical Field

The present disclosure relates generally to computing devices, and more particularly to systems, methods, and apparatus for operating computing devices.

Background

The data processing system may include one or more computing devices (such as accelerators, computing storage devices, etc.). The computing device may store the data in a memory, such as Dynamic Random Access Memory (DRAM), a storage medium, such as a flash memory medium, or the like. The computing device may include one or more computing resources that may enable the device to perform computing operations on data stored at the device. The computing operations may involve reading data from and/or writing data to memory, storage media, etc.

The above information disclosed in this background section is only for enhancement of understanding of the background of the invention principles and therefore it may contain information that does not form the prior art.

Disclosure of Invention

A method may include: performing, at the computing storage, a first computing task of the workload using the first data stored at the computing storage, wherein the step of performing the first computing task of the workload may include generating second data; transmitting second data from the computing storage device to the computing device using the interconnect fabric; and performing, at the computing device, a second computing task of the workload using the second data. The step of transmitting the second data may include transmitting the second data using a root complex of the interconnect structure. The step of transmitting the second data may include transmitting the second data using a switch of the interconnect fabric. The step of transmitting the second data may comprise performing a peer-to-peer transmission. The step of transferring the second data may comprise performing a direct memory access. The method may further comprise: a first computing task of the workload is allocated based on a size of the first data and a memory capacity of the computing device. The method may further comprise: the first computing task of the workload is assigned based on performance characteristics of the first computing task of the workload. The method may further comprise: a first computing task of the workload is allocated based on an operational state of the computing device. The interconnect structure may be connected to a host, and the method may further comprise: the first computing task of the workload is allocated based on the memory capacity of the host. The interconnect structure may be connected to a host, and the method may further comprise: a first computing task of the workload is allocated based on an operational state of the host. The workload may comprise a machine learning workload and the first computing task of the workload may comprise a reduction operation. The first computing task of the workload may include a sparse length summation operation. The method may further comprise: a third computing task of the workload is performed at the computing storage using the first data. The first data may be at least partially stored in a data structure, and the third computing task of the workload may include updating the data structure.

A storage device may include: a storage medium, at least one computing resource, an interconnect interface, and control circuitry configured to: performing a computing task of the workload using at least one of the at least one computing resource using the first data stored at the storage device, wherein the computing task of the workload may include generating second data and transmitting the second data from the storage device to the computing device using the interconnection interface. The computing tasks may include a first computing task of a workload, and the control circuitry may be configured to: a second computing task of the workload is performed using at least one of the at least one computing resource. The first data may be stored at least in part in a data structure, and the second computing task of the workload may include updating the data structure. The first computing task of the workload may include a summation operation and the second computing task of the workload may include a gradient operation.

A method may include: determining a memory capacity of a first computing device connected to an interconnect fabric, wherein the interconnect fabric is connectable to a second computing device; selecting a first computing device based on a memory capacity of the first computing device and a size of first data for a workload, wherein the workload may include a first computing task and a second computing task, and the first computing task generates second data for the second computing task using at least a portion of the first data; transmitting at least a portion of the first data to a first computing device; and performing, by the first computing device, a first computing task of the workload based on the selection. The step of selecting the first computing device may also be based on performance characteristics of the first computing device and performance characteristics of the first computing task of the workload. The performance characteristics of the first computing task of the workload may include a latency characteristic. The step of selecting the first computing device may also be based on an operational state of the first computing device. The operational state of the first computing device may include a utilization of the first computing device. The selection of the first computing device may also be based on a persistence characteristic of the first data. The interconnect fabric may be configured for peer-to-peer communication. The first computing device may comprise a host or a storage device. The first computing task of the workload may include a summation operation and the second computing task of the workload may include a gradient operation.

Drawings

The figures are not necessarily to scale and elements of similar structure or function may generally be represented by like reference numerals or parts thereof throughout the figures for illustrative purposes. The drawings are only intended to facilitate the description of the various embodiments described herein. The figures do not describe every aspect of the teachings disclosed herein and do not limit the scope of the claims. In order to prevent the drawing from becoming obscure, all components, connections, etc. may not be shown, and not all components may have reference numerals. However, the pattern of the component configuration can be readily made clear from the drawings. The accompanying drawings illustrate example embodiments of the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates an embodiment of a computing system according to a disclosed example embodiment.

FIG. 2 illustrates an embodiment of a computing system including a computing storage device according to a disclosed example embodiment.

FIG. 3 illustrates a first example embodiment of a computing system including a computing storage device according to example embodiments disclosed.

FIG. 4 illustrates an embodiment of a portion of a recommendation model training workload in accordance with a disclosed example embodiment.

Fig. 5 illustrates a second example embodiment of a computing system including a computing storage device according to example embodiments of the present disclosure.

Fig. 6 illustrates an embodiment of an allocation scheme according to a disclosed example embodiment.

FIG. 7 illustrates an embodiment of a method for assigning tasks to computing devices according to a disclosed example embodiment.

Fig. 8 illustrates a first example embodiment of an interconnect structure in accordance with the disclosed example embodiments.

Fig. 9 illustrates a second example embodiment of an interconnect structure in accordance with the disclosed example embodiments.

Fig. 10 illustrates an example embodiment of a host device according to an example embodiment of the disclosure.

FIG. 11 illustrates an example embodiment of a computing device according to an example embodiment of the disclosure.

Fig. 12 illustrates an embodiment of a method for operating a computing device, according to a disclosed example embodiment.

FIG. 13 illustrates an embodiment of a method for assigning tasks to computing devices according to a disclosed example embodiment.

Detailed Description

Some computing workloads may be divided into tasks, one or more of which may be performed by computing devices, such as a Central Processing Unit (CPU), accelerator devices, computing storage devices, and the like. For example, a Machine Learning (ML) workload, such as recommendation model training, may include a first task, such as Sparse Length Sum (SLS) computation, and a second task, such as interaction.

The first task, which may be performed by the CPU, for example, may read the input data from a data structure (e.g., an embedded table) that may be stored, for example, at a storage device. The first task may generate output data that may be used as input data for a second task that may be performed, for example, by the accelerator device. Depending on implementation details, such a task arrangement may involve relatively high overhead, e.g., transferring data from a storage device to a CPU and/or transferring data from a CPU to an accelerator device.

Some of the inventive principles disclosed relate to using a computing storage device to perform tasks related to reading data and/or writing data (e.g., reading relatively large and/or sparse data from and/or writing relatively large and/or sparse data to a relatively high capacity storage device and/or memory). For example, in a computing scheme according to a disclosed example embodiment, a first task of a workload may be performed at a computing storage device, which may read input data from a data structure (e.g., an embedded table) that may be stored at the computing storage device. Depending on implementation details, this may reduce overhead, for example, by reading the input data using a relatively high bandwidth internal data path of the computing storage.

Additionally or alternatively, output data from a first task executing at the computing storage device may be transferred to the accelerator device using the interconnect fabric to serve as input data for a second task of the workload. Depending on implementation details, this may reduce overhead, for example, by transferring data directly from the computing storage device to the accelerator device.

Some additional inventive principles disclosed relate to assigning one or more tasks of a workload to a computing device based on one or more characteristics of the task, one or more characteristics of the one or more computing devices, one or more operating states of the one or more computing devices, and the like. For example, in a task allocation scheme according to a disclosed example embodiment, one or more candidate computing devices may be selected for a task based on the candidate computing devices having sufficient memory and/or storage capacity to accommodate the amount of data associated with the task.

Additionally or alternatively, computing devices may be selected from candidate computing devices based on, for example, a latency specification of a task. Thus, if two candidate computing devices have sufficient memory and/or storage capacity to accommodate the amount of data associated with a task, and the task is relatively sensitive to delay, a first candidate computing device of the candidate computing devices having a relatively higher throughput may be selected for the task.

Additionally or alternatively, a computing device may be selected from candidate computing devices based on, for example, one or more utilization levels of the candidate computing devices. For example, if a first candidate computing device with a higher throughput has a relatively high utilization (e.g., is relatively busy), a second candidate computing device with a relatively lower throughput but lower utilization of the candidate computing devices may be selected for the task.

Additionally or alternatively, computing devices may be selected from candidate computing devices based on, for example, a persistence specification of the task. For example, if the data associated with the task includes persistent data, candidate computing devices having persistent memory and/or storage devices among the candidate computing devices may be selected for the task.

The present disclosure encompasses many of the inventive principles associated with operating a computing device. The principles disclosed herein may have independent utility and may be practiced separately and not every embodiment may utilize every principle. Furthermore, the principles may be implemented in various combinations, some of which may amplify some of the benefits of the various principles in a synergistic manner. For example, some embodiments, which may transfer output data from a first task executing at a computing storage device to an accelerator device for use by a second task, may also implement one or more additional features, such as assigning one or more tasks of a workload to the computing device based on one or more characteristics of the task, one or more characteristics of one or more computing devices, one or more operating states of one or more computing devices, and so forth.

For purposes of illustration, some embodiments may be described in the context of some specific implementation details, such as machine learning workload, communication protocols, such as computing fast link (CXL), and the like. However, the principles are not limited to these or any other specific implementation details.

Table 1 shows a first embodiment of a recommendation model training workload in accordance with the disclosed example embodiments.

TABLE 1

In some embodiments, the workload shown in table 1 may include one or more of the following tasks.

Task (1) may include one or more lookup operations, which may involve reading input data (e.g., classification data) from one or more embedded tables. In some embodiments, the embedded table may be relatively large, but the input data stored in the embedded table may be relatively sparse. In some embodiments, the recommendation model may use embedding to process sparse features that may represent classification data. For example, one or more classification features may be represented by one or more embedding vectors (e.g., rows of an embedding table). Additionally or alternatively, task (1) may include one or more Sparse Length Summation (SLS) computations, which may involve input summing of data read from one or more embedded tables. Depending on implementation details, the SLS operation may generate a relatively dense representation (e.g., of one or more features).

Task (2) may include one or more bottom multi-layer perceptron (MLP) operations to handle relatively dense features, continuous inputs, etc. In some embodiments, the bottom MLP operation may transform dense features, continuous inputs, etc., to generate one or more representations that may have the same or similar length as the one or more embedded vectors.

The task (3 a) may include one or more interactions (e.g., feature interactions) (e.g., by combining one or more outputs of one or more SLS operations and/or one or more outputs of one or more MLP operations). In some embodiments, the one or more interactive operations may include one or more stitching (concatenation) operations, summing operations, and the like.

Task (3 b) may include one or more top MLP operations. In some embodiments, one or more top MLP operations may receive one or more outputs from one or more interactions, e.g., to find event probabilities, capture one or more interactions of features, etc.

Task (4) may include one or more top MLP update operations that may update one or more parameters (e.g., weights, biases, etc.) of the one or more top MLPs using one or more outputs from the one or more interactions and/or top MLPs, e.g., using reverse pass information.

Task (5) may include one or more gradient computation operations to, for example, compute one or more gradients of one or more rows (e.g., one or more vectors) of one or more embedded tables. In some embodiments, gradient computation may use one or more SLS outputs (e.g., SLS output data) and/or embedded table data inputs. Additionally or alternatively, task (5) may include one or more write operations that may write one or more gradients to one or more rows (e.g., one or more vectors) of one or more embedded tables.

Task (6) may include one or more bottom MLP update operations that may update one or more parameters (e.g., weights, biases, etc.) of the one or more bottom MLPs using (e.g., using reverse pass information).

The tasks shown in table 1 are not necessarily based on the sequence of numbers and/or letters used to identify the task. Thus, some tasks may run in parallel (e.g., concurrently), while some other tasks may start based on output from another task. For example, in some embodiments, tasks (1) and (2) may run at least partially in parallel, while task (3) may not begin until tasks (1) and (2) are at least partially completed. In some embodiments, a synchronization mechanism may be used to coordinate some tasks that may run at least partially in parallel. For example, the GPU running task (2) may send a notification when task (2) is at least partially completed, and the CPU running task (1) may send a notification when task (1) is at least partially completed, thereby enabling the GPU running task (3) to begin using one or more outputs of task (1) and/or task (2). In some embodiments, the synchronization mechanism may be implemented by a host (e.g., a CPU) and/or an application running on the host.

In some embodiments, the workload shown in table 1 may be used with a Deep Learning Recommendation Model (DLRM). In some embodiments, tasks (1), (2), (3 a), and/or (3 b) may be characterized as forward transfer operations, while tasks (4), (5), and/or (6) may be characterized as reverse transfer operations. Some embodiments may implement a back propagation process in which, for one or more forward passes through a model, one or more back passes may be performed, for example, to adjust one or more parameters (e.g., weights, biases, etc.) of the model.

FIG. 1 illustrates an embodiment of a computing system according to a disclosed example embodiment. The system 100 shown in fig. 1 may include one or more CPUs 102, one or more storage devices 104, and/or one or more Graphics Processors (GPUs) 106.CPU 102 may include one or more computing units 108 and/or memory (e.g., DRAM) 110. The storage 104 may include a storage medium 112.GPU 106 may include one or more computing units 114 and/or memory 116.

For illustration purposes, the system 100 may be configured for machine learning workloads, in particular, recommended model training workloads (such as those shown in table 1). The workload may include a first task (1) that may be executed, for example, by one or more computing units 108 of the CPU 102. The first task (1) may include a lookup operation (1 a) in which one or more computing units 108 of the CPU 102 may read data (e.g., classification data) from one or more embedded tables 120 stored in a memory 110 of the CPU 102 as indicated by arrow 115 and/or read data (e.g., classification data) from one or more embedded tables 118 stored in a storage medium 112 of the storage device 104 as indicated by arrow 117. In some embodiments, one or more data stored in storage media 112 of storage 104 embedded in table 118 may be transferred (e.g., copied) to memory 110 of CPU 102 as indicated by arrow 119.

Additionally or alternatively, the first task (1) may include a Sparse Length Sum (SLS) operation (1 b) in which one or more computing units 108 of the CPU 102 may perform one or more SLS computations on data obtained by the lookup operation. In some embodiments, SLS computation may involve summing data read from one or more embedded tables 118 and/or 120. The SLS operation may generate output data (e.g., SLS output) 122, and the output data 122 may be stored in, for example, the memory 110 of the CPU 102 as indicated by arrow 125.

In some embodiments, some or all of the memory 110 of the CPU 102 is operable as a cache for the storage medium 112 of the storage device 104. For example, most or all or a portion of the embedded table used by the first task (1) may be stored in the storage medium 112 of the storage device 104, which may have a relatively large storage capacity. Some of the embedded tables, or portions thereof (e.g., more frequently accessed data, which may be referred to as hot data), may be cached in the memory 110 of the CPU 102.

The workload may include a second task (2) that may be performed, for example, by one or more computing units 114 of GPU 106. The second task (2) may include one or more bottom multi-layer perceptron (MLP) operations, which may use input data (e.g., relatively dense features, continuous inputs, etc.) stored in the memory 116.

The workload may include a third task (3) that may be performed, for example, by one or more computing units 114 of GPU 106. In some embodiments, the third task (3) may include one or more interactions (3 a). One or more outputs from the bottom MLP operation may be used as one or more inputs to the interoperation (3 a). Additionally or alternatively, output data (e.g., SLS output) 122a from the SLS operation may be stored in memory 116 and used as one or more inputs to the interoperation (3 a). Some or all of the SLS output data 122 stored in the memory 110 may be transferred from the memory 110 of the CPU 102 to the memory 116 of the GPU 106 as indicated by arrow 123 and stored as SLS output data 122a.

Additionally or alternatively, the third task (3) may include one or more top MLP operations (3 b). In some embodiments, one or more outputs from the interoperation may be used as one or more inputs to the top MLP operation.

The workload may include a fourth task (4) that may be performed, for example, by one or more computing units 114 of GPU 106. The fourth task (4) may include one or more update operations for one or more top MLPs (e.g., using reverse pass information to adjust one or more parameters, weights, biases, etc. of the top MLPs).

The workload may include a fifth task (5) which may be performed, for example, by one or more computing units 108 of the CPU 102. The fifth task (5) may include one or more embedded table update operations. The embedded table update operations may include one or more gradient calculations, which may use output data 122 from one or more SLS operations and/or data from one or more embedded tables 118 and/or 120 as inputs. The embedded table update operation may include one or more write operations in which one or more outputs from one or more gradient calculations may be written to one or more embedded tables 118 and/or 120. In some embodiments, one or more outputs from one or more gradient calculations performed by CPU 102 may be transferred (e.g., copied) to storage 104 as indicated by arrow 121.

The workload may include a sixth task (6) that may be performed, for example, by one or more computing units 114 of GPU 106. The sixth task (6) may include one or more update operations for one or more bottom MLPs (e.g., using reverse pass information to adjust one or more parameters, weights, biases, etc. of the bottom MLPs).

In some embodiments, GPU 106 may have a relatively greater computing power than CPU 102. However, some or all of the data stored in the embedded tables 118 and/or 120 may be too large to be stored in the memory 116 of the GPU 106 and/or the memory 110 of the CPU 102. Thus, some or all of the embedded tables may be stored in the storage medium 112 of one or more storage devices 104. The data stored in the embedded tables 118 in the one or more storage devices 104 may be partially processed by the CPU 102, for example, by copying a portion of the data stored in the embedded tables 118 from the storage devices 104 to the memory 110 of the CPU 102 as indicated by arrow 119. CPU 102 may use one or more computing units 108 to perform SLS operations on portions of embedded table data 120 that are copied to memory 110. One or more SLS operations may generate SLS output data 122 that may be smaller than the embedded table data 120. Thus, depending on implementation details, when SLS output data 122 is transferred as 122a from CPU 102 to GPU 106, SLS output data 122 may fit into memory 116 of GPU 106.

However, transferring data from one or more storage devices 104 to CPU 102 may involve CPU utilization (e.g., overhead time involved in copying data in embedded tables 118 in storage devices 104 to memory 110 in CPU 102). Depending on implementation details, this may prevent CPU 102 from performing other operations (e.g., SLS operations) while CPU 102 is busy copying data to storage device 104 and/or copying data from storage device 104.

Further, transferring SLS output data 122 from CPU 102 to GPU 106 may involve CPU and/or GPU utilization (e.g., overhead time involved in copying SLS output data 122 from memory 110 in CPU 102 to memory 116 in GPU 106). Depending on implementation details, this may prevent GPU 106 from performing other operations (e.g., MLP operations, interactive operations, etc.) when GPU 106 is busy copying data to CPU 102 and/or from CPU 102.

FIG. 2 illustrates an embodiment of a computing system including a computing storage device according to a disclosed example embodiment. The system 200 shown in fig. 2 may include one or more computing devices 206, one or more hosts 202, and/or one or more computing storage devices 204. One or more computing devices 206, hosts 202, and/or computing storage devices 204 may communicate using an interconnect fabric 224. Computing device 206 may include one or more computing resources 214. The computing storage 204 may include one or more computing resources 227.

In some embodiments, system 200 may be configured for any type of workload that may involve a relatively large data storage capacity, and/or that is divided at least in part into one or more tasks that may involve accessing (e.g., reading and/or writing) a relatively large amount of stored data. For example, in some embodiments, the system 200 may be configured to perform a recommendation model training workload (such as the workload shown in table 1). However, in other embodiments, the system 200 may be configured for other types of workloads, including other machine learning workloads, artificial intelligence workloads, natural language processing (e.g., recognition, generation, etc.) workloads, and the like.

The computing storage 204 may perform a first computing task (or first task) 226 of the workload. The first computing task 226 may receive the first data 228 as input and generate the second data 230 as output. The first computing task 226 may be performed, for example, using one or more computing resources 227. The second data 230 may be transferred to the computing device 206 using the interconnect fabric 224 as indicated by arrow 232. In some embodiments, the interconnect fabric 224 may communicate the second data 230 directly to the computing device 206, e.g., without involvement of the host 202, intervention, processor utilization (e.g., CPU utilization), and so forth. The computing device 206 may perform a second computing task 234 of the workload using the second data 230 as input. The second computing task 234 may be performed, for example, using one or more computing resources 214.

The system 200 shown in fig. 2 may be used, for example, with any type of workload that may be divided into tasks that may be performed by the computing device 206 and/or the computing storage 204, which may involve relatively high data storage capacity and/or may involve read and/or write access to relatively high data storage capacity. In some embodiments, there may be overlap in operations performed by tasks divided from the workload. Thus, in some embodiments, a task partitioned from a workload may refer to a task partitioned at least in part from a workload.

Although the system 200 shown in fig. 2 is not limited to any particular type of workload, embodiments configured for recommending model training workloads may operate as follows. The first task 226 may include one or more lookup and/or SLS computations and/or the second task 234 may include one or more interactions. The first data 228 stored at the computing storage 204 may include one or more embedded tables and/or the second data 230 may include SLS output data from one or more SLS computations in the first task 226. Some or all of the SLS output data 230 may be communicated (e.g., directly) to the computing device 206 using the interconnect fabric 224.

Depending on implementation details, performing one or more lookups and/or SLS computations at the computation storage 204 storing the embedded table data may reduce or eliminate data transfer overhead (e.g., data replication overhead involved in copying the embedded table data from the storage 104 to the CPU 102, such as shown by arrow 119 in fig. 1). Additionally or alternatively, transferring SLS output data from computing storage 204 to computing device 206 (e.g., directly from computing storage 204 to computing device 206) using interconnect fabric 224 may reduce or eliminate data transfer overhead (e.g., data replication overhead involved in copying SLS output data 122 from CPU 102 to GPU 106, such as shown by arrow 123 in fig. 1).

Computing storage 204 may be implemented using any type of memory and/or storage medium including any other type of solid state medium, magnetic medium, optical medium, etc. For example, in some embodiments, the storage device may be implemented as an SSD-based NAND (NAND) flash memory, a persistent memory (PMEM) such as a cross-grid non-volatile memory, a memory with bulk resistance change, a Phase Change Memory (PCM), a Dynamic Random Access Memory (DRAM), and/or the like, and/or any combination thereof.

Any of the computing storage devices disclosed herein may be implemented in any form factor (such as 3.5 inches, 2.5 inches, 1.8 inches, m.2, enterprise, and data center SSD form factor (EDSFF), NF1, etc.) using any connector configuration (such as serial ATA (SATA), small Computer System Interface (SCSI), serial Attached SCSI (SAS), U.2, etc.). Any of the storage devices disclosed herein may be implemented in whole or in part with and/or used in combination with a server chassis, a server rack, a data room, a data center, an edge data center, a mobile edge data center, and/or any combination thereof. In some embodiments, the compute storage device may be implemented as a Compute Storage Drive (CSD), a Compute Storage Processor (CSP), and/or a Compute Storage Array (CSA).

In some embodiments, computing storage 204 may be implemented with devices other than storage (e.g., any type of device that may include or have access to memory, storage media, etc.) to store an amount of data that may be processed by one or more computing resources 227. Examples may include memory expansion devices and/or buffer devices (such as CXL type 2 and/or CXL type 3 devices, as well as CXL type 1 devices that may be capable of accessing memory, storage media, and the like).

The computing device 206 may be implemented with any type of device, such as an accelerator device, a storage device (e.g., a computing storage device), a network device (e.g., a Network Interface Card (NIC)), a CPU, a GPU, a Neural Processor (NPU), a Tensor Processor (TPU), a Data Processor (DPU), etc., or a plurality and/or combination thereof.

The computing resources 227 and/or 214 may be implemented with any component or combination of components that may perform operations on data, such as combinational logic, sequential logic, a timer, a counter, a register, a state machine, a Complex Programmable Logic Device (CPLD), a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an embedded processor, a microcontroller, a Central Processing Unit (CPU), such as a Complex Instruction Set Computer (CISC) processor (e.g., an x86 processor), a Reduced Instruction Set Computer (RISC) processor (such as an ARM processor), or the like, and/or combinations thereof.

Host 202 may be implemented with any component or combination of components (such as a computing server, a storage server, a network server, a cloud server, etc., a node (such as a storage node), a computer (such as a workstation, a personal computer, a tablet computer, a smart phone, etc.), or multiple and/or combinations thereof.

In some embodiments, host 202 may control the overall operation of system 200 shown in fig. 2. For example, a recommendation application running on a CPU at a host may implement a recommendation model that may include one or more training workloads, inferred workloads, and the like. The recommendation application may offload one or more tasks, operations, etc. (e.g., first task 226 and/or second task 234 as shown in fig. 2) to one or more computing devices 206 and/or computing storage 204. In some embodiments, the host 202 and/or a recommended application running at the host 202 may configure the interconnect fabric 224 to perform data transfers between any of the components shown in fig. 2. For example, the host 202 may configure the interconnect fabric 224 to transfer the second data 230 directly from the computing storage 204 to the computing device 206.

The interconnect fabric 224 may be implemented with one or more interconnects, one or more networks, a network of networks (e.g., the internet), etc., or a combination thereof, using any type of interface and/or protocol. For example, interconnect fabric 224 may be implemented with peripheral component interconnect express (PCIe), NVMe over network (NVMe-oh), ethernet, transmission control protocol/internet protocol (TCP/IP), direct Memory Access (DMA), remote DMA (RDMA), RDMA Over Converged Ethernet (ROCE), fibre channel, infiniband, serial ATA (SATA), small Computer System Interface (SCSI), serial Attached SCSI (SAS), iWARP, computing fast link (CXL) and/or coherence protocol (such as cxl.mem, cxl.cache, cxl.io, etc.), gen-Z, open coherence accelerator processor interface (OpenCAPI), cache coherence interconnect for accelerator (CCIX), etc., advanced extensible interface (AXI), any generation wireless network (including 2G, 3G, 4G, 5G, 6G, etc.), any generation WiFi, bluetooth, near Field Communication (NFC), etc., or combinations thereof. In some embodiments, the interconnect fabric 224 may include one or more root complexes, switches, hubs, nodes, routers, and the like.

In some embodiments, the interconnect fabric 224 may be configured to transfer data directly between components, e.g., without involvement of the host 202, intervention, processor utilization (e.g., CPU utilization), etc. For example, in embodiments implemented at least in part with CXL, the interconnect fabric 224 may be configured to communicate the second data 230 directly from the computing storage 204 to the computing device 206, e.g., using a CXL switch, PCIe root complex, PCIe switch, PCIe peer-to-peer (P2P) communication, CXL P2P communication, or the like.

FIG. 3 illustrates a first example embodiment of a computing system including a computing storage device according to example embodiments disclosed. The system 300 shown in fig. 3 may be used, for example, to implement the system 200 shown in fig. 2. For purposes of illustrating the inventive principles, the embodiment shown in fig. 3 may be described in the context of particular implementation details (such as a host implemented with a CPU, a computing device implemented with a GPU, a workload implemented as a recommended model training workload), but the inventive principles are not limited to these or any other implementation details.

Referring to fig. 3, a system 300 may include one or more GPUs or other computing devices 306, one or more CPUs or other hosts 302, and/or one or more computing storage devices 304. One or more GPUs 306, CPUs 302, and/or computing storage 304 can communicate using interconnect fabric 324.

GPU 306 may include memory 316 and/or one or more computing resources 314. The CPU 302 may include memory 310 and/or one or more computing resources 308. Computing storage 304 may include storage medium 312, one or more computing resources 327, and/or controller 342. The controller 342 may control one or more operations of the computing storage 304. In some embodiments, the controller 342 may be implemented at least in part with a media conversion layer, such as a flash memory conversion layer (FTL) in embodiments implemented with flash memory at least a portion of the storage medium 312.

In some embodiments, computing storage 304 may include memory 338 and/or memory manager 340 that may control one or more operations of memory 338. For example, memory manager 340 may control one or more accesses to memory 338 by one or more computing resources 327.

In some embodiments, CPU 302 may include allocation logic 336, where allocation logic 336 may at least partially control the allocation, scheduling, order, timing, etc. of one or more tasks, operations, etc. of one or more workloads executed by system 300.

Although the system 300 shown in fig. 3 is not limited to any particular type of workload, embodiments configured for recommending model training workloads may operate as follows. The workload may include one or more tasks (such as the tasks shown in table 1).

The workload may include a first task (1) that may be performed, for example, by one or more computing resources 327 of computing storage 304. The first task (1) may include a lookup operation (1 a) in which one or more computing resources 327 may read data (e.g., sort input data) from one or more embedded tables 318 stored in the storage medium 312 of the storage device 304 as indicated by arrow 329.

Additionally or alternatively, the first task (1) may include an SLS computation (1 b), in which SLS computation (1 b) the one or more computing resources 327 may use data obtained, for example, by a lookup operation (1 a) from one or more embedded tables 318 stored in the storage medium 312 of the storage 304. In some embodiments, SLS computation may involve summing data read from one or more embedded tables 318. The SLS operation may generate output data (e.g., one or more SLS outputs) 350, and the output data 350 may be stored in the memory 338 of the computing storage 304, for example, as indicated by arrow 343.

Depending on implementation details, any of performing the first task (1), the lookup operation (1 a), and/or the SLS computation (1 b) at the computing storage 304 may reduce overhead, for example, because the computing storage 304 where input data for the task and/or operation is stored may have an "internal data path" (e.g., between the storage medium 312 and the computing resource 327) that may have a relatively high bandwidth.

In embodiments where computing storage 304 includes memory 338, all or a portion of memory 338 may be configured to operate as cache 348 for storage medium 312. For example, most or all or a portion of the embedded table used by the first task (1) may be stored in the storage medium 312 of the storage device 304, which may have a relatively large storage capacity. Some of the embedded tables, or portions thereof (e.g., more frequently accessed data, which may be referred to as hot data), may be cached in the cache 348 of the memory 338. In some embodiments, for example, in response to a request (e.g., a cache miss) to read data stored in the storage medium 312 but not in the one or more embedded tables 318 in the cache 348, the data stored in the one or more embedded tables 318 in the storage medium 312 of the storage device 304 may be transferred (e.g., copied) to the memory 338 of the computing storage device 304 as indicated by arrow 346. Similarly, in some embodiments, data may be transferred (e.g., copied) from memory 338 to storage medium 312 as indicated by arrow 356, e.g., based on write-back and/or write-through operations of cache 348. In some embodiments, transferring data between storage medium 312 and cache 348 may reduce overhead, for example, by utilizing an internal data path (e.g., between storage medium 312 and memory 338) that may have a relatively high bandwidth.

In embodiments in which the computing storage 304 includes memory 338, one or more of the first task (1), the lookup operation (1 a), and/or the SLS computation (1 b) may access data stored in the cache 348 (e.g., the embedded table data 344) as indicated by arrow 345. Depending on implementation details, this may reduce overhead, for example, because accessing data from memory 338 may be faster (e.g., with lower latency) than accessing data from storage medium 312. In some embodiments, accessing data from cache 348 may be faster (e.g., with lower latency) than accessing data from storage medium 312, for example, because memory 338 may have lower latency, and/or because it may utilize an internal data path (e.g., between memory 338 and one or more computing resources 327) that may have a relatively high bandwidth. Depending on implementation details, in addition to any resulting overhead reduction in performing the first task (1), the lookup operation (1 a), and/or the SLS computation (1 b) at the computing storage 304, there may be overhead reduction resulting from accessing data in the cache 348.

The workload may include a second task (2) that may be performed, for example, by one or more computing resources 314 of GPU 306. The second task (2) may include one or more bottom multi-layer perceptron (MLP) operations, which may operate, for example, using data (e.g., relatively dense features, continuous inputs, etc.) stored in the memory 316.

The workload may include a third task (3) that may be performed, for example, by one or more computing resources 314 of GPU 306. In some embodiments, the third task (3) may include one or more interactions (3 a). One or more outputs from the bottom MLP operation may be used as one or more inputs to the interoperation (3 a). Additionally or alternatively, SLS output data 350 from the SLS operation may be used as one or more inputs to the interoperation (3 a).

In some embodiments, SLS output data 350 from the SLS operations may be transferred (e.g., directly) from the storage medium 312 of the computing storage 304 and/or from the memory 338 to one or more computing resources 314 of the GPU 306 using the interconnect fabric 324 as indicated by arrow 352. For example, in some embodiments, interconnect fabric 324 may be configured to communicate SLS output data 350 directly from computing storage 304 to GPU 306, e.g., using CXL switches, PCIe root complexes, PCIe switches, PCIe peer-to-peer (P2P) communications, CXL P2P communications, and the like.

Depending on implementation details, transferring data (e.g., directly) from computing storage 304 to GPU 306 using interconnect fabric 324 may reduce overhead, such as by reducing or eliminating CPU utilization and/or GPU utilization involved in copying data from the CPU to the GPU (e.g., CPU utilization and/or GPU utilization associated with transferring SLS output data 122 from CPU 102 to GPU 106 as shown in fig. 1).

Additionally or alternatively, the third task (3) may include one or more top MLP operations (3 b). In some embodiments, one or more outputs from the interoperation (3 a) may be used as one or more inputs to the top MLP operation.

The workload may include a fourth task (4) that may be performed, for example, by one or more computing resources 314 of GPU 306. The fourth task (4) may include one or more update operations for one or more top MLPs (e.g., using reverse pass information to adjust one or more parameters, weights, biases, etc. of the top MLPs).

The workload may include a fifth task (5) that may be performed, for example, by one or more computing resources 327 of computing storage 304. The fifth task (5) may include one or more embedded table update operations. The embedded table update operations may include one or more gradient calculations, which may use output data 350 from one or more SLS operations and/or data from one or more embedded tables 318 and/or 344 as inputs. The embedded table update operation may include one or more write operations in which one or more outputs from one or more gradient calculations may be written to one or more embedded tables 318 in storage medium 312 as indicated by arrow 354 and/or to one or more embedded tables 344 in cache 348 of memory 338 as indicated by arrow 347.

Depending on implementation details, one or more write operations, as shown by arrows 354 and/or 347, may reduce overhead associated with the write operations, for example, by utilizing one or more internal data paths (e.g., between computing resources 327 and storage medium 312, as shown by arrow 354, and/or between computing resources 327 and memory 338, as shown by arrow 347) that may have a relatively high bandwidth.

The workload may include a sixth task (6) that may be performed, for example, by one or more computing resources 314 of GPU 306. The sixth task (6) may include one or more update operations for one or more bottom MLPs (e.g., using reverse pass information to adjust one or more parameters, weights, biases, etc. of the bottom MLPs).

Table 2 shows a second embodiment of a recommendation model training workload in accordance with the disclosed example embodiments. The embodiment shown in table 2 may include one or more tasks that may be similar to one or more tasks shown in table 1. However, in the embodiment shown in table 2, the fifth task (5) may include a sparse adjustment operation (5 b) in addition to one or more of the gradient calculation operation (5 a) and/or the embedded table write (5 c).

In some embodiments, the sparse adjustment operation (5 b) may adjust (e.g., optimize) one or more updates of one or more embedded tables. For example, in some embodiments, the sparse adjustment operation (5 b) may involve ordering row indices, accumulating and/or merging gradient updates (e.g., accumulating and/or merging the same row into one update), applying accumulated gradients, and so forth. Depending on implementation details, this may provide certainty and/or accuracy (e.g., with low performance overhead).

In some embodiments, any or all of task (1), task (5), operations (1 a), (1 b), (5 a), (5 b), and/or (5 c) may be performed by a computing storage device.

TABLE 2

FIG. 4 illustrates an embodiment of a portion of a recommendation model training workload in accordance with a disclosed example embodiment. The embodiment shown in fig. 4 may be used, for example, to implement at least a portion of the workload shown in table 2. For example, the embodiment shown in FIG. 4 may be used to implement some or all of tasks (1) and/or tasks (5) of Table 2.

Referring to fig. 4, the region above the dashed line 458 may be generally considered a tensor element, while the region below the dashed line 458 may be generally considered a gradient element. However, there may be overlap between regions, both conceptually and in terms of implementation details.

One or more classification inputs (e.g., sample(s) 1, which may include input values 1, 2, and/or 6 and/or Sample(s) 2, which may include input values 1, 3, and/or 6) may be applied to one or more vectors (e.g., row (Row) 1, … …, row 6) embedded in table 418. The lookup operation (which may correspond to task (1 a) in table 2) may read (e.g., from one or more rows of embedded table 418) one or more values that may be applied to one or more pooling operators 460. The one or more pooling operators 460 may implement, for example, one or more SLS operations (which may correspond to task (1 b) in table 2) to generate one or more output tensors (e.g., for Sample 1 and/or Sample 2).

In some embodiments, one or more gradient calculations and/or embedding gradient operations (which may correspond to task (5 a) in table 2) may be performed on one or more output tensors (e.g., for Sample 1 and/or Sample 1) to generate one or more embedding gradients 464 (e.g., ) Gradient of (e.g./>)1 And/or/>2). The sparse adjustment operation 466 (which may correspond to task (5 b) in table 2) may be performed using one or more embedding gradients 464 to generate update information 418a for one or more rows of the embedding table 418. In some embodiments, the sparse adjustment operation may be implemented with a sparse optimization operation. An embedded table write operation (which may correspond to task (5 c) in table 2) may be performed to write update information 418a to embedded table 418.

FIG. 5 illustrates a second example embodiment of a computing system including a computing storage device according to an example embodiment of the disclosure. The system 500 shown in fig. 5 may be used, for example, to implement some or all of the system 200 shown in fig. 2, the system 300 shown in fig. 3, the workload shown in tables 1, 2 and/or 4, etc. For example, the computing storage 504 shown in fig. 5 may be used to perform one or more operations shown in fig. 4.

For purposes of illustrating the inventive principles, the embodiment shown in fig. 5 may be described in the context of particular implementation details (such as a host implemented with a CPU, a computing device implemented with a GPU, a workload implemented as a recommended model training workload), but the inventive principles are not limited to these or any other implementation details.

The system 500 shown in fig. 5 may include one or more components and/or operations that may be the same or similar to the components and/or operations shown in fig. 2 and/or 3 and may be indicated by reference numerals ending with the same numerals. However, in the embodiment shown in FIG. 5, CPU 502 may use one or more lookup inputs (e.g., embedded table indexes) 568 stored in memory 510 to determine one or more embedded vectors (e.g., rows of one or more embedded tables 518 and/or 544) to be accessed for lookup operation (1 a). One or more lookup inputs may be transferred from CPU 502 (e.g., directly) to computing storage 504 as indicated by arrow 531, for example, using interconnect fabric 524.

Also in system 500, one or more computing resources 527 of computing storage 504 may perform one or more gradient computing operations (5 a), e.g., as shown in fig. 4, to generate SLS output gradients 551 and/or table gradients 564 that may be stored in memory 538 as indicated by arrow 570. In some embodiments, the one or more computing resources 527 may perform one or more sparseness adjustment operations (5 b), for example, using one or more sparseness optimizers. In some embodiments, one or more computing resources 527 may perform one or more embedded table update operations (5 c), as indicated by arrows 554 and/or 547, for example, by writing update information to one or more embedded tables 544 and/or 518.

Fig. 6 illustrates an embodiment of an allocation scheme according to a disclosed example embodiment. The embodiment shown in fig. 6 may be implemented, for example, by allocation logic 336 shown in fig. 3 and/or allocation logic 536 shown in fig. 5.

Referring to fig. 6, the allocation logic 636 may receive tasks 672 of a workload to be performed by a system which may include one or more computing devices, such as the GPU 606, the one or more CPUs 602, and/or the one or more computing storage devices 604.

One or more computing devices 606, 602, and/or 604 can have one or more characteristics such as memory and/or storage capacity, processing capacity (e.g., throughput, bandwidth, etc.), persistence characteristics (e.g., non-volatile and/or persistent memory and/or storage devices), and/or the like. In some embodiments, capacity may refer to available capacity (e.g., a portion of total capacity that may be unused and/or allocated).

One or more computing devices 606, 602, and/or 604 may have one or more states, such as a utilization level (e.g., a percentage of processing capacity being used).

Task 672 may have one or more characteristics such as the amount of data associated with the task (e.g., the amount of data that may be stored by the computing device), a latency specification, a persistence specification, etc.

In some embodiments, assignment logic 636 may assign tasks to one or more of computing devices 606, 602, and/or 604 based on one or more characteristics of computing devices 606, 602, and/or 604 and/or one or more characteristics of task 672. In some embodiments, assigning tasks may refer to assigning and determining one or more of scheduling, order, timing, etc. of one or more tasks.

In some embodiments, assignment logic 636 may select one or more candidate computing devices for task 672. For example, task 672 may involve (e.g., require) 100 units (e.g., bytes (B), KB, MB, GB, TB, PB, etc.) of memory and/or storage to perform the task, and computing devices 606, 602, and 604 may have 50 units, 100 units, and 1000 units of available memory and/or storage capacity, respectively. Accordingly, the allocation logic 636 may select the CPU 602 and the computing storage 604 as candidates because the CPU 602 and the computing storage 604 may have sufficient memory and/or storage capacity to accommodate the data size of the task 672.

Additionally or alternatively, after selecting the two candidate computing devices 602 and 604, the assignment logic 636 may select one of the two candidate devices, e.g., based on the delay specification of task 672. For example, task 672 may have a latency specification of 0.05 units (e.g., seconds (S), mS, μs, nS, pS, etc.), and candidate computing devices 602 and 604 may have a computing throughput of 50 and 10, respectively. In some embodiments, the computational throughput may be related to the delay as the inverse such that the computational throughput of 50 and 10 may correspond to a delay of 0.02 and 0.10, respectively. Thus, because the latency of 0.02 of the CPU602 may be less than the latency specification of 0.05 of task 672, while the latency of 0.10 of the computing storage 604 may be greater than the latency specification of 0.05 of task 672, the allocation logic 636 may select the CPU602 for task 672 (e.g., allocate task 672 to CPU602, and in some implementations schedule task 672 for CPU 602).

Additionally or alternatively, after initially selecting CPU 602 for task 672, allocation logic 636 may modify the selection based on, for example, the utilization level of CPU6 02. For example, one or more computing resources within the CPU 602 may have a current utilization level of 99% (e.g., may be 99% busy with other tasks), while one or more computing resources within the computing storage 604 may have a utilization level of 5%. If task 672 is assigned to CPU 602, task 672 may not perform acceptably because CPU 602 may only be able to utilize one percent of the computing resources within CPU 602. Accordingly, assignment logic 636 may modify the selection to assign task 672 to computing storage 604.

Additionally or alternatively, assignment logic 636 may select or modify the selection of computing devices 606, 602, and/or 604 based on the persistence specification of task 672. For example, task 672 may have a data size of 10 units, a latency specification of 10 units, a specification that data associated with the task is stored in persistent memory and/or storage. The computing devices 606, 602, and 604 may have available memory and/or storage capacity of 50 units, 100 units, and 1000 units, respectively, and have latency characteristics of 2 units, 5 units, and 100 units, respectively. Furthermore, the available memory capacity of computing devices 606 and 602 may include only DRAM, while the available memory capacity of computing storage device 604 may include more than 10 units of non-volatile memory. Thus, even though any of computing devices 606, 602, and 604 may have sufficient data capacity and/or processing throughput to accommodate task 672, assignment logic 636 may assign task 672 to computing storage 604 (e.g., select computing storage 604) because computing storage 604 has sufficient persistent memory and/or storage to persistently store data associated with task 672.

Although the embodiments shown in fig. 3 and/or 5 may show allocation logic located at a host (e.g., CPU), in other embodiments, allocation logic 636 may be located at a computing device, a computing storage device, and/or any other location. Further, in some embodiments, the assignment logic 636 may be distributed at multiple locations.

FIG. 7 illustrates an embodiment of a method for assigning tasks to computing devices according to a disclosed example embodiment. The embodiment shown in fig. 7 may be implemented with or for implementing any of the embodiments disclosed herein that include the allocation logic shown in fig. 3, 5, and/or 6.

Referring to fig. 7, the method may begin at operation 770, where allocation logic may receive a task having one or more of a data amount, a latency specification, and/or a persistence specification at operation 770. At operation 772, the allocation logic may select one or more candidate computing devices based on the amount of data used by the task and the amount of available memory and/or storage capacity of the one or more candidate computing devices. For example, the allocation logic may select one or more computing devices having sufficient memory and/or storage capacity to accommodate the amount of data used by the task as candidate devices. If no computing device has sufficient memory and/or storage capacity to accommodate the amount of data used by the task, the method may terminate with an error. If only one computing device has sufficient memory and/or storage capacity to accommodate the amount of data used by the task, the allocation logic may allocate the task to that one computing device and terminate the method.

At operation 774, the allocation logic may select a computing device of the candidate computing devices based on the latency specification of the task and the computational throughput of one or more of the candidate computing devices. For example, the allocation logic may select a candidate computing device that may have the highest computational throughput, provided that the highest throughput is sufficient to meet the latency specification of the task. Alternatively, the allocation logic may select the candidate computing device that may have the lowest computational throughput that is still sufficient to meet the latency specification of the task. The method may terminate in error if none of the candidate computing devices has a computing throughput sufficient to meet the latency specification of the task.

At operation 776, the allocation logic may determine whether the initially selected candidate computing device has a utilization (e.g., percent utilization) that may exceed a threshold. If the initially selected candidate computing device has a utilization that exceeds the threshold, the allocation logic may modify the selection by selecting candidate computing devices that may have a utilization that does not exceed the threshold. If none of the candidate computing devices has a utilization that does not exceed the threshold, the method may terminate in error.

At operation 778, the allocation logic may modify the selection of candidate computing devices based on the persistence specification of the task and the persistence characteristics of the initially selected candidate computing devices. For example, the task has a persistence specification for data used by the task, and the initially selected candidate computing device does not have sufficient persistence memory and/or storage capacity for the data used by the task, the allocation logic may modify the selection by selecting the candidate computing device that may have sufficient persistence memory and/or storage capacity to persistently store the data used by the task.

In operation 780, the assignment logic may assign the task to the selected candidate computing device. In some embodiments, the allocation logic may also determine one or more of a schedule, order, timing, etc. for the allocated tasks.

Table 3 shows an embodiment of computing storage memory space according to a disclosed example embodiment. The embodiment shown in table 3 may be implemented, for example, using a coherence interconnect, protocol, or the like (such as CXL memory space).

TABLE 3 Table 3

In some embodiments, the memory space map shown in FIG. 3 may be used in conjunction with one or more interconnects and/or protocol bias patterns. For example, in embodiments implemented with CXL, if data is only or primarily accessed by a computing storage device, data such as SLS output gradients, table gradients, etc., may be stored in private memory space and/or accessed in a device bias mode. Depending on implementation details, this may improve performance, for example, because it may enable a computing device to access data without examining one or more other memory spaces (e.g., caches).

As another example, in an embodiment implemented with CXL, if shared data is readable by more than one device (e.g., shared SLS output readable by a GPU and a computing storage device), the shared data may be stored in a shared memory space (e.g., in the computing storage device) and/or accessed in a host-biased mode.

Fig. 8 illustrates a first example embodiment of an interconnect structure in accordance with the disclosed example embodiments. The embodiment shown in fig. 8 may be used, for example, to implement any of the interconnect structures disclosed herein. For purposes of illustration, the embodiment shown in fig. 8 may be described in the context of one or more devices that may use the PCIe physical layer and/or one or more CXL protocols. However, the inventive principles may be implemented with any other interconnect, interface, protocol, etc., and are not limited to PCIe and/or CXL implementations.

The embodiment shown in fig. 8 may include a host 802 (which may be implemented, for example, with a CPU) having a root complex (e.g., PCIe root complex) 882. The first computing device 804 may be configured as a first endpoint and connected to the root complex 882, e.g., using one or more PCIe lanes 884. The second computing device 806 may be configured as a second endpoint and connect to the root complex 882, e.g., using one or more PCIe lanes 886. In some embodiments, data may be transferred from the first computing device 804 to the second computing device 806 as indicated by arrow 888, e.g., directly in a manner that may involve little or no utilization of a CPU at the host 802. For example, data transfer indicated by arrow 888 can be implemented using PCIe peer-to-peer (P2P) features, CXL direct memory access features (e.g., P2P direct memory access features), and so forth.

The embodiment shown in fig. 8 may be used, for example, to enable transfer of SLS output data from a computing storage device to a GPU as shown by arrow 352 in fig. 3 and/or arrow 552 in fig. 5.

Fig. 9 illustrates a second example embodiment of an interconnect structure in accordance with the disclosed example embodiments. The embodiment shown in fig. 9 may be used, for example, to implement any of the interconnect structures disclosed herein. For purposes of illustration, the embodiment shown in fig. 9 may be described in the context of one or more devices that may use the PCIe physical layer and/or one or more CXL protocols. However, the inventive principles may be implemented with any other interconnect, interface, protocol, etc., and are not limited to PCIe and/or CXL implementations.

The embodiment shown in fig. 9 may include a host 902 (which may be implemented, for example, with a CPU) having a root complex (e.g., PCIe root complex) 982 and a switch 990 (e.g., PCIe switch). Switch 990 may connect to root complex 982 using one or more PCIe lanes 992. The first computing device 904 may be configured as a first endpoint and connected to the switch 990, for example, using one or more PCIe lanes 984. The second computing device 906 may be configured as a second endpoint and connected to the switch 990, for example, using one or more PCIe lanes 986. In some embodiments, data may be transferred from the first computing device 904 to the second computing device 906 as indicated by arrow 988, for example, directly in a manner that may involve little or no utilization of a CPU at the host 902. For example, data transfer indicated by arrow 988 may be implemented using PCIe peer-to-peer (P2P) features, CXL direct memory access features (e.g., P2P direct memory access features), and so forth.

The embodiment shown in fig. 9 may be used, for example, to enable transfer of SLS output data from a computing storage device to a GPU as shown by arrow 352 in fig. 3 and/or arrow 552 in fig. 5.

Fig. 10 illustrates an example embodiment of a host device according to an example embodiment of the disclosure. The host device shown in fig. 10 may be used, for example, to implement any of the hosts disclosed herein. The host device 1000 shown in fig. 10 may include a processor 1002, a system memory 1006, one or more computing resources 1008, and/or a communication interface 1010, and the processor 1002 may include a memory controller 1004. Any or all of the components shown in fig. 10 may communicate over one or more system buses 1012. In some embodiments, one or more of the components shown in fig. 10 may be implemented using other components. In some embodiments, one or more computing resources 1008 may implement any computing resources disclosed herein including, for example, any computing resource 508 shown in fig. 5 and/or any computing resource for implementing CPU 602 shown in fig. 6.

FIG. 11 illustrates an example embodiment of a computing device according to an example embodiment of the disclosure. The embodiment 1100 shown in fig. 11 may be used, for example, to implement any of the computing devices disclosed herein. Computing device 1100 can include a device controller 1102, one or more computing resources 1108, device functional circuitry 1106, and a communication interface 1110. The components shown in fig. 11 may communicate via one or more device buses 1112.

The device function circuitry 1106 may comprise any hardware for implementing the primary functions of the device 1100. For example, if the device 1100 is implemented as a storage device, the device functional circuitry 1106 may include a storage medium (such as one or more flash memory devices, FTLs, etc.). As another example, if the device 1100 is implemented as a Network Interface Card (NIC), the device functional circuitry 1106 may include one or more modems, network interfaces, physical layers (PHYs), medium access control layers (MACs), and the like. As another example, if the device 1100 is implemented as an accelerator, the device functional circuitry 1106 may include one or more accelerator circuits, memory circuits, and the like.

Any of the functions described herein, including host functions, device functions, etc. (e.g., allocation logic 336, 536, and/or 636), as well as any of the functions described with respect to the embodiments shown in fig. 1-11, may be implemented with hardware, software, firmware, or any combination thereof (including, for example, hardware and/or software combinational logic, sequential logic, timers, counters, registers, state machines, volatile memory (such as DRAM and/or SRAM), non-volatile memory (including flash memory), persistent memory (such as cross-grid non-volatile memory), memory with variations in bulk resistance, PCM, etc., and/or any combination thereof, complex Programmable Logic Devices (CPLDs), FPGA, ASIC, CPU (including CISC processors (such as x86 processors) and/or RISC processors (such as ARM processors)), GPU, NPU, TPU, etc., executing instructions stored in any type of memory. In some embodiments, one or more components may be implemented as a system on a chip (SOC).

Fig. 12 illustrates an embodiment of a method for operating a computing device, according to a disclosed example embodiment. The method may begin at operation 1202. At operation 1204, the method may perform a first computing task of the workload using the first data stored at the computing storage, wherein the step of performing the first computing task of the workload includes generating second data. For example, in some embodiments, the workload may be implemented as a recommendation model training workload, and the first task may include performing SLS operations on data stored in one or more embedded tables stored at the computing storage to generate one or more SLS outputs as shown in fig. 3 and/or 5.

At operation 1206, the method may transfer the second data from the computing storage device to the computing device using the interconnect fabric. For example, one or more SLS outputs may be transmitted to one or more computing resources of a GPU (such as one or more computing resources of the GPU shown in fig. 3 and/or 5). At operation 1208, the method may perform a second computing task of the workload using the second data at the computing device. For example, one or more computing resources of the GPU may be used to perform interactions using one or more SLS outputs as shown in fig. 3 and/or 5. The method may end at operation 1210.

FIG. 13 illustrates an embodiment of a method for assigning tasks to computing devices according to a disclosed example embodiment. The method may begin at operation 1302. At operation 1304, the method may determine a memory capacity of a first computing device connected to an interconnect fabric, wherein the interconnect fabric is connected to a second computing device. For example, the allocation logic may determine the memory capacity of the GPU, CPU, and/or computing storage as shown in fig. 6.

At operation 1306, the method may select the first computing device based on a memory capacity of the first computing device and a size of first data for a workload, wherein the workload includes a first computing task and a second computing task, and the first computing task generates second data for the second computing task using at least a portion of the first data. For example, the allocation logic may allocate task 672 to a GPU, CPU, and/or computing storage as shown in fig. 6, wherein the workload may include first task 226 and second task 234 as shown in fig. 2.

At operation 1308, the method may transmit at least a portion of the first data to the first computing device. For example, the data 230 may be transferred from the computing storage 204 to the computing device 206 as shown in fig. 2. At operation 1310, the method may be based on selecting a first computing task of a workload to be performed by a first computing device. For example, one or more of the GPU, CPU, and/or computing storage as shown in fig. 6 may perform tasks 672 allocated by allocation logic 636 as shown in fig. 6. The method may end at operation 1312.

The embodiments shown in fig. 12 and 13, as well as all other embodiments described herein, are example operations and/or components. In some embodiments, some operations and/or components may be omitted, and/or other operations and/or components may be included. Furthermore, in some embodiments, the temporal and/or spatial order of operations and/or components may vary. Although some components and/or operations may be shown as separate components, in some embodiments, some components and/or operations shown separately may be integrated into a single component and/or operation and/or some components and/or operations shown as a single component and/or operation may be implemented with multiple components and/or operations.

Some embodiments disclosed above have been described in the context of various implementation details, but the principles disclosed are not limited to these or any other specific details. For example, some functions have been described as being implemented by a particular component, but in other embodiments, functions may be distributed among different systems and components at different locations and with various user interfaces. Particular embodiments have been described as having particular processes, operations, etc., but these terms also encompass embodiments in which a particular process, operation, etc., may be implemented with multiple processes, operations, etc., or embodiments in which multiple processes, operations, etc., may be integrated into a single process, step, etc. References to a component or element may refer to only a portion of the component or element. For example, a reference to a block may refer to an entire block or one or more sub-blocks. References to components or elements may refer to one or more components and elements, and references to multiple components or elements may refer to a single component or element. For example, a reference to a resource may refer to one or more resources, and a reference to a resource may refer to a single resource. Unless otherwise clear from the context, terms such as "first" and "second" are used in the present disclosure and claims for the purpose of distinguishing between elements that they modify and may not indicate any spatial or temporal order. In some embodiments, a reference to an element may refer to at least a portion of the element, e.g., "based on" may refer to "based at least in part on" and the like. The reference to the first element may not indicate the presence of the second element. The principles disclosed herein have independent utility and may be embodied separately and not every embodiment may utilize every principle. However, the principles may also be embodied in various combinations, some of which may amplify the benefits of the various principles in a synergistic manner. The various details and embodiments described above may be combined to produce additional embodiments according to the inventive principles of this patent disclosure.

Since the inventive principles of this patent disclosure may be modified in arrangement and detail without departing from the inventive concepts, such changes and modifications are considered to be within the scope of the appended claims.

Claims

1. A method for operating a computing device, comprising:

Executing, at the computing storage, a first computing task of the workload using the first data stored at the computing storage, wherein the step of executing the first computing task of the workload includes generating second data;

Transmitting second data from the computing storage device to the computing device using the interconnect fabric; and

A second computing task of the workload is performed at the computing device using the second data.

2. The method of claim 1, further comprising: a first computing task of the workload is allocated based on a size of the first data and a memory capacity of the computing device.

3. The method of claim 1, further comprising: the first computing task of the workload is assigned based on performance characteristics of the first computing task of the workload.

4. The method of claim 1, further comprising: a first computing task of the workload is allocated based on an operational state of the computing device.

5. The method of claim 1, wherein the interconnect fabric is connected to a host, the method further comprising: the first computing task of the workload is allocated based on the memory capacity of the host.

6. The method of claim 1, wherein the interconnect fabric is connected to a host, the method further comprising: a first computing task of the workload is allocated based on an operational state of the host.

7. The method according to claim 1, wherein:

The workload includes a machine learning workload; and

The first computing task of the workload includes a lookup operation.

8. The method of claim 1, wherein the first computing task of the workload comprises a sparse length summation operation.

9. The method of claim 1, further comprising: a third computing task of the workload is performed at the computing storage using the first data.

10. The method according to claim 9, wherein:

the first data is stored at least partially in a data structure; and

The third computing task of the workload includes updating the data structure.

11. A storage device, comprising:

a storage medium;

At least one computing resource;

An interconnect interface; and

Control circuitry configured to:

Performing a computing task of the workload using at least one of the at least one computing resource using the first data stored at the storage device, wherein the step of performing the computing task of the workload includes generating second data; and

Second data is transferred from the storage device to the computing device using the interconnect interface.

12. The storage device of claim 11, wherein the computing task comprises a first computing task of a workload, and the control circuitry is configured to: a second computing task of the workload is performed using at least one of the at least one computing resource.

13. The storage device of claim 12, wherein:

the first data is stored at least partially in a data structure; and

The second computing task of the workload includes updating the data structure.

14. The storage device of claim 12, wherein:

The first computing task of the workload includes a summing operation; and

The second computational task of the workload includes a gradient operation.

15. A method for operating a computing device, comprising:

Determining a memory capacity of a first computing device connected to an interconnect fabric, wherein the interconnect fabric is connected to a second computing device;

Selecting a first computing device based on a memory capacity of the first computing device and a size of first data for a workload, wherein the workload includes a first computing task and a second computing task, and the first computing task uses at least a portion of the first data to generate second data for the second computing task;

transmitting at least a portion of the first data to a first computing device; and

A first computing task of the workload is performed by the first computing device based on the selection.

16. The method of claim 15, wherein selecting the first computing device is further based on a performance characteristic of the first computing device and a performance characteristic of a first computing task of the workload.

17. The method of claim 16, wherein the performance characteristics of the first computing task of the workload include delay characteristics.

18. The method of claim 15, wherein selecting the first computing device is further based on an operational state of the first computing device.

19. The method of claim 18, wherein the operational state of the first computing device comprises a utilization of the first computing device.

20. The method of claim 15, wherein selecting the first computing device is further based on a persistence characteristic of the first data.