WO2022047802A1

WO2022047802A1 - Processing-in-memory device and data processing method thereof

Info

Publication number: WO2022047802A1
Application number: PCT/CN2020/113839
Authority: WO
Inventors: Yawen Zhang; Tianchan GUAN; Xiaoxin Fan; Yuhao WANG; Hongzhong Zheng; Shuangchen Li; Chunsheng Liu
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-09-07
Filing date: 2020-09-07
Publication date: 2022-03-10
Also published as: CN115836346A

Abstract

A processing-in-memory (PIM) device includes a memory array configured to store data and a computing circuit. The computing circuit is configured to execute a set of instructions to cause the PIM device to: select between multiple computation modes, including a first and a second sorting modes, based on a configuration from a host communicatively coupled to the PIM device; access data elements in a memory array of the PIM device; and in the first or the second sorting modes, output top K data elements among the data elements to the memory array or to the host. K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.

Description

PROCESSING-IN-MEMORY DEVICE AND DATA PROCESSING METHOD THEREOF

BACKGROUND

Similarity search has been widely used in various areas of computing, including multimedia databases, data mining, machine learning, etc. A top-k function can be applied in tasks of similarity search to find K largest or K smallest elements among given elements (e.g., N elements) . For example, the top-k function can be used in a fast region-convolution neural network (RCNN) and the like. Conventionally, the top-k function can be implemented using software.

However, traditional software implementations of the top-k function are unable to process a great number of elements within a reasonable period, and thus are not suitable for some applications with strict latency requirements. With rapidly growing sizes of databases, a large amount of data transfers between processing units and memory devices becomes a performance bottleneck in the top-k function, due to limited memory performance.

SUMMARY OF THE DISCLOSURE

Embodiments of the present disclosure provide a processing in memory (PIM) device. The PIM device includes a memory array configured to store data and a computing circuit. The computing circuit is configured to execute a set of instructions to cause the PIM device to: select between multiple computation modes, which include a first sorting mode and a second sorting mode, based on a configuration from a host communicatively coupled to the PIM device; access data elements in a memory array of the PIM device; and in the first sorting mode or the second sorting mode, output top K data elements among the data elements to the memory array or to the host. K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.

Embodiments of the present disclosure also provide a data processing method. The data processing method includes: selecting between multiple computation modes for a processing-in-memory (PIM) device based on a configuration, wherein the multiple computation modes include a first sorting mode and a second sorting mode; accessing multiple data elements in a memory array of the PIM device; and in the first sorting mode or the second sorting mode, outputting top K data elements among the multiple data elements to the memory array or to a host communicatively coupled to the PIM device, in which K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.

Embodiments of the present disclosure also provide a non-transitory computer readable storage media storing a set of instructions that are executable by one or more computing circuits of an apparatus to cause the apparatus to initiate a data processing method. The data processing method includes: selecting between multiple computation modes based on a configuration, wherein the multiple computation modes include a first sorting mode and a second sorting mode; accessing multiple data elements in a memory array of the apparatus; and in the first sorting mode or the second sorting mode, outputting top K data elements among the multiple data elements to the memory array or to a host communicatively connecting to the apparatus. K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.

Embodiments of the present disclosure also provide a system for processing data. The system for processing data includes a host and multiple processing-in-memory (PIM) devices communicatively coupled to the host. Any of the multiple PIM devices includes a memory array configured to store data and a computing circuit configured to execute a set of instructions to cause the PIM device to: select between multiple computation modes based on a configuration from the host, the multiple computation modes including a first sorting mode and a second sorting mode; access multiple data elements in a memory array of the PIM device; and in the first sorting mode or the second sorting mode, output top K data elements among the multiple data elements to the host. K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.

Additional features and advantages of the disclosed embodiments will be set forth in part in the following description, and in part will be apparent from the description, or may be learned by practice of the embodiments. The features and advantages of the disclosed embodiments may be realized and attained by the elements and combinations set forth in the claims.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosed embodiments, as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an exemplary processing in memory (PIM) block configuration, consistent with some embodiments of the present disclosure.

FIG. 2A illustrates an exemplary neural network accelerator architecture, consistent with some embodiments of the present disclosure.

FIG. 2B illustrates a schematic diagram of an exemplary cloud system incorporating a neural network accelerator architecture, consistent with some embodiments of the present disclosure.

FIG. 3 illustrates an exemplary memory tile configuration, consistent with some embodiments of the present disclosure.

FIG. 4 illustrates an exemplary PIM processing unit, consistent with some embodiments of the present disclosure.

FIG. 5 illustrates exemplary operations performed by the PIM processing unit in FIG. 4 for a top-k sorting method, consistent with some embodiments of the present disclosure.

FIG. 6 illustrates exemplary operations performed by the PIM processing unit in FIG. 4 for another top-k sorting method, consistent with some embodiments of the present disclosure.

FIG. 7A and FIG. 7B illustrate an exemplary PIM processing unit, consistent with some embodiments of the present disclosure.

FIG. 8 illustrates an exemplary PIM-based accelerator architecture, consistent with some embodiments of the present disclosure.

FIG. 9 illustrates exemplary operations performed by the PIM processing unit in FIG. 7A and FIG. 7B for a similarity search, consistent with some embodiments of the present disclosure.

FIG. 10 illustrates exemplary operations performed by the PIM processing unit in FIG. 7A and FIG. 7B for a similarity search, consistent with some embodiments of the present disclosure.

FIG. 11 and FIG. 12 illustrate exemplary operations performed by the PIM processing unit in FIG. 7A and FIG. 7B for a k-means clustering computation, consistent with some embodiments of the present disclosure.

FIG. 13 illustrates an exemplary flow diagram for performing a data processing method, consistent with some embodiments of the present disclosure.

FIG. 14 illustrates an exemplary flow diagram for performing a data processing method, consistent with some embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the disclosure. Instead, they are merely examples of apparatuses, systems, and methods consistent with aspects related to the disclosure as recited in the appended claims. The terms and definitions provided herein control, if in conflict with terms or definitions incorporated by reference.

Unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component may include A or B, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or A and B. As a second example, if it is stated that a component may include A, B, or C, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C. The term “exemplary” is used in the sense of “example” rather than “ideal. ”

Nowadays, the size of databases and data processing tasks are growing significantly and rapidly in various applications. In addition, in order to provide satisfying user experiences, many applications aim to meet strict latency requirements. At present, many high parallel and simple computations, such as similarity search and k-means computations, are limited by the bandwidth and the capacity of memory components in the system, which have become one of major performance bottlenecks.

Embodiments of the present disclosure mitigate the problems stated above by providing devices and methods for data processing that perform top-k sorting, k-means clustering, or other similarity search computations. By processing in memory (PIM) technologies and high bandwidth of DRAM, unnecessary data movements can be reduced, and efficient and parallel computations can be achieved. Accordingly, memory performance bottleneck in similarity search and k-means computations can be substantially reduced. With the devices and the methods disclosed in various embodiments, as data increases, computation time can still be kept in an acceptable range, and the overall performance and efficiency for various computations can be improved. The proposed devices and methods for data processing can be applied for various applications having large databases and large amounts of data processing tasks, including various cloud systems utilizing AI computations.

Particularly, the embodiments disclosed herein can be used in various applications or environments, such as artificial intelligence (AI) training and inference, database, and big data analytic acceleration, or the like. AI-related applications can involve neural network-based machine learning (ML) or deep learning (DL) . For example, some embodiments can be utilized in neural network architectures, such as deep neural networks (DNNs) , convolutional neural networks (CNNs) , recurrent neural networks (RNNs) , or the like. In addition, some embodiments can be configured for various processing architectures, such as data processing units (DPUs) , neural network processing units (NPUs) , graphics processing units (GPUs) , field programmable gate arrays (FPGAs) , tensor processing units (TPUs) , application-specific integrated circuits (ASICs) , any other types of heterogeneous accelerator processing units (HAPUs) , or the like.

The term “accelerator” as used herein refers to a hardware for accelerating certain computation. For example, an accelerator can be configured to accelerate top-k sorting computation, k-means clustering computation, or other computations performed in similarity search. In some embodiments, the accelerator can be configured to accelerate workload (e.g., neural network computing tasks) in any AI-related applications. The accelerator having a Dynamic Random-Access Memory (DRAM) or an Embedded Dynamic Random-Access Memory (eDRAM) is known as a DRAM-based or an eDRAM-based accelerator.

FIG. 1 illustrates an exemplary processing in memory (PIM) block configuration, consistent with some embodiments of the present disclosure. PIM block 100 includes a memory cell array 110, a block controller 120, a block row driver 131, and a block column driver 132. Although some embodiments will be illustrated using dynamic random-access memory (DRAM) as examples, it will be appreciated that PIM block 100 according to embodiments of the present disclosure can be implemented based on various memory technologies including static random-access memory (SRAM) , resistive random-access memory (ReRAM) , etc. Memory cell array 110 may include m number of rows r ₁ to r _m and n number of columns c ₁ to c _n. As shown in FIG. 1, a memory cell 111 can be connected between each of m number of rows r ₁ to r _m and each of n number of columns c ₁ to c _n. In some embodiments, data can be stored as multi-bit memristors in a crossbar memory.

Block row driver 131 and block column driver 132 may provide signals such as voltage signals to m number of rows r ₁ to r _m and n number of columns c ₁ to c _n for processing corresponding operations. In some embodiments, block row driver 131 and block column driver 132 may be configured to pass analog signals through memory cell 111. In some embodiments, the analog signals may have been converted from digital input data.

Block controller 120 may include an instruction register for storing instructions. In some embodiments, instructions may include instructions of when block row driver 131 or block column driver 132 provide signals to a corresponding column or row, which signals are to be provided, etc. Block controller 120 can decode instructions stored in the register into signals to be used by block row driver 131 or block column driver 132.

PIM block 100 may further include a row sense amplifier 141 or a column sense amplifier 142 for read out data from a memory cell or for storing the data into a memory cell. In some embodiments, row sense amplifier 141 and column sense amplifier 142 may store data for buffering. In some embodiments, PIM block 100 can further include DAC 151 (digital-to-analog converter) or ADC 152 (analog-to-digital converter) to convert input signal or output data between analog domain and digital domain. In some embodiments of the present disclosure, row sense amplifier 141 or column sense amplifier 142 can be omitted because computations in PIM block 100 may be performed directly on the stored values in the memory cell without reading the values out or without using any sense amplifier.

According to embodiments of the present disclosure, PIM block 100 enables parallel computing by using memories as multiple SIMD (single instruction, multiple data) processing units. PIM block 100 may support computational operations including bit-wise operations, additions, subtractions, multiplications, and divisions for both integer and floating-point values. For example, in memory cell array 110 of FIG. 1, a first column c ₁ and a second column c ₂ can store a first vector A and a second vector B, respectively. A vector operation result C from the addition of vectors A and B can be stored in a third column c3 by applying formatted signals to the first to third columns c ₁ to c ₃ and corresponding rows for a length of the vectors A, B, and C. Similarly, memory cell array 110 of FIG. 1 can also support vector multiplication and addition operation. For example, computation C=aA+bB can be performed by applying a voltage signal corresponding to a multiplier a to a first column c1 and a voltage signal corresponding to a multiplier b to a second column c2 and by applying formatted signals to corresponding columns and rows to perform addition and to save a result C in a third column c3.

In some embodiments, one vector can be stored in multiple columns for representing n-bit values for elements. For example, one vector of which element has 2-bit values can be stored in two columns of memory cells. In some embodiments, when the length of a vector exceeds the number of rows of memory cell array 110, which constitutes a memory block, the vector may be stored in multiple memory blocks. The multiple memory blocks may be configured to compute different vector segments in parallel. While embodiments in which PIM architecture performs computational operations without using arithmetic logics addition to memory cells, the present disclosure may also apply to PIM architecture including arithmetic logics for performing arithmetic operations in PIM architecture. As shown before, computational operations such as addition, multiplication, etc., can also be performed as column-wise vector calculations in PIM architecture. The disclosed embodiments provide a PIM accelerator architecture enabling efficient top-k operation, k-means clustering, or similarity search in large databases. In some embodiments, the top-k operation, i.e., finding the k largest or smallest elements from a set, can be widely used for predictive modeling in information retrieval, machine learning, and data mining.

FIG. 2A illustrates an exemplary accelerator architecture 200, consistent with some embodiments of the present disclosure. In some embodiments, accelerator architecture 200 may be referred to as a neural network processing unit (NPU) architecture. In the context of this disclosure, a neural network accelerator may also be referred to as a machine learning accelerator or deep learning accelerator. In various embodiments, accelerator architecture 200 may also be applied in PIM accelerators with various capabilities, such as accelerators for parallel graph processing, for database queries, or for other computation tasks. As shown in FIG. 2A, accelerator architecture 200 can include a PIM accelerator 210, an interface 212, and the like. It is appreciated that, PIM accelerator 210 can perform algorithmic operations based on communicated data.

PIM accelerator 210 can include one or more memory tiles 2024. In some embodiments, memory tiles 2024 can include a plurality of memory blocks for data storage and computation. A memory block can be configured to perform one or more operations (e.g., multiplication, addition, multiply-accumulate, etc. ) on the communicated data. In some embodiments, each of memory blocks included in memory tile 2024 may have the same configuration of PIM block 100 shown in FIG. 1. Due to a hierarchical design of PIM accelerator 210, PIM accelerator 210 can provide generality and scalability. PIM accelerator 210 may include any number of memory tiles 2024 and each memory tile 2024 may have any number of memory blocks.

Interface 212 (such as a PCIe interface) may serve as an inter-chip bus, providing communication between PIM accelerator 210 and host unit 222. The inter-chip bus connects PIM accelerator 210 with other devices, such as the off-chip memory or peripherals. In some embodiments, accelerator architecture 200 can further include a DMA unit, which may be considered as a part of interface 212, or a separate component (not shown) in PIM accelerator 210, that assists with transferring data between host memory 224 and PIM accelerator 210. In addition, DMA unit can assist with transferring data between multiple accelerators. DMA unit can allow off-chip devices to access both on-chip and off-chip memory without causing a host CPU interrupt. Thus, DMA unit can also generate memory addresses and initiate memory read or write cycles. DMA unit also can contain several hardware registers that can be written and read by the one or more processors, including a memory address register, a byte-count register, one or more control registers, and other types of registers. These registers can specify some combination of the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device) , the size of the transfer unit, or the number of bytes to transfer in one burst. It is appreciated that accelerator architecture 200 can include a second DMA unit, which can be used to transfer data between other accelerator architecture to allow multiple accelerator architectures to communicate directly without involving the host CPU.

While accelerator architecture 200 of FIG. 2A is explained to include PIM accelerator 210 having memory blocks (e.g., PIM block 100 of FIG. 1) , it is appreciated that the disclosed embodiments may be applied to any type of memory blocks, which support arithmetic operations, for accelerating some applications such as deep learning.

Accelerator architecture 200 can also communicate with a host unit 222. Host unit 222 can be one or more processing unit (e.g., an X86 central processing unit) . PIM accelerator 210 can be considered as a coprocessor to host unit 222 in some embodiments.

As shown in FIG. 2A, host unit 222 may be associated with host memory 224. In some embodiments, host memory 224 may be an integral memory or an external memory associated with host unit 222. Host memory 224 may be a local or a global memory. In some embodiments, host memory 224 may include host disk, which is an external memory configured to provide additional memory for host unit 222. Host memory 224 can be a double data rate synchronous dynamic random-access memory (e.g., DDR SDRAM) or the like. Host memory 224 can be configured to store a large amount of data with slower access speed, compared to the on-chip memory of PIM accelerator 210, acting as a higher-level cache. The data stored in host memory 224 may be transferred to PIM accelerator 210 to be used for various computation tasks or executing neural network models.

In some embodiments, a host system 220 having host unit 222 and host memory 224 can comprise a compiler (not shown) . The compiler is a program or computer software that transforms computer codes written in one programming language into instructions to create an executable program. In machine learning applications, a compiler can perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, code optimization, and code generation, or combinations thereof.

In some embodiments, the compiler may push one or more commands to host unit 222. Based on these commands, host unit 222 can assign any number of tasks to one or more memory tiles (e.g., memory tile 2024) or processing elements. Some of the commands may instruct a DMA unit to load instructions and data from host memory (e.g., host memory 224 of FIG. 2A) into accelerator (e.g., PIM accelerator 210 of FIG. 2A) . The instructions may be loaded to each memory tile (e.g., memory tile 2024 of FIG. 2A) assigned with the corresponding task, and the one or more memory tiles may process these instructions.

It is appreciated that the first few instructions may instruct to load/store data from host memory 224 into one or more local memories of the memory tile. Each memory tile may then initiate the instruction pipeline, which involves fetching the instruction (e.g., via a fetch unit) from the local memory, decoding the instruction (e.g., via an instruction decoder) and generating local memory addresses (e.g., corresponding to an operand) , reading the source data, executing or loading/storing operations, and then writing back results.

FIG. 2B illustrates a schematic diagram of an exemplary cloud system incorporating a neural network accelerator architecture, consistent with some embodiments of the present disclosure. As shown in FIG. 2B, cloud system 230 can provide cloud service with artificial intelligence (AI) capabilities and can include a plurality of computing servers (e.g., computing servers 232 and 234) . In some embodiments, computing server 232 can, for example, incorporate an accelerator architecture 200 of FIG. 2A. Accelerator architecture 200 is shown in FIG. 2B in a simplified manner for simplicity and clarity.

With the assistance of accelerator architecture 200, cloud system 230 can provide the extended data processing capabilities. For example, in some embodiments, cloud system 230 can provide AI capabilities of image recognition, facial recognition, translations, 3D modeling, or the like. It is appreciated that, accelerator architecture 200 can be deployed to computing devices in other forms. For example, accelerator architecture 200 can also be integrated in a computing device, such as a smart phone, a tablet, and a wearable device.

FIG. 3 illustrates an exemplary memory tile configuration, consistent with some embodiments of the present disclosure. A memory tile 300 may include a memory block assembly 310, a controller 320, a row driver 331, a column driver 332, a global buffer 340, an instruction storage 350, a data transfer table 360, and a block table 370. Memory block assembly 310 may include a plurality of memory blocks arranged in a two-dimensional mesh consistent with embodiments of the present disclosure.

Controller 320 can provide commands to each memory block in memory block assembly 310 via row driver 331, column driver 332, and global buffer 340. Row driver 331 is connected to each row of memory blocks in memory block assembly 310 and column driver 332 is connected to each column of memory blocks in memory block assembly 310. In some embodiments, block controller (e.g., block controller 120 in FIG. 1) included in each memory block may be configured to receive commands via row driver 331 or column driver 332 from the controller 320 and to issue signals to a block row driver (e.g., block row driver 131 in FIG. 1) and a block column driver (e.g., block column driver 132 in FIG. 1) to perform corresponding operations in memory. According to embodiments of the present disclosure, memory blocks can perform different operations independently by using block controllers in memory blocks of memory block assembly 310, and thus block-level parallel processing can be performed. Data can be efficiently transferred among memory cells arranged in rows and in columns in the corresponding memory block by the block controller.

In some embodiments, global buffer 340 can be used to transfer data between memory blocks in memory block assembly 310. For example, controller 320 can use global buffer 340 when transferring data from one memory block to another memory block in memory block assembly 310. According to some embodiments of the present disclosure, global buffer 340 can be shared by all memory blocks in memory block assembly 310. Global buffer 340 can be configured to store commands for each memory block to process assigned tasks in processing neural network model. In some embodiments, controller 320 is configured to send commands stored in global buffer 340 to corresponding memory blocks via row driver 331 and column driver 332. In some embodiments, such commands can be transferred from host unit (e.g., host unit 222 of FIG. 2A) . Global buffer 340 can be configured to store and send data to be used for processing assigned tasks to memory blocks. In some embodiments, the data stored in and sent from global buffer 340 can be transferred from host unit (e.g., host unit 222 of FIG. 2A) or other memory blocks in memory block assembly 310. In some embodiments, controller 320 is configured to store data from memory blocks in memory block assembly 310 into global buffer 340. In some embodiments, controller 320 can receive and store data of an entire row of one memory block in memory block assembly 310 into global buffer 340 in one cycle. Similarly, controller 320 can send data of an entire row from global buffer 340 to another memory block in one cycle.

In some embodiments, memory tile 300 of FIG. 3 may include an instruction storage 350 configured to store instructions for executing a neural network model in memory block assembly 310 in a pipelined manner. Instruction storage 350 may store instructions of computations or data movements between memory blocks in memory block assembly 310. Controller 320 can be configured to access instruction storage 350 to retrieve the instructions stored in the instruction storage 350. Instruction storage 350 may be configured to have a separate instruction segment assigned to each memory block. In some embodiments, memory tile 300 can include data transfer table 360 for recording data transfer in memory tile 300. Data transfer table 360 can be configured to record data transfers between memory blocks. In some embodiments, data transfer table 360 may be configured to record pending data transfers. In some embodiments, memory tile 300 can include block table 370 for recording a memory block status. Block table 370 can have a state field (State) storing a current status of the corresponding memory block. According to some embodiments of the present disclosure, during execution of the computation, memory block assembly 310 can have one of statuses, for example, including idle status, computing status, and ready status.

FIG. 4 illustrates an exemplary processing-in-memory (PIM) processing unit 400, consistent with some embodiments of the present disclosure. In some embodiments, PIM processing unit 400 may apply the same or similar architecture of accelerator architecture 200 shown in FIG. 2A and memory tile configuration (e.g., memory tile 300) shown in FIG. 3. In some embodiments, PIM processing unit 400 may be referred to a processing-in-memory data processing unit (PIM-DPU) . PIM processing unit 400 can include a memory array 410, a memory interface 420, a computing circuit 430, a host interface 440, a configuration register 450, and a controller 460, which can be integrated on the same chip or the same die or embedded in the same package. For example, in some embodiments, PIM processing unit 400 may be on a Dynamic Random-Access Memory (DRAM) die, in which memory device is a DRAM or an eDRAM having a memory array 410 including memory cells arranged in rows and columns. In addition, memory array 410 may be divided into multiple logical blocks or partitions, also called “trunks” for storing data. Each trunk includes one or more rows of memory array 410. For example, PIM processing unit 400 can include a 4Gbits DRAM, but the present disclosure is not limited thereto. PIM processing unit 400 may also include a DRAM unit having various capacities. In some embodiments, PIM processing unit 400 with DRAM or eDRAM unit (s) may be referred to a DRAM-based or an eDRAM-based accelerator.

An external agent, such as a host, can be communicatively coupled through a peripheral interface (e.g., host interface 440) and transmit commands, instructions, or data to each other through the peripheral interface, to communicate with PIM processing unit 400 and program configuration register 450 to configure various parameters for performing computations. For example, host interface 440 can be Peripheral Component Interconnect Express (PCI Express) , but the present disclosure is not limited thereto. Configuration register 450 may store a configuration including parameters, such as the K value for top-k sorting computation, the partition block size of memory block (s) in memory array 410 of the DRAM, etc.

Controller 460 can communicate with configuration register 450 to access the stored parameters, and accordingly instruct computing circuit 430 to perform a sequence of operations to perform various computations, such as a top-k sorting computation, a k-means clustering computation, or other computation for accelerating similarity search methods on large datasets. For example, in the top-k sorting computation, computing circuit 430 can calculate and output the first to the Kth maximum or minimum values in a dataset, in which K can be any integer. In the k-means clustering computation, computing circuit 430 can partition N data points (or “observations” ) into K sets (or “clusters” ) so as to minimize the within-cluster sum of squares (WCSS) or the variance. K can be any integer greater than 1, and N can be any integer greater than K. In some applications, the K value for the top-k sorting or the k-means computation may be a number between 64 and 1500, but the present disclosure is not limited thereto. In some embodiments, the number of vectors in the dataset for the top-k sorting or the k-means computation may be around 10 ⁸ to 10 ¹⁰, but the present disclosure is not limited thereto.

In some embodiments, instructions from controller 460 can be decoded and then executed by computing circuit 430. Computing circuit 430 includes memory components and computation components for performing computations. For example, memory components in computing circuit 430 may include one or more vector registers 432 and one or more vector registers 436, and computation components in computing circuit 430 may include one or more reducers 434 and a scalar arithmetic-logic unit (ALU) 438. Computing circuit 430 can, via memory interface 420, read or write data from or to memory block (s) 410. For example, memory interface 420 can be a wide input/output interface connecting memory block (s) 410 and computing circuit 430, which provides a 1024-bit read/write per cycle (e.g., 2 nanoseconds) .

Reference is now made to FIG. 5, which illustrates exemplary operations performed by PIM processing unit 400 for a top-k sorting method, consistent with some embodiments of the present disclosure. In the embodiments shown in FIG. 5, PIM processing unit 400 performs a top-k sorting method where the K value is greater than a vector can hold. For example, a 1024-bit vector can hold 32 data elements if data is stored in 32-bits. In response to the K value programmed in configuration register 450 being greater than 32, controller 460 can instruct computing circuit 430 to perform the method illustrated in FIG. 5.

In this scenario, memory array 410 includes multiple

logical blocks

412, 414, and 416 storing data elements. Computing circuit 430 receives the data elements from

logical blocks

412, 414, and 416, and further calculates a block maximum or minimum element for each of the

logical blocks

412, 414, and 416.

For example, computing circuit 430 can read a vector (e.g., a 1024-bit vector storing multiple data elements) , and compare a minimum value stored in the vector with a scalar value of the current minimum value in the current logical block. By repeating the processes above and reading each vector in the current logical block, computing circuit 430 can obtain the minimum element in the current logical block, and a block identification (ID) associated with this block minimum element. Computing circuit 430 may also perform similar operations to obtain the maximum element in the current logical block, and the block identification (ID) associated with this block maximum element.

Computing circuit 430 can store the minimum (or maximum) element of each of

logical blocks

412, 414, and 416 in one entry of one of vector register (s) 432. Then, computing circuit 430 can use reducer (s) 434 to determine a global minimum (or maximum) element based on the block minimum (or maximum) elements of

logical blocks

412, 414, and 416. For example, a min reducer may be used to determine the global minimum element, and a max reducer may be used to determine the global maximum element.

After storing the global minimum (or maximum) element as one of the top K data elements, computing circuit 430 disables the stored global minimum (or maximum) element and repeats the above operations to obtain the new block minimum (or maximum) element for the logical block associated with the stored, and disabled, global minimum (or maximum) element.

After obtaining the new block minimum (or maximum) element for this logical block, computing circuit 430 can again use reducer (s) 434 to determine a second global minimum (or maximum) element based on the block minimum (or maximum) elements of

logical blocks

412, 414, and 416. Accordingly, computing circuit 430 can repeat above operations for K cycles to obtain the first to the Kth global minimum (or maximum) element, in order to determine the top K data elements, which may be the greatest K data elements or the smallest K data elements.

Reference is now made to FIG. 6, which illustrates exemplary operations performed by PIM processing unit 400 for another top-k sorting method, consistent with some embodiments of the present disclosure. Compared to embodiments shown in FIG. 5, in the embodiments shown in FIG. 6, PIM processing unit 400 performs a top-k sorting method where the K value is smaller than or equal to one vector can hold. For example, if data is stored in 32-bit in 1024-bit vectors, in response to the K value programmed in configuration register 450 being smaller than or equal to 32, controller 460 can instruct computing circuit 430 to perform the method illustrated in FIG. 6.

In this scenario, vector register 432 is configured to store the current minimum K values. Computing circuit 430 uses reducer (s) 434 (e.g., a max reducer) to get the maximum of the current minimum K values and stores this maximum value in scalar register (s) 436.

When computing circuit 430 reads a vector (e.g., a 1024-bit vector storing multiple data elements) from memory array 410, computing circuit 430 stores one or more minimum values in the vector in scalar register (s) 436. Scalar ALU 438 can communicate with scalar register (s) 436 and compare the one or more minimum values in the vector with the maximum of the current minimum K values. In response to the one or more minimum values in the vector being smaller than the maximum of the current minimum K values, computing circuit 430 can replace the maximum of the current minimum K values in vector register 432 with the one or more minimum values in the vector, and then recalculate the new maximum of the current minimum K values in vector register 432.

Computing circuit 430 can perform the operations above repeatedly until all data elements are read out and processed. Accordingly, the K values remain in vector register 432 after this iterative process are the smallest K data elements.

Computing circuit 430 can perform similar operations to store the greatest K data elements in vector register 432, by storing the current maximum K values in vector register 432, comparing, by scalar ALU 438, the one or more maximum values of the vector read from memory array 410 with the minimum of the current maximum K values stored in scalar register (s) 436, and updating vector register (s) 432 based on the comparison result. Accordingly, computing circuit 430 can determine the top K data elements, which may be the greatest K data elements or the smallest K data elements.

Reference is now made to FIG. 7A and FIG. 7B, which illustrate an exemplary PIM processing unit 700 consistent with some embodiments of the present disclosure. Similar to PIM processing unit 400 in FIG. 4, in some embodiments, PIM processing unit 700 may apply the same or similar architecture of accelerator architecture 200 shown in FIG. 2A and memory tile configuration (e.g., memory tile 300) shown in FIG. 3. Compared to PIM processing unit 400 in FIG. 4, in some embodiments, PIM processing unit 700 may include more memory components and computation components for performing computations. For example, computing circuit 430 in PIM processing unit 700 may further include a Static Random-Access Memory (SRAM) 732, a decoder 734, and a Single Instruction Multiple Data (SIMD) processor 736 including one or more adders, subtractors, multipliers, multiply-accumulators, or any combination thereof.

FIG. 7B illustrates how memory components and computation components in PIM processing unit 700 communicate and cooperate to perform various computation tasks. In some embodiments, PIM processing unit 700 can use SRAM 732 and decoder 734 to perform Product Quantization (PQ) compression methods to compress or reconstruct the data received from memory array 410 or controller 460 for later data processing or operations in computing circuit 430.

In some embodiments, computing circuit 430 may perform for a similarity search or a k-means algorithm and use SIMD processor 736 and a sum reducer in reducer (s) 434 to calculate a distance between two vectors with high parallel computation scalability. After the computation, computing circuit 430 may store the calculated distance value in vector register (s) 432 or scalar register (s) 436. Computing circuit 430 can perform the top-k sorting operations by its register (s) (e.g., vector register (s) 432 or scalar register (s) 436) and a max reducer in reducer (s) 434 based on a heap sorting algorithm or any other suitable algorithms. Detailed operations will be further discussed in following paragraphs.

FIG. 8 illustrates a PIM-based accelerator architecture 800, consistent with some embodiments of the present disclosure. As shown in FIG. 8, in some embodiments, PIM processing units 810a-810n can be realized by PIM processing unit 400 in FIG. 4 or PIM processing unit 700 in FIG. 7A and can provide high scalability and high capacity. In PIM-based accelerator architecture 800 shown in FIG. 8, a PIM system may include multiple PIM processing units 810a-810n, and each of PIM processing units 810a-810n communicates with a host 820 through a handshake protocol, a double data rate (DDR) protocol, or any other suitable protocol.

In some embodiments, PIM processing units 810a-810n do not have direct communications. Alternatively stated, PIM processing units 810a-810n may only communicate with host 820, but the present disclosure is not limited thereto. In some other embodiments, it is also possible that part or all of PIM processing units 810a-810n directly communicate with another one or more PIM processing units 810a-810n through a proper protocol. In some embodiments, PIM-based accelerator architecture 800 may include hundreds of or thousands of PIM processing units 810a-810n according to different capacity requirements in various applications. In general, PIM processing units 810a-810n in FIG. 8 can handle and process many different types of high-parallel computations and send final computation results to host 820. Accordingly, the data communication between PIM chips and host 820 can be reduced.

Reference is now made to FIG. 9, which illustrates exemplary operations performed by PIM processing unit 700 for a non-simultaneous computation to perform a similarity search, consistent with some embodiments of the present disclosure. As shown in FIG. 9, memory array 410 may include 4 DRAM blocks. In some embodiments, during the similarity search, PIM processing unit 700 calculates the distance values between vectors stored in the DRAM blocks and performs a top-k sorting method to sort the calculated distance values.

As discussed above, computing circuit 430 can calculate the distance values between the vectors by SIMD processor 736 and a sum reducer in reducers (s) 434 in high parallel computations. The parallel computation outputs from SIMD processor 736 can be stored or accumulated in a vector accumulator 738.

For the non-simultaneous computation in FIG. 9, the distances values accumulated or stored in vector accumulator 738 can then be written back in one of the DRAM block in memory array 410. Then, computing circuit 430 can access the distances values from memory array 410, and perform the top-k sorting operations by its register (s) (e.g., vector register (s) 432 or scalar register (s) 436) and reducer (s) 434 (e.g., a min reducer) based on a merge sorting algorithm or any other suitable algorithms. The minimum value of DRAM block and a corresponding tag can be stored in the register (s) in computing circuit 430. Accordingly, computing circuit 430 can first use a min reducer in reducer (s) 434 to find the minimum value stored in the register (s) , and then continue to find a corresponding minimum value in the DRAM block storing the distances values and output the minimum value to the host. That is, in the non-simultaneous computation in FIG. 9, the computation of the distance values and the top-k sorting operations are performed in different time periods.

Reference is now made to FIG. 10, which illustrates exemplary operations performed by PIM processing unit 700 for a simultaneous computation to perform the similarity search, consistent with some embodiments of the present disclosure. Compared to the non-simultaneous computation in FIG. 9, for the simultaneous computation shown in FIG. 10, the distance values can be stored in the register (s) and are not written back to memory array 410.

Computing circuit 430 performs the top-k sorting operations by its register (s) (e.g., vector register (s) 432 or scalar register (s) 436) and reducer (s) 434 (e.g., a max reducer) based on a heap sorting algorithm. Particularly, computing circuit 430 can use the max reducer in reducer (s) 434 to maintain the minimum top-k heap and use the register (s) to store the heap. Accordingly, in the simultaneous computation in FIG. 10, the computation of the distance values and the top-k sorting operations are performed simultaneously at the same time within computing circuit 430.

Reference is now made to FIG. 11 and FIG. 12, which illustrate exemplary operations performed by PIM processing unit 700 for a k-means clustering computation, consistent with some embodiments of the present disclosure. K-means clustering is a method of vector quantization which aims to partition n observations into k sets (e.g., clusters) . Each observation is a vector and belongs to the cluster with a nearest mean. That is, each observation is assigned to the cluster with the closest cluster center (i.e., cluster centroid) .

Given an initial set of k means, the k-means clustering computation proceeds by alternating between an assignment step and an update step. FIG. 11 illustrates exemplary operations of the assignment step, in which PIM processing unit 700 assigns each vector to the cluster with the nearest mean (e.g., the cluster with the least squared Euclidean distance) . FIG. 12 illustrates exemplary operations of the update step, in which PIM processing unit 700 recalculates means, or centroids (i.e., data points, imaginary or real, at centers of clusters) , for vectors assigned to each cluster. For example, in some embodiments, centroids for clusters can be calculated and defined based on the following equation:

where x ₁ to x _n are n vectors to be clustered, and

to

respectively indicate the centroids for k clusters

to

in the t-th iteration.

As shown in FIG. 11, memory array 410 stores vectors in its memory blocks 1010 and stores the current centroids (means) of clusters in one of its row buffers 1020. In some embodiments, when a row buffer 1020 is assigned to store the current centroids, the associated memory block will not be assigned to store vectors, so vectors stored in memory array 410 can be read out via respective row buffers accordingly. Computing circuit 430 can read out a feature vector in memory array 410 and all centroids in row buffer 1020 to register (s) 432 or 436. Accordingly, computing circuit 430 can calculate the distance values between the feature vector and centroids by SIMD processor 736 and a sum reducer in reducers (s) 434 in high parallel computations and stores the calculated distances between the feature vector and centroids in SRAM 732.

Then, computing circuit 430 can use a min reducer in reducer (s) 434 to find the minimum value stored in SRAM 732 to assign the feature vector to the cluster with the nearest mean. Accordingly, computing circuit 430 can mark a cluster identification (ID) , which indicates the centroid that is nearest to the feature vector, to the feature vector, and write the feature vector with the cluster ID back to memory array 410.

As shown in FIG. 12, in the update step, computing circuit 430 reads out vectors marked with the cluster ID in memory array 410 to register (s) 432 or 436, and calculates an updated centroid according to one or more vectors marked with the corresponding cluster ID (e.g., the same cluster ID) by SIMD processor 736. The updated centroid can then be written back in row buffers 1020. In the update step, random access can be reduced by changing the access order of vectors and centroids, which reduces random memory access in the k-means clustering computation.

By repeating the assignment step and the update step, computing circuit 430 can cluster vectors stored in memory array 410 until reaching the convergence. In response to the centroids being unchanged after the update step (e.g., no vector is assigned to a different cluster in the assignment step) , computing circuit 430 may output a cluster result to the host or store the cluster result to memory array 410.

FIG. 13 illustrates an exemplary flow diagram for performing a data processing method 1300 on a PIM-based accelerator architecture, consistent with some embodiments of the present disclosure. The PIM-based accelerator architecture (e.g., accelerator architecture 200 in FIG. 2A, memory tile 300 in FIG. 3, and PIM-based accelerator architecture 800 in FIG. 8) can be used for performing top-k sorting, k-means clustering, or similarity search according to some embodiments of the present disclosure. Particularly, any PIM processing unit (e.g. PIM processing units 810a-810n in FIG. 8) in the PIM-based accelerator architecture can select between computation modes for the PIM processing unit based on a configuration from the host (e.g., host 820 in FIG. 8) communicatively connecting to the PIM processing unit. The computation modes may include one or more of a first top-k sorting mode, a second top-k sorting mode and a k-means clustering mode. According to the selected mode, the PIM processing unit can access data elements in the memory array to perform corresponding operations.

Data processing method 1300 in FIG. 13 illustrates operations for a top-k sorting computation performed by the PIM-based accelerator architecture when the top-k sorting mode is selected. At step 1310, the PIM processing unit (e.g. PIM processing units 810a-810n in FIG. 8) in the PIM-based accelerator architecture receives the configuration from the host (e.g., host 820 in FIG. 8) communicatively connected to the PIM processing unit. The PIM processing unit can select between computation modes based on the configuration. In data processing method 1300, the computation modes include a first top-k sorting mode and a second top-k sorting mode.

At step 1320, the PIM processing unit determines whether to operate in the first top-k sorting mode or the second top-k sorting mode. Specifically, the PIM processing unit compares a K value in the configuration with a threshold value. In response to the K value in the configuration being greater than the threshold value (step 1320 -Y) , the PIM processing unit selects the first top-k sorting mode and performs steps 1331-1338. In response to the K value in the configuration being smaller than or equal to the threshold value (step 1320 -N) , the PIM processing unit selects the second top-k sorting mode and performs steps 1341-1346.

In the first top-k sorting mode, at step 1331, the computing circuit of the PIM processing unit receives the data elements from multiple logical blocks in the memory array.

At step 1332, the computing circuit calculates, for each logical block, a block maximum (or minimum) element. For example, if the top-k sorting mode is used for determining the greatest K data, the computing circuit calculates the block maximum element. On the other hand, if the top-k sorting mode is used for determining the smallest K data, the computing circuit calculates the block minimum element.

At step 1333, the computing circuit stores the block maximum (or minimum) elements for the logical blocks in one or more vector registers in the computing circuit. Then, the computing circuit repeats steps 1334-1338 until the top K data elements are determined.

At step 1334, the computing circuit determines a global maximum (or minimum) element based on the block maximum (or minimum) elements for the logical blocks. At step 1335, the computing circuit stores the determined global maximum (or minimum) element as one of the top K data elements. After the global maximum (or minimum) element is stored, at step 1336, the computing circuit disables the global maximum (or minimum) element in the associated logical block. At step 1337, the computing circuit obtains a next block maximum or minimum element for the logical block associated with the stored, and disabled, global maximum or minimum element.

That is, by steps 1336 and 1337, when the global maximum (or minimum) element is determined and stored, the computing circuit only needs to recalculate and update one new block maximum (or minimum) element for this corresponding logical block, and block maximum (or minimum) elements for other logical blocks can be reused in a next iteration to determine the next global maximum (or minimum) element.

At step 1338, the computing circuit determines whether all top-k data elements have been determined and obtained. If all top-k data elements are determined and stored (step 1338 -yes) , the computing circuit performs step 1350 and outputs the top K data elements among the data elements to the host or to the memory array. Otherwise (step 1338 -no) , steps 1334-1338 are repeated to sequentially determine and store the top K data elements.

When the second top-k sorting mode is selected based on the determination at step 1320, at step 1341, the computing circuit of the PIM processing unit first stores K initial data elements from the memory array to a first register. Then. the computing circuit repeatedly updates the first register by repeating steps 1342-1346, until the data elements from the memory array are received and processed.

At step 1342, the computing circuit selects a maximum (or minimum) element in the first register as a target element. For example, if the top-k sorting mode is used for determining the greatest K data, the computing circuit selects the minimum element in the first register as the target element. On the other hand, if the top-k sorting mode is used for determining the smallest K data, the computing circuit calculates the maximum element in the first register as the target element. That is, the target element is an element which may be evicted from the first register and replaced by another data element during the following updating process.

At step 1343, the computing circuit determines a top K candidate from one or more remaining data elements received from the memory array. For example, the remaining data elements can be new data read from a vector, which has not been processed, of the DRAM data array. The computing circuit can receive the vector from the memory array, and select the minimum (or maximum) element in the vector as the top K candidate.

At step 1344, the computing circuit compares the top K candidate and the target element. If the top-k sorting mode is used for determining the greatest K data, the computing circuit may determine whether the top K candidate is greater than the target element currently stored in the first register. If the top-k sorting mode is used for determining the smallest K data, the computing circuit may determine whether the top K candidate is smaller than the target element currently stored in the first register. Accordingly, the computing circuit can determine whether to replace the target element in the first register with the top K candidate based on the comparison result. Particularly, in some embodiments, the computing circuit can store the target element and the top K candidate in a second register in the computing circuit and compare the top K candidate with the target element by a scalar ALU to obtain the comparison result.

If the computing circuit determines that the target element in the first register should be replaced (step 1344 –yes) , the computing circuit performs step 1345 to replace the target element with the top K candidate. Otherwise (step 1344 –no) , step 1345 is bypassed and the data stored in the first register remains unchanged.

At step 1346, the computing circuit determines whether all data elements of interest in the memory array are processed. If there are remaining data elements (e.g., data in a vector which has not been processed) to be processed (step 1346 –no) , steps 1342-1346 are repeated to update the current top K data elements stored in the first register. When all data elements of interest in the memory array are processed (step 1346 –yes) , the computing circuit performs step 1350 and outputs the top K data elements stored in the first register to the host or to the memory array.

Accordingly, by above operations, data processing method 1300 in FIG. 13 can realize a top-k sorting computation to output the K greatest (or smallest) data elements, in which the value of K may be greater than or smaller than the number of a vector in the vector registers can hold. Particularly, if the value of K is an integer greater than the threshold value, the first top-k sorting mode is selected and if the value of K is an integer smaller than or equal to the threshold value, the second top-k sorting mode is selected.

FIG. 14 illustrates an exemplary flow diagram for performing another data processing method 1400 on a PIM-based accelerator architecture, consistent with some embodiments of the present disclosure. Data processing method 1400 in FIG. 14 illustrates operations for a k-means clustering computation performed by a PIM-based accelerator architecture (e.g., accelerator architecture 200 in FIG. 2A, memory tile 300 in FIG. 3, and PIM-based accelerator architecture 800 in FIG. 8) , when the k-means clustering mode is selected.

In the k-means clustering mode, at step 1410, at least one PIM processing unit (e.g. PIM processing units 810a-810n in FIG. 8) initializes centroids for clustering. For example, in order to have K clusters, the PIM processing unit may provide K initial centroids and store the initial centroids in a row buffer of the memory array. Then, the PIM processing unit clusters multiple vectors stored in the memory array, by repeating an assignment step 1420 and an update step 1430. In assignment step 1420, the PIM processing unit assigns each of the vectors to one of current clusters based on distances between vector points and centroids of the clusters. In update step 1430, the PIM processing unit updates each of the centroids for the clusters based on the corresponding vectors. Each centroid is a mean of vector (s) assigned to the same cluster in assignment step 1420.

Particularly, assignment step 1420 includes sub-steps 1421-1425. At step 1421, the computing circuit receives the centroids from the row buffer. At step 1422, the computing circuit receives a feature vector selected from the vectors from the memory array. At step 1423, the computing circuit marks a cluster identification to the feature vector, in which the cluster identification indicates the centroid that is nearest to the feature vector. At step 1424, the computing circuit writes the feature vector with the cluster identification back to the memory array.

At step 1425, the computing circuit determines whether all vectors of interest in the memory array are assigned to an associating cluster. If there are remaining vectors to be processed (step 1425 –no) , steps 1421-1425 are repeated to assign the remaining vectors. When all vectors of interest in the memory array are assigned (step 1425 –yes) , the computing circuit enters update step 1430.

Update step 1430 also includes sub-steps 1431 and 1432. At step 1431, the computing circuit calculates an updated centroid according to one or more vectors marked with the corresponding cluster identification. At step 1432, the computing circuit determines whether centroids for all cluster identification are updated based on the latest assignment result obtained in step 1420. If there are remaining centroids to be updated (step 1432 –no) , steps 1431 and 1432 are repeated.

When all centroids are updated (step 1432 –yes) , at step 1440, the computing circuit checks whether the centroids are unchanged after step 1430 in the current cycle. If one or more centroids are changed (step 1440 –no) , the PIM processing unit repeats assignment step 1420 and update step 1430, until the convergence is reached.

In response to the centroids being unchanged after the updating operation (step 1440 –yes) , the PIM processing unit performs step 1450 and outputs a cluster result to the host or store the cluster result to the memory array. Accordingly, by above operations, data processing method 1400 in FIG. 14 can realize a k-means clustering computation and output the cluster result.

In view of above, as proposed in various embodiments of the present disclosure, the proposed devices and methods can take advantage of high bandwidth of DRAM and guarantee an efficient, parallel, and prompt computation. By the wide Input/output (e.g. a 1024- bit read/write per cycle) between the memory array (e.g., DRAM data arrays) and the computing circuit, the memory performance bottleneck in similarity search and k-means computations can be substantially reduced.

In addition, by performing the k-means clustering computation using the proposed devices and methods, unnecessary data movements between the memory array and the computing circuit can be reduced and minimized. Further, random memory access is reduced by storing the centroids in one of row buffers and changing the access order of vectors and centroids in the update step in the k-means clustering computation. Accordingly, the overall efficiency for k-means clustering can be improved. In some embodiments, the efficiency of similarity search can be only dependent on the ratio of bandwidth and memory capacity. As data increases, computation time can still be kept in tens of milliseconds for similarity searches.

Embodiments of the present disclosure can be applied to many products, environments, and scenarios. For example, some embodiments of the present disclosure can be applied to Processing-in Memory (PIM) , such as Processing-in Memory for AI (PIM-AI) , that includes DRAM based processing unit. Some embodiments of the present disclosure can also be applied to Tensor Processing Unit (TPU) , Data Processing Unit (DPU) , Neural network Processing Unit (NPU) , or the like.

Embodiments of the disclosure also provide a computer program product. The computer program product may include a non-transitory computer readable storage medium having computer readable program instructions thereon for causing a processor to carry out the above-described methods.

The computer readable storage medium may be a tangible device that can store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM) , a read-only memory (ROM) , an erasable programmable read-only memory (EPROM) , a static random access memory (SRAM) , a portable compact disc read-only memory (CD-ROM) , a digital versatile disk (DVD) , a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing.

The computer readable program instructions for carrying out the above-described methods may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or source code or object code written in any combination of one or more programming languages, including an object-oriented programming language, and conventional procedural programming languages. The computer readable program instructions may execute entirely on a computer system as a stand-alone software package, or partly on a first computer and partly on a second computer remote from the first computer. In the latter scenario, the second, remote computer may be connected to the first computer through any type of network, including a local area network (LAN) or a wide area network (WAN) .

The computer readable program instructions may be provided to a processor of a computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the above-described methods.

The flow charts and diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods, and computer program products according to various embodiments of the specification. In this regard, a block in the flow charts or diagrams may represent a software program, segment, or portion of code, which includes one or more executable instructions for implementing specific functions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the diagrams or flow charts, and combinations of blocks in the diagrams and flow charts, may be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It is appreciated that certain features of the specification, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the specification, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination or as suitable in any other described embodiment of the specification. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

The embodiments may further be described using the following clauses:

1. A processing-in-memory (PIM) device comprising:

a memory array configured to store data; and

a computing circuit configured to execute a set of instructions to cause the PIM device to:

select between a plurality of computation modes based on a configuration from a host communicatively coupled to the PIM device, wherein the plurality of computation modes comprise a first sorting mode and a second sorting mode;

access a plurality of data elements in a memory array of the PIM device; and

output top K data elements among the plurality of data elements to the memory array or to the host in the first sorting mode or the second sorting mode,

wherein K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.

2. The PIM device of clause 1, the computing circuit further comprising a vector register, wherein in the first sorting mode, the computing circuit receives the plurality of data elements from a plurality of logical blocks in the memory array, and wherein the computing circuit is further configured to execute the set of instructions to cause the PIM device to determine the top K data elements by:

calculating, for each of the plurality of logical blocks, a block maximum or minimum element;

storing the block maximum or minimum elements for the plurality of logical blocks in the vector register; and

repeating following operations until the top K data elements are determined:

determining a global maximum or minimum element based on the block maximum or minimum elements for the plurality of logical blocks;

storing the global maximum or minimum element as one of the top K data elements;

disabling the global maximum or minimum element in its logical block; and

obtaining a next block maximum or minimum element for the logical block associated with the disabled global maximum or minimum element.

3. The PIM device of

clauses

1 or 2, the computing circuit comprising a first register, wherein in the second sorting mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to determine the top K data elements by:

storing a plurality of initial data elements from the memory array to the first register; and

updating the first register until the plurality of data elements from the memory array are received and processed, by repeating following operations:

selecting a maximum or minimum element in the first register as a target element;

determining a candidate from one or more remaining data elements received from the memory array; and

determining whether to replace the target element in the first register with the candidate based on a comparison result of the candidate and the target element.

4. The PIM device of clause 3, the computing circuit comprising a second register and a scalar arithmetic-logic unit (ALU) , wherein the computing circuit is further configured to execute the set of instructions to cause the PIM device to determine whether to replace the target element in the first register with the candidate by:

storing the target element and the candidate in the second register; and

comparing, by the scalar ALU, the candidate with the target element.

5. The PIM device of any one of clauses 1-4, wherein the computing circuit is further configured to execute the set of instructions to cause the PIM device to:

in response to a K value in the configuration being greater than the threshold value, select the first sorting mode; and

in response to the K value in the configuration being smaller than or equal to the threshold value, select the second sorting mode.

6. The PIM device of any one of clauses 1-5, wherein the plurality of computation modes comprise a k-means clustering mode, wherein in the k-means clustering mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to:

cluster a plurality of vectors stored in the memory array, by repeating:

assigning each of the plurality of vectors to one of a plurality of clusters; and

updating a plurality of centroids for the plurality of clusters, wherein each of the plurality of centroids is a mean of one or more corresponding vectors assigned to the same cluster.

7. The PIM device of clause 6, wherein in the k-means clustering mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to assign each of the plurality of vectors by:

receiving, from a row buffer of the memory array, the plurality of centroids;

receiving, from the memory array, a feature vector selected from the plurality of vectors;

marking a cluster identification to the feature vector, the cluster identification indicating one of the plurality of centroids being nearest to the feature vector; and

writing the feature vector with the cluster identification back to the memory array.

8. The PIM device of clause 7, wherein in the k-means clustering mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to update each of the plurality of centroids by:

calculating an updated centroid according to one or more vectors in the memory array with a corresponding cluster identification.

9. The PIM device of any one of clauses 6-8, wherein in the k-means clustering mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to:

in response to the plurality of centroids being unchanged after updating each of the plurality of centroids, output a cluster result to the host or to the memory array.

10. The PIM device of any one of clauses 1-9, further comprising:

a host interface configured to communicate the PIM device with the host;

a configuration register configured to store the configuration from the host; and

a controller configured to send the set of instructions according to the configuration.

11. The PIM device of any one of clauses 1-10, wherein the computing circuit further comprises:

one or more registers configured to store data for computation;

a memory device configured to store the set of instructions; and

a decoder configured to decode the set of instructions.

12. The PIM device of any one of clauses 1-11, wherein the computing circuit further comprises one or more single instruction multiple data (SIMD) units, one or more reducer units, one or more arithmetic-logic unit (ALU) units, or any combination thereof.

13. The PIM device of any one of clauses 1-12, wherein the memory array comprises a DRAM array, and the PIM device further comprises an input/output interface communicating the DRAM array and the computing circuit.

14. A data processing method, comprising:

selecting between a plurality of computation modes for a processing-in-memory (PIM) device based on a configuration, wherein the plurality of computation modes comprise a first sorting mode and a second sorting mode;

accessing a plurality of data elements in a memory array of the PIM device; and

in the first sorting mode or the second sorting mode, outputting top K data elements among the plurality of data elements to the memory array or to a host communicatively coupled to the PIM device,

15. The data processing method of clause 14, further comprising:

when the PIM device is configured to operate in the first sorting mode:

receiving the plurality of data elements from a plurality of logical blocks in the memory array; and

determining the top K data elements by:

calculating and storing, for each of the plurality of logical blocks, a block maximum or minimum element; and

repeating following operations until the top K data elements are determined:

disabling the global maximum or minimum element in its logical block; and

16. The data processing method of clauses 14 or 15, further comprising:

when the PIM device is configured to operate in the second sorting mode, determining the top K data elements by:

storing a plurality of initial data elements from the memory array to a first register; and

17. The data processing method of clause 16, wherein determining whether to replace the target element in the first register with the candidate comprises:

storing the target element and the candidate in a second register; and

comparing the candidate with the target element.

18. The data processing method of any one of clauses 14-17, wherein the plurality of computation modes further comprise a k-means clustering mode, the data processing method further comprising:

when the PIM device is configured to operate in the k-means clustering mode, cluster a plurality of vectors stored in the memory array by repeating:

19. The data processing method of clause 18, wherein assigning each of the plurality of vectors comprises:

receiving the plurality of centroids;

receiving a feature vector selected from the plurality of vectors;

20. The data processing method of clause 19, further comprising:

when the PIM device is configured to operate in the k-means clustering mode, updating each of the plurality of centroids by calculating an updated centroid according to one or more vectors in the memory array with a corresponding cluster identification.

21. The data processing method of any one of clauses 18-20, further comprising:

when the PIM device is configured to operate in the k-means clustering mode, in response to the plurality of centroids being unchanged after updating each of the plurality of centroids, output a cluster result to the host or to the memory array.

22. A non-transitory computer readable medium that stores a set of instructions that is executable by one or more computing circuits of an apparatus to cause the apparatus to initiate a data processing method, the data processing method comprising:

selecting between a plurality of computation modes based on a configuration, wherein the plurality of computation modes comprise a first sorting mode and a second sorting mode;

accessing a plurality of data elements in a memory array of the apparatus; and

in the first sorting mode or the second sorting mode, outputting top K data elements among the plurality of data elements to the memory array or to a host communicatively connecting to the apparatus,

23. The non-transitory computer-readable medium of clause 22, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the first sorting mode:

receiving the plurality of data elements from a plurality of logical blocks in the memory array, and

determining the top K data elements by:

repeating following operations until the top K data elements are determined:

disabling the global maximum or minimum element in its logical block; and

24. The non-transitory computer-readable medium of clauses 22 or 23, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the second sorting mode:

determining the top K data elements by:

25. The non-transitory computer-readable medium of clause 24, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to determine whether to replace the target element in the first register with the candidate by:

storing the target element and the candidate in a second register; and

comparing the candidate with the target element.

26. The non-transitory computer-readable medium of any one of clauses 22-25, wherein the plurality of computation modes comprise a k-means clustering mode, and the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the k-means clustering mode:

clustering a plurality of vectors stored in the memory array, by repeating:

27. The non-transitory computer-readable medium of clause 26, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to assign each of the plurality of vectors by:

receiving, from a row buffer of the memory array, the plurality of centroids;

28. The non-transitory computer-readable medium of clause 27, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to update each of the plurality of centroids by:

29. The non-transitory computer-readable medium of any one of clauses 26-28, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the k-means clustering mode:

in response to the plurality of centroids being unchanged after updating each of the plurality of centroids, outputting a cluster result to the host or to the memory array.

30. A system for processing data, comprising:

a host; and

a plurality of processing-in-memory (PIM) devices communicatively coupled to the host, wherein any of the plurality of PIM devices comprises a memory array configured to store data and a computing circuit configured to execute a set of instructions to cause the PIM device to:

select between a plurality of computation modes based on a configuration from the host, the plurality of computation modes comprising a first sorting mode and a second sorting mode;

access a plurality of data elements in a memory array of the PIM device; and

in the first sorting mode or the second sorting mode, output top K data elements among the plurality of data elements to the host,

In the foregoing specification, embodiments have been described with reference to numerous specific details that can vary from implementation to implementation. Certain adaptations and modifications of the described embodiments can be made. Other embodiments can be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and examples be considered as exemplary only. It is also intended that the sequence of steps shown in figures are only for illustrative purposes and are not intended to be limited to any particular sequence of steps. As such, those skilled in the art can appreciate that these steps can be performed in a different order while implementing the same method. In the drawings and specification, there have been disclosed exemplary embodiments. However, many variations and modifications can be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation, the scope of the embodiments being defined by the following claims.

Claims

A processing-in-memory (PIM) device comprising:

a memory array configured to store data; and

a computing circuit configured to execute a set of instructions to cause the PIM device to:

select between a plurality of computation modes based on a configuration from a host communicatively coupled to the PIM device, wherein the plurality of computation modes comprise a first sorting mode and a second sorting mode;

access a plurality of data elements in a memory array of the PIM device; and

output top K data elements among the plurality of data elements to the memory array or to the host in the first sorting mode or the second sorting mode,

wherein K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.
The PIM device of claim 1, the computing circuit further comprising a vector register, wherein in the first sorting mode, the computing circuit receives the plurality of data elements from a plurality of logical blocks in the memory array, and wherein the computing circuit is further configured to execute the set of instructions to cause the PIM device to determine the top K data elements by:

calculating, for each of the plurality of logical blocks, a block maximum or minimum element;

storing the block maximum or minimum elements for the plurality of logical blocks in the vector register; and

repeating following operations until the top K data elements are determined:

determining a global maximum or minimum element based on the block maximum or minimum elements for the plurality of logical blocks;

storing the global maximum or minimum element as one of the top K data elements;

disabling the global maximum or minimum element in its logical block; and

obtaining a next block maximum or minimum element for the logical block associated with the disabled global maximum or minimum element.
The PIM device of claim 1, the computing circuit comprising a first register, wherein in the second sorting mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to determine the top K data elements by:

storing a plurality of initial data elements from the memory array to the first register; and

updating the first register until the plurality of data elements from the memory array are received and processed, by repeating following operations:

selecting a maximum or minimum element in the first register as a target element;

determining a candidate from one or more remaining data elements received from the memory array; and

determining whether to replace the target element in the first register with the candidate based on a comparison result of the candidate and the target element.
The PIM device of claim 3, the computing circuit comprising a second register and a scalar arithmetic-logic unit (ALU) , wherein the computing circuit is further configured to execute the set of instructions to cause the PIM device to determine whether to replace the target element in the first register with the candidate by:

storing the target element and the candidate in the second register; and

comparing, by the scalar ALU, the candidate with the target element.
The PIM device of claim 1, wherein the computing circuit is further configured to execute the set of instructions to cause the PIM device to:

in response to a K value in the configuration being greater than the threshold value, select the first sorting mode; and

in response to the K value in the configuration being smaller than or equal to the threshold value, select the second sorting mode.
The PIM device of claim 1, wherein the plurality of computation modes comprise a k-means clustering mode, wherein in the k-means clustering mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to:

cluster a plurality of vectors stored in the memory array, by repeating:

assigning each of the plurality of vectors to one of a plurality of clusters; and

updating a plurality of centroids for the plurality of clusters, wherein each of the plurality of centroids is a mean of one or more corresponding vectors assigned to the same cluster.
The PIM device of claim 6, wherein in the k-means clustering mode, the computing circuit is further configured to execute the set of instructions to cause the PIM device to assign each of the plurality of vectors by:

receiving, from a row buffer of the memory array, the plurality of centroids;

receiving, from the memory array, a feature vector selected from the plurality of vectors;

marking a cluster identification to the feature vector, the cluster identification indicating one of the plurality of centroids being nearest to the feature vector; and

writing the feature vector with the cluster identification back to the memory array.
The PIM device of claim 1, further comprising:

a host interface configured to communicate the PIM device with the host;

a configuration register configured to store the configuration from the host; and

a controller configured to send the set of instructions according to the configuration.
The PIM device of claim 1, wherein the computing circuit further comprises:

one or more registers configured to store data for computation;

a memory device configured to store the set of instructions; and

a decoder configured to decode the set of instructions.
The PIM device of claim 1, wherein the computing circuit further comprises one or more single instruction multiple data (SIMD) units, one or more reducer units, one or more arithmetic-logic unit (ALU) units, or any combination thereof.
A data processing method, comprising:

selecting between a plurality of computation modes for a processing-in-memory (PIM) device based on a configuration, wherein the plurality of computation modes comprise a first sorting mode and a second sorting mode;

accessing a plurality of data elements in a memory array of the PIM device; and

in the first sorting mode or the second sorting mode, outputting top K data elements among the plurality of data elements to the memory array or to a host communicatively coupled to the PIM device,

wherein K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.
The data processing method of claim 11, further comprising:

when the PIM device is configured to operate in the first sorting mode:

receiving the plurality of data elements from a plurality of logical blocks in the memory array; and

determining the top K data elements by:

calculating and storing, for each of the plurality of logical blocks, a block maximum or minimum element; and

repeating following operations until the top K data elements are determined:

determining a global maximum or minimum element based on the block maximum or minimum elements for the plurality of logical blocks;

storing the global maximum or minimum element as one of the top K data elements;

disabling the global maximum or minimum element in its logical block; and

obtaining a next block maximum or minimum element for the logical block associated with the disabled global maximum or minimum element.
The data processing method of claim 11, further comprising:

when the PIM device is configured to operate in the second sorting mode, determining the top K data elements by:

storing a plurality of initial data elements from the memory array to a first register; and

updating the first register until the plurality of data elements from the memory array are received and processed, by repeating following operations:

selecting a maximum or minimum element in the first register as a target element;

determining a candidate from one or more remaining data elements received from the memory array; and

determining whether to replace the target element in the first register with the candidate based on a comparison result of the candidate and the target element.
The data processing method of claim 13, wherein determining whether to replace the target element in the first register with the candidate comprises:

storing the target element and the candidate in a second register; and

comparing the candidate with the target element.
The data processing method of claim 11, wherein the plurality of computation modes further comprise a k-means clustering mode, the data processing method further comprising:

when the PIM device is configured to operate in the k-means clustering mode, cluster a plurality of vectors stored in the memory array by repeating:

assigning each of the plurality of vectors to one of a plurality of clusters; and

updating a plurality of centroids for the plurality of clusters, wherein each of the plurality of centroids is a mean of one or more corresponding vectors assigned to the same cluster.
A non-transitory computer readable medium that stores a set of instructions that is executable by one or more computing circuits of an apparatus to cause the apparatus to initiate a data processing method, the data processing method comprising:

selecting between a plurality of computation modes based on a configuration, wherein the plurality of computation modes comprise a first sorting mode and a second sorting mode;

accessing a plurality of data elements in a memory array of the apparatus; and

in the first sorting mode or the second sorting mode, outputting top K data elements among the plurality of data elements to the memory array or to a host communicatively connecting to the apparatus,

wherein K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.
The non-transitory computer-readable medium of claim 16, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the first sorting mode:

receiving the plurality of data elements from a plurality of logical blocks in the memory array, and

determining the top K data elements by:

calculating and storing, for each of the plurality of logical blocks, a block maximum or minimum element; and

repeating following operations until the top K data elements are determined:

determining a global maximum or minimum element based on the block maximum or minimum elements for the plurality of logical blocks;

storing the global maximum or minimum element as one of the top K data elements;

disabling the global maximum or minimum element in its logical block; and

obtaining a next block maximum or minimum element for the logical block associated with the disabled global maximum or minimum element.
The non-transitory computer-readable medium of claim 16, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the second sorting mode:

determining the top K data elements by:

storing a plurality of initial data elements from the memory array to a first register; and

updating the first register until the plurality of data elements from the memory array are received and processed, by repeating following operations:

selecting a maximum or minimum element in the first register as a target element;

determining a candidate from one or more remaining data elements received from the memory array; and

determining whether to replace the target element in the first register with the candidate based on a comparison result of the candidate and the target element.
The non-transitory computer-readable medium of claim 18, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to determine whether to replace the target element in the first register with the candidate by:

storing the target element and the candidate in a second register; and

comparing the candidate with the target element.
The non-transitory computer-readable medium of claim 16, wherein the plurality of computation modes comprise a k-means clustering mode, and the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to further perform, in the k-means clustering mode:

clustering a plurality of vectors stored in the memory array, by repeating:

assigning each of the plurality of vectors to one of a plurality of clusters; and

updating a plurality of centroids for the plurality of clusters, wherein each of the plurality of centroids is a mean of one or more corresponding vectors assigned to the same cluster.
The non-transitory computer-readable medium of claim 20, wherein the set of instructions that is executable by the one or more computing circuits of the apparatus to cause the apparatus to assign each of the plurality of vectors by:

receiving, from a row buffer of the memory array, the plurality of centroids;

receiving, from the memory array, a feature vector selected from the plurality of vectors;

marking a cluster identification to the feature vector, the cluster identification indicating one of the plurality of centroids being nearest to the feature vector; and

writing the feature vector with the cluster identification back to the memory array.
A system for processing data, comprising:

a host; and

a plurality of processing-in-memory (PIM) devices communicatively coupled to the host, wherein any of the plurality of PIM devices comprises a memory array configured to store data and a computing circuit configured to execute a set of instructions to cause the PIM device to:

select between a plurality of computation modes based on a configuration from the host, the plurality of computation modes comprising a first sorting mode and a second sorting mode;

access a plurality of data elements in a memory array of the PIM device; and

in the first sorting mode or the second sorting mode, output top K data elements among the plurality of data elements to the host,

wherein K is an integer greater than a threshold value when the first sorting mode is selected and is an integer smaller than or equal to the threshold value when the second sorting mode is selected.